This book addresses all the major and latest techniques of data mining. It deals in detail with the algorithms for discovering association rules for clustering and building decision trees, and techniques such as neural networks, genetic algorithms, rough set theory and support vector machine used in data mining. The algorithmic details of different techniques such as Apriori, Pincer-search, Dynamic Itemset Counting, FP-Tree growth, SLIQ, SPRINT, BOAT, CART, RainForest, BIRCH, CURE, BUBBLE, ROCK, STIRR, PAM, CLARANS, DBSCAN, GSP, SPADE and SPIRIT are covered. The book also discusses the mining of web, spatial, temporal and text data. In the third edition, the chapter on data warehousing concepts was thoroughly revised to include multidimensional data modelling and cube computation. The discussion on genetic algorithms was also expanded as a separate chapter. In the fourth edition, a chapter on ROC curve for visualizing the performance of a binary classifier and the method for computing AUC and its uses has been included.
Students of computer science, mathematical science and management will find this introductory textbook beneficial for a first course on the subject; the exposition of concepts with supporting illustrative examples and exercises makes it suitable for self-study as well.
Arun K Pujari, faculty and Dean of the School of Computer and Information Sciences, University of Hyderabad (UoH), is currently serving as the vice-chancellor of the Central University of Rajasthan. He obtained his post-graduation in mathematics from Sambalpur University (1974) and PhD from IIT Kanpur (1980). He joined UoH in 1985 as a reader and became a professor in 1990. Professor Pujari has wide experience as an administrator. He has served as a member of UGC, DST, DRDO, ISRO and AICTE, and as vice-chancellor of Sambalpur University (November 2008 to November 2011). He has also been on visiting assignments to several institutions that include the Institute of Industrial Sciences, University of Tokyo; International Institute of Software Technology, United Nations University, Macau; University of Memphis, USA; and Griffith University, Australia, among others.
Foreword xv Prologue xvii Preface to the Fourth Edition xix Preface to the First Edition xxi Acknowledgements 1. INTRODUCTION 1.1 Introduction 1.2 Data Mining as a Subject 1.3 Guide to this Book 2. DATA WAREHOUSING 2.1 Introduction 2.2 Data Warehouse Architecture 2.3 Dimensional Modelling 2.4 Categorisation of Hierarchies 2.5 Aggregate Function 2.6 Summarisability 2.7 Fact–Dimension Relationships 2.8 OLAP Operations 2.9 Lattice of Cuboids 2.10 OLAP Server 2.11 ROLAP 2.12 MOLAP 2.13 Cube Computation 2.14 Multiway Simultaneous Aggregation (ArrayCube) 2.15 BUC - Bottom-Up Cubing Algorithm 2.16 Condensed Cube 2.17 Coalescing 2.18 Dwarf 2.19 Other Cubing Techniques 2.20 Skycube 2.21 View Selection - Partial Materialisation 2.22 Data Marting 2.23 ETL 2.24 Data Cleaning 2.25 ELT vs. ETL 2.26 Cloud Data Warehousing Further Reading Exercises Bibliography 3. DATA MINING 3.1 Introduction 3.2 What is Data Mining? 3.3 Data Mining: Definitions 3.4 KDD vs. Data Mining 3.5 DBMS vs. DM 3.6 Other Related Areas 3.7 DM Techniques 3.8 Other Mining Problems 3.9 Issues and Challenges in DM 3.10 DM Application Areas 3.11 DM Applications—Case Studies 3.12 Conclusions Further Reading Exercises Bibliography 4. ASSOCIATION RULES 4.1 Introduction 4.2 What is an Association Rule? 4.3 Methods to Discover Association Rules 4.4 Apriori Algorithm 4.5 Partition Algorithm 4.6 Pincer-Search Algorithm 4.7 Dynamic Itemset Counting Algorithm 4.8 FP-tree Growth Algorithm 4.9 Eclat and dEclat 4.10 Rapid Association Rule Mining (RARM) 4.11 Discussion on Different Algorithms 4.12 Incremental Algorithm 4.13 Border Algorithm 4.14 Generalised Association Rule 4.15 Association Rules with Item Constraints 4.16 Summary Further Reading Exercises Bibliography 5. CLUSTERING TECHNIQUES 5.1 Introduction 5.2 Clustering Paradigms 5.3 Partitioning Algorithms 5.4 k-Medoid Algorithms 5.5 CLARA 5.6 CLARANS 5.7 Hierarchical Clustering 5.8 DBSCAN 5.9 BIRCH 5.10 CURE 5.11 Categorical Clustering Algorithms 5.12 STIRR 5.13 ROCK 5.14 CACTUS 5.15 Conclusions Further Reading Exercises Bibliography 6. DECISION TREES 6.1 Introduction 6.2 What is a Decision Tree? 6.3 Tree Construction Principle 6.4 Best Split 6.5 Splitting Indices 6.6 Splitting Criteria 6.7 Decision Tree Construction Algorithms 6.8 CART 6.9 ID3 6.10 C4.5 6.11 CHAID 6.12 Summary 6.13 Decision Tree Construction with Presorting 6.14 RainForest 6.15 Approximate Methods 6.16 CLOUDS 6.17 BOAT 6.18 Pruning Technique 6.19 Integration of Pruning and Construction 6.20 Summary: An Ideal Algorithm 6.21 Other Topics 6.22 Conclusions Further Reading Exercises Bibliography 7. ROUGH SET THEORY 7.1 Introduction 7.2 Definitions 7.3 Example 7.4 Reduct 7. 5 Propositional Reasoning and PIAP to Compute Reducts 7.6 Types of Reducts 7.7 Rule Extraction 7.8 Decision tree 7.9 Rough Sets and Fuzzy Sets 7.10 Granular Computing Further Reading Exercises Bibliography 8. GENETIC ALGORITHM 8.1 Introduction 8.2 Basic Steps of GA 8. 3 Selection 8.4 Crossover 8.5 Mutation 8.6 Data Mining Using GA 8.7 GA for Rule Discovery 8.8 GA and Decision Tree 8.9 Clustering Using GA Conclusions Further Reading Exercises Bibliography 9. OTHER TECHNIQUES 9.1 Introduction 9.2 What is a Neural Network? 9.3 Learning in NN 9.4 Unsupervised Learning 9.5 Data Mining Using NN: A Case Study 9.6 Support Vector Machines 9.7 Conclusions Further Reading Exercises Bibliography
10. Performance Evaluation - ROC Curve 10.1 Introduction 10.2 Classification Accuracy 10.3 ROC Space 10.4 ROC Curves 10.5 ROC Curves and Class Distribution 10.6 ROC Convex Hull (ROCCH) 10.7 Method to Find the Optimal Threshold Point 10.8 Combining Classifiers 10.9 Area Under the ROC Curve (AUC ) 10.10 Methods to Compute AUC 10.11 Averaging ROC Curves 10.12 R OC for Multi-class Classifiers 10.13 Precision–Recall Graph 10.14 DET Curves 10.15 Cost Curves Further Reading Exercises Bibliography 11. WEB MINING 11.1 Introduction 11.2 Web Mining 11.3 Web Content Mining 11.4 Web Structure Mining 11.5 Web Usage Mining 11.6 Text Mining 11.7 Unstructured Text 11.8 Episode Rule Discovery for Texts 11.9 Hierarchy of Categories 11.10 Text Clustering 11.11 Conclusions Further Reading Exercises Bibliography 12. TEMPORAL AND SPATIAL DATA MINING 12.1 Introduction 12.2 What is Temporal Data Mining? 12.3 Temporal Association Rules 12.4 Sequence Mining 12.5 The GSP Algorithm 12.6 SPADE 12.7 SPIRIT 12.8 WUM 12.9 Episode Discovery 12.10 Event Prediction Problem 12.11 Time-series Analysis 12.12 Spatial Mining 12.13 Spatial Mining Tasks 12.14 Spatial Clustering 12.15 Spatial Trends 12.16 Conclusions Further Reading Exercises Bibliography Index