Data mining and predictive analysis
(DMPA.AE1)
/ ISBN: 9781644593745
Data mining and predictive analysis
Data mining is the process of discovering useful patterns and trends in large data sets and predictive analytics is the process of extracting information from large data sets in order to make predictions and estimates about future outcomes. Data mining is becoming more widespread every day, because it empowers companies to uncover profitable patterns and trends from their existing databases. With uCertify's course Data mining and predictive analysis, you get a handson experience in data mining and you will learn what types of analysis will uncover the most profitable nuggets of knowledge from the data, while avoiding the potential pitfalls that may cost your company millions of dollars.
Lessons

34+ Lessons

58+ Exercises

120+ Quizzes

164+ Flashcards

164+ Glossary of terms
TestPrep
LiveLab

63+ LiveLab

63+ Video tutorials

02:02+ Hours
 What is Data Mining? What is Predictive Analytics?
 Why is this Course Needed?
 Who Will Benefit from this Course?
 Danger! Data Mining is Easy to do Badly
 “WhiteBox” Approach
 Algorithm WalkThroughs
 Exciting New Topics
 The R Zone
 Appendix: Data Summarization and Visualization
 The Case Study: Bringing it all Together
 How the Course is Structured
 What is Data Mining? What Is Predictive Analytics?
 Wanted: Data Miners
 The Need For Human Direction of Data Mining
 The CrossIndustry Standard Process for Data Mining: CRISPDM
 Fallacies of Data Mining
 What Tasks can Data Mining Accomplish
 The R Zone
 R References
 Exercises
 Why do We Need to Preprocess the Data?
 Data Cleaning
 Handling Missing Data
 Identifying Misclassifications
 Graphical Methods for Identifying Outliers
 Measures of Center and Spread
 Data Transformation
 Min–Max Normalization
 ZScore Standardization
 Decimal Scaling
 Transformations to Achieve Normality
 Numerical Methods for Identifying Outliers
 Flag Variables
 Transforming Categorical Variables into Numerical Variables
 Binning Numerical Variables
 Reclassifying Categorical Variables
 Adding an Index Field
 Removing Variables that are not Useful
 Variables that Should Probably not be Removed
 Removal of Duplicate Records
 A Word About ID Fields
 The R Zone
 R Reference
 Exercises
 Hypothesis Testing Versus Exploratory Data Analysis
 Getting to Know The Data Set
 Exploring Categorical Variables
 Exploring Numeric Variables
 Exploring Multivariate Relationships
 Selecting Interesting Subsets of the Data for Further Investigation
 Using EDA to Uncover Anomalous Fields
 Binning Based on Predictive Value
 Deriving New Variables: Flag Variables
 Deriving New Variables: Numerical Variables
 Using EDA to Investigate Correlated Predictor Variables
 Summary of Our EDA
 The R Zone
 R References
 Exercises
 Need for DimensionReduction in Data Mining
 Principal Components Analysis
 Applying PCA to the Houses Data Set
 How Many Components Should We Extract?
 Profiling the Principal Components
 Communalities
 Validation of the Principal Components
 Factor Analysis
 Applying Factor Analysis to the Adult Data Set
 Factor Rotation
 UserDefined Composites
 An Example of a UserDefined Composite
 The R Zone
 R References
 Exercises
 Data Mining Tasks in Discovering Knowledge in Data
 Statistical Approaches to Estimation and Prediction
 Statistical Inference
 How Confident are We in Our Estimates?
 Confidence Interval Estimation of the Mean
 How to Reduce the Margin of Error
 Confidence Interval Estimation of the Proportion
 Hypothesis Testing for the Mean
 Assessing The Strength of Evidence Against The Null Hypothesis
 Using Confidence Intervals to Perform Hypothesis Tests
 Hypothesis Testing for The Proportion
 Reference
 The R Zone
 R Reference
 Exercises
 TwoSample tTest for Difference in Means
 TwoSample ZTest for Difference in Proportions
 Test for the Homogeneity of Proportions
 ChiSquare Test for Goodness of Fit of Multinomial Data
 Analysis of Variance
 Reference
 The R Zone
 R Reference
 Exercises
 Supervised Versus Unsupervised Methods
 Statistical Methodology and Data Mining Methodology
 CrossValidation
 Overfitting
 Bias–Variance TradeOff
 Balancing The Training Data Set
 Establishing Baseline Performance
 The R Zone
 R Reference
 Exercises
 An Example of Simple Linear Regression
 Dangers of Extrapolation
 How Useful is the Regression? The Coefficient of Determination, r2
 Standard Error of the Estimate, s
 Correlation Coefficient r
 Anova Table for Simple Linear Regression
 Outliers, High Leverage Points, and Influential Observations
 Population Regression Equation
 Verifying The Regression Assumptions
 Inference in Regression
 tTest for the Relationship Between x and y
 Confidence Interval for the Slope of the Regression Line
 Confidence Interval for the Correlation Coefficient ρ
 Confidence Interval for the Mean Value of y Given x
 Prediction Interval for a Randomly Chosen Value of y Given x
 Transformations to Achieve Linearity
 Box–Cox Transformations
 The R Zone
 R References
 Exercises
 An Example of Multiple Regression
 The Population Multiple Regression Equation
 Inference in Multiple Regression
 Regression With Categorical Predictors, Using Indicator Variables
 Adjusting R2: Penalizing Models For Including Predictors That Are Not Useful
 Sequential Sums of Squares
 Multicollinearity
 Variable Selection Methods
 Gas Mileage Data Set
 An Application of Variable Selection Methods
 Using the Principal Components as Predictors in Multiple Regression
 The R Zone
 R References
 Exercises
 Classification Task
 kNearest Neighbor Algorithm
 Distance Function
 Combination Function
 Quantifying Attribute Relevance: Stretching the Axes
 Database Considerations
 kNearest Neighbor Algorithm for Estimation and Prediction
 Choosing k
 Application of kNearest Neighbor Algorithm Using IBM/SPSS Modeler
 The R Zone
 R References
 Exercises
 What is a Decision Tree?
 Requirements for Using Decision Trees
 Classification and Regression Trees
 C4.5 Algorithm
 Decision Rules
 Comparison of the C5.0 and CART Algorithms Applied to Real Data
 The R Zone
 R References
 Exercises
 Input and Output Encoding
 Neural Networks for Estimation and Prediction
 Simple Example of a Neural Network
 Sigmoid Activation Function
 BackPropagation
 GradientDescent Method
 BackPropagation Rules
 Example of BackPropagation
 Termination Criteria
 Learning Rate
 Momentum Term
 Sensitivity Analysis
 Application of Neural Network Modeling
 The R Zone
 R References
 Exercises
 Simple Example of Logistic Regression
 Maximum Likelihood Estimation
 Interpreting Logistic Regression Output
 Inference: Are the Predictors Significant?
 Odds Ratio and Relative Risk
 Interpreting Logistic Regression for a Dichotomous Predictor
 Interpreting Logistic Regression for a Polychotomous Predictor
 Interpreting Logistic Regression for a Continuous Predictor
 Assumption of Linearity
 ZeroCell Problem
 Multiple Logistic Regression
 Introducing Higher Order Terms to Handle Nonlinearity
 Validating the Logistic Regression Model
 WEKA: HandsOn Analysis Using Logistic Regression
 The R Zone
 R References
 Exercises
 Bayesian Approach
 Maximum A Posteriori (MAP) Classification
 Posterior Odds Ratio
 Balancing The Data
 Naïve Bayes Classification
 Interpreting The Log Posterior Odds Ratio
 ZeroCell Problem
 Numeric Predictors for Naïve Bayes Classification
 WEKA: Handson Analysis Using Naïve Bayes
 Bayesian Belief Networks
 Clothing Purchase Example
 Using The Bayesian Network to Find Probabilities
 The R Zone
 R References
 Exercises
 Model Evaluation Techniques for the Description Task
 Model Evaluation Techniques for the Estimation and Prediction Tasks
 Model Evaluation Measures for the Classification Task
 Accuracy and Overall Error Rate
 Sensitivity and Specificity
 FalsePositive Rate and FalseNegative Rate
 Proportions of True Positives, True Negatives, False Positives, and False Negatives
 Misclassification Cost Adjustment to Reflect RealWorld Concerns
 Decision Cost/Benefit Analysis
 Lift Charts and Gains Charts
 Interweaving Model Evaluation with Model Building
 Confluence of Results: Applying a Suite of Models
 The R Zone
 R References
 Exercises
 HandsOn Analysis
 Decision Invariance Under Row Adjustment
 Positive Classification Criterion
 Demonstration Of The Positive Classification Criterion
 Constructing The Cost Matrix
 Decision Invariance Under Scaling
 Direct Costs and Opportunity Costs
 Case Study: CostBenefit Analysis Using DataDriven Misclassification Costs
 Rebalancing as a Surrogate for Misclassification Costs
 The R Zone
 R References
 Exercises
 Classification Evaluation Measures for a Generic Trinary Target
 Application of Evaluation Measures for Trinary Classification to the Loan Approval Problem
 DataDriven CostBenefit Analysis for Trinary Loan Classification Problem
 Comparing Cart Models With and Without DataDriven Misclassification Costs
 Classification Evaluation Measures for a Generic kNary Target
 Example of Evaluation Measures and DataDriven Misclassification Costs for kNary Classification
 The R Zone
 R References
 Exercises
 Review of Lift Charts and Gains Charts
 Lift Charts and Gains Charts Using Misclassification Costs
 Response Charts
 Profits Charts
 Return on Investment (ROI) Charts
 The R Zone
 R References
 Exercises
 HandsOn Exercises
 The Clustering Task
 Hierarchical Clustering Methods
 SingleLinkage Clustering
 CompleteLinkage Clustering
 kMeans Clustering
 Example of kMeans Clustering at Work
 Behavior of MSB, MSE, and PseudoF as the kMeans Algorithm Proceeds
 Application of kMeans Clustering Using SAS Enterprise Miner
 Using Cluster Membership to Predict Churn
 The R Zone
 R References
 Exercises
 HandsOn Analysis
 SelfOrganizing Maps
 Kohonen Networks
 Example of a Kohonen Network Study
 Cluster Validity
 Application of Clustering Using Kohonen Networks
 Interpreting The Clusters
 Using Cluster Membership as Input to Downstream Data Mining Models
 The R Zone
 R References
 Exercises
 Rationale for BIRCH Clustering
 Cluster Features
 Cluster Feature TREE
 Phase 1: Building The CF Tree
 Phase 2: Clustering The SubClusters
 Example of Birch Clustering, Phase 1: Building The CF Tree
 Example of BIRCH Clustering, Phase 2: Clustering The SubClusters
 Evaluating The Candidate Cluster Solutions
 Case Study: Applying BIRCH Clustering to The Bank Loans Data Set
 The R Zone
 R References
 Exercises
 Rationale for Measuring Cluster Goodness
 The Silhouette Method
 Silhouette Example
 Silhouette Analysis of the IRIS Data Set
 The PseudoF Statistic
 Example of the PseudoF Statistic
 PseudoF Statistic Applied to the IRIS Data Set
 Cluster Validation
 Cluster Validation Applied to the Loans Data Set
 The R Zone
 R References
 Exercises
 Affinity Analysis and Market Basket Analysis
 Support, Confidence, Frequent Itemsets, and the A Priori Property
 How Does The A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets
 How Does The A Priori Algorithm Work (Part 2)? Generating Association Rules
 Extension From Flag Data to General Categorical Data
 InformationTheoretic Approach: Generalized Rule Induction Method
 Association Rules are Easy to do Badly
 How Can We Measure the Usefulness of Association Rules?
 Do Association Rules Represent Supervised or Unsupervised Learning?
 Local Patterns Versus Global Models
 The R Zone
 R References
 Exercises
 The Segmentation Modeling Process
 Segmentation Modeling Using EDA to Identify the Segments
 Segmentation Modeling using Clustering to Identify the Segments
 The R Zone
 R References
 Exercises
 Rationale for Using an Ensemble of Classification Models
 Bias, Variance, and Noise
 When to Apply, and not to apply, Bagging
 Bagging
 Boosting
 Application of Bagging and Boosting Using IBM/SPSS Modeler
 References
 The R Zone
 R Reference
 Exercises
 Simple Model Voting
 Alternative Voting Methods
 Model Voting Process
 An Application of Model Voting
 What is Propensity Averaging?
 Propensity Averaging Process
 An Application of Propensity Averaging
 The R Zone
 R References
 Exercises
 HandsOn Analysis
 Introduction To Genetic Algorithms
 Basic Framework of a Genetic Algorithm
 Simple Example of a Genetic Algorithm at Work
 Modifications and Enhancements: Selection
 Modifications and Enhancements: Crossover
 Genetic Algorithms for RealValued Variables
 Using Genetic Algorithms to Train a Neural Network
 WEKA: HandsOn Analysis Using Genetic Algorithms
 The R Zone
 R References
 Exercises
 Need for Imputation of Missing Data
 Imputation of Missing Data: Continuous Variables
 Standard Error of the Imputation
 Imputation of Missing Data: Categorical Variables
 Handling Patterns in Missingness
 Reference
 The R Zone
 R References
 CrossIndustry Standard Practice for Data Mining
 Business Understanding Phase
 Data Understanding Phase, Part 1: Getting a Feel for the Data Set
 Data Preparation Phase
 Data Understanding Phase, Part 2: Exploratory Data Analysis
 Partitioning the Data
 Developing the Principal Components
 Validating the Principal Components
 Profiling the Principal Components
 Choosing the Optimal Number of Clusters Using Birch Clustering
 Choosing the Optimal Number of Clusters Using kMeans Clustering
 Application of kMeans Clustering
 Validating the Clusters
 Profiling the Clusters
 Do You Prefer The Best Model Performance, Or A Combination Of Performance And Interpretability?
 Modeling And Evaluation Overview
 CostBenefit Analysis Using DataDriven Costs
 Variables to be Input To The Models
 Establishing The Baseline Model Performance
 Models That Use Misclassification Costs
 Models That Need Rebalancing as a Surrogate for Misclassification Costs
 Combining Models Using Voting and Propensity Averaging
 Interpreting The Most Profitable Model
 Variables to be Input to the Models
 Models that use Misclassification Costs
 Models that Need Rebalancing as a Surrogate for Misclassification Costs
 Combining Models using Voting and Propensity Averaging
 Lessons Learned
 Conclusions
 Data Summarization and Visualization
 Part 1: Summarization 1: Building Blocks Of Data Analysis
 Part 2: Visualization: Graphs and Tables For Summarizing And Organizing Data
 Part 3: Summarization 2: Measures Of Center, Variability, and Position
 Part 4: Summarization And Visualization Of Bivariate Relationships
Hands on Activities (Live Labs)
 Analyzing a Dataset
 Handling Missing Data
 Creating a Histogram
 Creating a Scatterplot
 Creating a Normal QQ Plot
 Creating Indicator Variables
 Analyzing the churn Dataset
 Exploring Categorical Variables
 Exploring Numeric Variables
 Exploring Multivariate Relationships
 Investigating Correlation Values and pvalues in Matrix Form
 Creating a Scree Plot
 Profiling the Principal Components
 Calculating Communalities
 Validating the Principal Components
 Applying Factor Analysis to a Dataset
 Estimating the Confidence Interval for the Mean
 Estimating the Confidence Interval of the Population Proportion
 Performing a ttest for Finding the Difference in Means
 Performing a ztest for Finding the Difference in Proportions
 Performing a ChiSquare Test for Homogeneity of Proportions
 Performing a ChiSquare Test for Goodness of Fit of Multinomial Data
 Analyzing a Variance
 Balancing the Training and Testing Datasets
 Plotting Data with a Regression Line
 Measuring the Goodness of Fit of the Regression
 Performing Regression with Other Hikers
 Verifying the Regression Assumptions
 Determining Prediction and Confidence Intervals
 Assessing Normality in Scrabble
 Applying BoxCox Transformations
 Approximating the Relationship
 Identifying Confidence Intervals
 Creating a Dot Plot
 Determining the Sequential Sums of Squares
 Analyzing Multicollinearity
 Applying the Best Subsets Procedure in a Regression Model
 Applying the Stepwise Selection Procedure in a Regression Model
 Applying Backward Elimination Procedure
 Applying Forward Selection Procedure
 Using the Principal Components as Predictors in Multiple Regression
 Running KNN
 Calculating the Euclidean Distance
 Plotting a Classification Tree
 Running a Neural Network
 Creating a Plot for Logistic Regression
 Interpreting Logistic Regression and Odds Ratio for a Dichotomous Predictor
 Calculating Posterior Odds Ratio
 Calculating the Log Posterior Odds Ratio
 Calculating the Numeric Predictors for Naive Bayes Classification
 Estimating Costs for Benefit Analysis
 Analyzing Costbenefit Using Datadriven Misclassification Costs
 Analyzing the CostBenefit for the Trinary Loan Classification Problem
 Using Singlelinkage Clustering
 Using Completelinkage Clustering
 Finding Clusters in Data
 Using a 3x2 Kohonen Network
 Interpreting Clusters
 Plotting Silhouette Values of a Dataset
 Applying Cluster Validation to a Dataset
 Viewing the Output Sorted by Support
 Predicting Income using Caps and No Caps Groups
 Using Genetic Algorithms to Train a Neural Network
×