Questions on Logistic Regression using R
From the text book attached below, I need solutions for the questions below from Chapter10: Logistic Regression
Problem 1: Financial conditions of banks (b, c, e)
Problem 4: Competitive auctioning on ebay (a, b, c, d, e, f)
DATA MINING
FOR BUSINESS ANALYTICS
DATA MINING
FOR BUSINESS ANALYTICS
Concepts, Techniques, and Applications in R
Galit Shmueli
Peter C. Bruce
Inbal Yahav
Nitin R. Patel
Kenneth C. Lichtendahl, Jr.
This edition first published 2018
© 2018 John Wiley & Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
law. Advice on how to obtain permission to reuse material from this title is available at
http://www.wiley.com/go/permissions.
The right of Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl Jr. to be
identified as the authors of this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at
www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that
appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
The publisher and the authors make no representations or warranties with respect to the accuracy or completeness
of the contents of this work and specifically disclaim all warranties; including without limitation any implied
warranties of fitness for a particular purpose. This work is sold with the understanding that the publisher is not
engaged in rendering professional services. The advice and strategies contained herein may not be suitable for
every situation. In view of on-going research, equipment modifications, changes in governmental regulations, and
the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader
is urged to review and evaluate the information provided in the package insert or instructions for each chemical,
piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of
usage and for added warnings and precautions. The fact that an organization or website is referred to in this work
as a citation and/or potential source of further information does not mean that the author or the publisher
endorses the information the organization or website may provide or recommendations it may make. Further,
readers should be aware that websites listed in this work may have changed or disappeared between when this
works was written and when it is read. No warranty may be created or extended by any promotional statements
for this work. Neither the publisher nor the author shall be liable for any damages arising here from.
Library of Congress Cataloging-in-Publication Data applied for
Hardback: 9781118879368
Cover Design: Wiley
Cover Image: © Achim Mittler, Frankfurt am Main/Gettyimages
Set in 11.5/14.5pt BemboStd by Aptara Inc., New Delhi, India
Printed in the United States of America.
10 9 8 7 6 5 4 3 2 1
http://www.wiley.com/go/permissions
http://www.wiley.com
The beginning of wisdom is this:
Get wisdom, and whatever else you get, get insight.
– Proverbs 4:7
Contents
Foreword by Gareth James xix
Foreword by Ravi Bapna xxi
Preface to the R Edition xxiii
Acknowledgments xxvii
PART I PRELIMINARIES
CHAPTER 1 Introduction 3
1.1 What Is Business Analytics? . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 What Is Data Mining? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Data Mining and Related Terms . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Why Are There So Many Different Methods? . . . . . . . . . . . . . . . . . . . 8
1.7 Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.8 Road Maps to This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Order of Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER 2 Overview of the Data Mining Process 15
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Core Ideas in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Association Rules and Recommendation Systems . . . . . . . . . . . . . . . . . 16
Predictive Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Data Reduction and Dimension Reduction . . . . . . . . . . . . . . . . . . . . 17
Data Exploration and Visualization . . . . . . . . . . . . . . . . . . . . . . . . 17
Supervised and Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 18
2.3 The Steps in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Preliminary Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Organization of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Predicting Home Values in the West Roxbury Neighborhood . . . . . . . . . . . 21
vii
viii CONTENTS
Loading and Looking at the Data in R . . . . . . . . . . . . . . . . . . . . . . 22
Sampling from a Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Oversampling Rare Events in Classification Tasks . . . . . . . . . . . . . . . . . 25
Preprocessing and Cleaning the Data . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Predictive Power and Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . 33
Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Creation and Use of Data Partitions . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Building a Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Modeling Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7 Using R for Data Mining on a Local Machine . . . . . . . . . . . . . . . . . . . 43
2.8 Automating Data Mining Solutions . . . . . . . . . . . . . . . . . . . . . . . . 43
Data Mining Software: The State of the Market (by Herb Edelstein) . . . . . . . . 45
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
PART II DATA EXPLORATION AND DIMENSION REDUCTION
CHAPTER 3 Data Visualization 55
3.1 Uses of Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Base R or ggplot? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Example 1: Boston Housing Data . . . . . . . . . . . . . . . . . . . . . . . . 57
Example 2: Ridership on Amtrak Trains . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots . . . . . . . . . . . . . 59
Distribution Plots: Boxplots and Histograms . . . . . . . . . . . . . . . . . . . 61
Heatmaps: Visualizing Correlations and Missing Values . . . . . . . . . . . . . . 64
3.4 Multidimensional Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Adding Variables: Color, Size, Shape, Multiple Panels, and Animation . . . . . . . 67
Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering . . . . 70
Reference: Trend Lines and Labels . . . . . . . . . . . . . . . . . . . . . . . . 74
Scaling up to Large Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Multivariate Plot: Parallel Coordinates Plot . . . . . . . . . . . . . . . . . . . . 75
Interactive Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5 Specialized Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Visualizing Networked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Visualizing Hierarchical Data: Treemaps . . . . . . . . . . . . . . . . . . . . . 82
Visualizing Geographical Data: Map Charts . . . . . . . . . . . . . . . . . . . . 83
3.6 Summary: Major Visualizations and Operations, by Data Mining Goal . . . . . . . 86
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Time Series Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
CHAPTER 4 Dimension Reduction 91
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
CONTENTS ix
4.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Example 1: House Prices in Boston . . . . . . . . . . . . . . . . . . . . . . . 93
4.4 Data Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Aggregation and Pivot Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.6 Reducing the Number of Categories in Categorical Variables . . . . . . . . . . . 99
4.7 Converting a Categorical Variable to a Numerical Variable . . . . . . . . . . . . 99
4.8 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Example 2: Breakfast Cereals . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Normalizing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Using Principal Components for Classification and Prediction . . . . . . . . . . . 109
4.9 Dimension Reduction Using Regression Models . . . . . . . . . . . . . . . . . . 111
4.10 Dimension Reduction Using Classification and Regression Trees . . . . . . . . . . 111
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
PART III PERFORMANCE EVALUATION
CHAPTER 5 Evaluating Predictive Performance 117
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.2 Evaluating Predictive Performance . . . . . . . . . . . . . . . . . . . . . . . . 118
Naive Benchmark: The Average . . . . . . . . . . . . . . . . . . . . . . . . . 118
Prediction Accuracy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Comparing Training and Validation Performance . . . . . . . . . . . . . . . . . 121
Lift Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3 Judging Classifier Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Benchmark: The Naive Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Class Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
The Confusion (Classification) Matrix . . . . . . . . . . . . . . . . . . . . . . . 124
Using the Validation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Accuracy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Propensities and Cutoff for Classification . . . . . . . . . . . . . . . . . . . . . 127
Performance in Case of Unequal Importance of Classes . . . . . . . . . . . . . . 131
Asymmetric Misclassification Costs . . . . . . . . . . . . . . . . . . . . . . . . 133
Generalization to More Than Two Classes . . . . . . . . . . . . . . . . . . . . . 135
5.4 Judging Ranking Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Lift Charts for Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Decile Lift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Beyond Two Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Lift Charts Incorporating Costs and Benefits . . . . . . . . . . . . . . . . . . . 139
Lift as a Function of Cutoff . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.5 Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Oversampling the Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . 144
x CONTENTS
Evaluating Model Performance Using a Non-oversampled Validation Set . . . . . . 144
Evaluating Model Performance if Only Oversampled Validation Set Exists . . . . . 144
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
PART IV PREDICTION AND CLASSIFICATION METHODS
CHAPTER 6 Multiple Linear Regression 153
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 Explanatory vs. Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . 154
6.3 Estimating the Regression Equation and Prediction . . . . . . . . . . . . . . . . 156
Example: Predicting the Price of Used Toyota Corolla Cars . . . . . . . . . . . . 156
6.4 Variable Selection in Linear Regression . . . . . . . . . . . . . . . . . . . . . 161
Reducing the Number of Predictors . . . . . . . . . . . . . . . . . . . . . . . 161
How to Reduce the Number of Predictors . . . . . . . . . . . . . . . . . . . . . 162
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
CHAPTER 7 k-Nearest Neighbors (kNN) 173
7.1 The k-NN Classifier (Categorical Outcome) . . . . . . . . . . . . . . . . . . . . 173
Determining Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Classification Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Example: Riding Mowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Choosing k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Setting the Cutoff Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
k-NN with More Than Two Classes . . . . . . . . . . . . . . . . . . . . . . . . 180
Converting Categorical Variables to Binary Dummies . . . . . . . . . . . . . . . 180
7.2 k-NN for a Numerical Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.3 Advantages and Shortcomings of k-NN Algorithms . . . . . . . . . . . . . . . . 182
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
CHAPTER 8 The Naive Bayes Classifier 187
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Cutoff Probability Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Example 1: Predicting Fraudulent Financial Reporting . . . . . . . . . . . . . . 188
8.2 Applying the Full (Exact) Bayesian Classifier . . . . . . . . . . . . . . . . . . . 189
Using the “Assign to the Most Probable Class” Method . . . . . . . . . . . . . . 190
Using the Cutoff Probability Method . . . . . . . . . . . . . . . . . . . . . . . 190
Practical Difficulty with the Complete (Exact) Bayes Procedure . . . . . . . . . . 190
Solution: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
The Naive Bayes Assumption of Conditional Independence . . . . . . . . . . . . 192
Using the Cutoff Probability Method . . . . . . . . . . . . . . . . . . . . . . . 192
Example 2: Predicting Fraudulent Financial Reports, Two Predictors . . . . . . . 193
Example 3: Predicting Delayed Flights . . . . . . . . . . . . . . . . . . . . . . 194
8.3 Advantages and Shortcomings of the Naive Bayes Classifier . . . . . . . . . . . 199
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
CONTENTS xi
CHAPTER 9 Classification and Regression Trees 205
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9.2 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Recursive Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Example 1: Riding Mowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Measures of Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Tree Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Classifying a New Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
9.3 Evaluating the Performance of a Classification Tree . . . . . . . . . . . . . . . . 215
Example 2: Acceptance of Personal Loan . . . . . . . . . . . . . . . . . . . . . 215
9.4 Avoiding Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
Stopping Tree Growth: Conditional Inference Trees . . . . . . . . . . . . . . . . 221
Pruning the Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Best-Pruned Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.5 Classification Rules from Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.6 Classification Trees for More Than Two Classes . . . . . . . . . . . . . . . . . . 227
9.7 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Measuring Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Evaluating Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
9.8 Improving Prediction: Random Forests and Boosted Trees . . . . . . . . . . . . 229
Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Boosted Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.9 Advantages and Weaknesses of a Tree . . . . . . . . . . . . . . . . . . . . . . 232
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
CHAPTER 10 Logistic Regression 237
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
10.2 The Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 239
10.3 Example: Acceptance of Personal Loan . . . . . . . . . . . . . . . . . . . . . . 240
Model with a Single Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Estimating the Logistic Model from Data: Computing Parameter Estimates . . . . 243
Interpreting Results in Terms of Odds (for a Profiling Goal) . . . . . . . . . . . . 244
10.4 Evaluating Classification Performance . . . . . . . . . . . . . . . . . . . . . . 247
Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
10.5 Example of Complete Analysis: Predicting Delayed Flights . . . . . . . . . . . . 250
Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Model-Fitting and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Model Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.6 Appendix: Logistic Regression for Profiling . . . . . . . . . . . . . . . . . . . . 259
Appendix A: Why Linear Regression Is Problematic for a Categorical Outcome . . . 259
xii CONTENTS
Appendix B: Evaluating Explanatory Power . . . . . . . . . . . . . . . . . . . . 261
Appendix C: Logistic Regression for More Than Two Classes . . . . . . . . . . . . 264
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
CHAPTER 11 Neural Nets 271
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
11.2 Concept and Structure of a Neural Network . . . . . . . . . . . . . . . . . . . . 272
11.3 Fitting a Network to Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Example 1: Tiny Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Computing Output of Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
Preprocessing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Training the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Example 2: Classifying Accident Severity . . . . . . . . . . . . . . . . . . . . . 282
Avoiding Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Using the Output for Prediction and Classification . . . . . . . . . . . . . . . . 283
11.4 Required User Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
11.5 Exploring the Relationship Between Predictors and Outcome . . . . . . . . . . . 287
11.6 Advantages and Weaknesses of Neural Networks . . . . . . . . . . . . . . . . . 288
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
CHAPTER 12 Discriminant Analysis 293
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Example 1: Riding Mowers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Example 2: Personal Loan Acceptance . . . . . . . . . . . . . . . . . . . . . . 294
12.2 Distance of a Record from a Class . . . . . . . . . . . . . . . . . . . . . . . . 296
12.3 Fisher’s Linear Classification Functions . . . . . . . . . . . . . . . . . . . . . . 297
12.4 Classification Performance of Discriminant Analysis . . . . . . . . . . . . . . . 300
12.5 Prior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
12.6 Unequal Misclassification Costs . . . . . . . . . . . . . . . . . . . . . . . . . 302
12.7 Classifying More Than Two Classes . . . . . . . . . . . . . . . . . . . . . . . . 303
Example 3: Medical Dispatch to Accident Scenes . . . . . . . . . . . . . . . . . 303
12.8 Advantages and Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
CHAPTER 13 Combining Methods: Ensembles and Uplift Modeling 311
13.1 Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Why Ensembles Can Improve Predictive Power . . . . . . . . . . . . . . . . . . 312
Simple Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Bagging and Boosting in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Advantages and Weaknesses of Ensembles . . . . . . . . . . . . . . . . . . . . 315
13.2 Uplift (Persuasion) Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
A-B Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
CONTENTS xiii
Uplift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Gathering the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
A Simple Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Modeling Individual Uplift . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Computing Uplift with R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Using the Results of an Uplift Model . . . . . . . . . . . . . . . . . . . . . . . 322
13.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
PART V MINING RELATIONSHIPS AMONG RECORDS
CHAPTER 14 Association Rules and Collaborative Filtering 329
14.1 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Discovering Association Rules in Transaction Databases . . . . . . . . . . . . . 330
Example 1: Synthetic Data on Purchases of Phone Faceplates . . . . . . . . . . 330
Generating Candidate Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Selecting Strong Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
The Process of Rule Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Interpreting the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
Rules and Chance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Example 2: Rules for Similar Book Purchases . . . . . . . . . . . . . . . . . . . 340
14.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Data Type and Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Example 3: Netflix Prize Contest . . . . . . . . . . . . . . . . . . . . . . . . . 343
User-Based Collaborative Filtering: “People Like You” . . . . . . . . . . . . . . 344
Item-Based Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . 347
Advantages and Weaknesses of Collaborative Filtering . . . . . . . . . . . . . . 348
Collaborative Filtering vs. Association Rules . . . . . . . . . . . . . . . . . . . 349
14.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
CHAPTER 15 Cluster Analysis 357
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Example: Public Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
15.2 Measuring Distance Between Two Records . . . . . . . . . . . . . . . . . . . . 361
Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
Normalizing Numerical Measurements . . . . . . . . . . . . . . . . . . . . . . 362
Other Distance Measures for Numerical Data . . . . . . . . . . . . . . . . . . . 362
Distance Measures for Categorical Data . . . . . . . . . . . . . . . . . . . . . . 365
Distance Measures for Mixed Data . . . . . . . . . . . . . . . . . . . . . . . . 366
15.3 Measuring Distance Between Two Clusters . . . . . . . . . . . . . . . . . . . . 366
Minimum Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
Maximum Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
xiv CONTENTS
Average Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Centroid Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
15.4 Hierarchical (Agglomerative) Clustering . . . . . . . . . . . . . . . . . . . . . 368
Single Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Complete Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Average Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Centroid Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Ward’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Dendrograms: Displaying Clustering Process and Results . . . . . . . . . . . . . 371
Validating Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Limitations of Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . 375
15.5 Non-Hierarchical Clustering: The k-Means Algorithm . . . . . . . . . . . . . . . 376
Choosing the Number of Clusters (k) . . . . . . . . . . . . . . . . . . . . . . . 377
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
PART VI FORECASTING TIME SERIES
CHAPTER 16 Handling Time Series 387
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
16.2 Descriptive vs. Predictive Modeling . . . . . . . . . . . . . . . . . . . . . . . 389
16.3 Popular Forecasting Methods in Business . . . . . . . . . . . . . . . . . . . . . 389
Combining Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
16.4 Time Series Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
Example: Ridership on Amtrak Trains . . . . . . . . . . . . . . . . . . . . . . . 390
16.5 Data-Partitioning and Performance Evaluation . . . . . . . . . . . . . . . . . . 395
Benchmark Performance: Naive Forecasts . . . . . . . . . . . . . . . . . . . . 395
Generating Future Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
CHAPTER 17 Regression-Based Forecasting 401
17.1 A Model with Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Linear Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Exponential Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
Polynomial Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
17.2 A Model with Seasonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
17.3 A Model with Trend and Seasonality . . . . . . . . . . . . . . . . . . . . . . . 411
17.4 Autocorrelation and ARIMA Models . . . . . . . . . . . . . . . . . . . . . . . . 412
Computing Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Improving Forecasts by Integrating Autocorrelation Information . . . . . . . . . 416
Evaluating Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
CONTENTS xv
CHAPTER 18 Smoothing Methods 433
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
18.2 Moving Average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
Centered Moving Average for Visualization . . . . . . . . . . . . . . . . . . . . 434
Trailing Moving Average for Forecasting . . . . . . . . . . . . . . . . . . . . . 435
Choosing Window Width (w) . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
18.3 Simple Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Choosing Smoothing Parameter α . . . . . . . . . . . . . . . . . . . . . . . . 440
Relation Between Moving Average and Simple Exponential Smoothing . . . . . . 440
18.4 Advanced Exponential Smoothing . . . . . . . . . . . . . . . . . . . . . . . . 442
Series with a Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
Series with a Trend and Seasonality . . . . . . . . . . . . . . . . . . . . . . . 443
Series with Seasonality (No Trend) . . . . . . . . . . . . . . . . . . . . . . . . 443
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
PART VII DATA ANALYTICS
CHAPTER 19 Social Network Analytics 455
19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
19.2 Directed vs. Undirected Networks . . . . . . . . . . . . . . . . . . . . . . . . 457
19.3 Visualizing and Analyzing Networks . . . . . . . . . . . . . . . . . . . . . . . 458
Graph Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
Edge List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
Using Network Data in Classification and Prediction . . . . . . . . . . . . . . . 461
19.4 Social Data Metrics and Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . 462
Node-Level Centrality Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 463
Egocentric Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
Network Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
19.5 Using Network Metrics in Prediction and Classification . . . . . . . . . . . . . . 467
Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Entity Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
19.6 Collecting Social Network Data with R . . . . . . . . . . . . . . . . . . . . . . 471
19.7 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . 474
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
CHAPTER 20 Text Mining 479
20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
20.2 The Tabular Representation of Text: Term-Document Matrix and “Bag-of-Words” . 480
20.3 Bag-of-Words vs. Meaning Extraction at Document Level . . . . . . . . . . . . . 481
20.4 Preprocessing the Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
Text Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
xvi CONTENTS
Presence/Absence vs. Frequency . . . . . . . . . . . . . . . . . . . . . . . . . 487
Term Frequency–Inverse Document Frequency (TF-IDF) . . . . . . . . . . . . . . 487
From Terms to Concepts: Latent Semantic Indexing . . . . . . . . . . . . . . . 488
Extracting Meaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
20.5 Implementing Data Mining Methods . . . . . . . . . . . . . . . . . . . . . . . 489
20.6 Example: Online Discussions on Autos and Electronics . . . . . . . . . . . . . . 490
Importing and Labeling the Records . . . . . . . . . . . . . . . . . . . . . . . 490
Text Preprocessing in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
Producing a Concept Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
Fitting a Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
20.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
PART VIII CASES
CHAPTER 21 Cases 499
21.1 Charles Book Club . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
The Book Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
Database Marketing at Charles . . . . . . . . . . . . . . . . . . . . . . . . . . 500
Data Mining Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
21.2 German Credit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
21.3 Tayko Software Cataloger . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
The Mailing Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
21.4 Political Persuasion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
Predictive Analytics Arrives in US Politics . . . . . . . . . . . . . . . . . . . . 513
Political Targeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
Uplift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
21.5 Taxi Cancellations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Business Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
21.6 Segmenting Consumers of Bath Soap . . . . . . . . . . . . . . . . . . . . . . . 518
Business Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
Key Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
CONTENTS xvii
Measuring Brand Loyalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
21.7 Direct-Mail Fundraising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
21.8 Catalog Cross-Selling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
21.9 Predicting Bankruptcy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
Predicting Corporate Bankruptcy . . . . . . . . . . . . . . . . . . . . . . . . . 525
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
21.10 Time Series Case: Forecasting Public Transportation Demand . . . . . . . . . . . 528
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
Available Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
Assignment Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
Tips and Suggested Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
References 531
Data Files Used in the Book 533
Index 535
Foreword by Gareth James
T he field of statistics has existed in one form or another for 200 years, and bythe second half of the 20th century had evolved into a well-respected and
essential academic discipline. However, its prominence expanded rapidly in the
1990s with the explosion of new, and enormous, data sources. For the first part
of this century, much of this attention was focused on biological applications,
in particular, genetics data generated as a result of the sequencing of the human
genome. However, the last decade has seen a dramatic increase in the availability
of data in the business disciplines, and a corresponding interest in business-related
statistical applications.
The impact has been profound. Ten years ago, when I was able to attract a
full class of MBA students to my new statistical learning elective, my colleagues
were astonished because our department struggled to fill most electives. Today,
we offer a Masters in Business Analytics, which is the largest specialized masters
program in the school and has application volume rivaling those of our MBA
programs. Our department’s faculty size and course offerings have increased
dramatically, yet the MBA students are still complaining that the classes are all
full. Google’s chief economist, Hal Varian, was indeed correct in 2009 when he
stated that “the sexy job in the next 10 years will be statisticians.”
This demand is driven by a simple, but undeniable, fact. Business analyt-
ics solutions have produced significant and measurable improvements in business
performance, on multiple dimensions and in numerous settings, and as a result,
there is a tremendous demand for individuals with the requisite skill set. How-
ever, training students in these skills is challenging given that, in addition to
the obvious required knowledge of statistical methods, they need to understand
business-related issues, possess strong communication skills, and be comfortable
dealing with multiple computational packages. Most statistics texts concentrate
on abstract training in classical methods, without much emphasis on practical,
let alone business, applications.
This book has by far the most comprehensive review of business analytics
methods that I have ever seen, covering everything from classical approaches
such as linear and logistic regression, through to modern methods like neural
xix
xx FOREWORD BY GARETH JAMES
networks, bagging and boosting, and even much more business specific proce-
dures such as social network analysis and text mining. If not the bible, it is at
the least a definitive manual on the subject. However, just as important as the
list of topics, is the way that they are all presented in an applied fashion using
business applications. Indeed the last chapter is entirely dedicated to 10 separate
cases where business analytics approaches can be applied.
In this latest edition, the authors have added an important new dimension
in the form of the R software package. Easily the most widely used and influ-
ential open source statistical software, R has become the go-to tool for such
purposes. With literally hundreds of freely available add-on packages, R can
be used for almost any business analytics related problem. The book provides
detailed descriptions and code involving applications of R in numerous business
settings, ensuring that the reader will actually be able to apply their knowledge
to real-life problems.
We recently introduced a business analytics course into our required MBA
core curriculum and I intend to make heavy use of this book in developing the
syllabus. I’m confident that it will be an indispensable tool for any such course.
GARETH JAMES
Marshall School of Business, University of Southern California, 2017
Foreword by Ravi Bapna
D ata is the new gold—and mining this gold to create business value in today’scontext of a highly networked and digital society requires a skillset that we
haven’t traditionally delivered in business or statistics or engineering programs
on their own. For those businesses and organizations that feel overwhelmed by
today’s Big Data, the phrase you ain’t seen nothing yet comes to mind. Yester-
day’s three major sources of Big Data—the 20+ years of investment in enterprise
systems (ERP, CRM, SCM, …), the 3 billion plus people on the online social
grid, and the close to 5 billion people carrying increasingly sophisticated mobile
devices—are going to be dwarfed by tomorrow’s smarter physical ecosystems
fueled by the Internet of Things (IoT) movement.
The idea that we can use sensors to connect physical objects such as homes,
automobiles, roads, even garbage bins and streetlights, to digitally optimized
systems of governance goes hand in glove with bigger data and the need for
deeper analytical capabilities. We are not far away from a smart refrigerator
sensing that you are short on, say, eggs, populating your grocery store’s mobile
app’s shopping list, and arranging a Task Rabbit to do a grocery run for you.
Or the refrigerator negotiating a deal with an Uber driver to deliver an evening
meal to you. Nor are we far away from sensors embedded in roads and vehicles
that can compute traffic congestion, track roadway wear and tear, record vehicle
use and factor these into dynamic usage-based pricing, insurance rates, and even
taxation. This brave new world is going to be fueled by analytics and the ability
to harness data for competitive advantage.
Business Analytics is an emerging discipline that is going to help us ride this
new wave. This new Business Analytics discipline requires individuals who are
grounded in the fundamentals of business such that they know the right questions
to ask, who have the ability to harness, store, and optimally process vast datasets
from a variety of structured and unstructured sources, and who can then use an
array of techniques from machine learning and statistics to uncover new insights
for decision-making. Such individuals are a rare commodity today, but their
creation has been the focus of this book for a decade now. This book’s forte is
that it relies on explaining the core set of concepts required for today’s business
analytics professionals using real-world data-rich cases in a hands-on manner,
xxi
xxii FOREWORD BY RAVI BAPNA
without sacrificing academic rigor. It provides a modern day foundation for
Business Analytics, the notion of linking the x’s to the y’s of interest in a predictive
sense. I say this with the confidence of someone who was probably the first
adopter of the zeroth edition of this book (Spring 2006 at the Indian School of
Business).
I can’t say enough about the long-awaited R edition. R is my go-to platform
for analytics these days. It’s also used by a wide variety of instructors in our MS-
Business Analytics program. The open-innovation paradigm used by R is one
key part of the analytics perfect storm, the other components being the advances
in computing and the business appetite for data-driven decision-making.
I look forward to using the book in multiple fora, in executive education,
in MBA classrooms, in MS-Business Analytics programs, and in Data Science
bootcamps. I trust you will too!
RAVI BAPNA
Carlson School of Management, University of Minnesota, 2017
Preface to the R Edition
T his textbook first appeared in early 2007 and has been used by numerousstudents and practitioners and in many courses, ranging from dedicated
data mining classes to more general business analytics courses (including our
own experience teaching this material both online and in person for more than
10 years). The first edition, based on the Excel add-in XLMiner, was followed
by two more XLMiner editions, a JMP edition, and now this R edition, with
its companion website, www.dataminingbook.com.
This new R edition, which relies on the free and open-source R soft-
ware, presents output from R, as well as the code used to produce that output,
including specification of a variety of packages and functions. Unlike computer-
science or statistics-oriented textbooks, the focus in this book is on data mining
concepts, and how to implement the associated algorithms in R. We assume a
basic facility with R.
For this R edition, two new co-authors stepped on board—Inbal Yahav and
Casey Lichtendahl—bringing both expertise teaching business analytics courses
using R and data mining consulting experience in business and government.
Such practical experience is important, since the open-source nature of R soft-
ware makes available a plethora of approaches, packages, and functions available
for data mining. Given the main goal of this book—to introduce data min-
ing concepts using R software for illustration—our challenge was to choose an
R code cocktail that supports highlighting the important concepts. In addi-
tion to providing R code and output, this edition also incorporates updates and
new material based on feedback from instructors teaching MBA, undergraduate,
diploma, and executive courses, and from their students as well.
One update, compared to the first two editions of the book, is the title:
we now use Business Analytics in place of Business Intelligence. This reflects the
change in terminology since the second edition: Business Intelligence today
refers mainly to reporting and data visualization (“what is happening now”),
while Business Analytics has taken over the “advanced analytics,” which include
predictive analytics and data mining. In this new edition, we therefore use the
updated terms.
xxiii
http://www.dataminingbook.com
xxiv PREFACE TO THE R EDITION
This R edition includes the material that was recently added in the third
edition of the original (XLMiner-based) book:
• Social network analysis
• Text mining
• Ensembles
• Uplift modeling
• Collaborative filtering
Since the appearance of the (XLMiner-based) second edition, the landscape
of the courses using the textbook has greatly expanded: whereas initially, the
book was used mainly in semester-long elective MBA-level courses, it is now
used in a variety of courses in Business Analytics degrees and certificate programs,
ranging from undergraduate programs, to post-graduate and executive education
programs. Courses in such programs also vary in their duration and coverage. In
many cases, this textbook is used across multiple courses. The book is designed to
continue supporting the general “Predictive Analytics” or “Data Mining” course
as well as supporting a set of courses in dedicated business analytics programs.
A general “Business Analytics,” “Predictive Analytics,” or “Data Mining”
course, common in MBA and undergraduate programs as a one-semester elec-
tive, would cover Parts I–III, and choose a subset of methods from Parts IV
and V. Instructors can choose to use cases as team assignments, class discussions,
or projects. For a two-semester course, Part VI might be considered, and we
recommend introducing the new Part VII (Data Analytics).
For a set of courses in a dedicated business analytics program, here are a few
courses that have been using our book:
Predictive Analytics: Supervised Learning In a dedicated Business
Analytics program, the topic of Predictive Analytics is typically instructed
across a set of courses. The first course would cover Parts I–IV and instruc-
tors typically choose a subset of methods from Part IV according to the
course length. We recommend including the new Chapter 13 in such a
course, as well as the new “Part VII: Data Analytics.”
Predictive Analytics: Unsupervised Learning This course introduces
data exploration and visualization, dimension reduction, mining relation-
ships, and clustering (Parts III and V). If this course follows the Predictive
Analytics: Supervised Learning course, then it is useful to examine examples
and approaches that integrate unsupervised and supervised learning, such as
the new part on “Data Analytics.”
Forecasting Analytics A dedicated course on time series forecasting
would rely on Part VI.
PREFACE TO THE R EDITION xxv
Advanced Analytics A course that integrates the learnings from Predictive
Analytics (supervised and unsupervised learning). Such a course can focus
on Part VII: Data Analytics, where social network analytics and text mining
are introduced. Some instructors choose to use the Cases (Chapter 21) in
such a course.
In all courses, we strongly recommend including a project component,
where data are either collected by students according to their interest or pro-
vided by the instructor (e.g., from the many data mining competition datasets
available). From our experience and other instructors’ experience, such projects
enhance the learning and provide students with an excellent opportunity to
understand the strengths of data mining and the challenges that arise in the
process.
Acknowledgments
We thank the many people who assisted us in improving the first three edi-
tions of the initial XLMiner version of this book and the JMP edition, as well
as those who helped with comments on early drafts of this R edition. Anthony
Babinec, who has been using earlier editions of this book for years in his data
mining courses at Statistics.com, provided us with detailed and expert correc-
tions. Dan Toy and John Elder IV greeted our project with early enthusiasm
and provided detailed and useful comments on initial drafts. Ravi Bapna, who
used an early draft in a data mining course at the Indian School of Business, has
provided invaluable comments and helpful suggestions since the book’s start.
Many of the instructors, teaching assistants, and students using earlier edi-
tions of the book have contributed invaluable feedback both directly and indi-
rectly, through fruitful discussions, learning journeys, and interesting data min-
ing projects that have helped shape and improve the book. These include MBA
students from the University of Maryland, MIT, the Indian School of Business,
National Tsing Hua University, and Statistics.com. Instructors from many uni-
versities and teaching programs, too numerous to list, have supported and helped
improve the book since its inception.
Several professors have been especially helpful with this R edition:
Hayri Tongarlak, Prashant Joshi, Jay Annadatha, Roger Bohn, and Sridhar
Vaithianathan provided detailed comments and R code files for the compan-
ion website; Scott Nestler has been a helpful friend of this book project from
the beginning.
Kuber Deokar, instructional operations supervisor at Statistics.com, has been
unstinting in his assistance, support, and detailed attention. We also thank
Shweta Jadhav and Dhanashree Vishwasrao, assistant teachers. Valerie Troiano
has shepherded many instructors and students through the Statistics.com courses
that have helped nurture the development of these books.
Colleagues and family members have been providing ongoing feedback
and assistance with this book project. Boaz Shmueli and Raquelle Azran gave
detailed editorial comments and suggestions on the first two editions; Bruce
McCullough and Adam Hughes did the same for the first edition. Noa Shmueli
provided careful proofs of the third edition. Ran Shenberger offered design tips.
xxvii
xxviii ACKNOWLEDGMENTS
Ken Strasma, founder of the microtargeting firm HaystaqDNA and director of
targeting for the 2004 Kerry campaign and the 2008 Obama campaign, provided
the scenario and data for the section on uplift modeling. We also thank Jen Gol-
beck, director of the Social Intelligence Lab at the University of Maryland and
author of Analyzing the Social Web, whose book inspired our presentation in the
chapter on social network analytics. Randall Pruim contributed extensively to
the chapter on visualization.
Marietta Tretter at Texas A&M shared comments and thoughts on the time
series chapters, and Stephen Few and Ben Shneiderman provided feedback and
suggestions on the data visualization chapter and overall design tips.
Susan Palocsay and Mia Stephens have provided suggestions and feedback
on numerous occasions, as has Margret Bjarnadottir. We also thank Catherine
Plaisant at the University of Maryland’s Human–Computer Interaction Lab, who
helped out in a major way by contributing exercises and illustrations to the data
visualization chapter. Gregory Piatetsky-Shapiro, founder of KDNuggets.com,
has been generous with his time and counsel over the many years of this project.
This book would not have seen the light of day without the nurturing sup-
port of the faculty at the Sloan School of Management at MIT. Our special
thanks to Dimitris Bertsimas, James Orlin, Robert Freund, Roy Welsch, Gor-
don Kaufmann, and Gabriel Bitran. As teaching assistants for the data mining
course at Sloan, Adam Mersereau gave detailed comments on the notes and cases
that were the genesis of this book, Romy Shioda helped with the preparation
of several cases and exercises used here, and Mahesh Kumar helped with the
material on clustering.
Colleagues at the University of Maryland’s Smith School of Business: Shri-
vardhan Lele, Wolfgang Jank, and Paul Zantek provided practical advice and
comments. We thank Robert Windle, and University of Maryland MBA stu-
dents Timothy Roach, Pablo Macouzet, and Nathan Birckhead for invaluable
datasets. We also thank MBA students Rob Whitener and Daniel Curtis for the
heatmap and map charts.
Anand Bodapati provided both data and advice. Jake Hofman from Micro-
soft Research and Sharad Borle assisted with data access. Suresh Ankolekar and
Mayank Shah helped develop several cases and provided valuable pedagogical
comments. Vinni Bhandari helped write the Charles Book Club case.
We would like to thank Marvin Zelen, L. J. Wei, and Cyrus Mehta at Har-
vard, as well as Anil Gore at Pune University, for thought-provoking discussions
on the relationship between statistics and data mining. Our thanks to Richard
Larson of the Engineering Systems Division, MIT, for sparking many stimu-
lating ideas on the role of data mining in modeling complex systems. Over a
decade ago, they helped us develop a balanced philosophical perspective on the
emerging field of data mining.
ACKNOWLEDGMENTS xxix
Lastly, we thank the folks at Wiley for the decade-long successful journey
of this book. Steve Quigley at Wiley showed confidence in this book from
the beginning and helped us navigate through the publishing process with great
speed. Curt Hinrichs’ vision, tips, and encouragement helped bring this book to
the starting gate. Jon Gurstelle, Kathleen Pagliaro, and Katrina Maceda greatly
assisted us in pushing ahead and finalizing this R edition. We are also especially
grateful to Amy Hendrickson, who assisted with typesetting and making this
book beautiful.
Part I
Preliminaries
CHAPTER 1
Introduction
1.1 What Is Business Analytics?
Business Analytics (BA) is the practice and art of bringing quantitative data to bear
on decision-making. The term means different things to different organizations.
Consider the role of analytics in helping newspapers survive the transition
to a digital world. One tabloid newspaper with a working-class readership in
Britain had launched a web version of the paper, and did tests on its home
page to determine which images produced more hits: cats, dogs, or monkeys.
This simple application, for this company, was considered analytics. By contrast,
the Washington Post has a highly influential audience that is of interest to big
defense contractors: it is perhaps the only newspaper where you routinely see
advertisements for aircraft carriers. In the digital environment, the Post can
track readers by time of day, location, and user subscription information. In this
fashion, the display of the aircraft carrier advertisement in the online paper may
be focused on a very small group of individuals—say, the members of the House
and Senate Armed Services Committees who will be voting on the Pentagon’s
budget.
Business Analytics, or more generically, analytics, include a range of data
analysis methods. Many powerful applications involve little more than count-
ing, rule-checking, and basic arithmetic. For some organizations, this is what is
meant by analytics.
The next level of business analytics, now termed Business Intelligence (BI),
refers to data visualization and reporting for understanding “what happened and
what is happening.” This is done by use of charts, tables, and dashboards to
display, examine, and explore data. BI, which earlier consisted mainly of gener-
ating static reports, has evolved into more user-friendly and effective tools and
practices, such as creating interactive dashboards that allow the user not only to
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
3
4 INTRODUCTION
access real-time data, but also to directly interact with it. Effective dashboards
are those that tie directly into company data, and give managers a tool to quickly
see what might not readily be apparent in a large complex database. One such
tool for industrial operations managers displays customer orders in a single two-
dimensional display, using color and bubble size as added variables, showing
customer name, type of product, size of order, and length of time to produce.
Business Analytics now typically includes BI as well as sophisticated data
analysis methods, such as statistical models and data mining algorithms used for
exploring data, quantifying and explaining relationships between measurements,
and predicting new records. Methods like regression models are used to describe
and quantify “on average” relationships (e.g., between advertising and sales),
to predict new records (e.g., whether a new patient will react positively to a
medication), and to forecast future values (e.g., next week’s web traffic).
Readers familiar with earlier editions of this book may have noticed that the
book title has changed from Data Mining for Business Intelligence to Data Mining
for Business Analytics in this edition. The change reflects the more recent term
BA, which overtook the earlier term BI to denote advanced analytics. Today,
BI is used to refer to data visualization and reporting.
W H O U S E S P R E D I C T I V E A N A L Y T I C S ?
The widespread adoption of predictive analytics, coupled with the accelerating avail-
ability of data, has increased organizations’ capabilities throughout the economy.
A few examples:
Credit scoring: One long-established use of predictive modeling techniques for
business prediction is credit scoring. A credit score is not some arbitrary judgment
of credit-worthiness; it is based mainly on a predictive model that uses prior data
to predict repayment behavior.
Future purchases: A more recent (and controversial) example is Target’s use of
predictive modeling to classify sales prospects as “pregnant” or “not-pregnant.”
Those classified as pregnant could then be sent sales promotions at an early stage
of pregnancy, giving Target a head start on a significant purchase stream.
Tax evasion: The US Internal Revenue Service found it was 25 times more likely to
find tax evasion when enforcement activity was based on predictive models, allowing
agents to focus on the most-likely tax cheats (Siegel, 2013).
The Business Analytics toolkit also includes statistical experiments, the most
common of which is known to marketers as A-B testing. These are often used
for pricing decisions:
• Orbitz, the travel site, found that it could price hotel options higher for
Mac users than Windows users.
• Staples online store found it could charge more for staplers if a customer
lived far from a Staples store.
WHAT IS DATA MINING? 5
Beware the organizational setting where analytics is a solution in search of
a problem: A manager, knowing that business analytics and data mining are hot
areas, decides that her organization must deploy them too, to capture that hidden
value that must be lurking somewhere. Successful use of analytics and data
mining requires both an understanding of the business context where value is to
be captured, and an understanding of exactly what the data mining methods do.
1.2 What Is Data Mining?
In this book, data mining refers to business analytics methods that go beyond
counts, descriptive techniques, reporting, and methods based on business rules.
While we do introduce data visualization, which is commonly the first step into
more advanced analytics, the book focuses mostly on the more advanced data
analytics tools. Specifically, it includes statistical and machine-learning meth-
ods that inform decision-making, often in an automated fashion. Prediction is
typically an important component, often at the individual level. Rather than
“what is the relationship between advertising and sales,” we might be interested
in “what specific advertisement, or recommended product, should be shown to
a given online shopper at this moment?” Or we might be interested in clustering
customers into different “personas” that receive different marketing treatment,
then assigning each new prospect to one of these personas.
The era of Big Data has accelerated the use of data mining. Data mining
methods, with their power and automaticity, have the ability to cope with huge
amounts of data and extract value.
1.3 Data Mining and Related Terms
The field of analytics is growing rapidly, both in terms of the breadth of appli-
cations, and in terms of the number of organizations using advanced analytics.
As a result, there is considerable overlap and inconsistency of definitions.
The term data mining itself means different things to different people. To the
general public, it may have a general, somewhat hazy and pejorative meaning
of digging through vast stores of (often personal) data in search of something
interesting. One major consulting firm has a “data mining department,” but its
responsibilities are in the area of studying and graphing past data in search of
general trends. And, to confuse matters, their more advanced predictive models
are the responsibility of an “advanced analytics department.” Other terms that
organizations use are predictive analytics, predictive modeling, and machine learning.
Data mining stands at the confluence of the fields of statistics and machine
learning (also known as artificial intelligence). A variety of techniques for explor-
ing data and building models have been around for a long time in the world of
6 INTRODUCTION
statistics: linear regression, logistic regression, discriminant analysis, and princi-
pal components analysis, for example. But the core tenets of classical statistics—
computing is difficult and data are scarce—do not apply in data mining applica-
tions where both data and computing power are plentiful.
This gives rise to Daryl Pregibon’s description of data mining as “statistics
at scale and speed” (Pregibon, 1999). Another major difference between the
fields of statistics and machine learning is the focus in statistics on inference from
a sample to the population regarding an “average effect”—for example, “a $1
price increase will reduce average demand by 2 boxes.” In contrast, the focus in
machine learning is on predicting individual records—“the predicted demand
for person i given a $1 price increase is 1 box, while for person j it is 3 boxes.”
The emphasis that classical statistics places on inference (determining whether a
pattern or interesting result might have happened by chance in our sample) is
absent from data mining.
In comparison to statistics, data mining deals with large datasets in an open-
ended fashion, making it impossible to put the strict limits around the question
being addressed that inference would require. As a result, the general approach
to data mining is vulnerable to the danger of overfitting, where a model is fit
so closely to the available sample of data that it describes not merely structural
characteristics of the data, but random peculiarities as well. In engineering terms,
the model is fitting the noise, not just the signal.
In this book, we use the term machine learning to refer to algorithms that
learn directly from data, especially local patterns, often in layered or iterative
fashion. In contrast, we use statistical models to refer to methods that apply global
structure to the data. A simple example is a linear regression model (statistical) vs.
a k-nearest-neighbors algorithm (machine learning). A given record would be
treated by linear regression in accord with an overall linear equation that applies
to all the records. In k-nearest neighbors, that record would be classified in
accord with the values of a small number of nearby records.
Lastly, many practitioners, particularly those from the IT and computer sci-
ence communities, use the term machine learning to refer to all the methods dis-
cussed in this book.
1.4 Big Data
Data mining and Big Data go hand in hand. Big Data is a relative term—data
today are big by reference to the past, and to the methods and devices available
to deal with them. The challenge Big Data presents is often characterized by the
four V’s—volume, velocity, variety, and veracity. Volume refers to the amount of
data. Velocity refers to the flow rate—the speed at which it is being generated and
changed. Variety refers to the different types of data being generated (currency,
DATA SCIENCE 7
dates, numbers, text, etc.). Veracity refers to the fact that data is being generated
by organic distributed processes (e.g., millions of people signing up for services
or free downloads) and not subject to the controls or quality checks that apply
to data collected for a study.
Most large organizations face both the challenge and the opportunity of Big
Data because most routine data processes now generate data that can be stored
and, possibly, analyzed. The scale can be visualized by comparing the data in a
traditional statistical analysis (say, 15 variables and 5000 records) to the Walmart
database. If you consider the traditional statistical study to be the size of a period
at the end of a sentence, then the Walmart database is the size of a football field.
And that probably does not include other data associated with Walmart—social
media data, for example, which comes in the form of unstructured text.
If the analytical challenge is substantial, so can be the reward:
• OKCupid, the online dating site, uses statistical models with their data
to predict what forms of message content are most likely to produce a
response.
• Telenor, a Norwegian mobile phone service company, was able to reduce
subscriber turnover 37% by using models to predict which customers were
most likely to leave, and then lavishing attention on them.
• Allstate, the insurance company, tripled the accuracy of predicting injury
liability in auto claims by incorporating more information about vehicle
type.
The above examples are from Eric Siegel’s book Predictive Analytics (2013, Wiley).
Some extremely valuable tasks were not even feasible before the era of Big
Data. Consider web searches, the technology on which Google was built. In
early days, a search for “Ricky Ricardo Little Red Riding Hood” would have
yielded various links to the I Love LucyTV show, other links to Ricardo’s career as
a band leader, and links to the children’s story of Little Red Riding Hood. Only
once the Google database had accumulated sufficient data (including records of
what users clicked on) would the search yield, in the top position, links to the
specific I Love Lucy episode in which Ricky enacts, in a comic mixture of Spanish
and English, Little Red Riding Hood for his infant son.
1.5 Data Science
The ubiquity, size, value, and importance of Big Data has given rise to a new
profession: the data scientist. Data science is a mix of skills in the areas of statistics,
machine learning, math, programming, business, and IT. The term itself is thus
broader than the other concepts we discussed above, and it is a rare individual
who combines deep skills in all the constituent areas. In their book Analyzing
8 INTRODUCTION
the Analyzers (Harris et al., 2013), the authors describe the skill sets of most data
scientists as resembling a ‘T’—deep in one area (the vertical bar of the T), and
shallower in other areas (the top of the T).
At a large data science conference session (Strata+Hadoop World, Octo-
ber 2014), most attendees felt that programming was an essential skill, though
there was a sizable minority who felt otherwise. And, although Big Data is the
motivating power behind the growth of data science, most data scientists do not
actually spend most of their time working with terabyte-size or larger data.
Data of the terabyte or larger size would be involved at the deployment stage
of a model. There are manifold challenges at that stage, most of them IT and
programming issues related to data-handling and tying together different compo-
nents of a system. Much work must precede that phase. It is that earlier piloting
and prototyping phase on which this book focuses—developing the statistical and
machine learning models that will eventually be plugged into a deployed system.
What methods do you use with what sorts of data and problems? How do the
methods work? What are their requirements, their strengths, their weaknesses?
How do you assess their performance?
1.6 Why Are There So Many Different
Methods?
As can be seen in this book or any other resource on data mining, there are
many different methods for prediction and classification. You might ask yourself
why they coexist, and whether some are better than others. The answer is that
each method has advantages and disadvantages. The usefulness of a method
can depend on factors such as the size of the dataset, the types of patterns that
exist in the data, whether the data meet some underlying assumptions of the
method, how noisy the data are, and the particular goal of the analysis. A small
illustration is shown in Figure 1.1, where the goal is to find a combination of
household income level and household lot size that separates buyers (solid circles) from
FIGURE 1.1 TWO METHODS FOR SEPARATING OWNERS FROM NONOWNERS
TERMINOLOGY AND NOTATION 9
nonbuyers (hollow circles) of riding mowers. The first method (left panel) looks
only for horizontal and vertical lines to separate buyers from nonbuyers, whereas
the second method (right panel) looks for a single diagonal line.
Different methods can lead to different results, and their performance can
vary. It is therefore customary in data mining to apply several different methods
and select the one that appears most useful for the goal at hand.
1.7 Terminology and Notation
Because of the hybrid parentry of data mining, its practitioners often use multiple
terms to refer to the same thing. For example, in the machine learning (artificial
intelligence) field, the variable being predicted is the output variable or target
variable. To a statistician, it is the dependent variable or the response. Here is a
summary of terms used:
Algorithm A specific procedure used to implement a particular data mining
technique: classification tree, discriminant analysis, and the like.
Attribute see Predictor.
Case see Observation.
Confidence A performance measure in association rules of the type “IF A and
B are purchased, THEN C is also purchased.” Confidence is the conditional
probability that C will be purchased IF A and B are purchased.
Confidence also has a broader meaning in statistics (confidence interval), concern-
ing the degree of error in an estimate that results from selecting one sample
as opposed to another.
Dependent Variable see Response.
Estimation see Prediction.
Feature see Predictor.
Holdout Data (or holdout set) A sample of data not used in fitting a model,
but instead used to assess the performance of that model. This book uses the
terms validation set and test set instead of holdout set.
Input Variable see Predictor.
Model An algorithm as applied to a dataset, complete with its settings (many of
the algorithms have parameters that the user can adjust).
Observation The unit of analysis on which the measurements are taken (a cus-
tomer, a transaction, etc.), also called instance, sample, example, case, record,
pattern, or row. In spreadsheets, each row typically represents a record; each
column, a variable. Note that the use of the term “sample” here is dif-
ferent from its usual meaning in statistics, where it refers to a collection of
observations.
10 INTRODUCTION
Outcome Variable see Response.
Output Variable see Response.
P (A | B) The conditional probability of event A occurring given that event
B has occurred, read as “the probability that A will occur given that B has
occurred.”
Prediction The prediction of the numerical value of a continuous output vari-
able; also called estimation.
Predictor A variable, usually denoted by X , used as an input into a predic-
tive model, also called a feature, input variable, independent variable, or from a
database perspective, a field.
Profile A set of measurements on an observation (e.g., the height, weight, and
age of a person).
Record see Observation.
Response A variable, usually denoted by Y , which is the variable being pre-
dicted in supervised learning, also called dependent variable, output variable,
target variable, or outcome variable.
Sample In the statistical community, “sample” means a collection of observa-
tions. In the machine learning community, “sample” means a single obser-
vation.
Score A predicted value or class. Scoring new data means using a model devel-
oped with training data to predict output values in new data.
Success Class The class of interest in a binary outcome (e.g., purchasers in the
outcome purchase/no purchase).
Supervised Learning The process of providing an algorithm (logistic regres-
sion, regression tree, etc.) with records in which an output variable of inter-
est is known and the algorithm “learns” how to predict this value with new
records where the output is unknown.
Target see Response.
Test Data (or test set) The portion of the data used only at the end of the
model building and selection process to assess how well the final model might
perform on new data.
Training Data (or training set) The portion of the data used to fit a model.
Unsupervised Learning An analysis in which one attempts to learn patterns
in the data other than predicting an output value of interest.
Validation Data (or validation set) The portion of the data used to assess
how well the model fits, to adjust models, and to select the best model from
among those that have been tried.
Variable Any measurement on the records, including both the input (X) vari-
ables and the output (Y ) variable.
ROAD MAPS TO THIS BOOK 11
1.8 Road Maps to This Book
The book covers many of the widely used predictive and classification methods
as well as other data mining tools. Figure 1.2 outlines data mining from a process
perspective and where the topics in this book fit in. Chapter numbers are indi-
cated beside the topic. Table 1.1 provides a different perspective: it organizes
data mining procedures according to the type and structure of the data.
Order of Topics
The book is divided into five parts: Part I (Chapters 1–2) gives a general
overview of data mining and its components. Part II (Chapters 3–4) focuses
on the early stages of data exploration and dimension reduction.
Part III (Chapter 5) discusses performance evaluation. Although it contains
only one chapter, we discuss a variety of topics, from predictive performance
metrics to misclassification costs. The principles covered in this part are crucial
for the proper evaluation and comparison of supervised learning methods.
Part IV includes eight chapters (Chapters 6–13), covering a variety of popular
supervised learning methods (for classification and/or prediction). Within this
part, the topics are generally organized according to the level of sophistication
of the algorithms, their popularity, and ease of understanding. The final chapter
introduces ensembles and combinations of methods.
FIGURE 1.2 DATA MINING FROM A PROCESS PERSPECTIVE. NUMBERS IN PARENTHESES
INDICATE CHAPTER NUMBERS
12 INTRODUCTION
TABLE 1.1 ORGANIZATION OF DATA MINING METHODS IN THIS BOOK, ACCORDING TO THE
NATURE OF THE DATA∗
Supervised Unsupervised
Continuous Categorical
Response Response No Response
Continuous Linear regression (6) Logistic regression (10) Principal components (4)
predictors Neural nets (11) Neural nets (11) Cluster analysis (15)
k-Nearest neighbors (7) Discriminant analysis (12) Collaborative filtering (14)
Ensembles (13) k-Nearest neighbors (7)
Ensembles (13)
Categorical Linear regression (6) Neural nets (11) Association rules (14)
predictors Neural nets (11) Classification trees (9) Collaborative filtering (14)
Regression trees (9) Logistic regression (10)
Ensembles (13) Naive Bayes (8)
Ensembles (13)
∗Numbers in parentheses indicate chapter number.
Part V focuses on unsupervised mining of relationships. It presents associa-
tion rules and collaborative filtering (Chapter 14) and cluster analysis (Chapter
15).
Part VI includes three chapters (Chapters 16–18), with the focus on fore-
casting time series. The first chapter covers general issues related to handling
and understanding time series. The next two chapters present two popular fore-
casting approaches: regression-based forecasting and smoothing methods.
Part VII (Chapters 19–20) presents two broad data analytics topics: social
network analysis and text mining. These methods apply data mining to special-
ized data structures: social networks and text.
Finally, part VIII includes a set of cases.
Although the topics in the book can be covered in the order of the chapters,
each chapter stands alone. We advise, however, to read parts I–III before pro-
ceeding to chapters in parts IV–V. Similarly, Chapter 16 should precede other
chapters in part VI.
U S I N G R A N D R S T U D I O
To facilitate a hands-on data mining experience, this book uses R, a free software
environment for statistical computing and graphics, and RStudio, an integrated
development environment (IDE) for R. The R programming language is widely used
in academia and industry for data mining and data analysis. R offers a variety of
methods for analyzing data, provided by a variety of separate packages. Among the
numerous packages, R has extensive coverage of statistical and data mining tech-
niques for classification, prediction, mining associations and text, forecasting, and
ROAD MAPS TO THIS BOOK 13
data exploration and reduction. It offers a variety of supervised data mining tools:
neural nets, classification and regression trees, k-nearest-neighbor classification,
naive Bayes, logistic regression, linear regression, and discriminant analysis, all for
predictive modeling. R’s packages also cover unsupervised algorithms: association
rules, collaborative filtering, principal components analysis, k-means clustering,
and hierarchical clustering, as well as visualization tools and data-handling util-
ities. Often, the same method is implemented in multiple packages, as we will
discuss throughout the book. The illustrations, exercises, and cases in this book
are written in relation to R.
Download: To download R and RStudio, visit www.r-project.org and www.rstudio.
com/products/RStudio and follow the instructions there.
Installation: Install both R and RStudio. Note that R releases new versions fairly
often. When a new version is released, some packages might require a new instal-
lation of R (this is rare).
Use: To start using R, open RStudio, then open a new script under File > New
File > R Script. RStudio contains four panels as shown in Figure 1.3: Script (top
left), Console (bottom left), Environment (top right), and additional information,
such as plot and help (bottom right). To run a selected code line from the Script
panel, press ctrl+r. Code lines starting with # are comments.
Package Installation: To start using an R package, you will first need to install it.
Installation is done via the information panel (tab ”packages”) or using command
install.packages(). New packages might not support old R versions and
require a new R installation.
FIGURE 1.3 RSTUDIO SCREEN
http://www.r-project.org
http://www.rstudio.com/products/RStudio
http://www.rstudio.com/products/RStudio
CHAPTER 2
Overview of the Data Mining
Process
In this chapter, we give an overview of the steps involved in data mining, starting
from a clear goal definition and ending with model deployment. The general
steps are shown schematically in Figure 2.1. We also discuss issues related to
data collection, cleaning, and preprocessing. We introduce the notion of data
partitioning, where methods are trained on a set of training data and then their
performance is evaluated on a separate set of validation data, as well as explain
how this practice helps avoid overfitting. Finally, we illustrate the steps of model
building by applying them to data.
Define
Purpose
Obtain
Data
Explore
& Clean
Data
Determine
DM Task
Choose
DM
Methods
Apply Methods
& Select Final
Model
Evaluate
Performance
Deploy
FIGURE 2.1 SCHEMATIC OF THE DATA MODELING PROCESS
2.1 Introduction
In Chapter 1, we saw some very general definitions of data mining. In this chap-
ter, we introduce the variety of methods sometimes referred to as data mining.
The core of this book focuses on what has come to be called predictive analyt-
ics, the tasks of classification and prediction as well as pattern discovery, which
have become key elements of a “business analytics” function in most large firms.
These terms are described and illustrated below.
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
15
16 OVERVIEW OF THE DATA MINING PROCESS
Not covered in this book to any great extent are two simpler database meth-
ods that are sometimes considered to be data mining techniques: (1) OLAP
(online analytical processing) and (2) SQL (structured query language). OLAP
and SQL searches on databases are descriptive in nature and are based on business
rules set by the user (e.g., “find all credit card customers in a certain zip code
with annual charges > $20,000, who own their home and who pay the entire
amount of their monthly bill at least 95% of the time.”) Although SQL queries
are often used to obtain the data in data mining, they do not involve statistical
modeling or automated algorithmic methods.
2.2 Core Ideas in Data Mining
Classification
Classification is perhaps the most basic form of data analysis. The recipient of
an offer can respond or not respond. An applicant for a loan can repay on time,
repay late, or declare bankruptcy. A credit card transaction can be normal or
fraudulent. A packet of data traveling on a network can be benign or threatening.
A bus in a fleet can be available for service or unavailable. The victim of an illness
can be recovered, still be ill, or be deceased.
A common task in data mining is to examine data where the classification
is unknown or will occur in the future, with the goal of predicting what that
classification is or will be. Similar data where the classification is known are
used to develop rules, which are then applied to the data with the unknown
classification.
Prediction
Prediction is similar to classification, except that we are trying to predict the
value of a numerical variable (e.g., amount of purchase) rather than a class (e.g.,
purchaser or nonpurchaser). Of course, in classification we are trying to predict
a class, but the term prediction in this book refers to the prediction of the value of a
continuous variable. (Sometimes in the data mining literature, the terms estima-
tion and regression are used to refer to the prediction of the value of a continuous
variable, and prediction may be used for both continuous and categorical data.)
Association Rules and Recommendation Systems
Large databases of customer transactions lend themselves naturally to the analysis
of associations among items purchased, or “what goes with what.” Association
rules, or affinity analysis, is designed to find such general associations patterns
between items in large databases. The rules can then be used in a variety of ways.
For example, grocery stores can use such information for product placement.
CORE IDEAS IN DATA MINING 17
They can use the rules for weekly promotional offers or for bundling products.
Association rules derived from a hospital database on patients’ symptoms during
consecutive hospitalizations can help find “which symptom is followed by what
other symptom” to help predict future symptoms for returning patients.
Online recommendation systems, such as those used on Amazon.com and
Netflix.com, use collaborative filtering, a method that uses individual users’ pref-
erences and tastes given their historic purchase, rating, browsing, or any other
measurable behavior indicative of preference, as well as other users’ history. In
contrast to association rules that generate rules general to an entire population,
collaborative filtering generates “what goes with what” at the individual user
level. Hence, collaborative filtering is used in many recommendation systems
that aim to deliver personalized recommendations to users with a wide range of
preferences.
Predictive Analytics
Classification, prediction, and to some extent, association rules and collaborative
filtering constitute the analytical methods employed in predictive analytics. The
term predictive analytics is sometimes used to also include data pattern identifi-
cation methods such as clustering.
Data Reduction and Dimension Reduction
The performance of data mining algorithms is often improved when the num-
ber of variables is limited, and when large numbers of records can be grouped
into homogeneous groups. For example, rather than dealing with thousands of
product types, an analyst might wish to group them into a smaller number of
groups and build separate models for each group. Or a marketer might want to
classify customers into different “personas,” and must therefore group customers
into homogeneous groups to define the personas. This process of consolidating
a large number of records (or cases) into a smaller set is termed data reduction.
Methods for reducing the number of cases are often called clustering.
Reducing the number of variables is typically called dimension reduction.
Dimension reduction is a common initial step before deploying data min-
ing methods, intended to improve predictive power, manageability, and inter-
pretability.
Data Exploration and Visualization
One of the earliest stages of engaging with a dataset is exploring it. Exploration
is aimed at understanding the global landscape of the data, and detecting unusual
values. Exploration is used for data cleaning and manipulation as well as for
visual discovery and “hypothesis generation.”
18 OVERVIEW OF THE DATA MINING PROCESS
Methods for exploring data include looking at various data aggregations and
summaries, both numerically and graphically. This includes looking at each
variable separately as well as looking at relationships among variables. The pur-
pose is to discover patterns and exceptions. Exploration by creating charts and
dashboards is called Data Visualization or Visual Analytics. For numerical vari-
ables, we use histograms and boxplots to learn about the distribution of their
values, to detect outliers (extreme observations), and to find other information
that is relevant to the analysis task. Similarly, for categorical variables, we use
bar charts. We can also look at scatter plots of pairs of numerical variables to
learn about possible relationships, the type of relationship, and again, to detect
outliers. Visualization can be greatly enhanced by adding features such as color
and interactive navigation.
Supervised and Unsupervised Learning
A fundamental distinction among data mining techniques is between supervised
and unsupervised methods. Supervised learning algorithms are those used in classi-
fication and prediction. We must have data available in which the value of the
outcome of interest (e.g., purchase or no purchase) is known. Such data are
also called “labeled data,” since they contain the label (outcome value) for each
record. These training data are the data from which the classification or predic-
tion algorithm “learns,” or is “trained,” about the relationship between predictor
variables and the outcome variable. Once the algorithm has learned from the
training data, it is then applied to another sample of labeled data (the validation
data) where the outcome is known but initially hidden, to see how well it does
in comparison to other models. If many different models are being tried out, it
is prudent to save a third sample, which also includes known outcomes (the test
data) to use with the model finally selected to predict how well it will do. The
model can then be used to classify or predict the outcome of interest in new
cases where the outcome is unknown.
Simple linear regression is an example of a supervised learning algorithm
(although rarely called that in the introductory statistics course where you prob-
ably first encountered it). The Y variable is the (known) outcome variable and
the X variable is a predictor variable. A regression line is drawn to minimize the
sum of squared deviations between the actual Y values and the values predicted
by this line. The regression line can now be used to predict Y values for new
values of X for which we do not know the Y value.
Unsupervised learning algorithms are those used where there is no outcome
variable to predict or classify. Hence, there is no “learning” from cases where
such an outcome variable is known. Association rules, dimension reduction
methods, and clustering techniques are all unsupervised learning methods.
THE STEPS IN DATA MINING 19
Supervised and unsupervised methods are sometimes used in conjunction.
For example, unsupervised clustering methods are used to separate loan appli-
cants into several risk-level groups. Then, supervised algorithms are applied
separately to each risk-level group for predicting propensity of loan default.
S U P E R V I S E D L E A R N I N G R E Q U I R E S G O O D S U P E R V I S I O N
In some cases, the value of the outcome variable (the ‘label’) is known because it
is an inherent component of the data. Web logs will show whether a person clicked
on a link or not. Bank records will show whether a loan was paid on time or not. In
other cases, the value of the known outcome must be supplied by a human labeling
process to accumulate enough data to train a model. E-mail must be labeled as
spam or legitimate, documents in legal discovery must be labeled as relevant or
irrelevant. In either case, the data mining algorithm can be led astray if the quality
of the supervision is poor.
Gene Weingarten reported in the January 5, 2014 Washington Post magazine
how the strange phrase “defiantly recommend” is making its way into English via
auto-correction. “Defiantly” is closer to the common misspelling definatly than
is definitely, so Google.com, in the early days, offered it as a correction when
users typed the misspelled word “definatly.” In the ideal supervised learning model,
humans guide the auto-correction process by rejecting defiantly and substituting
definitely. Google’s algorithm would then learn that this is the best first-choice
correction of “definatly.” The problem was that too many people were lazy, just
accepting the first correction that Google presented. All these acceptances then
cemented “defiantly” as the proper correction.
2.3 The Steps in Data Mining
This book focuses on understanding and using data mining algorithms (Steps
4 to 7 below). However, some of the most serious errors in analytics projects
result from a poor understanding of the problem—an understanding that must
be developed before we get into the details of algorithms to be used. Here is a
list of steps to be taken in a typical data mining effort:
1. Develop an understanding of the purpose of the data mining project. How will
the stakeholder use the results? Who will be affected by the results? Will
the analysis be a one-shot effort or an ongoing procedure?
2. Obtain the dataset to be used in the analysis. This often involves sampling
from a large database to capture records to be used in an analysis. How
well this sample reflects the records of interest affects the ability of the
data mining results to generalize to records outside of this sample. It
may also involve pulling together data from different databases or sources.
20 OVERVIEW OF THE DATA MINING PROCESS
The databases could be internal (e.g., past purchases made by customers)
or external (credit ratings). While data mining deals with very large
databases, usually the analysis to be done requires only thousands or tens
of thousands of records.
3. Explore, clean, and preprocess the data. This step involves verifying that the
data are in reasonable condition. How should missing data be handled?
Are the values in a reasonable range, given what you would expect for
each variable? Are there obvious outliers? The data are reviewed graph-
ically: for example, a matrix of scatterplots showing the relationship of
each variable with every other variable. We also need to ensure con-
sistency in the definitions of fields, units of measurement, time periods,
and so on. In this step, new variables are also typically created from exist-
ing ones. For example, “duration” can be computed from start and end
dates.
4. Reduce the data dimension, if necessary. Dimension reduction can involve
operations such as eliminating unneeded variables, transforming variables
(e.g., turning “money spent” into “spent > $100” vs. “spent ≤ $100”),
and creating new variables (e.g., a variable that records whether at least
one of several products was purchased). Make sure that you know what
each variable means and whether it is sensible to include it in the model.
5. Determine the data mining task. (classification, prediction, clustering, etc.).
This involves translating the general question or problem of Step 1 into
a more specific data mining question.
6. Partition the data (for supervised tasks). If the task is supervised (classification
or prediction), randomly partition the dataset into three parts: training,
validation, and test datasets.
7. Choose the data mining techniques to be used. (regression, neural nets, hier-
archical clustering, etc.).
8. Use algorithms to perform the task. This is typically an iterative process—
trying multiple variants, and often using multiple variants of the same
algorithm (choosing different variables or settings within the algorithm).
Where appropriate, feedback from the algorithm’s performance on vali-
dation data is used to refine the settings.
9. Interpret the results of the algorithms. This involves making a choice as to
the best algorithm to deploy, and where possible, testing the final choice
on the test data to get an idea as to how well it will perform. (Recall
that each algorithm may also be tested on the validation data for tuning
purposes; in this way, the validation data become a part of the fitting
process and are likely to underestimate the error in the deployment of
the model that is finally chosen.)
PRELIMINARY STEPS 21
10. Deploy the model. This step involves integrating the model into oper-
ational systems and running it on real records to produce decisions or
actions. For example, the model might be applied to a purchased list of
possible customers, and the action might be “include in the mailing if the
predicted amount of purchase is > $10.” A key step here is “scoring” the
new records, or using the chosen model to predict the outcome value
(“score”) for each new record.
The foregoing steps encompass the steps in SEMMA, a methodology developed
by the software company SAS:
Sample Take a sample from the dataset; partition into training, validation,
and test datasets.
Explore Examine the dataset statistically and graphically.
Modify Transform the variables and impute missing values.
Model Fit predictive models (e.g., regression tree, neural network).
Assess Compare models using a validation dataset.
IBM SPSS Modeler (previously SPSS-Clementine) has a similar method-
ology, termed CRISP-DM (CRoss-Industry Standard Process for Data Min-
ing). All these frameworks include the same main steps involved in predictive
modeling.
2.4 Preliminary Steps
Organization of Datasets
Datasets are nearly always constructed and displayed so that variables are in
columns and records are in rows. We will illustrate this with home values in West
Roxbury, Boston, in 2014. 14 variables are recorded for over 5000 homes. The
spreadsheet is organized so that each row represents a home—the first home’s
assessed value was $344,200, its tax was $4430, its size was 9965 ft2, it was built
in 1880, and so on. In supervised learning situations, one of these variables will
be the outcome variable, typically listed in the first or last column (in this case
it is TOTAL VALUE, in the first column).
Predicting Home Values in the West Roxbury Neighborhood
The Internet has revolutionized the real estate industry. Realtors now list houses
and their prices on the web, and estimates of house and condominium prices
have become widely available, even for units not on the market. At this time of
22 OVERVIEW OF THE DATA MINING PROCESS
writing, Zillow (www.zillow.com) is the most popular online real estate infor-
mation site in the United States1, and in 2014 they purchased their major rival,
Trulia. By 2015, Zillow had become the dominant platform for checking house
prices and, as such, the dominant online advertising venue for realtors. What
used to be a comfortable 6% commission structure for realtors, affording them
a handsome surplus (and an oversupply of realtors), was being rapidly eroded by
an increasing need to pay for advertising on Zillow. (This, in fact, is the key to
Zillow’s business model—redirecting the 6% commission away from realtors and
to itself.)
Zillow gets much of the data for its “Zestimates” of home values directly
from publicly available city housing data, used to estimate property values for
tax assessment. A competitor seeking to get into the market would likely take
the same approach. So might realtors seeking to develop an alternative to Zillow.
A simple approach would be a naive, model-less method—just use the
assessed values as determined by the city. Those values, however, do not nec-
essarily include all properties, and they might not include changes warranted by
remodeling, additions, etc. Moreover, the assessment methods used by cities
may not be transparent or always reflect true market values. However, the city
property data can be used as a starting point to build a model, to which additional
data (such as that collected by large realtors) can be added later.
Let’s look at how Boston property assessment data, available from the city
of Boston, might be used to predict home values. The data in WestRoxbury.csv
includes information on single family owner-occupied homes in West Roxbury,
a neighborhood in southwest Boston, MA, in 2014. The data include values
for various predictor variables, and for an outcome—assessed home value (“total
value”). This dataset has 14 variables, and a description of each variable is given
in Table 2.1 (the full data dictionary provided by the City of Boston is available
at http://goo.gl/QBRlYF; we have modified a few variable names). The dataset
includes 5802 homes. A sample of the data is shown in Table 2.2, and the “data
dictionary” describing each variable is in Table 2.1.
As we saw earlier, below the header row, each row in the data represents
a home. For example, the first home was assessed at a total value of $344.2
thousand (TOTAL VALUE). Its tax bill was $4330. It has a lot size of 9965
square feet (ft2), was built in the year 1880, has two floors, six rooms, and so on.
Loading and Looking at the Data in R
To load data into R, we will typically want to have the data available as a csv
(comma separated values) file. If the data are in an xls (or xlsx) file, we can save
1“Harney, K., Zestimates may not be as right as you’d like”, Washington Post, Feb. 7, 2015, p. T10.
http://www.zillow.com
http://goo.gl/QBRlYF
PRELIMINARY STEPS 23
TABLE 2.1 DESCRIPTION OF VARIABLES IN WEST ROXBURY (BOSTON) HOME VALUE DATASET
TOTAL VALUE Total assessed value for property, in thousands of USD
TAX Tax bill amount based on total assessed value multiplied by the tax rate, in USD
LOT SQ FT Total lot size of parcel in square feet
YR BUILT Year the property was built
GROSS AREA Gross floor area
LIVING AREA Total living area for residential properties (ft2)
FLOORS Number of floors
ROOMS Total number of rooms
BEDROOMS Total number of bedrooms
FULL BATH Total number of full baths
HALF BATH Total number of half baths
KITCHEN Total number of kitchens
FIREPLACE Total number of fireplaces
REMODEL When the house was remodeled (Recent/Old/None)
that same file in Excel as a csv file: go to File > Save as > Save as type: CSV
(Comma delimited) (*.csv) > Save.
Note: When dealing with .csv files in Excel, beware of two things:
• Opening a .csv file in Excel strips off leading 0’s, which corrupts zipcode
data.
• Saving a .csv file in Excel saves only the digits that are displayed; if you
need precision to a certain number of decimals, you need to ensure they
are displayed before saving.
Once we have R and RStudio installed on our machine and the West Rox-
bury.csv file saved as a csv file, we can run the code in Table 2.3 to load the data
into R.
TABLE 2.2 FIRST 10 RECORDS IN THE WEST ROXBURY HOME VALUES DATASET
TOTAL TAX LOT YR GROSS LIVING FLOORS ROOMS BED FULL HALF KIT FIRE REMODEL
VALUE SQ FT BUILT AREA AREA ROOMS BATH BATH CHEN PLACE
344.2 4330 9965 1880 2436 1352 2 6 3 1 1 1 0 None
412.6 5190 6590 1945 3108 1976 2 10 4 2 1 1 0 Recent
330.1 4152 7500 1890 2294 1371 2 8 4 1 1 1 0 None
498.6 6272 13,773 1957 5032 2608 1 9 5 1 1 1 1 None
331.5 4170 5000 1910 2370 1438 2 7 3 2 0 1 0 None
337.4 4244 5142 1950 2124 1060 1 6 3 1 0 1 1 Old
359.4 4521 5000 1954 3220 1916 2 7 3 1 1 1 0 None
320.4 4030 10,000 1950 2208 1200 1 6 3 1 0 1 0 None
333.5 4195 6835 1958 2582 1092 1 5 3 1 0 1 1 Recent
409.4 5150 5093 1900 4818 2992 2 8 4 2 0 1 0 None
24 OVERVIEW OF THE DATA MINING PROCESS
TABLE 2.3 WORKING WITH FILES IN R
To start, open RStudio, go to File > New File > R Script. It opens a new tab.
Then save your Untitled1.R file into the directory where your csv is saved. Give it the name WestRoxbury.R.
From the Menu Bar, go to Session > Set Working Directory > To Source File Location; This sets the working
directory as the place where both the R file and csv file are saved.
code for loading and creating subsets from the data
housing.df <- read.csv("WestRoxbury.csv", header = TRUE) # load data
dim(housing.df) # find the dimension of data frame
head(housing.df) # show the first six rows
View(housing.df) # show all the data in a new tab
# Practice showing different subsets of the data
housing.df[1:10, 1] # show the first 10 rows of the first column only
housing.df[1:10, ] # show the first 10 rows of each of the columns
housing.df[5, 1:10] # show the fifth row of the first 10 columns
housing.df[5, c(1:2, 4, 8:10)] # show the fifth row of some columns
housing.df[, 1] # show the whole first column
housing.df$TOTAL_VALUE # a different way to show the whole first column
housing.df$TOTAL_VALUE[1:10] # show the first 10 rows of the first column
length(housing.df$TOTAL_VALUE) # find the length of the first column
mean(housing.df$TOTAL_VALUE) # find the mean of the first column
summary(housing.df) # find summary statistics for each column
Data from a csv file is stored in R as a data frame (e.g., housing.df ). If our
csv file has column headers, these headers get automatically stored as the column
names of our data. A data frame is the fundamental object almost all analyses
begin with in R. A data frame has rows and columns. The rows are the obser-
vations for each case (e.g., house), and the columns are the variables of interest
(e.g., TOTAL VALUE, TAX). The code in Table 2.3 walks you through some
basic steps you will want to perform prior to doing any analysis: finding the size
and dimension of your data (number of rows and columns), viewing all the data,
displaying only selected rows and columns, and computing summary statistics
for variables of interest. Note that comments are preceded with the # symbol.
Sampling from a Database
Typically, we perform data mining on less than the complete database. Data
mining algorithms will have varying limitations on what they can handle in
terms of the numbers of records and variables, limitations that may be specific
to computing power and capacity as well as software limitations. Even within
those limits, many algorithms will execute faster with smaller samples.
Accurate models can often be built with as few as several thousand records.
Hence, we will want to sample a subset of records for model building. Table 2.4
provides code for sampling in R.
PRELIMINARY STEPS 25
TABLE 2.4 SAMPLING IN R
code for sampling and over/under-sampling
# random sample of 5 observations
s <- sample(row.names(housing.df), 5)
housing.df[s,]
# oversample houses with over 10 rooms
s <- sample(row.names(housing.df), 5, prob = ifelse(housing.df$ROOMS>10, 0.9, 0.01))
housing.df[s,]
Oversampling Rare Events in Classification Tasks
If the event we are interested in classifying is rare, for example, customers pur-
chasing a product in response to a mailing, or fraudulent credit card transactions,
sampling a random subset of records may yield so few events (e.g., purchases)
that we have little information on them. We would end up with lots of data
on nonpurchasers and non-fraudulent transactions but little on which to base a
model that distinguishes purchasers from nonpurchasers or fraudulent from non-
fraudulent. In such cases, we would want our sampling procedure to overweight
the rare class (purchasers or frauds) relative to the majority class (nonpurchasers,
non-frauds) so that our sample would end up with a healthy complement of
purchasers or frauds.
Assuring an adequate number of responder or “success” cases to train the
model is just part of the picture. A more important factor is the costs of mis-
classification. Whenever the response rate is extremely low, we are likely to
attach more importance to identifying a responder than to identifying a non-
responder. In direct-response advertising (whether by traditional mail, e-mail,
or web advertising), we may encounter only one or two responders for every
hundred records—the value of finding such a customer far outweighs the costs of
reaching him or her. In trying to identify fraudulent transactions, or customers
unlikely to repay debt, the costs of failing to find the fraud or the nonpaying
customer are likely to exceed the cost of more detailed review of a legitimate
transaction or customer.
If the costs of failing to locate responders are comparable to the costs of
misidentifying responders as non-responders, our models would usually achieve
highest overall accuracy if they identified everyone as a non-responder (or almost
everyone, if it is easy to identify a few responders without catching many non-
responders). In such a case, the misclassification rate is very low—equal to the
rate of responders—but the model is of no value.
26 OVERVIEW OF THE DATA MINING PROCESS
More generally, we want to train our model with the asymmetric costs in
mind so that the algorithm will catch the more valuable responders, probably
at the cost of “catching” and misclassifying more non-responders as responders
than would be the case if we assume equal costs. This subject is discussed in
detail in Chapter 5.
Preprocessing and Cleaning the Data
Types of Variables There are several ways of classifying variables. Vari-
ables can be numerical or text (character/string). They can be continuous (able
to assume any real numerical value, usually in a given range), integer (taking
only integer values), categorical (assuming one of a limited number of values),
or date. Categorical variables can be either coded as numerical (1, 2, 3) or text
(payments current, payments not current, bankrupt). Categorical variables can
be unordered (called nominal variables) with categories such as North America,
Europe, and Asia; or they can be ordered (called ordinal variables) with categories
such as high value, low value, and nil value.
Continuous variables can be handled by most data mining routines with the
exception of the naive Bayes classifier, which deals exclusively with categorical
predictor variables. The machine learning roots of data mining grew out of
problems with categorical outcomes; the roots of statistics lie in the analysis of
continuous variables. Sometimes, it is desirable to convert continuous variables
to categorical variables. This is done most typically in the case of outcome
variables, where the numerical variable is mapped to a decision (e.g., credit scores
above a certain threshold mean “grant credit,” a medical test result above a certain
threshold means “start treatment”).
For the West Roxbury data, Table 2.5 presents some R statements to review
the variables and determine what type (class) R thinks they are, and to determine
the number of levels in a factor variable.
Handling Categorical Variables Categorical variables can also be han-
dled by most data mining routines, but often require special handling. If the
categorical variable is ordered (age group, degree of creditworthiness, etc.), we
can sometimes code the categories numerically (1, 2, 3, …) and treat the vari-
able as if it were a continuous variable. The smaller the number of categories,
and the less they represent equal increments of value, the more problematic this
approach becomes, but it often works well enough.
Nominal categorical variables, however, often cannot be used as is. In many
cases, they must be decomposed into a series of binary variables, called dummy
PRELIMINARY STEPS 27
TABLE 2.5 REVIEWING VARIABLES IN R
code for reviewing variables
names(housing.df) # print a list of variables to the screen.
t(t(names(housing.df))) # print the list in a useful column format
colnames(housing.df)[1] <- c("TOTAL_VALUE") # change the first column's name
class(housing.df$REMODEL) # REMODEL is a factor variable
class(housing.df[ ,14]) # Same.
levels(housing.df[, 14]) # It can take one of three levels
class(housing.df$BEDROOMS) # BEDROOMS is an integer variable
class(housing.df[, 1]) # Total_Value is a numeric variable
Partial Output
> t(t(names(housing.df)))
[,1]
[1,] “TOTAL_VALUE”
[2,] “TAX”
[3,] “LOT.SQFT”
[4,] “YR.BUILT”
[5,] “GROSS.AREA”
[6,] “LIVING.AREA”
[7,] “FLOORS”
[8,] “ROOMS”
[9,] “BEDROOMS”
[10,] “FULL.BATH”
[11,] “HALF.BATH”
[12,] “KITCHEN”
[13,] “FIREPLACE”
[14,] “REMODEL”
> class(housing.df$REMODEL)
[1] “factor”
> levels(housing.df[, 14])
[1] “None” “Old” “Recent”
variables. For example, a single categorical variable that can have possible values
of “student,” “unemployed,” “employed,” or “retired” would be split into four
separate dummy variables:
Student—Yes/No
Unemployed—Yes/No
Employed—Yes/No
Retired—Yes/No
In many cases, only three of the dummy variables need to be used; if the val-
ues of three are known, the fourth is also known. For example, given that these
four values are the only possible ones, we can know that if a person is neither
28 OVERVIEW OF THE DATA MINING PROCESS
student, unemployed, nor employed, he or she must be retired. In some rou-
tines (e.g., linear regression and logistic regression), you should not use all four
variables—the redundant information will cause the algorithm to fail. Note,
also, that typical methods of creating dummy variables will leave the original
categorical variable intact; obviously you should not use both the original vari-
able and the dummies. The R code to create binary dummies from a categorical
(factor) variable is given in Table 2.6.
TABLE 2.6 CREATING DUMMY VARIABLES IN R
code for creating binary dummies (indicators)
# use model.matrix() to convert all categorical variables in the data frame into
# a set of dummy variables. We must then turn the resulting data matrix back into
# a data frame for further work.
xtotal <- model.matrix(~ 0 + BEDROOMS + REMODEL, data = housing.df)
xtotal$BEDROOMS[1:5] # will not work because xtotal is a matrix
xtotal <- as.data.frame(xtotal)
t(t(names(xtotal))) # check the names of the dummy variables
head(xtotal)
xtotal <- xtotal[, -4] # drop one of the dummy variables.
# In this case, drop REMODELRecent.
Partial Output
> t(t(names(xtotal))) # Check the names of the dummy variables.
[,1]
[1,] “BEDROOMS”
[2,] “REMODELNone”
[3,] “REMODELOld”
[4,] “REMODELRecent”
> head(xtotal)
BEDROOMS REMODELNone REMODELOld REMODELRecent
1 3 1 0 0
2 4 0 0 1
3 4 1 0 0
4 5 1 0 0
5 3 1 0 0
6 3 0 1 0
Variable Selection More is not necessarily better when it comes to select-
ing variables for a model. Other things being equal, parsimony, or compactness,
is a desirable feature in a model. For one thing, the more variables we include
and the more complex the model, the greater the number of records we will
need to assess relationships among the variables. Fifteen records may suffice to
give us a rough idea of the relationship between Y and a single predictor vari-
able X . If we now want information about the relationship between Y and 15
predictor variables X1, . . . , X15, 15 records will not be enough (each estimated
PRELIMINARY STEPS 29
relationship would have an average of only one record’s worth of information,
making the estimate very unreliable). In addition, models based on many vari-
ables are often less robust, as they require the collection of more variables in the
future, are subject to more data quality and availability issues, and require more
data cleaning and preprocessing.
How Many Variables and How Much Data? Statisticians give us proce-
dures to learn with some precision how many records we would need to achieve
a given degree of reliability with a given dataset and a given model. These
are called “power calculations” and are intended to assure that an average pop-
ulation effect will be estimated with sufficient precision from a sample. Data
miners’ needs are usually different, because the focus is not on identifying an
average effect but rather on predicting individual records. This purpose typically
requires larger samples than those used for statistical inference. A good rule of
thumb is to have 10 records for every predictor variable. Another rule, used by
Delmaster and Hancock (2001, p. 68) for classification procedures, is to have at
least 6×m× p records, where m is the number of outcome classes and p is the
number of variables.
In general, compactness or parsimony is a desirable feature in a data mining
model. Even when we start with a small number of variables, we often end
up with many more after creating new variables (such as converting a categor-
ical variable into a set of dummy variables). Data visualization and dimension
reduction methods help reduce the number of variables so that redundancies and
information overlap are reduced.
Even when we have an ample supply of data, there are good reasons to pay
close attention to the variables that are included in a model. Someone with
domain knowledge (i.e., knowledge of the business process and the data) should
be consulted, as knowledge of what the variables represent is typically critical
for building a good model and avoiding errors. For example, suppose we’re
trying to predict the total purchase amount spent by customers, and we have a
few predictor columns that are coded X1, X2, X3, . . ., where we don’t know
what those codes mean. We might find that X1 is an excellent predictor of the
total amount spent. However, if we discover that X1 is the amount spent on
shipping, calculated as a percentage of the purchase amount, then obviously a
model that uses shipping amount cannot be used to predict purchase amount,
because the shipping amount is not known until the transaction is completed.
Another example is if we are trying to predict loan default at the time a customer
applies for a loan. If our dataset includes only information on approved loan
applications, we will not have information about what distinguishes defaulters
from non-defaulters among denied applicants. A model based on approved loans
alone can therefore not be used to predict defaulting behavior at the time of loan
application, but rather only once a loan is approved.
30 OVERVIEW OF THE DATA MINING PROCESS
Outliers The more data we are dealing with, the greater the chance of
encountering erroneous values resulting from measurement error, data-entry
error, or the like. If the erroneous value is in the same range as the rest of
the data, it may be harmless. If it is well outside the range of the rest of the data
(a misplaced decimal, for example), it may have a substantial effect on some of
the data mining procedures we plan to use.
Values that lie far away from the bulk of the data are called outliers. The term
far away is deliberately left vague because what is or is not called an outlier is
an arbitrary decision. Analysts use rules of thumb such as “anything over three
standard deviations away from the mean is an outlier,” but no statistical rule can
tell us whether such an outlier is the result of an error. In this statistical sense,
an outlier is not necessarily an invalid data point, it is just a distant one.
The purpose of identifying outliers is usually to call attention to values that
need further review. We might come up with an explanation looking at the
data—in the case of a misplaced decimal, this is likely. We might have no expla-
nation, but know that the value is wrong—a temperature of 178◦F for a sick
person. Or, we might conclude that the value is within the realm of possibility
and leave it alone. All these are judgments best made by someone with domain
knowledge, knowledge of the particular application being considered: direct mail,
mortgage finance, and so on, as opposed to technical knowledge of statistical or
data mining procedures. Statistical procedures can do little beyond identifying
the record as something that needs review.
If manual review is feasible, some outliers may be identified and corrected.
In any case, if the number of records with outliers is very small, they might
be treated as missing data. How do we inspect for outliers? One technique is
to sort the records by the first column (e.g., using the R function order(), then
review the data for very large or very small values in that column. Then repeat
for each successive column. Another option is to examine the minimum and
maximum values of each column using R’s min() and max() functions. For a
more automated approach that considers each record as a unit, rather than each
column in isolation, clustering techniques (see Chapter 14) could be used to
identify clusters of one or a few records that are distant from others. Those
records could then be examined.
Missing Values Typically, some records will contain missing values. If the
number of records with missing values is small, those records might be omitted.
However, if we have a large number of variables, even a small proportion of
missing values can affect a lot of records. Even with only 30 variables, if only
5% of the values are missing (spread randomly and independently among cases
and variables), almost 80% of the records would have to be omitted from the
analysis. (The chance that a given record would escape having a missing value is
0.9530 = 0.215.)
PRELIMINARY STEPS 31
An alternative to omitting records with missing values is to replace the miss-
ing value with an imputed value, based on the other values for that variable across
all records. For example, if among 30 variables, household income is missing for
a particular record, we might substitute the mean household income across all
records. Doing so does not, of course, add any information about how house-
hold income affects the outcome variable. It merely allows us to proceed with
the analysis and not lose the information contained in this record for the other
29 variables. Note that using such a technique will understate the variability in
a dataset. However, we can assess variability and the performance of our data
mining technique using the validation data, and therefore this need not present
a major problem. One option is to replace missing values using fairly simple
substitutes (e.g., mean, median). More sophisticated procedures do exist—for
example, using linear regression, based on other variables, to fill in the missing
values. These methods have been elaborated mainly for analysis of medical and
scientific studies, where each patient or subject record comes at great expense.
In data mining, where data are typically plentiful, simpler methods usually suf-
fice. Table 2.7 shows some R code to illustrate the use of the median to replace
TABLE 2.7 MISSING DATA
code for imputing missing data with median
# To illustrate missing data procedures, we first convert a few entries for
# bedrooms to NA’s. Then we impute these missing values using the median of the
# remaining values.
rows.to.missing <- sample(row.names(housing.df), 10)
housing.df[rows.to.missing,]$BEDROOMS <- NA
summary(housing.df$BEDROOMS) # Now we have 10 NA's and the median of the
# remaining values is 3.
# replace the missing values using the median of the remaining values.
# use median() with na.rm = TRUE to ignore missing values when computing the median.
housing.df[rows.to.missing,]$BEDROOMS <- median(housing.df$BEDROOMS, na.rm = TRUE)
summary(housing.df$BEDROOMS)
Partial Output
> housing.df[rows.to.missing,]$BEDROOMS <- NA
> summary(housing.df$BEDROOMS)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
1.00 3.00 3.00 3.23 4.00 9.00 10
> housing.df[rows.to.missing,]$BEDROOMS <- median(housing.df$BEDROOMS, na.rm = TRUE)
> summary(housing.df$BEDROOMS)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.00 3.00 3.23 4.00 9.00
32 OVERVIEW OF THE DATA MINING PROCESS
missing values. Since the data are complete to begin with, the first step is an arti-
ficial one of creating some missing records for illustration purposes. The median
is used for imputation, rather than the mean, to preserve the integer nature of
the counts for bedrooms.
Some datasets contain variables that have a very large number of missing
values. In other words, a measurement is missing for a large number of records.
In that case, dropping records with missing values will lead to a large loss of data.
Imputing the missing values might also be useless, as the imputations are based on
a small number of existing records. An alternative is to examine the importance
of the predictor. If it is not very crucial, it can be dropped. If it is important,
perhaps a proxy variable with fewer missing values can be used instead. When
such a predictor is deemed central, the best solution is to invest in obtaining the
missing data.
Significant time may be required to deal with missing data, as not all situ-
ations are susceptible to automated solutions. In a messy dataset, for example,
a “0” might mean two things: (1) the value is missing, or (2) the value is actu-
ally zero. In the credit industry, a “0” in the “past due” variable might mean a
customer who is fully paid up, or a customer with no credit history at all—two
very different situations. Human judgment may be required for individual cases
or to determine a special rule to deal with the situation.
Normalizing (Standardizing) and Rescaling Data Some algorithms
require that the data be normalized before the algorithm can be implemented
effectively. To normalize a variable, we subtract the mean from each value
and then divide by the standard deviation. This operation is also sometimes
called standardizing. In R, function scale() performs this operation. In effect, we
are expressing each value as the “number of standard deviations away from the
mean,” also called a z-score.
Normalizing is one way to bring all variables to the same scale. Another
popular approach is rescaling each variable to a [0,1] scale. This is done by
subtracting the minimum value and then dividing by the range. Subtracting the
minimum shifts the variable origin to zero. Dividing by the range shrinks or
expands the data to the range [0,1]. In R, rescaling can be done using function
rescale() in the scales package.
To consider why normalizing or scaling to [0,1] might be necessary, consider
the case of clustering. Clustering typically involves calculating a distance measure
that reflects how far each record is from a cluster center or from other records.
With multiple variables, different units will be used: days, dollars, counts, and
so on. If the dollars are in the thousands and everything else is in the tens,
the dollar variable will come to dominate the distance measure. Moreover,
changing units from, say, days to hours or months, could alter the outcome
completely.
PREDICTIVE POWER AND OVERFITTING 33
Data mining software typically have an option to normalize the data in those
algorithms where it may be required. It is an option rather than an automatic
feature of such algorithms, because there are situations where we want each
variable to contribute to the distance measure in proportion to its original scale.
2.5 Predictive Power and Overfitting
In supervised learning, a key question presents itself: How well will our pre-
diction or classification model perform when we apply it to new data? We are
particularly interested in comparing the performance of various models so that
we can choose the one we think will do the best when it is implemented in prac-
tice. A key concept is to make sure that our chosen model generalizes beyond
the dataset that we have at hand. To assure generalization, we use the concept
of data partitioning and try to avoid overfitting. These two important concepts are
described next.
Overfitting
The more variables we include in a model, the greater the risk of overfitting the
particular data used for modeling. What is overfitting?
In Table 2.8, we show hypothetical data about advertising expenditures in
one time period and sales in a subsequent time period. A scatter plot of the data
is shown in Figure 2.2. We could connect up these points with a smooth but
complicated function, one that interpolates all these data points perfectly and
leaves no error (residuals). This can be seen in Figure 2.3. However, we can see
that such a curve is unlikely to be accurate, or even useful, in predicting future
sales on the basis of advertising expenditures. For instance, it is hard to believe
that increasing expenditures from $400 to $500 will actually decrease revenue.
A basic purpose of building a model is to represent relationships among vari-
ables in such a way that this representation will do a good job of predicting future
outcome values on the basis of future predictor values. Of course, we want the
model to do a good job of describing the data we have, but we are more inter-
ested in its performance with future data.
TABLE 2.8
Advertising Sales
239 514
364 789
602 550
644 1386
770 1394
789 1440
911 1354
34 OVERVIEW OF THE DATA MINING PROCESS
FIGURE 2.2 SCATTER PLOT FOR ADVERTISING AND SALES DATA
FIGURE 2.3 OVERFITTING: THIS FUNCTION FITS THE DATA WITH NO ERROR
In the hypothetical advertising example, a simple straight line might do a
better job than the complex function in terms of predicting future sales on the
basis of advertising. Instead, we devised a complex function that fit the data per-
fectly, and in doing so, we overreached. We ended up modeling some variation
in the data that is nothing more than chance variation. We mistreated the noise
in the data as if it were a signal.
Similarly, we can add predictors to a model to sharpen its performance with
the data at hand. Consider a database of 100 individuals, half of whom have
contributed to a charitable cause. Information about income, family size, and zip
code might do a fair job of predicting whether or not someone is a contributor.
If we keep adding additional predictors, we can improve the performance of the
model with the data at hand and reduce the misclassification error to a negligible
level. However, this low error rate is misleading, because it probably includes
spurious effects, which are specific to the 100 individuals, but not beyond that
sample.
For example, one of the variables might be height. We have no basis in
theory to suppose that tall people might contribute more or less to charity, but if
there are several tall people in our sample and they just happened to contribute
PREDICTIVE POWER AND OVERFITTING 35
heavily to charity, our model might include a term for height—the taller you
are, the more you will contribute. Of course, when the model is applied to
additional data, it is likely that this will not turn out to be a good predictor.
If the dataset is not much larger than the number of predictor variables, it
is very likely that a spurious relationship like this will creep into the model.
Continuing with our charity example, with a small sample just a few of whom
are tall, whatever the contribution level of tall people may be, the algorithm is
tempted to attribute it to their being tall. If the dataset is very large relative to the
number of predictors, this is less likely to occur. In such a case, each predictor
must help predict the outcome for a large number of cases, so the job it does is
much less dependent on just a few cases, which might be flukes.
Somewhat surprisingly, even if we know for a fact that a higher-degree curve
is the appropriate model, if the model-fitting dataset is not large enough, a lower-
degree function (that is not as likely to fit the noise) is likely to perform better in
terms of predicting new values. Overfitting can also result from the application
of many different models, from which the best performing model is selected.
Creation and Use of Data Partitions
At first glance, we might think it best to choose the model that did the best
job of classifying or predicting the outcome variable of interest with the data
at hand. However, when we use the same data both to develop the model and
to assess its performance, we introduce an “optimism” bias. This is because
when we choose the model that works best with the data, this model’s superior
performance comes from two sources:
• A superior model
• Chance aspects of the data that happen to match the chosen model better
than they match other models
The latter is a particularly serious problem with techniques (such as trees and
neural nets) that do not impose linear or other structure on the data, and thus
end up overfitting it.
To address the overfitting problem, we simply divide (partition) our data and
develop our model using only one of the partitions. After we have a model, we
try it out on another partition and see how it performs, which we can measure
in several ways. In a classification model, we can count the proportion of held-
back records that were misclassified. In a prediction model, we can measure the
residuals (prediction errors) between the predicted values and the actual values.
This evaluation approach in effect mimics the deployment scenario, where our
model is applied to data that it hasn’t “seen.”
We typically deal with two or three partitions: a training set, a validation
set, and sometimes an additional test set. Partitioning the data into training,
36 OVERVIEW OF THE DATA MINING PROCESS
validation, and test sets is done either randomly according to predetermined
proportions or by specifying which records go into which partition according to
some relevant variable (e.g., in time-series forecasting, the data are partitioned
according to their chronological order). In most cases, the partitioning should be
done randomly to minimize the chance of getting a biased partition. Note the
varying nomenclature—the training partition is nearly always called “training”
but the names for the other partitions can vary and overlap.
Training Partition The training partition, typically the largest partition,
contains the data used to build the various models we are examining. The same
training partition is generally used to develop multiple models.
Validation Partition The validation partition (sometimes called the test
partition) is used to assess the predictive performance of each model so that you
can compare models and choose the best one. In some algorithms (e.g., classi-
fication and regression trees, k-nearest neighbors), the validation partition may
be used in an automated fashion to tune and improve the model.
Test Partition The test partition (sometimes called the holdout or evalua-
tion partition) is used to assess the performance of the chosen model with new
data.
Why have both a validation and a test partition? When we use the validation
data to assess multiple models and then choose the model that performs best with
the validation data, we again encounter another (lesser) facet of the overfitting
problem—chance aspects of the validation data that happen to match the cho-
sen model better than they match other models. In other words, by using the
validation data to choose one of several models, the performance of the chosen
model on the validation data will be overly optimistic.
The random features of the validation data that enhance the apparent perfor-
mance of the chosen model will probably not be present in new data to which
the model is applied. Therefore, we may have overestimated the accuracy of our
model. The more models we test, the more likely it is that one of them will be
particularly effective in modeling the noise in the validation data. Applying the
model to the test data, which it has not seen before, will provide an unbiased
estimate of how well the model will perform with new data. Figure 2.4 shows
the three data partitions and their use in the data mining process. When we are
concerned mainly with finding the best model and less with exactly how well it
will do, we might use only training and validation partitions. Table 2.9 shows R
code to partition the West Roxbury data into two sets (training and validation)
or into three sets (training, validation, and test). This is done by first drawing
PREDICTIVE POWER AND OVERFITTING 37
FIGURE 2.4 THREE DATA PARTITIONS AND THEIR ROLE IN THE DATA MINING PROCESS
a random sample of records into the training set, then assigning the remaining
records as validation. In the case of three partitions, the validation records are
chosen randomly from the data after excluding the records already sampled into
the training set.
Note that with some algorithms, such as nearest-neighbor algorithms,
records in the validation and test partitions, and in new data, are compared to
records in the training data to find the nearest neighbor(s). As k-nearest neigh-
bors is discussed in this book, the use of two partitions is an essential part of
the classification or prediction process, not merely a way to improve or assess it.
Nonetheless, we can still interpret the error in the validation data in the same
way that we would interpret error from any other model.
Cross-Validation When the number of records in our sample is small,
data partitioning might not be advisable as each partition will contain too few
records for model building and performance evaluation. An alternative to data
partitioning is cross-validation, which is especially useful with small samples.
Cross-validation is a procedure that starts with partitioning the data into “folds,”
or non-overlapping subsamples. Often we choose k = 5 folds, meaning that
the data are randomly partitioned into 5 equal parts, where each fold has 20%
of the observations. A model is then fit k times. Each time, one of the folds
is used as the validation set and the remaining k − 1 folds serve as the training
set. The result is that each fold is used once as the validation set, thereby pro-
ducing predictions for every observation in the dataset. We can then combine
38 OVERVIEW OF THE DATA MINING PROCESS
TABLE 2.9 DATA PARTITIONING IN R
code for partitioning the West Roxbury data into training, validation (and test) sets
# use set.seed() to get the same partitions when re-running the R code.
set.seed(1)
## partitioning into training (60%) and validation (40%)
# randomly sample 60% of the row IDs for training; the remaining 40% serve as
# validation
train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.6)
# collect all the columns with training row ID into training set:
train.data <- housing.df[train.rows, ]
# assign row IDs that are not already in the training set, into validation
valid.rows <- setdiff(rownames(housing.df), train.rows)
valid.data <- housing.df[valid.rows, ]
# alternative code for validation (works only when row names are numeric):
# collect all the columns without training row ID into validation set
valid.data <- housing.df[-train.rows, ] # does not work in this case
## partitioning into training (50%), validation (30%), test (20%)
# randomly sample 50% of the row IDs for training
train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.5)
# sample 30% of the row IDs into the validation set, drawing only from records
# not already in the training set
# use setdiff() to find records not already in the training set
valid.rows <- sample(setdiff(rownames(housing.df), train.rows),
dim(housing.df)[1]*0.3)
# assign the remaining 20% row IDs serve as test
test.rows <- setdiff(rownames(housing.df), union(train.rows, valid.rows))
# create the 3 data frames by collecting all columns from the appropriate rows
train.data <- housing.df[train.rows, ]
valid.data <- housing.df[valid.rows, ]
test.data <- housing.df[test.rows, ]
the model’s predictions on each of the k validation sets in order to evaluate the
overall performance of the model. Sometimes cross-validation is built into a data
mining algorithm, with the results of the cross-validation used for choosing the
algorithm’s parameters (see, e.g., Chapter 9).
2.6 Building a Predictive Model
Let us go through the steps typical to many data mining tasks using a familiar
procedure: multiple linear regression. This will help you understand the overall
process before we begin tackling new algorithms.
BUILDING A PREDICTIVE MODEL 39
Modeling Process
We now describe in detail the various model stages using the West Roxbury
home values example.
1. Determine the purpose. Let’s assume that the purpose of our data mining
project is to predict the value of homes in West Roxbury for new records.
2. Obtain the data. We will use the 2014 West Roxbury housing data. The
dataset in question is small enough that we do not need to sample from
it—we can use it in its entirety.
3. Explore, clean, and preprocess the data. Let’s look first at the description
of the variables, also known as the “data dictionary,” to be sure that we
understand them all. These descriptions are available on the “descrip-
tion” worksheet in the Excel file and in Table 2.2. The variable names
and descriptions in this dataset all seem fairly straightforward, but this is
not always the case. Often, variable names are cryptic and their descrip-
tions may be unclear or missing.
It is useful to pause and think about what the variables mean and
whether they should be included in the model. Consider the variable
TAX. At first glance, we consider that the tax on a home is usually a
function of its assessed value, so there is some circularity in the model—
we want to predict a home’s value using TAX as a predictor, yet TAX
itself is determined by a home’s value. TAX might be a very good pre-
dictor of home value in a numerical sense, but would it be useful if we
wanted to apply our model to homes whose assessed value might not be
known? For this reason, we will exclude TAX from the analysis.
It is also useful to check for outliers that might be errors. For exam-
ple, suppose that the column FLOORS (number of floors) looked like
the one in Table 2.10, after sorting the data in descending order based
on floors. We can tell right away that the 15 is in error—it is unlikely
that a home has 15 floors. Since all other values are between 1 and 2 the
decimal was probably misplaced and the value should be 1.5.
Lastly, we create dummy variables for categorical variables. Here we
have one categorical variable: REMODEL, which has three categories.
TABLE 2.10 OUTLIER IN WEST ROXBURY
DATA
FLOORS ROOMS
15 8
2 10
1.5 6
1 6
40 OVERVIEW OF THE DATA MINING PROCESS
4. Reduce the data dimension. The West Roxbury dataset has been prepared
for presentation with fairly low dimension—it has only 13 variables, and
the single categorical variable considered has only three categories (and
hence adds two dummy variables when used in a linear regression model).
If we had many more variables, at this stage we might want to apply a vari-
able reduction technique, such as condensing multiple categories into a
smaller number, or applying principal components analysis to consolidate
multiple similar numerical variables (e.g., LIVING AREA, ROOMS,
BEDROOMS, BATH, HALF BATH) into a smaller number of vari-
ables.
5. Determine the data mining task. The specific task is to predict the value
of TOTAL VALUE using the predictor variables. This is a supervised
prediction task. For simplicity, we excluded several additional vari-
ables present in the original dataset, which have many categories (BLDG
TYPE, ROOF TYPE, and EXT FIN). We therefore use all the numer-
ical variables (except TAX) and the dummies created for the remaining
categorical variables.
6. Partition the data (for supervised tasks). In this case we divide the data into
two partitions: training and validation (see Table 2.9). The training parti-
tion is used to build the model, and the validation partition is used to see
how well the model does when applied to new data. We need to specify
the percent of the data used in each partition. Note: Although not used
in our example, a test partition might also be used.
7. Choose the technique. In this case, it is multiple linear regression. Having
divided the data into training and validation partitions, we can build a
multiple linear regression model with the training data. We want to pre-
dict the value of a house in West Roxbury on the basis of all the other
predictors (except TAX).
8. Use the algorithm to perform the task. In R, we use the lm() function to
predict house value with the training data, then use the same model to
predict values for the validation data. Chapter 6 on linear regression goes
into more detail. Table 2.11 shows the predicted values for the first few
records in the training data along with the actual values and the residuals
(prediction errors). Note that the predicted values are often called the
fitted values, since they are for the records to which the model was fit. The
results for the validation data are shown in Table 2.12. The prediction
errors for the training and validation data are compared in Table 2.13.
Prediction error can be aggregated in several ways. Five common mea-
sures are shown in Table 2.13. The first is mean error (ME), simply the
average of the residuals (errors). In both cases, it is quite small relative
to the units of TOTAL VALUE, indicating that, on balance, predictions
BUILDING A PREDICTIVE MODEL 41
TABLE 2.11 PREDICTIONS (FITTED VALUES) FOR A SAMPLE OF TRAINING DATA
code for fitting a regression model to training data (West Roxbury)
reg <- lm(TOTAL_VALUE ~ ., data = housing.df, subset = train.rows)
tr.res <- data.frame(train.data$TOTAL_VALUE, reg$fitted.values, reg$residuals)
head(tr.res)
Partial Output
> head(tr.res)
train.data.TOTAL_VALUE reg.fitted.values reg.residuals
3651 371.6 371.5818 0.018235205
359 299.4 299.4014 -0.001431463
1195 294.5 294.4762 0.023835688
1024 249.4 249.4029 -0.002874472
3984 505.5 505.5246 -0.024612237
2227 410.5 410.5323 -0.032339156
average about right—our predictions are “unbiased.” Of course, this sim-
ply means that the positive and negative errors balance out. It tells us
nothing about how large these errors are.
TheRMS error (RMSE) (root-mean-squared error) is more informative
of the error magnitude: it takes the square root of the average squared
error, so it gives an idea of the typical error (whether positive or negative)
in the same scale as that used for the original outcome variable. As we
might expect, the RMS error for the validation data (161.5 thousand
TABLE 2.12 PREDICTIONS FOR A SAMPLE OF VALIDATION DATA
code for applying the regression model to predict validation set (West Roxbury)
pred <- predict(reg, newdata = valid.data)
vl.res <- data.frame(valid.data$TOTAL_VALUE, pred, residuals =
valid.data$TOTAL_VALUE - pred)
head(vl.res)
Partial Output
> head(vl.res)
valid.data.TOTAL_VALUE pred residuals
1 344.2 344.2388 -0.038842075
14 575.0 575.0025 -0.002509638
16 298.2 298.2107 -0.010697442
17 313.1 313.0764 0.023567596
18 344.9 344.8719 0.028111825
19 330.7 330.7222 -0.022234883
42 OVERVIEW OF THE DATA MINING PROCESS
TABLE 2.13 PREDICTION ERROR METRICS FOR TRAINING AND VALIDATION DATA (ERROR
FIGURES ARE IN THOUSANDS OF $)
code for computing model evaluation metrics
library(forecast)
# compute accuracy on training set
accuracy(reg$fitted.values, train.data$TOTAL_VALUE)
# compute accuracy on prediction set
pred <- predict(reg, newdata = valid.data)
accuracy(pred, valid.data$TOTAL_VALUE)
Partial Output
> accuracy(reg$fitted.values, train.data$TOTAL_VALUE)
ME RMSE MAE MPE MAPE
Test set 1.388101e-16 0.02268016 0.01956465 5.193036e-06 0.00528389
> accuracy(pred, valid.data$TOTAL_VALUE)
ME RMSE MAE MPE MAPE
Test set 90.86934 161.5043 118.6455 15.14207 24.45668
dollars), which the model is seeing for the first time in making these
predictions, is larger than for the training data (≈ 0 thousand dollars),
which were used in training the model. The other measures are discussed
in Chapter 5.
9. Interpret the results. At this stage, we would typically try other prediction
algorithms (e.g., regression trees) and see how they do error-wise. We
might also try different “settings” on the various models (e.g., we could
use the best subsets option in multiple linear regression to choose a reduced
set of variables that might perform better with the validation data). After
choosing the best model—typically, the model with the lowest error on
the validation data while also recognizing that “simpler is better”—we
use that model to predict the output variable in fresh data. These steps
are covered in more detail in the analysis of cases.
10. Deploy the model. After the best model is chosen, it is applied to new data
to predict TOTAL VALUE for homes where this value is unknown. This
was, of course, the original purpose. Predicting the output value for new
records is called scoring. For predictive tasks, scoring produces predicted
numerical values. For classification tasks, scoring produces classes and/or
propensities. Table 2.14 shows an example of a data frame with three
homes to be scored using our linear regression model. Note that all
the required predictor columns are present, and the output column is
absent.
USING R FOR DATA MINING ON A LOCAL MACHINE 43
TABLE 2.14 DATA FRAME WITH THREE RECORDS TO BE SCORED
# new.data can be read from a csv file, or defined directly in R.
> new.data
TAX LOT.SQFT YR.BUILT GROSS.AREA LIVING.AREA FLOORS ROOMS BEDROOMS
100 3818 4200 1960 2670 1710 2.0 10 4
101 3791 6444 1940 2886 1474 1.5 6 3
102 4275 5035 1925 3264 1523 1.0 6 2
FULL.BATH HALF.BATH KITCHEN FIREPLACE REMODEL
100 1 1 1 1 None
101 1 1 1 1 None
102 1 0 1 0 Recent
> pred <- predict(reg, newdata = new.data)
> pred
100 101 102
303.5358 301.3919 339.8642
2.7 Using R for Data Mining on a Local
Machine
An important aspect of the data mining process is that the heavy-duty analysis
does not necessarily require a huge number of records. The dataset to be analyzed
may have millions of records, of course, but in applying multiple linear regression
or applying a classification tree, the use of a sample of 20,000 is likely to yield as
accurate an answer as that obtained when using the entire dataset. The principle
involved is the same as the principle behind polling: If sampled judiciously, 2000
voters can give an estimate of the entire population’s opinion within one or
two percentage points. (See “How Many Variables and How Much Data” in
Section 2.4 for further discussion.)
Therefore, in most cases, the number of records required in each partition
(training, validation, and test) can be accommodated within the memory limit
allowed in R (to check and increase memory limit in R use function mem-
ory.limit()).
When we apply Big Data analytics in R, it might be useful to remove unused
objects (function rm()) and call the garbage collection (function gc()) afterwards.
2.8 Automating Data Mining Solutions
In most supervised data mining applications, the goal is not a static, one-time
analysis of a particular dataset. Rather, we want to develop a model that can be
used on an ongoing basis to predict or classify new records. Our initial analysis
will be in prototype mode, while we explore and define the problem and test
different models. We will follow all the steps outlined earlier in this chapter.
44 OVERVIEW OF THE DATA MINING PROCESS
At the end of that process, we will typically want our chosen model to be
deployed in automated fashion. For example, the US Internal Revenue Service
(IRS) receives several hundred million tax returns per year—it does not want to
have to pull each tax return out into an Excel sheet or other environment separate
from its main database to determine the predicted probability that the return is
fraudulent. Rather, it would prefer that determination to be made as part of
the normal tax filing environment and process. Music streaming services, such
as Pandora or Spotify, need to determine “recommendations” for next songs
quickly for each of millions of users; there is no time to extract the data for
manual analysis.
In practice, this is done by building the chosen algorithm into the com-
putational setting in which the rest of the process lies. A tax return is entered
directly into the IRS system by a tax preparer, a predictive algorithm is imme-
diately applied to the new data in the IRS system, and a predicted classification
is decided by the algorithm. Business rules would then determine what happens
with that classification. In the IRS case, the rule might be “if no predicted fraud,
continue routine processing; if fraud is predicted, alert an examiner for possible
audit.”
This flow of the tax return from data entry, into the IRS system, through
a predictive algorithm, then back out to a human user is an example of a
“data pipeline.” The different components of the system communicate with
one another via Application Programming Interfaces (APIs) that establish locally
valid rules for transmitting data and associated communications. An API for
a data mining algorithm would establish the required elements for a predictive
algorithm to work—the exact predictor variables, their order, data formats, etc.
It would also establish the requirements for communicating the results of the
algorithm. Algorithms to be used in an automated data pipeline will need to be
compliant with the rules of the APIs where they operate.
Finally, once the computational environment is set and functioning, the data
miner’s work is not done. The environment in which a model operates is typ-
ically dynamic, and predictive models often have a short shelf life—one leading
consultant finds they rarely continue to function effectively for more than a year.
So, even in a fully deployed state, models must be periodically checked and re-
evaluated. Once performance flags, it is time to return to prototype mode and
see if a new model can be developed.
In this book, our focus will be on the prototyping phase—all the steps that
go into properly defining the model and developing and selecting a model. You
should be aware, though, that most of the actual work of implementing a data
mining model lies in the automated deployment phase. Much of this work is not
in the analytic domain; rather, it lies in the domains of databases and computer
science, to assure that detailed nuts and bolts of an automated dataflow all work
properly.
AUTOMATING DATA MINING SOLUTIONS 45
D A T A M I N I N G S O F T W A R E : T H E S T A T E O F T H E M A R K E T
by Herb Edelstein∗
The data mining market has changed in some important ways since the last
edition of this book. The most significant trends have been the increasing volume
of information available and the growing use of the cloud for data storage and
analytics. Data mining and analysis have evolved to meet these new demands.
The term “Big Data” reflects the surge in the amount and types of data col-
lected. There is still an enormous amount of transactional data, data warehous-
ing data, scientific data, and clickstream data. However, adding to the massive
storage requirements of traditional data is the influx of information from unstruc-
tured sources (e.g., customer service calls and images), social media, and more
recently the Internet of Things, which produces a flood of sensor data. The num-
ber of organizations collecting such data has greatly increased as virtually every
business is expanding in these areas.
Rapid technological change has been an important factor. The price of data
storage has dropped precipitously. At this writing, hard disk storage has fallen
to about $50 per terabyte and solid state drives are about $200 per terabyte.
Concomitant with the decrease in storage costs has been a dramatic increase
in bandwidth at ever lower costs. This has enabled the spread of cloud-based
computing. Cloud-based computing refers to using remote platforms for storage,
data management, and now analysis. Because scaling up the hardware and soft-
ware infrastructure for Big Data is so complex, many organizations are entrusting
their data to outside vendors. The largest cloud players (Amazon, Google, IBM,
and Microsoft) are each reporting annual revenues in excess of US$5 billion,
according to Forbes magazine.
Managing this much data is a challenging task. While traditional relational
DBMSs—such as Oracle, Microsofts SQL Server, IBMs DB2, and SAPs Adaptive
Server Enterprise (formerly Sybase)—are still among the leading data manage-
ment tools, open source DBMSs, such as Oracles MySQL are becoming increasingly
popular. In addition, nonrelational data storage is making headway in storing
extremely large amounts of data. For example, Hadoop, an open source tool
for managing large distributed database architectures, has become an important
player in the cloud database space. However, Hadoop is a tool for the appli-
cation developer community rather than end users. Consequently, many of the
data mining analytics vendors such as SAS have built interfaces to Hadoop.
All the major database management system vendors offer data mining capa-
bilities, usually integrated into their DBMS. Leading products include Microsoft
SQL Server Analysis Services, Oracle Data Mining, and Teradata Warehouse Miner.
The target user for embedded data mining is a database professional. Not surpris-
ingly, these products take advantage of database functionality, including using
the DBMS to transform variables, storing models in the database, and extending
the data access language to include model-building and scoring the database.
A few products also supply a separate graphical interface for building data min-
ing models. Where the DBMS has parallel processing capabilities, embedded data
mining tools will generally take advantage of it, resulting in greater performance.
As with the data mining suites described below, these tools offer an assortment
46 OVERVIEW OF THE DATA MINING PROCESS
of algorithms. Not only does IBM have embedded analytics in DB2, but follow-
ing its acquisition of SPSS, IBM has incorporated Clementine and SPSS into IBM
Modeler.
There are still a large number of stand-alone data mining tools based on
a single algorithm or on a collection of algorithms called a suite. Target users
include both statisticians and business intelligence analysts. The leading suites
include SAS Enterprise Miner, SAS JMP, IBM Modeler, Salford Systems SPM, Statis-
tica, XLMiner, and RapidMiner. Suites are characterized by providing a wide range
of functionality, frequently accessed via a graphical user interface designed to
enhance model-building productivity. A popular approach for many of these GUIs
is to provide a workflow interface in which the data mining steps and analysis
are linked together.
Many suites have outstanding visualization tools and links to statistical
packages that extend the range of tasks they can perform. They provide interac-
tive data transformation tools as well as a procedural scripting language for more
complex data transformations. The suite vendors are working to link their tools
more closely to underlying DBMSs; for example, data transformations might be
handled by the DBMS. Data mining models can be exported to be incorporated
into the DBMS through generating SQL, procedural language code (e.g., C++ or
Java), or a standardized data mining model language called Predictive Model
Markup Language (PMML).
In contrast to the general-purpose suites, application-specific tools are
intended for particular analytic applications such as credit scoring, customer
retention, and product marketing. Their focus may be further sharpened to
address the needs of specialized markets such as mortgage lending or finan-
cial services. The target user is an analyst with expertise in the application
domain. Therefore the interfaces, the algorithms, and even the terminology are
customized for that particular industry, application, or customer. While less flex-
ible than general-purpose tools, they offer the advantage of already incorporating
domain knowledge into the product design, and can provide very good solutions
with less effort. Data mining companies including SAS, IBM, and RapidMiner
offer vertical market tools, as do industry specialists such as Fair Isaac. Other
companies, such as Domo, are focusing on creating dashboards with analytics
and visualizations for business intelligence.
Another technological shift has occurred with the spread of open source
model building tools and open core tools. A somewhat simplified view of open
source software is that the source code for the tool is available at no charge
to the community of users and can be modified or enhanced by them. These
enhancements are submitted to the originator or copyright holder, who can add
them to the base package. Open core is a more recent approach in which a core
set of functionality remains open and free, but there are proprietary extensions
that are not free.
The most important open source statistical analysis software is R. R is
descended from a Bell Labs program called S, which was commercialized as S+.
Many data mining algorithms have been added to R, along with a plethora of
statistics, data management tools, and visualization tools. Because it is essen-
tially a programming language, R has enormous flexibility but a steeper learning
AUTOMATING DATA MINING SOLUTIONS 47
curve than many of the GUI-based tools. Although there are some GUIs for R,
the overwhelming majority of use is through programming.
Some vendors, as well as the open source community, are adding statisti-
cal and data mining tools to Python, a popular programming language that is
generally easier to use than C++ or Java, and faster than R.
As mentioned above, the cloud-computing vendors have moved into the data
mining/predictive analytics business by offering AaaS (Analytics as a Service) and
pricing their products on a transaction basis. These products are oriented more
toward application developers than business intelligence analysts. A big part
of the attraction of mining data in the cloud is the ability to store and man-
age enormous amounts of data without requiring the expense and complexity of
building an in-house capability. This can also enable a more rapid implementa-
tion of large distributed multi-user applications. Cloud based data can be used
with non-cloud-based analytics if the vendors analytics do not meet the users
needs.
Amazon has added Amazon Machine Learning to its Amazon Web Services
(AWS), taking advantage of predictive modeling tools developed for Amazons
internal use. AWS supports both relational databases and Hadoop data manage-
ment. Models cannot be exported, because they are intended to be applied to
data stored on the Amazon cloud.
Google is very active in cloud analytics with its BigQuery and Prediction API.
BigQuery allows the use of Google infrastructure to access large amounts of data
using a SQL-like interface. The Prediction API can be accessed from a variety
of languages including R and Python. It uses a variety of machine learning
algorithms and automatically selects the best results. Unfortunately, this is not
a transparent process. Furthermore, as with Amazon, models cannot be exported.
Microsoft is an active player in cloud analytics with its Azure Machine Learn-
ing Studio and Stream Analytics. Azure works with Hadoop clusters as well as
with traditional relational databases. Azure ML offers a broad range of algorithms
such as boosted trees and support vector machines as well as supporting R scripts
and Python. Azure ML also supports a workflow interface making it more suit-
able for the nonprogrammer data scientist. The real-time analytics component
is designed to allow streaming data from a variety of sources to be analyzed
on the fly. XLMiner’s cloud version is based on Microsoft Azure. Microsoft also
acquired Revolution Analytics, a major player in the R analytics business, with a
view to integrating Revolution’s “R Enterprise” with SQL Server and Azure ML. R
Enterprise includes extensions to R that eliminate memory limitations and take
advantage of parallel processing.
One drawback of the cloud-based analytics tools is a relative lack of trans-
parency and user control over the algorithms and their parameters. In some
cases, the service will simply select a single model that is a black box to the
user. Another drawback is that for the most part cloud-based tools are aimed at
more sophisticated data scientists who are systems savvy.
Data science is playing a central role in enabling many organizations to
optimize everything from production to marketing. New storage options and
analytical tools promise even greater capabilities. The key is to select technology
that’s appropriate for an organization’s unique goals and constraints. As always,
human judgment is the most important component of a data mining solution.
48 OVERVIEW OF THE DATA MINING PROCESS
This book’s focus is on a comprehensive understanding of the different tech-
niques and algorithms used in data mining, and less on the data management
requirements of real-time deployment of data mining models. R makes it ideal
for this purpose, and for exploration, prototyping, and piloting of solutions.
***
∗Herb Edelstein is president of Two Crows Consulting (www.twocrows.com), a leading data mining
consulting firm near Washington, DC. He is an internationally recognized expert in data mining and
data warehousing, a widely published author on these topics, and a popular speaker.
Copyright © 2015 Herb Edelstein.
PROBLEMS 49
PROBLEMS
2.1 Assuming that data mining techniques are to be used in the following cases, identify
whether the task required is supervised or unsupervised learning.
a. Deciding whether to issue a loan to an applicant based on demographic and financial
data (with reference to a database of similar data on prior customers).
b. In an online bookstore, making recommendations to customers concerning addi-
tional items to buy based on the buying patterns in prior transactions.
c. Identifying a network data packet as dangerous (virus, hacker attack) based on com-
parison to other packets whose threat status is known.
d. Identifying segments of similar customers.
e. Predicting whether a company will go bankrupt based on comparing its financial
data to those of similar bankrupt and nonbankrupt firms.
f. Estimating the repair time required for an aircraft based on a trouble ticket.
g. Automated sorting of mail by zip code scanning.
h. Printing of custom discount coupons at the conclusion of a grocery store checkout
based on what you just bought and what others have bought previously.
2.2 Describe the difference in roles assumed by the validation partition and the test parti-
tion.
2.3 Consider the sample from a database of credit applicants in Table 2.15. Comment on
the likelihood that it was sampled randomly, and whether it is likely to be a useful
sample.
TABLE 2.15 SAMPLE FROM A DATABASE OF CREDIT APPLICATIONS
OBS CHECK DURATION HISTORY NEW USED FURNITURE RADIO EDUC RETRAIN AMOUNT SAVE RESPONSE
ACCT CAR CAR TV ACCT
1 0 6 4 0 0 0 1 0 0 1169 4 1
8 1 36 2 0 1 0 0 0 0 6948 0 1
16 0 24 2 0 0 0 1 0 0 1282 1 0
24 1 12 4 0 1 0 0 0 0 1804 1 1
32 0 24 2 0 0 1 0 0 0 4020 0 1
40 1 9 2 0 0 0 1 0 0 458 0 1
48 0 6 2 0 1 0 0 0 0 1352 2 1
56 3 6 1 1 0 0 0 0 0 783 4 1
64 1 48 0 0 0 0 0 0 1 14421 0 0
72 3 7 4 0 0 0 1 0 0 730 4 1
80 1 30 2 0 0 1 0 0 0 3832 0 1
88 1 36 2 0 0 0 0 1 0 12612 1 0
96 1 54 0 0 0 0 0 0 1 15945 0 0
104 1 9 4 0 0 1 0 0 0 1919 0 1
112 2 15 2 0 0 0 0 1 0 392 0 1
2.4 Consider the sample from a bank database shown in Table 2.16; it was selected ran-
domly from a larger database to be the training set. Personal Loan indicates whether a
solicitation for a personal loan was accepted and is the response variable. A campaign
is planned for a similar solicitation in the future and the bank is looking for a model
that will identify likely responders. Examine the data carefully and indicate what your
next step would be.
50 OVERVIEW OF THE DATA MINING PROCESS
TABLE 2.16 SAMPLE FROM A BANK DATABASE
OBS AGE EXPERIENCE INCOME ZIP FAMILY CC EDUC MORTGAGE PERSONAL SECURITIES
CODE AVG LOAN ACCT
1 25 1 49 91107 4 1.6 1 0 0 1
4 35 9 100 94112 1 2.7 2 0 0 0
5 35 8 45 91330 4 1 2 0 0 0
9 35 10 81 90089 3 0.6 2 104 0 0
10 34 9 180 93023 1 8.9 3 0 1 0
12 29 5 45 90277 3 0.1 2 0 0 0
17 38 14 130 95010 4 4.7 3 134 1 0
18 42 18 81 94305 4 2.4 1 0 0 0
21 56 31 25 94015 4 0.9 2 111 0 0
26 43 19 29 94305 3 0.5 1 97 0 0
29 56 30 48 94539 1 2.2 3 0 0 0
30 38 13 119 94104 1 3.3 2 0 1 0
35 31 5 50 94035 4 1.8 3 0 0 0
36 48 24 81 92647 3 0.7 1 0 0 0
37 59 35 121 94720 1 2.9 1 0 0 0
38 51 25 71 95814 1 1.4 3 198 0 0
39 42 18 141 94114 3 5 3 0 1 1
41 57 32 84 92672 3 1.6 3 0 0 1
2.5 Using the concept of overfitting, explain why when a model is fit to training data,
zero error with those data is not necessarily good.
2.6 In fitting a model to classify prospects as purchasers or nonpurchasers, a certain com-
pany drew the training data from internal data that include demographic and purchase
information. Future data to be classified will be lists purchased from other sources,
with demographic (but not purchase) data included. It was found that “refund issued”
was a useful predictor in the training data. Why is this not an appropriate variable to
include in the model?
2.7 A dataset has 1000 records and 50 variables with 5% of the values missing, spread
randomly throughout the records and variables. An analyst decides to remove records
with missing values. About how many records would you expect to be removed?
2.8 Normalize the data in Table 2.17, showing calculations.
TABLE 2.17
Age Income ($)
25 49,000
56 156,000
65 99,000
32 192,000
41 39,000
49 57,000
2.9 Statistical distance between records can be measured in several ways. Consider
Euclidean distance, measured as the square root of the sum of the squared differences.
For the first two records in Table 2.17, it is√
(25− 56)2 + (49, 000− 156, 000)2.
PROBLEMS 51
Can normalizing the data change which two records are farthest from each other in
terms of Euclidean distance?
2.10 Two models are applied to a dataset that has been partitioned. Model A is considerably
more accurate than model B on the training data, but slightly less accurate than model
B on the validation data. Which model are you more likely to consider for final
deployment?
2.11 The dataset ToyotaCorolla.csv contains data on used cars on sale during the late summer
of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes,
including Price, Age, Kilometers, HP, and other specifications.
a. Explore the data using the data visualization capabilities of R.Which of the pairs
among the variables seem to be correlated?
b. We plan to analyze the data using various data mining techniques described in future
chapters. Prepare the data for use as follows:
i. The dataset has two categorical attributes, Fuel Type andMetallic. Describe how
you would convert these to binary variables. Confirm this using R’s functions
to transform categorical data into dummies.
ii. Prepare the dataset (as factored into dummies) for data mining techniques of
supervised learning by creating partitions in R. Select all the variables and use
default values for the random seed and partitioning percentages for training
(50%), validation (30%), and test (20%) sets. Describe the roles that these par-
titions will play in modeling.
Part II
Data Exploration and
Dimension Reduction
CHAPTER 3
Data Visualization
In this chapter, we describe a set of plots that can be used to explore the multidi-
mensional nature of a data set. We present basic plots (bar charts, line graphs, and
scatter plots), distribution plots (boxplots and histograms), and different enhance-
ments that expand the capabilities of these plots to visualize more information.
We focus on how the different visualizations and operations can support data
mining tasks, from supervised tasks (prediction, classification, and time series
forecasting) to unsupervised tasks, and provide a few guidelines on specific visu-
alizations to use with each data mining task. We also describe the advantages
of interactive visualization over static plots. The chapter concludes with a pre-
sentation of specialized plots suitable for data with special structure (hierarchical,
network, and geographical).
3.1 Uses of Data Visualization1
The popular saying “a picture is worth a thousand words” refers to the ability
to condense diffused verbal information into a compact and quickly understood
graphical image. In the case of numbers, data visualization and numerical sum-
marization provide us with both a powerful tool to explore data and an effective
way to present results.
Where do visualization techniques fit into the data mining process, as
described so far? They are primarily used in the preprocessing portion of the
data mining process. Visualization supports data cleaning by finding incorrect
values (e.g., patients whose age is 999 or −1), missing values, duplicate rows,
1Randall Pruim assisted with the ggplot code in this chapter. This and subsequent sections in this
chapter copyright ©2017 Statistics.com and Galit Shmueli. Used by permission.
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
55
56 DATA VISUALIZATION
columns with all the same value, and the like. Visualization techniques are also
useful for variable derivation and selection: they can help determine which vari-
ables to include in the analysis and which might be redundant. They can also
help with determining appropriate bin sizes, should binning of numerical vari-
ables be needed (e.g., a numerical outcome variable might need to be converted
to a binary variable if a yes/no decision is required). They can also play a role in
combining categories as part of the data reduction process. Finally, if the data
have yet to be collected and collection is expensive (as with the Pandora project
at its outset, see Chapter 7), visualization methods can help determine, using a
sample, which variables and metrics are useful.
In this chapter, we focus on the use of graphical presentations for the purpose
of data exploration, particularly with relation to predictive analytics. Although our
focus is not on visualization for the purpose of data reporting, this chapter offers
ideas as to the effectiveness of various graphical displays for the purpose of data
presentation. These offer a wealth of useful presentations beyond tabular sum-
maries and basic bar charts, which are currently the most popular form of data
presentation in the business environment. For an excellent discussion of using
graphs to report business data, see Few (2004). In terms of reporting data min-
ing results graphically, we describe common graphical displays elsewhere in the
book, some of which are technique-specific [e.g., dendrograms for hierarchical
clustering (Chapter 15), network charts for social network analysis (Chapter 19),
and tree charts for classification and regression trees (Chapter 9)] while others
are more general [e.g., receiver operating characteristic (ROC) curves and lift
charts for classification (Chapter 5) and profile plots and heatmaps for clustering
(Chapter 15)].
Note: The term “graph” can have two meanings in statistics. It can refer,
particularly in popular usage, to any of a number of figures to represent data (e.g.,
line chart, bar plot, histogram, etc.). In a more technical use, it refers to the data
structure and visualization in networks (see Chapter 19). Using the term “plot”
for the visualizations we explore in this chapter avoids this confusion.
Data exploration is a mandatory initial step whether or not more formal
analysis follows. Graphical exploration can support free-form exploration for the
purpose of understanding the data structure, cleaning the data (e.g., identifying
unexpected gaps or “illegal” values), identifying outliers, discovering initial pat-
terns (e.g., correlations among variables and surprising clusters), and generating
interesting questions. Graphical exploration can also be more focused, geared
toward specific questions of interest. In the data mining context, a combination
is needed: free-form exploration performed with the purpose of supporting a
specific goal.
Graphical exploration can range from generating very basic plots to using
operations such as filtering and zooming interactively to explore a set of
interconnected visualizations that include advanced features such as color and
DATA EXAMPLES 57
multiple-panels. This chapter is not meant to be an exhaustive guidebook on
visualization techniques, but instead discusses main principles and features that
support data exploration in a data mining context. We start by describing vary-
ing levels of sophistication in terms of visualization, and show the advantages of
different features and operations. Our discussion is from the perspective of how
visualization supports the subsequent data mining goal. In particular, we distin-
guish between supervised and unsupervised learning; within supervised learning,
we also further distinguish between classification (categorical outcome variable)
and prediction (numerical outcome variable).
Base R or ggplot?
The ggplot package by Hadley Wickham has become the most popular dedi-
cated graphics package in R for creating presentation-quality visualization in a
wide variety of contexts. The “gg” in ggplot refers to “Grammar of Graphics,”
a term coined by Leland Wilkinson to define a system of plotting theory and
nomenclature. Learning ggplot effectively means becoming familiar with this
philosophy and technical language of plotting. This brings with it flexibility and
power. It also entails a non-trivial learning curve. If you are likely to be using
data visualizations on a regular basis in communicating with others with high-
quality graphics, it is worth the time and effort to get up to speed on ggplot. If
your uses of visualization are likely to be exploratory and informal, perhaps as an
initial part of analysis that will not be formally written up, you may decide not
to undertake the investment of learning ggplot.
In this chapter, which is primarily an introduction, base R syntax is shown
for nearly all the visualizations that we present. In most examples, ggplot syntax
is also shown in the R code that follows each Figure.
3.2 Data Examples
To illustrate data visualization, we use two datasets used in additional chapters in
the book.
Example 1: Boston Housing Data
The Boston Housing data contain information on census tracts in Boston2 for
which several measurements are taken (e.g., crime rate, pupil/teacher ratio). It
has 14 variables. A description of each variable is given in Table 3.1 and a
sample of the first nine records is shown in Table 3.2. In addition to the original
2The Boston Housing dataset was originally published by Harrison and Rubinfeld in “Hedonic prices
and the demand for clean air”, Journal of Environmental Economics & Management, vol. 5, p. 81–102,
1978.
58 DATA VISUALIZATION
TABLE 3.1 DESCRIPTION OF VARIABLES IN BOSTON HOUSING DATASET
CRIM Crime rate
ZN Percentage of residential land zoned for lots over 25,000 ft2
INDUS Percentage of land occupied by nonretail business
CHAS Does tract bound Charles River (= 1 if tract bounds river, = 0 otherwise)
NOX Nitric oxide concentration (parts per 10 million)
RM Average number of rooms per dwelling
AGE Percentage of owner-occupied units built prior to 1940
DIS Weighted distances to five Boston employment centers
RAD Index of accessibility to radial highways
TAX Full-value property tax rate per $10,000
PTRATIO Pupil-to-teacher ratio by town
LSTAT Percentage of lower status of the population
MEDV Median value of owner-occupied homes in $1000s
CAT.MEDV Is median value of owner-occupied homes in tract above $30,000 (CAT.MEDV = 1) or not
(CAT.MEDV = 0)
13 variables, the dataset also contains the additional variable CAT.MEDV, which
has been created by categorizing median value (MEDV) into two categories:
high and low.
We consider three possible tasks:
1. A supervised predictive task, where the outcome variable of interest is
the median value of a home in the tract (MEDV).
TABLE 3.2 FIRST NINE RECORDS IN THE BOSTON HOUSING DATA
code for opening the Boston Housing file and viewing the first 9 records
housing.df <- read.csv("BostonHousing.csv")
head(housing.df, 9)
Output
> head(housing.df, 9)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV CAT.MEDV
1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0 0
2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6 0
3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7 1
4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4 1
5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2 1
6 0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7 0
7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 12.43 22.9 0
8 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 19.15 27.1 0
9 0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 29.93 16.5 0
BASIC CHARTS: BAR CHARTS, LINE GRAPHS, AND SCATTER PLOTS 59
2. A supervised classification task, where the outcome variable of interest is
the binary variable CAT.MEDV that indicates whether the home value
is above or below $30,000.
3. An unsupervised task, where the goal is to cluster census tracts.
(MEDV and CAT.MEDV are not used together in any of the three cases).
Example 2: Ridership on Amtrak Trains
Amtrak, a US railway company, routinely collects data on ridership. Here
we focus on forecasting future ridership using the series of monthly ridership
between January 1991 and March 2004. The data and their source are described
in Chapter 16. Hence our task here is (numerical) time series forecasting.
3.3 Basic Charts: Bar Charts, Line Graphs,
and Scatter Plots
The three most effective basic plots are bar charts, line graphs, and scatter plots.
These plots are easy to create in R and are the plots most commonly used in the
current business world, in both data exploration and presentation (unfortunately,
pie charts are also popular, although they are usually ineffective visualizations).
Basic charts support data exploration by displaying one or two columns of data
(variables) at a time. This is useful in the early stages of getting familiar with the
data structure, the amount and types of variables, the volume and type of missing
values, etc.
The nature of the data mining task and domain knowledge about the data
will affect the use of basic charts in terms of the amount of time and effort allo-
cated to different variables. In supervised learning, there will be more focus on
the outcome variable. In scatter plots, the outcome variable is typically associ-
ated with the y-axis. In unsupervised learning (for the purpose of data reduction
or clustering), basic plots that convey relationships (such as scatter plots) are pre-
ferred.
The top-left panel in Figure 3.1 displays a line chart for the time series of
monthly railway passengers on Amtrak. Line graphs are used primarily for show-
ing time series. The choice of time frame to plot, as well as the temporal scale,
should depend on the horizon of the forecasting task and on the nature of the
data.
Bar charts are useful for comparing a single statistic (e.g., average, count,
percentage) across groups. The height of the bar (or length in a horizontal
display) represents the value of the statistic, and different bars correspond to
different groups. Two examples are shown in the bottom panels in Figure 3.1.
The left panel shows a bar chart for a numerical variable (MEDV) and the right
60 DATA VISUALIZATION
Year
R
id
er
sh
ip
(
in
0
00
s)
1992 1996 2000 2004
14
00
18
00
22
00
10 20 30
10
20
30
40
50
MDEV
LS
TA
T
0 1
CHAS
A
vg
. M
E
D
V
0
5
10
15
20
25
0 1
CHAS
%
o
f C
AT
.M
E
D
V
0
5
10
15
20
25
30
FIGURE 3.1 BASIC PLOTS: LINE GRAPH (TOP LEFT), SCATTER PLOT (TOP RIGHT), BAR CHART
FOR NUMERICAL VARIABLE (BOTTOM LEFT), AND BAR CHART FOR CATEGORICAL
VARIABLE (BOTTOM RIGHT)
code for creating Figure 3.1
## line chart for the Amtrak data
Amtrak.df <- read.csv("Amtrak.csv")
# use time series analysis
library(forecast)
ridership.ts <- ts(Amtrak.df$Ridership, start = c(1991, 1), end = c(2004, 3), freq = 12)
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
## Boston housing data
housing.df <- read.csv("BostonHousing.csv")
## scatter plot with axes names
plot(housing.df$MEDV ~ housing.df$LSTAT, xlab = "MDEV", ylab = "LSTAT")
# alternative plot with ggplot
library(ggplot2)
ggplot(housing.df) + geom_point(aes(x = LSTAT, y = MEDV), colour = "navy", alpha = 0.7)
## barchart of CHAS vs. mean MEDV
# compute mean MEDV per CHAS = (0, 1)
data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$CHAS), FUN = mean)
names(data.for.plot) <- c("CHAS", "MeanMEDV")
barplot(data.for.plot$MeanMEDV, names.arg = data.for.plot$CHAS,
xlab = "CHAS", ylab = "Avg. MEDV")
# alternative plot with ggplot
ggplot(data.for.plot) + geom_bar(aes(x = CHAS, y = MeanMEDV), stat = "identity")
## barchart of CHAS vs. % CAT.MEDV
data.for.plot <- aggregate(housing.df$CAT..MEDV, by = list(housing.df$CHAS), FUN = mean)
names(data.for.plot) <- c("CHAS", "MeanCATMEDV")
barplot(data.for.plot$MeanCATMEDV * 100, names.arg = data.for.plot$CHAS,
xlab = "CHAS", ylab = "% of CAT.MEDV")
BASIC CHARTS: BAR CHARTS, LINE GRAPHS, AND SCATTER PLOTS 61
panel shows a bar chart for a categorical variable (CAT.MEDV). In each, separate
bars are used to denote homes in Boston that are near the Charles River vs. those
that are not (thereby comparing the two categories of CHAS). The chart with
the numerical output MEDV (bottom left) uses the average MEDV on the y-
axis. This supports the predictive task: the numerical outcome is on the y-axis
and the x-axis is used for a potential categorical predictor.3 (Note that the x-axis
on a bar chart must be used only for categorical variables, because the order of
bars in a bar chart should be interchangeable.) For the classification task (bottom
right), the y-axis indicates the percent of tracts with median value above $30K
and the x-axis is a binary variable indicating proximity to the Charles. This
graph shows us that the tracts bordering the Charles are much more likely to
have median values above $30K.
The top-right panel in Figure 3.1 displays a scatter plot of MEDV vs.
LSTAT. This is an important plot in the prediction task. Note that the output
MEDV is again on the y-axis (and LSTAT on the x-axis is a potential predic-
tor). Because both variables in a basic scatter plot must be numerical, it cannot
be used to display the relation between CAT.MEDV and potential predictors
for the classification task (but we can enhance it to do so—see Section 3.4).
For unsupervised learning, this particular scatter plot helps study the associa-
tion between two numerical variables in terms of information overlap as well as
identifying clusters of observations.
All three basic plots highlight global information such as the overall level
of ridership or MEDV, as well as changes over time (line chart), differences
between subgroups (bar chart), and relationships between numerical variables
(scatter plot).
Distribution Plots: Boxplots and Histograms
Before moving on to more sophisticated visualizations that enable multidimen-
sional investigation, we note two important plots that are usually not considered
“basic charts” but are very useful in statistical and data mining contexts. The box-
plot and the histogram are two plots that display the entire distribution of a numer-
ical variable. Although averages are very popular and useful summary statistics,
there is usually much to be gained by looking at additional statistics such as the
median and standard deviation of a variable, and even more so by examining
the entire distribution. Whereas bar charts can only use a single aggregation,
boxplots and histograms display the entire distribution of a numerical variable.
3We refer here to a bar chart with vertical bars. The same principles apply if using a bar chart with
horizontal lines, except that the x-axis is now associated with the numerical variable and the y-axis
with the categorical variable.
62 DATA VISUALIZATION
Boxplots are also effective for comparing subgroups by generating side-by-side
boxplots, or for looking at distributions over time by creating a series of boxplots.
Distribution plots are useful in supervised learning for determining potential
data mining methods and variable transformations. For example, skewed numer-
ical variables might warrant transformation (e.g., moving to a logarithmic scale)
if used in methods that assume normality (e.g., linear regression, discriminant
analysis).
A histogram represents the frequencies of all x values with a series of vertical
connected bars. For example, in the left panel of Figure 3.2, there are over 150
tracts where the median value (MEDV) is between $20K–$25K.
A boxplot represents the variable being plotted on the y-axis (although the
plot can potentially be turned in a 90 degrees angle, so that the boxes are parallel
to the x-axis). In the right panel of Figure 3.2 there are two boxplots (called a
side-by-side boxplot). The box encloses 50% of the data—for example, in the
Histogram of housing.df$MEDV
MEDV
F
re
qu
en
cy
10 20 30 40 50
0
50
10
0
15
0
0 1
10
20
30
40
50
CHAS
M
E
D
V
FIGURE 3.2 DISTRIBUTION CHARTS FOR NUMERICAL VARIABLE MEDV. LEFT: HISTOGRAM,
RIGHT: BOXPLOT
code for creating Figure 3.2
## histogram of MEDV
hist(housing.df$MEDV, xlab = "MEDV")
# alternative plot with ggplot
library(ggplot2)
ggplot(housing.df) + geom_histogram(aes(x = MEDV), binwidth = 5)
## boxplot of MEDV for different values of CHAS
boxplot(housing.df$MEDV ~ housing.df$CHAS, xlab = "CHAS", ylab = "MEDV")
# alternative plot with ggplot
ggplot(housing.df) + geom_boxplot(aes(x = as.factor(CHAS), y = MEDV)) + xlab("CHAS")
BASIC CHARTS: BAR CHARTS, LINE GRAPHS, AND SCATTER PLOTS 63
right-hand box half of the tracts have median values (MEDV) between $20,000–
$33,000. The horizontal line inside the box represents the median (50th per-
centile). The top and bottom of the box represent the 75th and 25th percentiles,
respectively. Lines extending above and below the box cover the rest of the data
range; outliers may be depicted as points or circles. Sometimes the average is
marked by a + (or similar) sign. Comparing the average and the median helps in
assessing how skewed the data are. Boxplots are often arranged in a series with
a different plot for each of the various values of a second variable, shown on the
x-axis.
Because histograms and boxplots are geared toward numerical variables, their
basic form is useful for prediction tasks. Boxplots can also support unsupervised
learning by displaying relationships between a numerical variable (y-axis) and a
categorical variable (x-axis). To illustrate these points, look again at Figure 3.2.
The left panel shows a histogram of MEDV, revealing a skewed distribution.
Transforming the output variable to log(MEDV) might improve results of a linear
regression predictor.
The right panel in Figure 3.2 shows side-by-side boxplots comparing the
distribution of MEDV for homes that border the Charles River (1) or not (0),
similar to Figure 3.1. We see that not only is the average MEDV for river-
bounding homes higher than the non-river-bounding homes, the entire distri-
bution is higher (median, quartiles, min, and max). We can also see that all river-
bounding homes have MEDV above $10 thousand, unlike non-river-bounding
homes. This information is useful for identifying the potential importance of
this predictor (CHAS), and for choosing data mining methods that can capture
the non-overlapping area between the two distributions (e.g., trees).
Boxplots and histograms applied to numerical variables can also provide
directions for deriving new variables, for example, they can indicate how to
bin a numerical variable (for example, binning a numerical outcome in order
to use a naive Bayes classifier, or in the Boston Housing example, choosing the
cutoff to convert MEDV to CAT.MEDV).
Finally, side-by-side boxplots are useful in classification tasks for evaluating
the potential of numerical predictors. This is done by using the x-axis for the
categorical outcome and the y-axis for a numerical predictor. An example is
shown in Figure 3.3, where we can see the effects of four numerical predictors
on CAT.MEDV. The pairs that are most separated (e.g., PTRATIO and INDUS)
indicate potentially useful predictors.
The main weakness of basic charts and distribution plots, in their basic form
(that is, using position in relation to the axes to encode values), is that they can
only display two variables and therefore cannot reveal high-dimensional infor-
mation. Each of the basic charts has two dimensions, where each dimension is
dedicated to a single variable. In data mining, the data are usually multivariate
by nature, and the analytics are designed to capture and measure multivariate
64 DATA VISUALIZATION
0 1
0.
4
0.
5
0.
6
0.
7
0.
8
CAT.MEDV
N
O
X
0 1
10
20
30
CAT.MEDV
LS
TA
T
0 1
14
16
18
20
22
CAT.MEDV
P
T
R
AT
IO
0 1
0
5
10
15
20
25
CAT.MEDV
IN
D
U
S
FIGURE 3.3 SIDE-BY-SIDE BOXPLOTS FOR EXPLORING THE CAT.MEDV OUTPUT VARIABLE BY
DIFFERENT NUMERICAL PREDICTORS. IN A SIDE-BY-SIDE BOXPLOT, ONE AXIS IS
USED FOR A CATEGORICAL VARIABLE, AND THE OTHER FOR A NUMERICAL
VARIABLE. PLOTTING A CATEGORICAL OUTCOME VARIABLE AND A NUMERICAL
PREDICTOR COMPARES THE PREDICTOR’S DISTRIBUTION ACROSS THE OUTCOME
CATEGORIES. PLOTTING A NUMERICAL OUTCOME VARIABLE AND A CATEGORICAL
PREDICTOR DISPLAYS THE DISTRIBUTION OF THE OUTCOME VARIABLE ACROSS
DIFFERENT LEVELS OF THE PREDICTOR
code for creating Figure 3.3
## side-by-side boxplots
# use par() to split the plots into panels.
par(mfcol = c(1, 4))
boxplot(housing.df$NOX ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "NOX")
boxplot(housing.df$LSTAT ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "LSTAT")
boxplot(housing.df$PTRATIO ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "PTRATIO")
boxplot(housing.df$INDUS ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "INDUS")
information. Visual exploration should therefore also incorporate this important
aspect. In the next section, we describe how to extend basic charts (and distribu-
tion plots) to multidimensional data visualization by adding features, employing
manipulations, and incorporating interactivity. We then present several special-
ized charts that are geared toward displaying special data structures (Section 3.5).
Heatmaps: Visualizing Correlations and Missing Values
A heatmap is a graphical display of numerical data where color is used to denote
values. In a data mining context, heatmaps are especially useful for two pur-
poses: for visualizing correlation tables and for visualizing missing values in the
data. In both cases the information is conveyed in a two-dimensional table. A
correlation table for p variables has p rows and p columns. A data table contains
p columns (variables) and n rows (observations). If the number of rows is huge,
then a subset can be used. In both cases, it is much easier and faster to scan
the color-coding rather than the values. Note that heatmaps are useful when
examining a large number of values, but they are not a replacement for more
precise graphical display, such as bar charts, because color differences cannot be
perceived accurately.
BASIC CHARTS: BAR CHARTS, LINE GRAPHS, AND SCATTER PLOTS 65
C
R
IM Z
N
IN
D
U
S
C
H
A
S
N
O
X
R
M
A
G
E
D
IS
R
A
D
TA
X
P
T
R
AT
IO
LS
TA
T
M
E
D
V
C
AT
..M
E
D
V
CAT..MEDV
MEDV
LSTAT
PTRATIO
TAX
RAD
DIS
AGE
RM
NOX
CHAS
INDUS
ZN
CRIM
−0.15 0.37 −0.37 0.11 −0.23 0.64 −0.19 0.12 −0.2 −0.27 −0.44 −0.47 0.79 1
−0.39 0.36 −0.48 0.18 −0.43 0.7 −0.38 0.25 −0.38 −0.47 −0.51 −0.74 1 0.79
0.46 −0.41 0.6 −0.05 0.59 −0.61 0.6 −0.5 0.49 0.54 0.37 1 −0.74 −0.47
0.29 −0.39 0.38 −0.12 0.19 −0.36 0.26 −0.23 0.46 0.46 1 0.37 −0.51 −0.44
0.58 −0.31 0.72 −0.04 0.67 −0.29 0.51 −0.53 0.91 1 0.46 0.54 −0.47 −0.27
0.63 −0.31 0.6 −0.01 0.61 −0.21 0.46 −0.49 1 0.91 0.46 0.49 −0.38 −0.2
−0.38 0.66 −0.71 −0.1 −0.77 0.21 −0.75 1 −0.49 −0.53 −0.23 −0.5 0.25 0.12
0.35 −0.57 0.64 0.09 0.73 −0.24 1 −0.75 0.46 0.51 0.26 0.6 −0.38 −0.19
−0.22 0.31 −0.39 0.09 −0.3 1 −0.24 0.21 −0.21 −0.29 −0.36 −0.61 0.7 0.64
0.42 −0.52 0.76 0.09 1 −0.3 0.73 −0.77 0.61 0.67 0.19 0.59 −0.43 −0.23
−0.06 −0.04 0.06 1 0.09 0.09 0.09 −0.1 −0.01 −0.04 −0.12 −0.05 0.18 0.11
0.41 −0.53 1 0.06 0.76 −0.39 0.64 −0.71 0.6 0.72 0.38 0.6 −0.48 −0.37
−0.2 1 −0.53 −0.04 −0.52 0.31 −0.57 0.66 −0.31 −0.31 −0.39 −0.41 0.36 0.37
1 −0.2 0.41 −0.06 0.42 −0.22 0.35 −0.38 0.63 0.58 0.29 0.46 −0.39 −0.15
FIGURE 3.4 HEATMAP OF A CORRELATION TABLE. DARKER VALUES DENOTE STRONGER
CORRELATION
code for creating Figure 3.4
## simple heatmap of correlations (without values)
heatmap(cor(housing.df), Rowv = NA, Colv = NA)
## heatmap with values
library(gplots)
heatmap.2(cor(housing.df), Rowv = FALSE, Colv = FALSE, dendrogram = "none",
cellnote = round(cor(housing.df),2),
notecol = "black", key = FALSE, trace = 'none', margins = c(10,10))
# alternative plot with ggplot
library(ggplot2)
library(reshape) # to generate input for the plot
cor.mat <- round(cor(housing.df),2) # rounded correlation matrix
melted.cor.mat <- melt(cor.mat)
ggplot(melted.cor.mat, aes(x = X1, y = X2, fill = value)) +
geom_tile() +
geom_text(aes(x = X1, y = X2, label = value))
An example of a correlation table heatmap is shown in Figure 3.4, showing
all the pairwise correlations between 13 variables (MEDV and 12 predictors).
Darker shades correspond to stronger (positive or negative) correlation. It is easy
to quickly spot the high and low correlations.
In a missing value heatmap, rows correspond to records and columns to
variables. We use a binary coding of the original dataset where 1 denotes a
missing value and 0 otherwise. This new binary table is then colored such that
66 DATA VISUALIZATION
FIGURE 3.5 HEATMAP OF MISSING VALUES IN A DATASET. BLACK DENOTES MISSING VALUE
code for generating a heatmap of missing values
# replace dataFrame with your data.
# is.na() returns a Boolean (TRUE/FALSE) output indicating the location of missing
# values.
# multiplying the Boolean value by 1 converts the output into binary (0/1).
heatmap(1 * is.na(dataFrame), Rowv = NA, Colv = NA)
only missing value cells (with value 1) are colored. Figure 3.5 shows an example
of a missing value heatmap for a dataset with over 1000 columns. The data
include economic, social, political, and “well-being” information on different
countries around the world (each row is a country). The variables were merged
from multiple sources, and for each source information was not always available
on every country. The missing data heatmap helps visualize the level and amount
of “missingness” in the merged data file. Some patterns of “missingness” easily
emerge: variables that are missing for nearly all observations, as well as clusters of
rows (countries) that are missing many values. Variables with little missingness
are also visible. This information can then be used for determining how to
handle the missingness (e.g., dropping some variables, dropping some records,
imputing, or via other techniques).
MULTIDIMENSIONAL VISUALIZATION 67
3.4 Multidimensional Visualization
Basic plots can convey richer information with features such as color, size, and
multiple panels, and by enabling operations such as rescaling, aggregation, and
interactivity. These additions allow looking at more than one or two variables
at a time. The beauty of these additions is their effectiveness in displaying com-
plex information in an easily understandable way. Effective features are based on
understanding how visual perception works (see Few (2009) for a discussion).
The purpose is to make the information more understandable, not just to rep-
resent the data in higher dimensions (such as three-dimensional plots that are
usually ineffective visualizations).
Adding Variables: Color, Size, Shape, Multiple Panels, and Animation
In order to include more variables in a plot, we must consider the type of variable
to include. To represent additional categorical information, the best way is to
use hue, shape, or multiple panels. For additional numerical information, we can
use color intensity or size. Temporal information can be added via animation.
Incorporating additional categorical and/or numerical variables into the
basic (and distribution) plots means that we can now use all of them for both
prediction and classification tasks. For example, we mentioned earlier that a
basic scatter plot cannot be used for studying the relationship between a cate-
gorical outcome and predictors (in the context of classification). However, a
very effective plot for classification is a scatter plot of two numerical predictors
color-coded by the categorical outcome variable. An example is shown in the
top panel of Figure 3.6, with color denoting CAT.MEDV.
In the context of prediction, color-coding supports the exploration of the
conditional relationship between the numerical outcome (on the y-axis) and
a numerical predictor. Color-coded scatter plots then help assess the need for
creating interaction terms (for example, is the relationship between MEDV and
LSTAT different for homes near vs. away from the river?).
Color can also be used to include further categorical variables into a bar
chart, as long as the number of categories is small. When the number of cat-
egories is large, a better alternative is to use multiple panels. Creating multiple
panels (also called “trellising”) is done by splitting the observations according to
a categorical variable, and creating a separate plot (of the same type) for each
category. An example is shown in the right-hand panel of Figure 3.6, where a
bar chart of average MEDV by RAD is broken down into two panels by CHAS.
We see that the average MEDV for different highway accessibility levels (RAD)
behaves differently for homes near the river (lower panel) compared to homes
away from the river (upper panel). This is especially salient for RAD = 1. We
68 DATA VISUALIZATION
10 20 30
0
.4
0
.5
0
.6
0
.7
0
.8
LSTAT
N
O
X
CAT.MEDV = 1
CAT.MEDV = 0
1 2 3 4 5 6 7 8 24
CHAS = 0
RAD
Av
g.
M
ED
V
0
15
1 2 3 4 5 6 7 8 24
CHAS = 1
RAD
Av
g.
M
ED
V
0
30
FIGURE 3.6 ADDING CATEGORICAL VARIABLES BY COLOR-CODING AND MULTIPLE PANELS.
LEFT: SCATTER PLOT OF TWO NUMERICAL PREDICTORS, COLOR-CODED BY THE
CATEGORICAL OUTCOME CAT.MEDV. RIGHT: BAR CHART OF MEDV BY TWO
CATEGORICAL PREDICTORS (CHAS AND RAD), USING MULTIPLE PANELS FOR
CHAS
code for creating Figure 3.6
## color plot
par(xpd=TRUE) # allow legend to be displayed outside of plot area
plot(housing.df$NOX ~ housing.df$LSTAT, ylab = "NOX", xlab = "LSTAT",
col = ifelse(housing.df$CAT..MEDV == 1, "black", "gray"))
# add legend outside of plotting area
# In legend() use argument inset = to control the location of the legend relative
# to the plot.
legend("topleft", inset=c(0, -0.2),
legend = c("CAT.MEDV = 1", "CAT.MEDV = 0"), col = c("black", "gray"),
pch = 1, cex = 0.5)
# alternative plot with ggplot
library(ggplot2)
ggplot(housing.df, aes(y = NOX, x = LSTAT, colour= CAT..MEDV)) +
geom_point(alpha = 0.6)
## panel plots
# compute mean MEDV per RAD and CHAS
# In aggregate() use argument drop = FALSE to include all combinations
# (exiting and missing) of RAD X CHAS.
data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$RAD, housing.df$CHAS),
FUN = mean, drop = FALSE)
names(data.for.plot) <- c("RAD", "CHAS", "meanMEDV")
# plot the data
par(mfcol = c(2,1))
barplot(height = data.for.plot$meanMEDV[data.for.plot$CHAS == 0],
names.arg = data.for.plot$RAD[data.for.plot$CHAS == 0],
xlab = "RAD", ylab = "Avg. MEDV", main = "CHAS = 0")
barplot(height = data.for.plot$meanMEDV[data.for.plot$CHAS == 1],
names.arg = data.for.plot$RAD[data.for.plot$CHAS == 1],
xlab = "RAD", ylab = "Avg. MEDV", main = "CHAS = 1")
# alternative plot with ggplot
ggplot(data.for.plot) +
geom_bar(aes(x = as.factor(RAD), y = `meanMEDV`), stat = "identity") +
xlab("RAD") + facet_grid(CHAS ~ .)
MULTIDIMENSIONAL VISUALIZATION 69
also see that there are no near-river homes in RAD levels 2, 6, and 7. Such infor-
mation might lead us to create an interaction term between RAD and CHAS,
and to consider condensing some of the bins in RAD. All these explorations are
useful for prediction and classification.
A special plot that uses scatter plots with multiple panels is the scatter plot
matrix. In it, all pairwise scatter plots are shown in a single display. The panels in
a matrix scatter plot are organized in a special way, such that each column and
each row correspond to a variable, thereby the intersections create all the possi-
ble pairwise scatter plots. The scatter plot matrix is useful in unsupervised learn-
ing for studying the associations between numerical variables, detecting outliers
and identifying clusters. For supervised learning, it can be used for examining
pairwise relationships (and their nature) between predictors to support variable
transformations and variable selection (see Correlation Analysis in Chapter 4).
For prediction, it can also be used to depict the relationship of the outcome with
the numerical predictors.
An example of a scatter plot matrix is shown in Figure 3.7, with MEDV
and three predictors. Below the diagonal are the scatter plots. Variable name
indicates the y-axis variable. For example, the plots in the bottom row all have
MEDV on the y-axis (which allows studying the individual outcome–predictor
C
R
IM
IN
D
U
S
LS
TA
T
M
E
D
V
CRIM INDUS LSTAT MEDV
0.0
0.1
0.2
0.3 Corr:
0.407
Corr:
0.456
Corr:
−0.388
0
10
20 Corr:
0.604
Corr:
−0.484
0
10
20
30 Corr:
−0.738
10
20
30
40
50
0 25 50 75 0 10 20 0 10 20 30 10 20 30 40 50
FIGURE 3.7 SCATTER PLOT MATRIX FOR MEDV AND THREE NUMERICAL PREDICTORS
code for creating Figure 3.7
## simple plot
# use plot() to generate a matrix of 4X4 panels with variable name on the diagonal,
# and scatter plots in the remaining panels.
plot(housing.df[, c(1, 3, 12, 13)])
# alternative, nicer plot (displayed)
library(GGally)
ggpairs(housing.df[, c(1, 3, 12, 13)])
70 DATA VISUALIZATION
relations). We can see different types of relationships from the different shapes
(e.g., an exponential relationship between MEDV and LSTAT and a highly
skewed relationship between CRIM and INDUS), which can indicate needed
transformations. Along the diagonal, where just a single variable is involved, the
frequency distribution for that variable is displayed. Above the diagonal are the
correlation coefficients corresponding to the two variables.
Once hue is used, further categorical variables can be added via shape and
multiple panels. However, one must proceed cautiously in adding multiple vari-
ables, as the display can become over-cluttered and then visual perception is lost.
Adding a numerical variable via size is useful especially in scatter plots
(thereby creating “bubble plots”), because in a scatter plot, points represent indi-
vidual observations. In plots that aggregate across observations (e.g., boxplots,
histograms, bar charts), size and hue are not normally incorporated.
Finally, adding a temporal dimension to a plot to show how the information
changes over time can be achieved via animation. A famous example is Rosling’s
animated scatter plots showing how world demographics changed over the years
(www.gapminder.org). However, while animations of this type work for “sta-
tistical storytelling,” they are not very effective for data exploration.
Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering
Most of the time spent in data mining projects is spent in preprocessing. Typi-
cally, considerable effort is expended getting all the data in a format that can actu-
ally be used in the data mining software. Additional time is spent processing the
data in ways that improve the performance of the data mining procedures. This
preprocessing step in data mining includes variable transformation and deriva-
tion of new variables to help models perform more effectively. Transformations
include changing the numeric scale of a variable, binning numerical variables,
and condensing categories in categorical variables. The following manipula-
tions support the preprocessing step as well the choice of adequate data mining
methods. They do so by revealing patterns and their nature.
Rescaling Changing the scale in a display can enhance the plot and illu-
minate relationships. For example, in Figure 3.8, we see the effect of changing
both axes of the scatter plot (top) and the y-axis of a boxplot (bottom) to loga-
rithmic (log) scale. Whereas the original plots (left) are hard to understand, the
patterns become visible in log scale (right). In the histograms, the nature of the
relationship between MEDV and CRIM is hard to determine in the original
scale, because too many of the points are “crowded” near the y-axis. The re-
scaling removes this crowding and allows a better view of the linear relationship
between the two log-scaled variables (indicating a log–log relationship). In the
boxplot displaying the crowding toward the x-axis in the original units does not
allow us to compare the two box sizes, their locations, lower outliers, and most
http://www.gapminder.org
MULTIDIMENSIONAL VISUALIZATION 71
0 20 40 60 80
10
20
30
40
50
CRIM
M
E
D
V
0.01 0.10 1.00 10.00 100.00
5
10
20
50
CRIM
M
E
D
V
0 1
0
20
40
60
80
CAT.MEDV
C
R
IM
0 1
0.
01
1.
00
10
0.
0 0
CAT.MEDV
C
R
IM
FIGURE 3.8 RESCALING CAN ENHANCE PLOTS AND REVEAL PATTERNS. LEFT: ORIGINAL
SCALE, RIGHT: LOGARITHMIC SCALE
code for creating Figure 3.8
options(scipen=999) # avoid scientific notation
## scatter plot: regular and log scale
plot(housing.df$MEDV ~ housing.df$CRIM, xlab = "CRIM", ylab = "MEDV")
# to use logarithmic scale set argument log = to either 'x', 'y', or 'xy'.
plot(housing.df$MEDV ~ housing.df$CRIM,
xlab = "CRIM", ylab = "MEDV", log = 'xy')
# alternative log-scale plot with ggplot
library(ggplot2)
ggplot(housing.df) + geom_point(aes(x = CRIM, y = MEDV)) +
scale_x_log10(breaks = 10^(-2:2),
labels = format(10^(-2:2), scientific = FALSE, drop0trailing = TRUE)) +
scale_y_log10(breaks = c(5, 10, 20, 40))
## boxplot: regular and log scale
boxplot(housing.df$CRIM ~ housing.df$CAT..MEDV,
xlab = "CAT.MEDV", ylab = "CRIM")
boxplot(housing.df$CRIM ~ housing.df$CAT..MEDV,
xlab = "CAT.MEDV", ylab = "CRIM", log = 'y')
of the distribution information. Rescaling removes the “crowding to the x-axis”
effect, thereby allowing a comparison of the two boxplots.
Aggregation and Hierarchies Another useful manipulation of scaling is
changing the level of aggregation. For a temporal scale, we can aggregate by
72 DATA VISUALIZATION
different granularity (e.g., monthly, daily, hourly) or even by a “seasonal” factor
of interest such as month-of-year or day-of-week. A popular aggregation for
time series is a moving average, where the average of neighboring values within
a given window size is plotted. Moving average plots enhance visualizing a global
trend (see Chapter 16).
Non-temporal variables can be aggregated if some meaningful hierarchy
exists: geographical (tracts within a zip code in the Boston Housing exam-
ple), organizational (people within departments within units), etc. Figure 3.9
illustrates two types of aggregation for the railway ridership time series. The
original monthly series is shown in the top-left panel. Seasonal aggregation (by
month-of-year) is shown in the top-right panel, where it is easy to see the peak
in ridership in July–August and the dip in January–February. The bottom-right
panel shows temporal aggregation, where the series is now displayed in yearly
aggregates. This plot reveals the global long-term trend in ridership and the
generally increasing trend from 1996 on.
Examining different scales, aggregations, or hierarchies supports both super-
vised and unsupervised tasks in that it can reveal patterns and relationships at
various levels, and can suggest new sets of variables with which to work.
Zooming and Panning The ability to zoom in and out of certain areas of
the data on a plot is important for revealing patterns and outliers. We are often
interested in more detail on areas of dense information or of special interest.
Panning refers to the operation of moving the zoom window to other areas
(popular in mapping applications such as Google Maps). An example of zooming
is shown in the bottom-left panel of Figure 3.9, where the ridership series is
zoomed in to the first two years of the series.
Zooming and panning support supervised and unsupervised methods by
detecting areas of different behavior, which may lead to creating new interac-
tion terms, new variables, or even separate models for data subsets. In addi-
tion, zooming and panning can help choose between methods that assume
global behavior (e.g., regression models) and data-driven methods (e.g., expo-
nential smoothing forecasters and k-nearest-neighbors classifiers), and indicate
the level of global/local behavior (as manifested by parameters such as k in k-
nearest neighbors, the size of a tree, or the smoothing parameters in exponential
smoothing).
Filtering Filtering means removing some of the observations from the
plot. The purpose of filtering is to focus the attention on certain data while
eliminating “noise” created by other data. Filtering supports supervised and
unsupervised learning in a similar way to zooming and panning: it assists in
identifying different or unusual local behavior.
MULTIDIMENSIONAL VISUALIZATION 73
Year
R
id
er
sh
ip
(
in
0
00
s)
1992 1994 1996 1998 2000 2002 2004
14
00
18
00
22
00
Year
R
id
er
sh
ip
(
in
0
00
s)
1991.0 1991.5 1992.0 1992.5
14
00
18
00
22
00
14
00
18
00
22
00
Month
A
ve
ra
ge
R
id
er
sh
ip
Jan Mar May Jul Sep Nov
Year
A
ve
ra
ge
R
id
er
sh
ip
1992 1994 1996 1998 2000 2002
14
00
18
00
22
00
FIGURE 3.9 TIME SERIES LINE GRAPHS USING DIFFERENT AGGREGATIONS (RIGHT PANELS),
ADDING CURVES (TOP-LEFT PANEL), AND ZOOMING IN (BOTTOM-LEFT PANEL)
code for creating Figure 3.9
library(forecast)
Amtrak.df <- read.csv("Amtrak data.csv")
ridership.ts <- ts(Amtrak.df$Ridership, start = c(1991, 1), end = c(2004, 3), freq = 12)
## fit curve
ridership.lm <- tslm(ridership.ts ~ trend + I(trend^2))
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
lines(ridership.lm$fitted, lwd = 2)
# alternative plot with ggplot
library(ggplot2)
ggplot(Amtrak.df, aes(y = Ridership, x = Month, group = 12)) +
geom_line() + geom_smooth(formula = y ~ poly(x, 2), method= "lm",
colour = "navy", se = FALSE, na.rm = TRUE)
## zoom in, monthly, and annual plots
ridership.2yrs <- window(ridership.ts, start = c(1991,1), end = c(1992,12))
plot(ridership.2yrs, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
monthly.ridership.ts <- tapply(ridership.ts, cycle(ridership.ts), mean)
plot(monthly.ridership.ts, xlab = "Month", ylab = "Average Ridership",
ylim = c(1300, 2300), type = "l", xaxt = 'n')
## set x labels
axis(1, at = c(1:12), labels = c("Jan","Feb","Mar", "Apr","May","Jun",
"Jul","Aug","Sep", "Oct","Nov","Dec"))
annual.ridership.ts <- aggregate(ridership.ts, FUN = mean)
plot(annual.ridership.ts, xlab = "Year", ylab = "Average Ridership",
ylim = c(1300, 2300))
74 DATA VISUALIZATION
Reference: Trend Lines and Labels
Trend lines and using in-plot labels also help to detect patterns and outliers.
Trend lines serve as a reference, and allow us to more easily assess the shape of
a pattern. Although linearity is easy to visually perceive, more elaborate rela-
tionships such as exponential and polynomial trends are harder to assess by eye.
Trend lines are useful in line graphs as well as in scatter plots. An example is
shown in the top-left panel of Figure 3.9, where a polynomial curve is overlaid
on the original line graph (see also Chapter 16).
In displays that are not overcrowded, the use of in-plot labels can be useful
for better exploration of outliers and clusters. An example is shown in Figure
3.10 (a reproduction of Figure 15.1 with the addition of labels). The figure
shows different utilities on a scatter plot that compares fuel cost with total sales.
We might be interested in clustering the data, and using clustering algorithms
to identify clusters that differ markedly with respect to fuel cost and sales. Fig-
ure 3.10, with the labels, helps visualize these clusters and their members (e.g.,
Nevada and Puget are part of a clear cluster with low fuel costs and high sales).
For more on clustering and on this example, see Chapter 15.
Scaling up to Large Datasets
When the number of observations (rows) is large, plots that display each indi-
vidual observation (e.g., scatter plots) can become ineffective. Aside from using
aggregated charts such as boxplots, some alternatives are:
1. Sampling—drawing a random sample and using it for plotting
2. Reducing marker size
3. Using more transparent marker colors and removing fill
4. Breaking down the data into subsets (e.g., by creating multiple panels)
5. Using aggregation (e.g., bubble plots where size corresponds to number
of observations in a certain range)
6. Using jittering (slightly moving each marker by adding a small amount
of noise)
An example of the advantage of plotting a sample over the large dataset
is shown in Figure 12.2 in Chapter 12, where a scatter plot of 5000 records
is plotted alongside a scatter plot of a sample. Those plots were generated in
Excel. Figure 3.11 illustrates an improved plot of the full dataset by using smaller
markers, using jittering to uncover overlaid points, and more transparent colors.
We can see that larger areas of the plot are dominated by the grey class, the black
class is mainly on the right, while there is a lot of overlap in the top-right area.
MULTIDIMENSIONAL VISUALIZATION 75
5000 10000 15000 20000
0.
5
1.
0
1.
5
2.
0
Sales
F
ue
l C
os
t
Ariz
ona
Bos
ton
Cen
tral
Com
mon
wea
lth
NY
Flor
ida
Haw
aiian
Idah
o
Ken
tuck
y
Mad
ison
Nev
ada
New
Eng
land
Nort
hern
Okla
hom
a
Paci
fic
Pug
et
San
Die
go
Sou
ther
n
Texa
s
Wisc
onsi
n
Unit
ed
Virg
inia
FIGURE 3.10 SCATTER PLOT WITH LABELED POINTS
code for creating Figure 3.10
utilities.df <- read.csv("Utilities.csv")
plot(utilities.df$Fuel_Cost ~ utilities.df$Sales,
xlab = "Sales", ylab = "Fuel Cost", xlim = c(2000, 20000))
text(x = utilities.df$Sales, y = utilities.df$Fuel_Cost,
labels = utilities.df$Company, pos = 4, cex = 0.8, srt = 20, offset = 0.2)
# alternative with ggplot
library(ggplot2)
ggplot(utilities.df, aes(y = Fuel_Cost, x = Sales)) + geom_point() +
geom_text(aes(label = paste(" ", Company)), size = 4, hjust = 0.0, angle = 15) +
ylim(0.25, 2.25) + xlim(3000, 18000)
Multivariate Plot: Parallel Coordinates Plot
Another approach toward presenting multidimensional information in a two-
dimensional plot is via specialized plots such as the parallel coordinates plot. In this
76 DATA VISUALIZATION
10 20 50 100 200
0.
1
0.
2
0.
5
1.
0
2.
0
5.
0
10
.0
Income
C
C
A
vg
FIGURE 3.11 SCATTER PLOT OF LARGE DATASET WITH REDUCED MARKER SIZE, JITTERING, AND
MORE TRANSPARENT COLORING
code for creating Figure 3.11
# use function alpha() in library scales to add transparent colors
library(scales)
plot(jitter(universal.df$CCAvg, 1) ~ jitter(universal.df$Income, 1),
col = alpha(ifelse(universal.df$Securities.Account == 0, "gray", "black"), 0.4),
pch = 20, log = 'xy', ylim = c(0.1, 10),
xlab = "Income", ylab = "CCAvg")
# alternative with ggplot
library(ggplot2)
ggplot(universal.df) +
geom_jitter(aes(x = Income, y = CCAvg, colour = Securities.Account)) +
scale_x_log10(breaks = c(10, 20, 40, 60, 100, 200)) +
scale_y_log10(breaks = c(0.1, 0.2, 0.4, 0.6, 1.0, 2.0, 4.0, 6.0))
plot a vertical axis is drawn for each variable. Then each observation is repre-
sented by drawing a line that connects its values on the different axes, thereby
creating a “multivariate profile.” An example is shown in Figure 3.12 for the
Boston Housing data. In this display, separate panels are used for the two values
of CAT.MEDV, in order to compare the profiles of homes in the two classes
(for a classification task). We see that the more expensive homes (bottom panel)
consistently have low CRIM, low LSAT, and high RM compared to cheaper
MULTIDIMENSIONAL VISUALIZATION 77
CAT.MEDV = 0
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV
CAT.MEDV = 1
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV
FIGURE 3.12 PARALLEL COORDINATES PLOT FOR BOSTON HOUSING DATA. EACH OF THE
VARIABLES (SHOWN ON THE HORIZONTAL AXIS) IS SCALED TO 0–100%. PANELS
ARE USED TO DISTINGUISH CAT.MEDV (TOP PANEL = HOMES BELOW $30,000)
code for creating Figure 3.12
library(MASS)
par(mfcol = c(2,1))
parcoord(housing.df[housing.df$CAT..MEDV == 0, -14], main = "CAT.MEDV = 0")
parcoord(housing.df[housing.df$CAT..MEDV == 1, -14], main = "CAT.MEDV = 1")
homes (top panel), which are more mixed on CRIM, and LSAT, and have a
medium level of RM. This observation gives indication of useful predictors and
suggests possible binning for some numerical predictors.
Parallel coordinates plots are also useful in unsupervised tasks. They can
reveal clusters, outliers, and information overlap across variables. A useful manip-
ulation is to reorder the columns to better reveal observation clusterings.
Interactive Visualization
Similar to the interactive nature of the data mining process, interactivity is key
to enhancing our ability to gain information from graphical visualization. In the
words of Stephen Few (Few, 2009), an expert in data visualization:
We can only learn so much when staring at a static visualization such as a
printed graph …If we can’t interact with the data …we hit the wall.
By interactive visualization, we mean an interface that supports the following
principles:
1. Making changes to a chart is easy, rapid, and reversible.
2. Multiple concurrent charts and tables can be easily combined and dis-
played on a single screen.
78 DATA VISUALIZATION
3. A set of visualizations can be linked, so that operations in one display are
reflected in the other displays.
Let us consider a few examples where we contrast a static plot generator
(e.g., Excel) with an interactive visualization interface.
Histogram rebinning Consider the need to bin a numerical variables and using
a histogram for that purpose. A static histogram would require replotting for each
new binning choice. If the user generates multiple plots, then the screen becomes
cluttered. If the same plot is recreated, then it is hard to compare different
binning choices. In contrast, an interactive visualization would provide an easy
way to change bin width interactively (see, e.g., the slider below the histogram
in Figure 3.13), and then the histogram would automatically and rapidly replot
as the user changes the bin width.
Aggregation and Zooming Consider a time series forecasting task, given a
long series of data. Temporal aggregation at multiple levels is needed for deter-
mining short and long term patterns. Zooming and panning are used to iden-
tify unusual periods. A static plotting software requires the user to compute
new variables for each temporal aggregation (e.g., aggregate daily data to obtain
weekly aggregates). Zooming and panning requires manually changing the min
and max values on the axis scale of interest (thereby losing the ability to quickly
move between different areas without creating multiple charts). An interactive
visualization would provide immediate temporal hierarchies which the user can
easily switch between. Zooming would be enabled as a slider near the axis (see,
e.g., the sliders on the top-left panel in Figure 3.13), thereby allowing direct
manipulation and rapid reaction.
Combining Multiple Linked Plots That Fit in a Single Screen To support a clas-
sification task, multiple plots are created of the outcome variable vs. potential
categorical and numerical predictors. These can include side-by-side boxplots,
color-coded scatter plots, and multipanel bar charts. The user wants to detect
possible multidimensional relationships (and identify possible outliers) by select-
ing a certain subset of the data (e.g., a single category of some variable) and
locating the observations on the other plots. In a static interface, the user would
have to manually organize the plots of interest and resize them in order to fit
within a single screen. A static interface would usually not support inter-plot
linkage, and even if it did, the entire set of plots would have to be regenerated
each time a selection is made. In contrast, an interactive visualization would
provide an easy way to automatically organize and resize the set of plots to fit
within a screen. Linking the set of plots would be easy, and in response to the
users selection on one plot, the appropriate selection would be automatically
highlighted in the other plots (see example in Figure 3.13).
MULTIDIMENSIONAL VISUALIZATION 79
FIGURE 3.13 MULTIPLE INTER-LINKED PLOTS IN A SINGLE VIEW (USING SPOTFIRE). NOTE THE
MARKED OBSERVATION IN THE TOP-LEFT PANEL, WHICH IS ALSO HIGHLIGHTED
IN ALL OTHER PLOTS
In earlier sections, we used plots to illustrate the advantages of visualizations,
because “a picture is worth a thousand words.” The advantages of an interactive
visualization are even harder to convey in words. As Ben Shneiderman, a well-
known researcher in information visualization and interfaces, puts it:
A picture is worth a thousand words. An interface is worth a thousand pictures.
The ability to interact with plots, and link them together turns plotting
into an analytical tool that supports continuous exploration of the data. Sev-
eral commercial visualization tools provide powerful capabilities along these
lines; two very popular ones are Spotfire (http://spotfire.tibco.com) and Tableau
(www.tableausoftware.com); Figure 3.13 was generated using Spotfire.
Tableau and Spotfire have spent hundreds of millions of dollars on software
R&D and review of interactions with customers to hone interfaces that allow
analysts to interact with their data via plots smoothly and efficiently. It is difficult
to replicate in a programming language like R the sophisticated and highly engi-
neered user interface required for rapid progression through different exploratory
views of the data. The need is there, however, and the R community is moving
http://spotfire.tibco.com
http://www.tableausoftware.com
80 DATA VISUALIZATION
to provide the capability to provide interactivity in plots. The widespread use of
javascript in web development has led some programmers to provide R wrap-
pers for javascript plotting tools such as Highcharts and Plotly. See “Interactive
Charts in R” at http://flowingdata.com/2016/10/21/interactive-charts-in-r/.
The tool ggvis, from Hadley Wickham at RStudio, also provides interactivity
capabilities for R plots. As of the time of this writing, interactivity tools in R
were evolving, and R programmers are likely to see more and higher level tools
that will allow them to develop custom interactive plots quickly, plots that can
be deployed and used by other non-programming analysts in the organization.
3.5 Specialized Visualizations
In this section, we mention a few specialized visualizations that are able to capture
data structures beyond the standard time series and cross-sectional structures—
special types of relationships that are usually hard to capture with ordinary plots.
In particular, we address hierarchical data, network data, and geographical data—
three types of data that are becoming increasingly available.
Visualizing Networked Data
Network analysis techniques were spawned by the explosion of social and prod-
uct network data. Examples of social networks are networks of sellers and buyers
on eBay and networks of users on Facebook. An example of a product network
is the network of products on Amazon (linked through the recommendation
system). Network data visualization is available in various network-specialized
software, and also in general-purpose software.
A network diagram consists of actors and relations between them. “Nodes”
are the actors (e.g., users in a social network or products in a product network),
and represented by circles. “Edges” are the relations between nodes, and are
represented by lines connecting nodes. For example, in a social network such as
Facebook, we can construct a list of users (nodes) and all the pairwise relations
(edges) between users who are “Friends.” Alternatively, we can define edges as
a posting that one user posts on another user’s Facebook page. In this setup,
we might have more than a single edge between two nodes. Networks can also
have nodes of multiple types. A common structure is networks with two types of
nodes. An example of a two-type node network is shown in Figure 3.14, where
we see a set of transactions between a network of sellers and buyers on the online
auction site www.eBay.com [the data are for auctions selling Swarovski beads,
and took place during a period of several months; from Jank and Yahav (2010)].
The black circles represent sellers and the grey circles represent buyers. Circle size
represents the number of transactions that the node (seller or buyer) was involved
in within this network. Line width represents the number of auctions that the
http://flowingdata.com/2016/10/21/interactive-charts-in-r/
http://www.eBay.com
SPECIALIZED VISUALIZATIONS 81
FIGURE 3.14 NETWORK GRAPH OF EBAY SELLERS (BLACK CIRCLES) AND BUYERS (GREY
CIRCLES) OF SWAROVSKI BEADS. CIRCLE SIZE REPRESENTS THE NODE’S NUMBER
OF TRANSACTIONS. LINE WIDTH REPRESENTS THE NUMBER OF TRANSACTIONS
BETWEEN THAT PAIR OF SELLER–BUYER
code for creating Figure 3.14
library(igraph)
ebay.df <- read.csv("eBayNetwork.csv")
# transform node ids to factors
ebay.df[,1] <- as.factor(ebay.df[,1])
ebay.df[,2] <- as.factor(ebay.df[,2])
graph.edges <- as.matrix(ebay.df[,1:2])
g <- graph.edgelist(graph.edges, directed = FALSE)
isBuyer <- V(g)$name %in% graph.edges[,2]
plot(g, vertex.label = NA, vertex.color = ifelse(isBuyer, "gray", "black"),
vertex.size = ifelse(isBuyer, 7, 10))
bidder–seller pair interacted in. We can see that this marketplace is dominated
by three or four high-volume sellers. We can also see that many buyers interact
with a single seller. The market structures for many individual products could
82 DATA VISUALIZATION
be reviewed quickly in this way. Network providers could use the information,
for example, to identify possible partnerships to explore with sellers.
Figure 3.14 was produced using R’s igraph package. Another useful pack-
age, especially for social network analysis, is sna. Using these packages, networks
can be imported from social network websites such as Twitter and Facebook.
The graph’s appearance can be customized and various features are available such
as filtering nodes and edges, 3D visualization, altering the graph’s layout, find-
ing clusters of related nodes, calculating graph metrics, and performing network
analysis (see Chapter 19 for details and examples).
Network graphs can be potentially useful in the context of association rules
(see Chapter 14). For example, consider a case of mining a dataset of consumers’
grocery purchases to learn which items are purchased together (“what goes with
what”). A network can be constructed with items as nodes and edges connecting
items that were purchased together. After a set of rules is generated by the data
mining algorithm (which often contains an excessive number of rules, many of
which are unimportant), the network graph can help visualize different rules for
the purpose of choosing the interesting ones. For example, a popular “beer and
diapers” combination would appear in the network graph as a pair of nodes with
very high connectivity. An item which is almost always purchased regardless of
other items (e.g., milk) would appear as a very large node with high connectivity
to all other nodes.
Visualizing Hierarchical Data: Treemaps
We discussed hierarchical data and the exploration of data at different hierarchy
levels in the context of plot manipulations. Treemaps are useful visualizations
specialized for exploring large data sets that are hierarchically structured (tree-
structured). They allow exploration of various dimensions of the data while
maintaining the hierarchical nature of the data. An example is shown in Fig-
ure 3.15, which displays a large set of auctions from eBay.com,4 hierarchically
ordered by item category, sub-category, and brand. The levels in the hierarchy
of the treemap are visualized as rectangles containing sub-rectangles. Categorical
variables can be included in the display by using hue. Numerical variables can
be included via rectangle size and color intensity (ordering of the rectangles is
sometimes used to reinforce size). In the example in Figure 3.15, size is used to
represent the average closing price (which reflects item value) and color intensity
represents the percent of sellers with negative feedback (a negative seller feed-
back indicates buyer dissatisfaction in past transactions and is often indicative of
fraudulent seller behavior). Consider the task of classifying ongoing auctions
in terms of a fraudulent outcome. From the treemap, we see that the highest
4We thank Sharad Borle for sharing this dataset.
SPECIALIZED VISUALIZATIONS 83
negative.feedback
0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014
Staplers
Dewalt_Cordless_Drill
Bausch_and_Laumb_Microscope
Oakley_Sunglasses
Brioni_Tie Zegna_TieCross_PenWaterman_Pen
HP_Inkjet_Color_Printer
Casio_Calculator Sharp_Calculator
Celestron_Telescope
Hair_DryerShaver
Cartier_WristwatchRolex_Wristwatch
Samsonite_Luggage
Rookwood_Vase
Roseville_Vase
Callaway_Golfbag
Callaway_Golfballs
Ping_Golfbag
Titleist_Golfballs
Desktop
accessoriesElectric Drills
Microscopes
Sunglasses
Neck tiesPremium Pens
Computer
AccessoriesCalculators
Telescopes
Hair
Care
Men's electric
shavers
Premium wristwatches
Luggage bags
Collectible Pottery Golf
Business & Industrial
Clothing & accessories
Clothing shoes
& accessories
Collectibles
ComputersConsumer Electronics
Health & Beauty
Jewelry & watches
Luggage
Pottery & Glass Sports
FIGURE 3.15 TREEMAP SHOWING NEARLY 11,000 EBAY AUCTIONS, ORGANIZED BY ITEM
CATEGORY, SUBCATEGORY, AND BRAND. RECTANGLE SIZE REPRESENTS AVERAGE
CLOSING PRICE (REFLECTING ITEM VALUE). SHADE REPRESENTS PERCENTAGE OF
SELLERS WITH NEGATIVE FEEDBACK (DARKER = HIGHER)
code for creating Figure 3.15
library(treemap)
tree.df <- read.csv("EbayTreemap.csv")
# add column for negative feedback
tree.df$negative.feedback <- 1* (tree.df$Seller.Feedback < 0)
# draw treemap
treemap(tree.df, index = c("Category","Sub.Category", "Brand"),
vSize = "High.Bid", vColor = "negative.feedback", fun.aggregate = "mean",
align.labels = list(c("left", "top"), c("right", "bottom"), c("center", "center")),
palette = rev(gray.colors(3)), type = "manual", title = "")
proportion of sellers with negative ratings (black) is concentrated in expensive
item auctions (Rolex and Cartier wristwatches).
Ideally, treemaps should be explored interactively, zooming to different levels
of the hierarchy. One example of an interactive online application of treemaps
is currently available at www.drasticdata.nl. One of their treemap examples dis-
plays player-level data from the 2014 World Cup, aggregated to team level. The
user can choose to explore players and team data.
Visualizing Geographical Data: Map Charts
Many datasets used for data mining now include geographical information. Zip
codes are one example of a categorical variable with many categories, where it
is not straightforward to create meaningful variables for analysis. Plotting the
data on a geographic map can often reveal patterns that are harder to identify
http://www.drasticdata.nl
84 DATA VISUALIZATION
otherwise. A map chart uses a geographical map as its background; then color,
hue, and other features are used to include categorical or numerical variables.
Besides specialized mapping software, maps are now becoming part of general-
purpose software, and Google Maps provides APIs (application programming
interfaces) that allow organizations to overlay their data on a Google map. While
Google Maps is readily available, resulting map charts (such as Figure 3.16) are
0
20
40
60
−150 −125 −100 −75 −50
lon
la
t
FIGURE 3.16 MAP CHART OF STUDENTS’ AND INSTRUCTORS’ LOCATIONS ON A GOOGLE MAP
(FROM STATISTICS.COM)
code for creating Figure 3.16
library(ggmap)
SCstudents <- read.csv("SC-US-students-GPS-data-2016.csv")
Map <- get_map("Denver, CO", zoom = 3)
ggmap(Map) + geom_point(aes(x = longitude, y = latitude), data = SCstudents,
alpha = 0.4, colour = "red", size = 0.5)
SPECIALIZED VISUALIZATIONS 85
somewhat inferior in terms of effectiveness compared to map charts in dedicated
interactive visualization software.
Figure 3.17 shows two world map charts (created with Spotfire), comparing
countries’ “well-being” (according to a 2006 Gallup survey) in the top map, to
gross domestic product (GDP) in the bottom map. Lighter shade means higher
value.
4
6
8
Happiness
5.0e+12
1.0e+13
1.5e+13
GDP2015
FIGURE 3.17 WORLD MAPS COMPARING “WELL-BEING” (TOP) TO GDP (BOTTOM). SHADING BY
AVERAGE “GLOBAL WELL-BEING” SCORE (TOP) OR GDP (BOTTOM) OF COUNTRY.
LIGHTER CORRESPONDS TO HIGHER SCORE OR LEVEL. DATA FROM VEENHOVEN’S
WORLD DATABASE OF HAPPINESS
code for creating Figure 3.17
library(mosaic)
gdp.df <- read.csv("gdp.csv", skip = 4, stringsAsFactors = FALSE)
names(gdp.df)[5] <- "GDP2015"
happiness.df <- read.csv("Veerhoven.csv")
# gdp map
mWorldMap(gdp.df, key = "Country.Name", fill = "GDP2015") + coord_map()
# eell-being map
mWorldMap(happiness.df, key = "Nation", fill = "Score") + coord_map() +
scale_fill_continuous(name = "Happiness")
86 DATA VISUALIZATION
3.6 Summary: Major Visualizations and
Operations, by Data Mining Goal
Prediction
• Plot outcome on the y-axis of boxplots, bar charts, and scatter plots.
• Study relation of outcome to categorical predictors via side-by-side box-
plots, bar charts, and multiple panels.
• Study relation of outcome to numerical predictors via scatter plots.
• Use distribution plots (boxplot, histogram) for determining needed trans-
formations of the outcome variable (and/or numerical predictors).
• Examine scatter plots with added color/panels/size to determine the need
for interaction terms.
• Use various aggregation levels and zooming to determine areas of the
data with different behavior, and to evaluate the level of global vs. local
patterns.
Classification
• Study relation of outcome to categorical predictors using bar charts with
the outcome on the y-axis.
• Study relation of outcome to pairs of numerical predictors via color-coded
scatter plots (color denotes the outcome).
• Study relation of outcome to numerical predictors via side-by-side box-
plots: Plot boxplots of a numerical variable by outcome. Create similar
displays for each numerical predictor. The most separable boxes indicate
potentially useful predictors.
• Use color to represent the outcome variable on a parallel coordinate plot.
• Use distribution plots (boxplot, histogram) for determining needed trans-
formations of numerical predictor variables.
• Examine scatter plots with added color/panels/size to determine the need
for interaction terms.
• Use various aggregation levels and zooming to determine areas of the
data with different behavior, and to evaluate the level of global vs. local
patterns.
Time Series Forecasting
• Create line graphs at different temporal aggregations to determine types
of patterns.
• Use zooming and panning to examine various shorter periods of the series
to determine areas of the data with different behavior.
SUMMARY: MAJOR VISUALIZATIONS AND OPERATIONS, BY DATA MINING GOAL 87
• Use various aggregation levels to identify global and local patterns.
• Identify missing values in the series (that will require handling).
• Overlay trend lines of different types to determine adequate modeling
choices.
Unsupervised Learning
• Create scatter plot matrices to identify pairwise relationships and cluster-
ing of observations.
• Use heatmaps to examine the correlation table.
• Use various aggregation levels and zooming to determine areas of the data
with different behavior.
• Generate a parallel coordinates plot to identify clusters of observations.
88 DATA VISUALIZATION
PROBLEMS
3.1 Shipments of Household Appliances: Line Graphs. The file ApplianceShip-
ments.csv contains the series of quarterly shipments (in millions of dollars) of US house-
hold appliances between 1985 and 1989.
a. Create a well-formatted time plot of the data using R.
b. Does there appear to be a quarterly pattern? For a closer view of the patterns, zoom
in to the range of 3500–5000 on the y-axis.
c. Using R, create one chart with four separate lines, one line for each of Q1, Q2, Q3,
and Q4. In R, this can be achieved by generating a data.frame for each quarter Q1,
Q2, Q3, Q4, and then plotting them as separate series on the line graph. Zoom
in to the range of 3500–5000 on the y-axis. Does there appear to be a difference
between quarters?
d. Using R, create a line graph of the series at a yearly aggregated level (i.e., the total
shipments in each year).
3.2 Sales of Riding Mowers: Scatter Plots. A company that manufactures riding
mowers wants to identify the best sales prospects for an intensive sales campaign. In
particular, the manufacturer is interested in classifying households as prospective own-
ers or nonowners on the basis of Income (in $1000s) and Lot Size (in 1000 ft2). The
marketing expert looked at a random sample of 24 households, given in the file Riding-
Mowers.csv.
a. Using R, create a scatter plot of Lot Size vs. Income, color-coded by the outcome
variable owner/nonowner. Make sure to obtain a well-formatted plot (create legi-
ble labels and a legend, etc.).
3.3 Laptop Sales at a London Computer Chain: Bar Charts and Boxplots. The
file LaptopSalesJanuary2008.csv contains data for all sales of laptops at a computer chain
in London in January 2008. This is a subset of the full dataset that includes data for
the entire year.
a. Create a bar chart, showing the average retail price by store. Which store has the
highest average? Which has the lowest?
b. To better compare retail prices across stores, create side-by-side boxplots of retail
price by store. Now compare the prices in the two stores from (a). Does there
seem to be a difference between their price distributions?
3.4 Laptop Sales at a London Computer Chain: Interactive Visualization. The
next exercises are designed for using an interactive visualization tool. The file LaptopSales.txt is
a comma-separated file with nearly 300,000 rows. ENBIS (the European Network for Business
and Industrial Statistics) provided these data as part of a contest organized in the fall of 2009.
Scenario: Imagine that you are a new analyst for a company called Acell (a company
selling laptops). You have been provided with data about products and sales. You need
to help the company with their business goal of planning a product strategy and pricing
policies that will maximize Acell’s projected revenues in 2009. Using an interactive
visualization tool, answer the following questions.
a. Price Questions:
i. At what price are the laptops actually selling?
ii. Does price change with time? (Hint: Make sure that the date column is recog-
nized as such. The software should then enable different temporal aggregation
PROBLEMS 89
choices, e.g., plotting the data by weekly or monthly aggregates, or even by
day of week.)
iii. Are prices consistent across retail outlets?
iv. How does price change with configuration?
b. Location Questions:
i. Where are the stores and customers located?
ii. Which stores are selling the most?
iii. How far would customers travel to buy a laptop?
◦ Hint 1: You should be able to aggregate the data, for example, plot the sum
or average of the prices.
◦ Hint 2: Use the coordinated highlighting between multiple visualizations in
the same page, for example, select a store in one view to see the matching
customers in another visualization.
◦ Hint 3: Explore the use of filters to see differences. Make sure to filter in
the zoomed out view. For example, try to use a “store location” slider as an
alternative way to dynamically compare store locations. This might be more
useful to spot outlier patterns if there were 50 store locations to compare.
iv. Try an alternative way of looking at how far customers traveled. Do this by
creating a new data column that computes the distance between customer and
store.
c. Revenue Questions:
i. How do the sales volume in each store relate to Acell’s revenues?
ii. How does this relationship depend on the configuration?
d. Configuration Questions:
i. What are the details of each configuration? How does this relate to price?
ii. Do all stores sell all configurations?
CHAPTER 4
Dimension Reduction
In this chapter, we describe the important step of dimension reduction. The
dimension of a dataset, which is the number of variables, must be reduced for
the data mining algorithms to operate efficiently. This process is part of the
pilot/prototype phase of data mining and is done before deploying a model.
We present and discuss several dimension reduction approaches: (1) Incorporat-
ing domain knowledge to remove or combine categories, (2) using data sum-
maries to detect information overlap between variables (and remove or combine
redundant variables or categories), (3) using data conversion techniques such
as converting categorical variables into numerical variables, and (4) employing
automated reduction techniques, such as principal components analysis (PCA),
where a new set of variables (which are weighted averages of the original vari-
ables) is created. These new variables are uncorrelated and a small subset of them
usually contains most of their combined information (hence, we can reduce
dimension by using only a subset of the new variables). Finally, we mention
data mining methods such as regression models and classification and regression
trees, which can be used for removing redundant variables and for combining
“similar” categories of categorical variables.
4.1 Introduction
In data mining, one often encounters situations where there is a large number of
variables in the database. Even when the initial number of variables is small, this
set quickly expands in the data preparation step, where new derived variables
are created (e.g., dummies for categorical variables and new forms of existing
variables). In such situations, it is likely that subsets of variables are highly cor-
related with each other. Including highly correlated variables in a classification
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
91
92 DIMENSION REDUCTION
or prediction model, or including variables that are unrelated to the outcome of
interest, can lead to overfitting, and accuracy and reliability can suffer. A large
number of variables also poses computational problems for some supervised as
well as unsupervised algorithms (aside from questions of correlation). In model
deployment, superfluous variables can increase costs due to the collection and
processing of these variables.
4.2 Curse of Dimensionality
The dimensionality of a model is the number of predictors or input variables used
by the model. The curse of dimensionality is the affliction caused by adding vari-
ables to multivariate data models. As variables are added, the data space becomes
increasingly sparse, and classification and prediction models fail because the avail-
able data are insufficient to provide a useful model across so many variables. An
important consideration is the fact that the difficulties posed by adding a variable
increase exponentially with the addition of each variable. One way to think of
this intuitively is to consider the location of an object on a chessboard. It has
two dimensions and 64 squares or choices. If you expand the chessboard to a
cube, you increase the dimensions by 50%—from 2 dimensions to 3 dimen-
sions. However, the location options increase by 800%, to 512 (8 × 8 × 8).
In statistical distance terms, the proliferation of variables means that nothing is
close to anything else anymore—too much noise has been added and patterns
and structure are no longer discernible. The problem is particularly acute in Big
Data applications, including genomics, where, for example, an analysis might
have to deal with values for thousands of different genes. One of the key steps in
data mining, therefore, is finding ways to reduce dimensionality with minimal
sacrifice of accuracy. In the artificial intelligence literature, dimension reduction
is often referred to as factor selection or feature extraction.
4.3 Practical Considerations
Although data mining prefers automated methods over domain knowledge, it is
important at the first step of data exploration to make sure that the variables mea-
sured are reasonable for the task at hand. The integration of expert knowledge
through a discussion with the data provider (or user) will probably lead to better
results. Practical considerations include: Which variables are most important
for the task at hand, and which are most likely to be useless? Which variables
are likely to contain much error? Which variables will be available for mea-
surement (and what will it cost to measure them) in the future if the analysis is
PRACTICAL CONSIDERATIONS 93
repeated? Which variables can actually be measured before the outcome occurs?
For example, if we want to predict the closing price of an ongoing online auc-
tion, we cannot use the number of bids as a predictor because this will not be
known until the auction closes.
Example 1: House Prices in Boston
We return to the Boston Housing example introduced in Chapter 3. For each
neighborhood, a number of variables are given, such as the crime rate, the stu-
dent/teacher ratio, and the median value of a housing unit in the neighborhood.
A description of all 14 variables is given in Table 4.1. The first nine records of
the data are shown in Table 4.2. The first row represents the first neighborhood,
which had an average per capita crime rate of 0.006, 18% of the residential land
zoned for lots over 25,000 ft2, 2.31% of the land devoted to nonretail business,
no border on the Charles River, and so on.
TABLE 4.1 DESCRIPTION OF VARIABLES IN THE BOSTON HOUSING DATASET
CRIM Crime rate
ZN Percentage of residential land zoned for lots over 25,000 ft2
INDUS Percentage of land occupied by nonretail business
CHAS Does tract bound Charles River? (= 1 if tract bounds river, = 0 otherwise)
NOX Nitric oxide concentration (parts per 10 million)
RM Average number of rooms per dwelling
AGE Percentage of owner-occupied units built prior to 1940
DIS Weighted distances to five Boston employment centers
RAD Index of accessibility to radial highways
TAX Full-value property tax rate per $10,000
PTRATIO Pupil-to-teacher ratio by town
LSTAT Percentage of lower status of the population
MEDV Median value of owner-occupied homes in $1000s
CAT.MEDV Is median value of owner-occupied homes in tract above $30,000 (CAT.MEDV = 1) or
not (CAT.MEDV = 0)?
TABLE 4.2 FIRST NINE RECORDS IN THE BOSTON HOUSING DATA
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV CAT.MEDV
0.00632 18.0 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 4.98 24.0 0
0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6 0
0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7 1
0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4 1
0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2 1
0.02985 0.0 2.18 0 0.458 6.43 58.7 6.0622 3 222 18.7 5.21 28.7 0
0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 12.43 22.9 0
0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 19.15 27.1 0
0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 29.93 16.5 0
94 DIMENSION REDUCTION
4.4 Data Summaries
As we have seen in the chapter on data visualization, an important initial step of
data exploration is getting familiar with the data and their characteristics through
summaries and graphs. The importance of this step cannot be overstated. The
better you understand the data, the better the results from the modeling or min-
ing process.
Numerical summaries and graphs of the data are very helpful for data reduc-
tion. The information that they convey can assist in combining categories of
a categorical variable, in choosing variables to remove, in assessing the level of
information overlap between variables, and more. Before discussing such strate-
gies for reducing the dimension of a data set, let us consider useful summaries
and tools.
Summary Statistics
R has several functions and facilities that assist in summarizing data. The func-
tion summary() gives an overview of the entire set of variables in the data. The
functions mean(), sd(), min(), max(), median(), and length() are also very helpful
for learning about the characteristics of each variable. First, they give us infor-
mation about the scale and type of values that the variable takes. The min and
max functions can be used to detect extreme values that might be errors. The
mean and median give a sense of the central values of that variable, and a large
deviation between the two also indicates skew. The standard deviation gives a
sense of how dispersed the data are (relative to the mean). Further options, such
as sum(is.na(variable)), which gives the number of null values, can tell us about
missing values.
Table 4.3 shows summary statistics (and R code) for the Boston Housing
example. We immediately see that the different variables have very different
ranges of values. We will soon see how variation in scale across variables can
distort analyses if not treated properly. Another observation that can be made
is that the mean of the first variable, CRIM (as well as several others), is much
larger than the median, indicating right skew. None of the variables have missing
values. There also do not appear to be indications of extreme values that might
result from typing errors.
Next, we summarize relationships between two or more variables. For
numerical variables, we can compute a complete matrix of correlations between
each pair of variables, using the R function cor(). Table 4.4 shows the correlation
matrix for the Boston Housing variables. We see that most correlations are low
and that many are negative. Recall also the visual display of a correlation matrix
via a heatmap (see Figure 3.4 in Chapter 3 for the heatmap corresponding to this
DATA SUMMARIES 95
TABLE 4.3 SUMMARY STATISTICS FOR THE BOSTON HOUSING DATA
code for summary statistics
boston.housing.df <- read.csv("BostonHousing.csv", header = True)
head(boston.housing.df, 9)
summary(boston.housing.df)
# compute mean, standard dev., min, max, median, length, and missing values of CRIM
mean(boston.housing.df$CRIM)
sd(boston.housing.df$CRIM)
min(boston.housing.df$CRIM)
max(boston.housing.df$CRIM)
median(boston.housing.df$CRIM)
length(boston.housing.df$CRIM)
# find the number of missing values of variable CRIM
sum(is.na(boston.housing.df$CRIM))
# compute mean, standard dev., min, max, median, length, and missing values for all
# variables
data.frame(mean=sapply(boston.housing.df, mean), +
+ sd=sapply(boston.housing.df, sd), +
+ min=sapply(boston.housing.df, min), +
+ max=sapply(boston.housing.df, max), +
+ median=sapply(boston.housing.df, median), +
+ length=sapply(boston.housing.df, length) +
+ miss.val=sapply(boston.housing.df, function(x)
+ sum(length(which(is.na(x))))))) )
Output
> data.frame(mean=sapply(boston.housing.df, mean),
+ sd=sapply(boston.housing.df, sd),
+ min=sapply(boston.housing.df, min),
+ max=sapply(boston.housing.df, max),
+ median=sapply(boston.housing.df, median),
+ length=sapply(boston.housing.df, length)
+ miss.val=sapply(boston.housing.df,
function(x) sum(length(which(is.na(x)))))))
mean sd min max median length miss.val
CRIM 3.61352356 8.6015451 0.00632 88.9762 0.25651 506 0
ZN 11.36363636 23.3224530 0.00000 100.0000 0.00000 506 0
INDUS 11.13677866 6.8603529 0.46000 27.7400 9.69000 506 0
CHAS 0.06916996 0.2539940 0.00000 1.0000 0.00000 506 0
NOX 0.55469506 0.1158777 0.38500 0.8710 0.53800 506 0
RM 6.28463439 0.7026171 3.56100 8.7800 6.20850 506 0
AGE 68.57490119 28.1488614 2.90000 100.0000 77.50000 506 0
DIS 3.79504269 2.1057101 1.12960 12.1265 3.20745 506 0
RAD 9.54940711 8.7072594 1.00000 24.0000 5.00000 506 0
TAX 408.23715415 168.5371161 187.00000 711.0000 330.00000 506 0
PTRATIO 18.45553360 2.1649455 12.60000 22.0000 19.05000 506 0
LSTAT 12.65306324 7.1410615 1.73000 37.9700 11.36000 506 0
MEDV 22.53280632 9.1971041 5.00000 50.0000 21.20000 506 0
CAT.MEDV 0.16600791 0.3724560 0.00000 1.0000 0.00000 506 0
96 DIMENSION REDUCTION
TABLE 4.4 CORRELATION TABLE FOR BOSTON HOUSING DATA
> round(cor(boston.housing.df),2)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT MEDV CAT.MEDV
CRIM 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29 0.46 -0.39 -0.15
ZN -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39 -0.41 0.36 0.37
INDUS 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38 0.60 -0.48 -0.37
CHAS -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12 -0.05 0.18 0.11
NOX 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19 0.59 -0.43 -0.23
RM -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36 -0.61 0.70 0.64
AGE 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26 0.60 -0.38 -0.19
DIS -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23 -0.50 0.25 0.12
RAD 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46 0.49 -0.38 -0.20
TAX 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46 0.54 -0.47 -0.27
PTRATIO 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00 0.37 -0.51 -0.44
LSTAT 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37 1.00 -0.74 -0.47
MEDV -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51 -0.74 1.00 0.79
CAT.MEDV -0.15 0.37 -0.37 0.11 -0.23 0.64 -0.19 0.12 -0.20 -0.27 -0.44 -0.47 0.79 1.00
correlation table). We will return to the importance of the correlation matrix
soon, in the context of correlation analysis.
Aggregation and Pivot Tables
Another very useful approach for exploring the data is aggregation by one or
more variables. For aggregation by a single variable, we can use table(). For
example, Table 4.5 shows the number of neighborhoods that bound the Charles
River vs. those that do not (the variable CHAS is chosen as the grouping vari-
able). It appears that the majority of neighborhoods (471 of 506) do not bound
the river.
The aggregate() function can be used for aggregating by one or more variables,
and computing a range of summary statistics (count, average, percentage, etc.).
For categorical variables, we obtain a breakdown of the records by the combina-
tion of categories. For instance, in Table 4.6, we compute the average MEDV
by CHAS and RM. Note that the numerical variable RM (the average number
of rooms per dwelling in the neighborhood) should be first grouped into bins of
size 1 (0–1, 1–2, and so on). Note the empty values, denoting that there are no
neighborhoods in the dataset with those combinations (e.g., bounding the river
and having on average 3 rooms).
TABLE 4.5 NUMBER OF NEIGHBORHOODS THAT BOUND THE CHARLES RIVER VS. THOSE THAT
DO NOT
> boston.housing.df <- read.csv("BostonHousing.csv")
> table(boston.housing.df$CHAS)
0 1
471 35
CORRELATION ANALYSIS 97
TABLE 4.6 AVERAGE MEDV BY CHAS AND RM
code for aggregating MEDV by CHAS and RM
# create bins of size 1
boston.housing.df$RM.bin <- .bincode(boston.housing.df$RM, c(1:9))
# compute the average of MEDV by (binned) RM and CHAS
# in aggregate() use the argument by= to define the list of aggregating variables,
# and FUN= as an aggregating function.
aggregate(boston.housing.df$MEDV, by=list(RM=boston.housing.df$RM.bin,
CHAS=boston.housing.df$CHAS), FUN=mean)
Output
RM CHAS x
1 3 0 25.30000
2 4 0 15.40714
3 5 0 17.2000
4 6 0 21.76917
5 7 0 35.96444
6 8 0 45.70000
7 5 1 22.21818
8 6 1 25.91875
9 7 1 44.06667
10 8 1 35.95000
Another useful set of functions are melt() and cast() in the reshape package,
that allow the creation of pivot tables. melt() takes a set of columns and stacks
them into a single column. cast() then reshapes the single column into multiple
columns by the aggregating variables of our choice. For example, Table 4.7 com-
putes the average of MEDV by CHAS and RM and presents it as a pivot table.
In classification tasks, where the goal is to find predictor variables that dis-
tinguish between two classes, a good exploratory step is to produce summaries
for each class. This can assist in detecting useful predictors that display some
separation between the two classes. Data summaries are useful for almost any
data mining task and are therefore an important preliminary step for cleaning
and understanding the data before carrying out further analyses.
4.5 Correlation Analysis
In datasets with a large number of variables (which are likely to serve as pre-
dictors), there is usually much overlap in the information covered by the set of
variables. One simple way to find redundancies is to look at a correlation matrix.
This shows all the pairwise correlations between variables. Pairs that have a very
strong (positive or negative) correlation contain a lot of overlap in information
and are good candidates for data reduction by removing one of the variables.
98 DIMENSION REDUCTION
TABLE 4.7 PIVOT TABLES IN R
code for creating pivot tables using functions melt() and cast()
# use install.packages("reshape") the first time the package is used
library(reshape)
boston.housing.df <- read.csv("BostonHousing.csv")
# create bins of size 1
boston.housing.df$RM.bin <- .bincode(boston.housing.df$RM, c(1:9))
# use melt() to stack a set of columns into a single column of data.
# stack MEDV values for each combination of (binned) RM and CHAS
mlt <- melt(boston.housing.df, id=c("RM.bin", "CHAS"), measure=c("MEDV"))
head(mlt, 5)
# use cast() to reshape data and generate pivot table
cast(mlt, RM.bin ~ CHAS, subset=variable=="MEDV",
margins=c("grand_row", "grand_col"), mean)
Output
> mlt <- melt(boston.housing.df, id=c("RM.bin", "CHAS"), measure=c("MEDV"))
> head(mlt, 5)
RM.bin CHAS variable value
1 6 0 MEDV 24.0
2 6 0 MEDV 21.6
3 7 0 MEDV 34.7
4 6 0 MEDV 33.4
5 7 0 MEDV 36.2
> cast(mlt, RM.bin ~ CHAS, subset=variable==”MEDV”,
margins=c(“grand_row”, “grand_col”), mean)
RM.bin 0 1 (all)
1 3 25.30000 NaN 25.30000
2 4 15.40714 NaN 15.40714
3 5 17.20000 22.21818 17.55159
4 6 21.76917 25.91875 22.01599
5 7 35.96444 44.06667 36.91765
6 8 45.70000 35.95000 44.20000
7 (all) 22.09384 28.44000 22.53281
Removing variables that are strongly correlated to others is useful for avoiding
multicollinearity problems that can arise in various models. (Multicollinearity is
the presence of two or more predictors sharing the same linear relationship with
the outcome variable; R handles this automatically in regression.)
Correlation analysis is also a good method for detecting duplications of vari-
ables in the data. Sometimes, the same variable appears accidentally more than
once in the dataset (under a different name) because the dataset was merged from
multiple sources, the same phenomenon is measured in different units, and so
REDUCING THE NUMBER OF CATEGORIES IN CATEGORICAL VARIABLES 99
on. Using correlation table heatmaps, as shown in Chapter 3, can make the task
of identifying strong correlations easier.
4.6 Reducing the Number of Categories in
Categorical Variables
When a categorical variable has many categories, and this variable is destined
to be a predictor, many data mining methods will require converting it into
many dummy variables. In particular, a variable with m categories will be trans-
formed into either m or m − 1 dummy variables (depending on the method).
This means that even if we have very few original categorical variables, they can
greatly inflate the dimension of the dataset. One way to handle this is to reduce
the number of categories by combining close or similar categories. Combining
categories requires incorporating expert knowledge and common sense. Pivot
tables are useful for this task: We can examine the sizes of the various categories
and how the outcome variable behaves in each category. Generally, categories
that contain very few observations are good candidates for combining with other
categories. Use only the categories that are most relevant to the analysis and label
the rest as “other.” In classification tasks (with a categorical outcome variable), a
pivot table broken down by the outcome classes can help identify categories that
do not separate the classes. Those categories too are candidates for inclusion in
the “other” category. An example is shown in Figure 4.1, where the distribution
of outcome variable CAT.MEDV is broken down by ZN (treated here as a cate-
gorical variable). We can see that the distribution of CAT.MEDV is identical for
ZN = 17.5, 90, 95, and 100 (where all neighborhoods have CAT.MEDV = 1).
These four categories can then be combined into a single category. Similarly,
categories ZN = 12.5, 25, 28, 30, and 70 can be combined. Further combina-
tion is also possible based on similar bars.
In a time series context where we might have a categorical variable denoting
season (such as month, or hour of day) that will serve as a predictor, reducing
categories can be done by examining the time series plot and identifying similar
periods. For example, the time plot in Figure 4.2 shows the quarterly revenues
of Toys “R” Us between 1992–1995. Only quarter 4 periods appear different,
and therefore, we can combine quarters 1–3 into a single category.
4.7 Converting a Categorical Variable to a
Numerical Variable
Sometimes the categories in a categorical variable represent intervals. Common
examples are age group or income bracket. If the interval values are known
(e.g., category 2 is the age interval 20–30), we can replace the categorical value
100 DIMENSION REDUCTION
0 17.5 20 22 28 33 35 45 55 70 80 85 95
Distribution of CAT.MEDV by ZN
ZN
0%
20
%
40
%
60
%
80
%
10
0%
FIGURE 4.1 DISTRIBUTION OF CAT.MEDV (BLACK DENOTES CAT.MEDV = 0) BY ZN. SIMILAR
BARS INDICATE LOW SEPARATION BETWEEN CLASSES, AND CAN BE COMBINED
code for creating Figure 4.1
library(ggmap)
boston.housing.df <- read.csv("BostonHousing.csv")
tbl <- table(boston.housing.df$CAT..MEDV, boston.housing.df$ZN)
prop.tbl <- prop.table(tbl, margin=2)
barplot(prop.tbl, xlab="ZN", ylab="", yaxt="n",main="Distribution of CAT.MEDV by ZN")
axis(2, at=(seq(0,1, 0.2)), paste(seq(0,100,20), "%"))
Time
R
ev
en
ue
(
$
m
ill
io
ns
)
1992 1993 1994 1995 1996
10
00
15
00
20
00
25
00
30
00
35
00
40
00
FIGURE 4.2 QUARTERLY REVENUES OF TOYS “R” US, 1992–1995
PRINCIPAL COMPONENTS ANALYSIS 101
(“2” in the example) with the mid-interval value (here “25”). The result will
be a numerical variable which no longer requires multiple dummy variables.
4.8 Principal Components Analysis
Principal components analysis (PCA) is a useful method for dimension reduction,
especially when the number of variables is large. PCA is especially valuable when
we have subsets of measurements that are measured on the same scale and are
highly correlated. In that case, it provides a few variables (often as few as three)
that are weighted linear combinations of the original variables, and that retain
the majority of the information of the full original set. PCA is intended for
use with numerical variables. For categorical variables, other methods such as
correspondence analysis are more suitable.
Example 2: Breakfast Cereals
Data were collected on the nutritional information and consumer rating of 77
breakfast cereals.1 The consumer rating is a rating of cereal “healthiness” for
consumer information (not a rating by consumers). For each cereal, the data
include 13 numerical variables, and we are interested in reducing this dimen-
sion. For each cereal, the information is based on a bowl of cereal rather than a
serving size, because most people simply fill a cereal bowl (resulting in constant
volume, but not weight). A snapshot of these data is given in Figure 4.3, and
the description of the different variables is given in Table 4.8.
We focus first on two variables: calories and consumer rating. These are given
in Table 4.9. The average calories across the 77 cereals is 106.88 and the average
consumer rating is 42.67. The estimated covariance matrix between the two
variables is
S =
[
379.63 −188.68
−188.68 197.32
]
.
It can be seen that the two variables are strongly correlated with a negative
correlation of
−0.69 = −188.68√
(379.63)(197.32)
.
Roughly speaking, 69% of the total variation in both variables is actually “co-
variation,” or variation in one variable that is duplicated by similar variation in
the other variable. Can we use this fact to reduce the number of variables, while
making maximum use of their unique contributions to the overall variation?
Since there is redundancy in the information that the two variables contain, it
1The data are available at http://lib.stat.cmu.edu/DASL/Stories/HealthyBreakfast.html.
http://lib.stat.cmu.edu/DASL/Stories/HealthyBreakfast.html
102 DIMENSION REDUCTION
Cereal Name mfr type calories protein fat sodium fiber carbo sugars potass vitamins
100% Bran N C 70 4 1 130 10 5 6 280 25
100% Natural Bran Q C 120 3 5 15 2 8 8 135 0
All-Bran K C 70 4 1 260 9 7 5 320 25
All-Bran with Extra Fiber K C 50 4 0 140 14 8 0 330 25
Almond Delight R C 110 2 2 200 1 14 8 25
Apple Cinnamon Cheerios G C 110 2 2 180 1.5 10.5 10 70 25
Apple Jacks K C 110 2 0 125 1 11 14 30 25
Basic 4 G C 130 3 2 210 2 18 8 100 25
Bran Chex R C 90 2 1 200 4 15 6 125 25
Bran Flakes P C 90 3 0 210 5 13 5 190 25
Cap'n'Crunch Q C 120 1 2 220 0 12 12 35 25
Cheerios G C 110 6 2 290 2 17 1 105 25
Cinnamon Toast Crunch G C 120 1 3 210 0 13 9 45 25
Clusters G C 110 3 2 140 2 13 7 105 25
Cocoa Puffs G C 110 1 1 180 0 12 13 55 25
Corn Chex R C 110 2 0 280 0 22 3 25 25
Corn Flakes K C 100 2 0 290 1 21 2 35 25
Corn Pops K C 110 1 0 90 1 13 12 20 25
Count Chocula G C 110 1 1 180 0 12 13 65 25
Cracklin' Oat Bran K C 110 3 3 140 4 10 7 160 25
FIGURE 4.3 SAMPLE FROM THE 77 BREAKFAST CEREALS DATASET
might be possible to reduce the two variables to a single variable without losing
too much information. The idea in PCA is to find a linear combination of the
two variables that contains most, even if not all, of the information, so that this
new variable can replace the two original variables. Information here is in the
sense of variability: What can explain the most variability among the 77 cereals?
The total variability here is the sum of the variances of the two variables, which
TABLE 4.8 DESCRIPTION OF THE VARIABLES IN THE BREAKFAST CEREAL DATASET
Variable Description
mfr Manufacturer of cereal (American Home Food Products, General Mills, Kellogg, etc.)
type Cold or hot
calories Calories per serving
protein Grams of protein
fat Grams of fat
sodium Milligrams of sodium
fiber Grams of dietary fiber
carbo Grams of complex carbohydrates
sugars Grams of sugars
potass Milligrams of potassium
vitamins Vitamins and minerals: 0, 25, or 100, indicating the typical percentage of FDA
recommended
shelf Display shelf (1, 2, or 3, counting from the floor)
weight Weight in ounces of one serving
cups Number of cups in one serving
rating Rating of the cereal calculated by consumer reports
PRINCIPAL COMPONENTS ANALYSIS 103
TABLE 4.9 CEREAL CALORIES AND RATINGS
Cereal Calories Rating Cereal Calories Rating
100% Bran 70 68.40297 Just Right Fruit & Nut 140 36.471512
100% Natural Bran 120 33.98368 Kix 110 39.241114
All-Bran 70 59.42551 Life 100 45.328074
All-Bran with Extra
Fiber
50 93.70491 Lucky Charms 110 26.734515
Almond Delight 110 34.38484 Maypo 100 54.850917
Apple Cinnamon
Cheerios
110 29.50954 Muesli Raisins,
Dates & Almonds
150 37.136863
Apple Jacks 110 33.17409 Muesli Raisins,
Peaches & Pecans
150 34.139765
Basic 4 130 37.03856 Mueslix Crispy Blend 160 30.313351
Bran Chex 90 49.12025 Multi-Grain Cheerios 100 40.105965
Bran Flakes 90 53.31381 Nut&Honey Crunch 120 29.924285
Cap’n’Crunch 120 18.04285 Nutri-Grain
Almond-Raisin
140 40.69232
Cheerios 110 50.765 Nutri-grain Wheat 90 59.642837
Cinnamon Toast
Crunch
120 19.82357 Oatmeal Raisin Crisp 130 30.450843
Clusters 110 40.40021 Post Nat. Raisin Bran 120 37.840594
Cocoa Puffs 110 22.73645 Product 19 100 41.50354
Corn Chex 110 41.44502 Puffed Rice 50 60.756112
Corn Flakes 100 45.86332 Puffed Wheat 50 63.005645
Corn Pops 110 35.78279 Quaker Oat Squares 100 49.511874
Count Chocula 110 22.39651 Quaker Oatmeal 100 50.828392
Cracklin’ Oat Bran 110 40.44877 Raisin Bran 120 39.259197
Cream of Wheat (Quick) 100 64.53382 Raisin Nut Bran 100 39.7034
Crispix 110 46.89564 Raisin Squares 90 55.333142
Crispy Wheat & Raisins 100 36.1762 Rice Chex 110 41.998933
Double Chex 100 44.33086 Rice Krispies 110 40.560159
Froot Loops 110 32.20758 Shredded Wheat 80 68.235885
Frosted Flakes 110 31.43597 Shredded Wheat
’n’Bran
90 74.472949
Frosted Mini-Wheats 100 58.34514 Shredded Wheat
spoon size
90 72.801787
Fruit & Fibre Dates,
Walnuts & Oats
120 40.91705 Smacks 110 31.230054
Fruitful Bran 120 41.01549 Special K 110 53.131324
Fruity Pebbles 110 28.02577 Strawberry Fruit
Wheats
90 59.363993
Golden Crisp 100 35.25244 Total Corn Flakes 110 38.839746
Golden Grahams 110 23.80404 Total Raisin Bran 140 28.592785
Grape Nuts Flakes 100 52.0769 Total Whole Grain 100 46.658844
Grape-Nuts 110 53.37101 Triples 110 39.106174
Great Grains Pecan 120 45.81172 Trix 110 27.753301
Honey Graham Ohs 120 21.87129 Wheat Chex 100 49.787445
Honey Nut Cheerios 110 31.07222 Wheaties 100 51.592193
Honey-comb 110 28.74241 Wheaties Honey Gold 110 36.187559
Just Right Crunchy
Nuggets
110 36.52368
104 DIMENSION REDUCTION
in this case is 379.63 + 197.32 = 577. This means that calories accounts for 66%
= 379.63 / 577 of the total variability, and rating for the remaining 34%. If we
drop one of the variables for the sake of dimension reduction, we lose at least
34% of the total variability. Can we redistribute the total variability between two
new variables in a more polarized way? If so, it might be possible to keep only
the one new variable that (hopefully) accounts for a large portion of the total
variation.
Figure 4.4 shows a scatter plot of rating vs. calories. The line z1 is the direction
in which the variability of the points is largest. It is the line that captures the
most variation in the data if we decide to reduce the dimensionality of the data
from two to one. Among all possible lines, it is the line for which, if we project
the points in the dataset orthogonally to get a set of 77 (one-dimensional) values,
the variance of the z1 values will be maximum. This is called the first principal
component. It is also the line that minimizes the sum-of-squared perpendicular
distances from the line. The z2-axis is chosen to be perpendicular to the z1-axis.
In the case of two variables, there is only one line that is perpendicular to z1,
and it has the second largest variability, but its information is uncorrelated with
z1. This is called the second principal component. In general, when we have more
than two variables, once we find the direction z1 with the largest variability,
FIGURE 4.4 SCATTER PLOT OF RATING VS. CALORIES FOR 77 BREAKFAST CEREALS, WITH THE
TWO PRINCIPAL COMPONENT DIRECTIONS
PRINCIPAL COMPONENTS ANALYSIS 105
we search among all the orthogonal directions to z1 for the one with the next-
highest variability. That is z2. The idea is then to find the coordinates of these
lines and to see how they redistribute the variability.
Running PCA in R is done with the function prcomp(). Table 4.10 shows
the output from running PCA on the two variables calories and rating. The value
.rot of this function is the rotation matrix, which gives the weights that are used
to project the original points onto the two new directions. The weights for z1
are given by (0.847, −0.532), and for z2 they are given by (0.532, 0.847). The
function summary() gives the reallocated variance: z1 accounts for 86% of the
total variability and z2 for the remaining 14%. Therefore, if we drop z2, we still
maintain 86% of the total variability.
TABLE 4.10 PCA ON THE TWO VARIABLES CALORIES AND RATING
code for running PCA
cereals.df <- read.csv("Cereals.csv")
# compute PCs on two dimensions
pcs <- prcomp(data.frame(cereals.df$calories, cereals.df$rating))
summary(pcs)
pcs$rot
scores <- pcs$x
head(scores, 5)
Output
> summary(pcs)
Importance of components:
PC1 PC2
Standard deviation 22.3165 8.8844
Proportion of Variance 0.8632 0.1368
Cumulative Proportion 0.8632 1.0000
> pcs$rot
PC1 PC2
cereals.df.calories 0.8470535 0.5315077
cereals.df.rating -0.5315077 0.8470535
> scores <- pcs$x
> head(scores, 5)
PC1 PC2
[1,] -44.921528 2.1971833
[2,] 15.725265 -0.3824165
[3,] -40.149935 -5.4072123
[4,] -75.310772 12.9991256
[5,] 7.041508 -5.3576857
106 DIMENSION REDUCTION
The weights are used to compute principal component scores, which are
the projected values of calories and rating onto the new axes (after subtracting
the means). In R, scores for the two dimensions are depicted by the PCA
value .x. The first column is the projection onto z1 using the weights (0.847,
−0.532). The second column is the projection onto z2 using the weights (0.532,
0.847). For example, the first score for the 100% Bran cereal (with 70 calories
and a rating of 68.4) is (0.847)(70 − 106.88) + (−0.532)(68.4 − 42.67) =
−44.92.
Note that the means of the new variables z1 and z2 are zero, because
we’ve subtracted the mean of each variable. The sum of the variances
var(z1) + var(z2) is equal to the sum of the variances of the original variables,
var(calories)+var(rating). Furthermore, the variances of z1 and z2 are 498 and 79,
respectively, so the first principal component, z1, accounts for 86% of the total
variance. Since it captures most of the variability in the data, it seems reasonable
to use one variable, the first principal score, to represent the two variables in the
original data. Next, we generalize these ideas to more than two variables.
Principal Components
Let us formalize the procedure described above so that it can easily be generalized
to p > 2 variables. Denote the original p variables by X1, X2, . . . , Xp. In
PCA, we are looking for a set of new variables Z1, Z2, . . . , Zp that are weighted
averages of the original variables (after subtracting their mean):
Zi = ai,1(X1 − X̄1) + ai,2(X2 − X̄2) + · · ·+ ai,p(Xp − X̄p), i = 1, . . . , p
(4.1)
where each pair of Z’s has correlation = 0. We then order the resulting Z’s
by their variance, with Z1 having the largest variance and Zp having the small-
est variance. The software computes the weights ai,j , which are then used in
computing the principal component scores.
A further advantage of the principal components compared to the original
data is that they are uncorrelated (correlation coefficient = 0). If we construct
regression models using these principal components as predictors, we will not
encounter problems of multicollinearity.
Let us return to the breakfast cereal dataset with all 15 variables, and apply
PCA to the 13 numerical variables. The resulting output is shown in Table 4.11.
Note that the first three components account for more than 96% of the total
variation associated with all 13 of the original variables. This suggests that we
can capture most of the variability in the data with less than 25% of the original
dimensions in the data. In fact, the first two principal components alone capture
92.6% of the total variation. However, these results are influenced by the scales
of the variables, as we describe next.
PRINCIPAL COMPONENTS ANALYSIS 107
TABLE 4.11 PCA OUTPUT USING ALL 13 NUMERICAL VARIABLES IN THE BREAKFAST CEREALS
DATASET. THE TABLE SHOWS RESULTS FOR THE FIRST FIVE PRINCIPAL
COMPONENTS
> pcs <- prcomp(na.omit(cereals.df[,-c(1:3)]))
> summary(pcs)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 83.7641 70.9143 22.64375 19.18148 8.42323 2.09167 1.69942
Proportion of Variance 0.5395 0.3867 0.03943 0.02829 0.00546 0.00034 0.00022
Cumulative Proportion 0.5395 0.9262 0.96560 0.99389 0.99935 0.99968 0.99991
PC8 PC9 PC10 PC11 PC12 PC13
Standard deviation 0.77963 0.65783 0.37043 0.1864 0.06302 5.334e-08
Proportion of Variance 0.00005 0.00003 0.00001 0.0000 0.00000 0.000e+00
Cumulative Proportion 0.99995 0.99999 1.00000 1.0000 1.00000 1.000e+00
> pcs$rot[,1:5]
PC1 PC2 PC3 PC4 PC5
calories 0.0779841812 0.0093115874 -0.6292057595 -0.6010214629 0.454958508
protein -0.0007567806 -0.0088010282 -0.0010261160 0.0031999095 0.056175970
fat -0.0001017834 -0.0026991522 -0.0161957859 -0.0252622140 -0.016098458
sodium 0.9802145422 -0.1408957901 0.1359018583 -0.0009680741 0.013948118
fiber -0.0054127550 -0.0306807512 0.0181910456 0.0204721894 0.013605026
carbo 0.0172462607 0.0167832981 -0.0173699816 0.0259482087 0.349266966
sugars 0.0029888631 0.0002534853 -0.0977049979 -0.1154809105 -0.299066459
potass -0.1349000039 -0.9865619808 -0.0367824989 -0.0421757390 -0.047150529
vitamins 0.0942933187 -0.0167288404 -0.6919777623 0.7141179984 -0.037008623
shelf -0.0015414195 -0.0043603994 -0.0124888415 0.0056471836 -0.007876459
weight 0.0005120017 -0.0009992138 -0.0038059565 -0.0025464145 0.003022113
cups 0.0005101111 0.0015910125 -0.0006943214 0.0009853800 0.002148458
rating -0.0752962922 -0.0717421528 0.3079471212 0.3345338994 0.757708025
comment on missing values
Use function na.omit() to remove observations that contain missing values.
Normalizing the Data
A further use of PCA is to understand the structure of the data. This is done
by examining the weights to see how the original variables contribute to the
different principal components. In our example, it is clear that the first principal
component is dominated by the sodium content of the cereal: it has the highest
(in this case, positive) weight. This means that the first principal component is in
fact measuring how much sodium is in the cereal. Similarly, the second principal
component seems to be measuring the amount of potassium. Since both these
variables are measured in milligrams, whereas the other nutrients are measured
in grams, the scale is obviously leading to this result. The variances of potassium
and sodium are much larger than the variances of the other variables, and thus
the total variance is dominated by these two variances. A solution is to normalize
the data before performing the PCA. Normalization (or standardization) means
replacing each original variable by a standardized version of the variable that
108 DIMENSION REDUCTION
has unit variance. This is easily accomplished by dividing each variable by its
standard deviation. The effect of this normalization is to give all variables equal
importance in terms of variability.
When should we normalize the data like this? It depends on the nature
of the data. When the units of measurement are common for the variables
(e.g., dollars), and when their scale reflects their importance (sales of jet fuel,
sales of heating oil), it is probably best not to normalize (i.e., not to rescale the
data so that they have unit variance). If the variables are measured in different
units so that it is unclear how to compare the variability of different variables
(e.g., dollars for some, parts per million for others) or if for variables measured
in the same units, scale does not reflect importance (earnings per share, gross
revenues), it is generally advisable to normalize. In this way, the differences in
units of measurement do not affect the principal components’ weights. In the
rare situations where we can give relative weights to variables, we multiply the
normalized variables by these weights before doing the principal components
analysis.
Thus far, we have calculated principal components using the covariance
matrix. An alternative to normalizing and then performing PCA is to perform
PCA on the correlation matrix instead of the covariance matrix. Most software
programs allow the user to choose between the two. Remember that using the
correlation matrix means that you are operating on the normalized data.
Returning to the breakfast cereal data, we normalize the 13 variables due
to the different scales of the variables and then perform PCA (or equiva-
lently, we use PCA applied to the correlation matrix). The output is given in
Table 4.12.
Now we find that we need 7 principal components to account for more than
90% of the total variability. The first 2 principal components account for only
52% of the total variability, and thus reducing the number of variables to two
would mean losing a lot of information. Examining the weights, we see that
the first principal component measures the balance between 2 quantities: (1)
calories and cups (large positive weights) vs. (2) protein, fiber, potassium, and
consumer rating (large negative weights). High scores on principal component
1 mean that the cereal is high in calories and the amount per bowl, and low
in protein, and potassium. Unsurprisingly, this type of cereal is associated with
a low consumer rating. The second principal component is most affected by
the weight of a serving, and the third principal component by the carbohydrate
content. We can continue labeling the next principal components in a similar
fashion to learn about the structure of the data.
When the data can be reduced to two dimensions, a useful plot is a scatter
plot of the first vs. second principal scores with labels for the observations (if
PRINCIPAL COMPONENTS ANALYSIS 109
TABLE 4.12 PCA OUTPUT USING ALL NORMALIZED 13 NUMERICAL VARIABLES IN THE
BREAKFAST CEREALS DATASET. THE TABLE SHOWS RESULTS FOR THE FIRST FIVE
PRINCIPAL COMPONENTS
> pcs.cor <- prcomp(na.omit(cereals.df[,-c(1:3)]), scale. = T)
> summary(pcs.cor)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 1.9062 1.7743 1.3818 1.00969 0.9947 0.84974 0.81946 0.64515
Proportion of Variance 0.2795 0.2422 0.1469 0.07842 0.0761 0.05554 0.05166 0.03202
Cumulative Proportion 0.2795 0.5217 0.6685 0.74696 0.8231 0.87861 0.93026 0.96228
PC9 PC10 PC11 PC12 PC13
Standard deviation 0.56192 0.30301 0.25194 0.13897 1.499e-08
Proportion of Variance 0.02429 0.00706 0.00488 0.00149 0.000e+00
Cumulative Proportion 0.98657 0.99363 0.99851 1.00000 1.000e+00
> pcs.cor$rot[,1:5]
PC1 PC2 PC3 PC4 PC5
calories 0.29954236 0.3931479 -0.114857453 0.20435870 0.20389885
protein -0.30735632 0.1653233 -0.277281953 0.30074318 0.31974897
fat 0.03991542 0.3457243 0.204890102 0.18683311 0.58689327
sodium 0.18339651 0.1372205 -0.389431009 0.12033726 -0.33836424
fiber -0.45349036 0.1798119 -0.069766079 0.03917361 -0.25511906
carbo 0.19244902 -0.1494483 -0.562452458 0.08783547 0.18274252
sugars 0.22806849 0.3514345 0.355405174 -0.02270716 -0.31487243
potass -0.40196429 0.3005442 -0.067620183 0.09087843 -0.14836048
vitamins 0.11598020 0.1729092 -0.387858660 -0.60411064 -0.04928672
shelf -0.17126336 0.2650503 0.001531036 -0.63887859 0.32910135
weight 0.05029930 0.4503085 -0.247138314 0.15342874 -0.22128334
cups 0.29463553 -0.2122479 -0.139999705 0.04748909 0.12081645
rating -0.43837841 -0.2515389 -0.181842433 0.03831622 0.05758420
comment on normalization
Use function prcomp() with scale. = T to run PCA on normalized data.
the dataset is not too large). To illustrate this, Figure 4.5 displays the first two
principal component scores for the breakfast cereals.
We can see that as we move from right (bran cereals) to left, the cereals are
less “healthy” in the sense of high calories, low protein and fiber, and so on.
Also, moving from bottom to top, we get heavier cereals (moving from puffed
rice to raisin bran). These plots are especially useful if interesting clusters of
observations can be found. For instance, we see here that children’s cereals are
close together on the middle-left part of the plot.
Using Principal Components for Classification and Prediction
When the goal of the data reduction is to have a smaller set of variables that will
serve as predictors, we can proceed as following: Apply PCA to the predictors
110 DIMENSION REDUCTION
FIGURE 4.5 SCATTER PLOT OF THE SECOND VS. FIRST PRINCIPAL COMPONENTS SCORES FOR
THE NORMALIZED BREAKFAST CEREAL OUTPUT (CREATED USING TABLEAU)
using the training data. Use the output to determine the number of principal
components to be retained. The predictors in the model now use the (reduced
number of) principal scores columns. For the validation set, we can use the
weights computed from the training data to obtain a set of principal scores by
applying the weights to the variables in the validation set. These new variables
are then treated as the predictors.
One disadvantage of using a subset of principal components as predictors in
a supervised task, is that we might lose predictive information that is nonlinear
DIMENSION REDUCTION USING REGRESSION MODELS 111
(e.g., a quadratic effect of a predictor on the outcome variable or an interac-
tion between predictors). This is because PCA produces linear transformations,
thereby capturing linear relationships between the original variables.
4.9 Dimension Reduction Using Regression
Models
In this chapter, we discussed methods for reducing the number of columns using
summary statistics, plots, and principal components analysis. All these are con-
sidered exploratory methods. Some of them completely ignore the outcome
variable (e.g., PCA), whereas in other methods we informally try to incorporate
the relationship between the predictors and the outcome variable (e.g., combin-
ing similar categories, in terms of their outcome variable behavior). Another
approach to reducing the number of predictors, which directly considers the
predictive or classification task, is by fitting a regression model. For prediction,
a linear regression model is used (see Chapter 6) and for classification, a logistic
regression model (see Chapter 10). In both cases, we can employ subset selection
procedures that algorithmically choose a subset of predictor variables among the
larger set (see details in the relevant chapters).
Fitted regression models can also be used to further combine similar cate-
gories: categories that have coefficients that are not statistically significant (i.e.,
have a high p-value) can be combined with the reference category, because their
distinction from the reference category appears to have no significant effect on
the outcome variable. Moreover, categories that have similar coefficient values
(and the same sign) can often be combined, because their effect on the outcome
variable is similar. See the example in Chapter 10 on predicting delayed flights
for an illustration of how regression models can be used for dimension reduction.
4.10 Dimension Reduction Using
Classification and Regression Trees
Another method for reducing the number of columns and for combining cat-
egories of a categorical variable is by applying classification and regression trees
(see Chapter 9). Classification trees are used for classification tasks and regression
trees for prediction tasks. In both cases, the algorithm creates binary splits on the
predictors that best classify/predict the outcome variable (e.g., above/below age
30). Although we defer the detailed discussion to Chapter 9, we note here that
the resulting tree diagram can be used for determining the important predic-
tors. Predictors (numerical or categorical) that do not appear in the tree can be
removed. Similarly, categories that do not appear in the tree can be combined.
112 DIMENSION REDUCTION
PROBLEMS
4.1 Breakfast Cereals. Use the data for the breakfast cereals example in Section 4.8 to
explore and summarize the data as follows:
a. Which variables are quantitative/numerical? Which are ordinal? Which are
nominal?
b. Compute the mean, median, min, max, and standard deviation for each of the
quantitative variables. This can be done through R’s sapply() function (e.g., sap-
ply(data, mean, na.rm = TRUE)).
c. Use R to plot a histogram for each of the quantitative variables. Based on the
histograms and summary statistics, answer the following questions:
i. Which variables have the largest variability?
ii. Which variables seem skewed?
iii. Are there any values that seem extreme?
d. Use R to plot a side-by-side boxplot comparing the calories in hot vs. cold cereals.
What does this plot show us?
e. Use R to plot a side-by-side boxplot of consumer rating as a function of the shelf
height. If we were to predict consumer rating from shelf height, does it appear that
we need to keep all three categories of shelf height?
f. Compute the correlation table for the quantitative variable (function cor()). In addi-
tion, generate a matrix plot for these variables (function plot(data)).
i. Which pair of variables is most strongly correlated?
ii. How can we reduce the number of variables based on these correlations?
iii. How would the correlations change if we normalized the data first?
g. Consider the first PC of the analysis of the 13 numerical variables in Table 4.11.
Describe briefly what this PC represents.
4.2 University Rankings. The dataset on American college and university rankings
(available from www.dataminingbook.com) contains information on 1302 American
colleges and universities offering an undergraduate program. For each university, there
are 17 measurements that include continuous measurements (such as tuition and grad-
uation rate) and categorical measurements (such as location by state and whether it is
a private or a public school).
a. Remove all categorical variables. Then remove all records with missing numerical
measurements from the dataset.
b. Conduct a principal components analysis on the cleaned data and comment on the
results. Should the data be normalized? Discuss what characterizes the components
you consider key.
4.3 Sales of Toyota Corolla Cars. The file ToyotaCorolla.csv contains data on used cars
(Toyota Corollas) on sale during late summer of 2004 in the Netherlands. It has 1436
records containing details on 38 attributes, including Price, Age, Kilometers, HP, and
other specifications. The goal will be to predict the price of a used Toyota Corolla
based on its specifications.
a. Identify the categorical variables.
b. Explain the relationship between a categorical variable and the series of binary
dummy variables derived from it.
http://www.dataminingbook.com
PROBLEMS 113
c. How many dummy binary variables are required to capture the information in a
categorical variable with N categories?
d. Use R to convert the categorical variables in this dataset into dummy variables, and
explain in words, for one record, the values in the derived binary dummies.
e. Use R to produce a correlation matrix and matrix plot. Comment on the relation-
ships among variables.
4.4 Chemical Features of Wine. Table 4.13 shows the PCA output on data (non-
normalized) in which the variables represent chemical characteristics of wine, and
each case is a different wine.
TABLE 4.13 PRINCIPAL COMPONENTS OF NON-NORMALIZED WINE DATA
code for running PCA on the wine data
wine.df <- read.csv("Wine.csv")
pcs.cor <- prcomp(wine.df[,-1])
summary(pcs.cor)
pcs.cor$rot[,1:4]
Output
> summary(pcs.cor)
importance of components:
PC1 PC2 PC3 PC4 PC5
Standard deviation 314.9632 13.13527 3.07215 2.23409 1.10853
Proportion of Variance 0.9981 0.00174 0.00009 0.00005 0.00001
Cumulative Proportion 0.9981 0.99983 0.99992 0.99997 0.99998
PC6 PC7 PC8 PC9 PC10
Standard deviation 0.91710 0.5282 0.3891 0.3348 0.2678
Proportion of Variance 0.00001 0.0000 0.0000 0.0000 0.0000
Cumulative Proportion 0.99999 1.0000 1.0000 1.0000 1.0000
PC11 PC12 PC13
Standard deviation 0.1938 0.1452 0.09057
Proportion of Variance 0.0000 0.0000 0.00000
Cumulative Proportion 1.0000 1.0000 1.00000
> pcs.cor$rot[,1:4]
PC1 PC2 PC3 PC4
Alcohol -0.0016592647 -1.203406e-03 -0.016873809 0.141446778
Malic_Acid 0.0006810156 -2.154982e-03 -0.122003373 0.160389543
Ash -0.0001949057 -4.593693e-03 -0.051987430 -0.009772810
Ash_Alcalinity 0.0046713006 -2.645039e-02 -0.938593003 -0.330965260
Magnesium -0.0178680075 -9.993442e-01 0.029780248 -0.005393756
Total_Phenols -0.0009898297 -8.779622e-04 0.040484644 -0.074584656
Flavanoids -0.0015672883 5.185073e-05 0.085443339 -0.169086724
Nonflavanoid_Phenols 0.0001230867 1.354479e-03 -0.013510780 0.010805561
Proanthocyanins -0.0006006078 -5.004400e-03 0.024659382 -0.050120952
Color_Intensity -0.0023271432 -1.510035e-02 -0.291398464 0.878893693
Hue -0.0001713800 7.626731e-04 0.025977662 -0.060034945
OD280_OD315 -0.0007049316 3.495364e-03 0.070323969 -0.178200254
Proline -0.9998229365 1.777381e-02 -0.004528682 -0.003112916
114 DIMENSION REDUCTION
a. The data are in the file Wine.csv. Consider the rows labeled “Proportion of Vari-
ance.” Explain why the value for PC1 is so much greater than that of any other
column.
b. Comment on the use of normalization (standardization) in part (a).
Part III
Performance Evaluation
CHAPTER 5
Evaluating Predictive Performance
In this chapter, we discuss how the predictive performance of data mining meth-
ods can be assessed. We point out the danger of overfitting to the training data,
and the need to test model performance on data that were not used in the training
step. We discuss popular performance metrics. For prediction, metrics include
Average Error, MAPE, and RMSE (based on the validation data). For classi-
fication tasks, metrics based on the confusion matrix include overall accuracy,
specificity and sensitivity, and metrics that account for misclassification costs.
We also show the relation between the choice of cutoff value and classification
performance, and present the ROC curve, which is a popular chart for assessing
method performance at different cutoff values. When the goal is to accurately
classify the most interesting or important records, called ranking, rather than accu-
rately classify the entire sample (e.g., the 10% of customers most likely to respond
to an offer, or the 5% of claims most likely to be fraudulent), lift charts are used
to assess performance. We also discuss the need for oversampling rare classes
and how to adjust performance metrics for the oversampling. Finally, we men-
tion the usefulness of comparing metrics based on the validation data to those
based on the training data for the purpose of detecting overfitting. While some
differences are expected, extreme differences can be indicative of overfitting.
5.1 Introduction
In supervised learning, we are interested in predicting the outcome variable for
new records. Three main types of outcomes of interest are:
Predicted numerical value : when the outcome variable is numerical
(e.g., house price)
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
117
118 EVALUATING PREDICTIVE PERFORMANCE
Predicted class membership : when the outcome variable is categorical
(e.g., buyer/nonbuyer)
Propensity : the probability of class membership, when the outcome vari-
able is categorical (e.g., the propensity to default)
Prediction methods are used for generating numerical predictions, while classi-
fication methods (“classifiers”) are used for generating propensities and, using a
cutoff value on the propensities, we can generate predicted class memberships.
A subtle distinction to keep in mind is the two distinct predictive uses of
classifiers: one use, classification, is aimed at predicting class membership for new
records. The other, ranking, is detecting among a set of new records the ones
most likely to belong to a class of interest.
Let’s now examine the approach for judging the usefulness of a prediction
method used for generating numerical predictions (Section 5.2), a classifier used
for classification (Section 5.3), and a classifier used for ranking (Section 5.4). In
Section 5.5, we’ll look at evaluating performance under the scenario of over-
sampling.
5.2 Evaluating Predictive Performance
First, let us emphasize that predictive accuracy is not the same as goodness-of-
fit. Classical statistical measures of performance are aimed at finding a model
that fits well to the data on which the model was trained. In data mining, we
are interested in models that have high predictive accuracy when applied to new
records. Measures such as R2 and standard error of estimate are common metrics
in classical regression modeling, and residual analysis is used to gauge goodness-
of-fit in that situation. However, these measures do not tell us much about the
ability of the model to predict new records.
For assessing prediction performance, several measures are used. In all cases,
the measures are based on the validation set, which serves as a more objective
ground than the training set to assess predictive accuracy. This is because records
in the validation set are more similar to the future records to be predicted, in
the sense that they are not used to select predictors or to estimate the model
parameters. Models are trained on the training data, applied to the validation
data, and measures of accuracy then use the prediction errors on that validation
set.
Naive Benchmark: The Average
The benchmark criterion in prediction is using the average outcome value
(thereby ignoring all predictor information). In other words, the prediction for
a new record is simply the average across the outcome values of the records in
EVALUATING PREDICTIVE PERFORMANCE 119
the training set (ȳ). This is sometimes called a naive benchmark. A good pre-
dictive model should outperform the benchmark criterion in terms of predictive
accuracy.
Prediction Accuracy Measures
The prediction error for record i is defined as the difference between its actual
outcome value and its predicted outcome value: ei = yi − ŷi. A few popular
numerical measures of predictive accuracy are:
MAE (mean absolute error/deviation) = 1
n
∑n
i=1 |ei|. This gives the mag-
nitude of the average absolute error.
Mean Error = 1
n
∑n
i=1 ei. This measure is similar to MAE except that
it retains the sign of the errors, so that negative errors cancel out positive
errors of the same magnitude. It therefore gives an indication of whether the
predictions are on average over- or underpredicting the outcome variable.
MPE (mean percentage error) = 100× 1
n
∑n
i=1 ei/yi. This gives the per-
centage score of how predictions deviate from the actual values (on average),
taking into account the direction of the error.
MAPE (mean absolute percentage error) = 100 × 1
n
∑n
i=1 |ei/yi|. This
measure gives a percentage score of how predictions deviate (on average)
from the actual values.
RMSE (root mean squared error) =
√
1
n
∑n
i=1 e
2
i . This is similar to the
standard error of estimate in linear regression, except that it is computed on
the validation data rather than on the training data. It has the same units as
the outcome variable.
Such measures can be used to compare models and to assess their degree of
prediction accuracy. Note that all these measures are influenced by outliers. To
check outlier influence, we can compute median-based measures (and compare
to the above mean-based measures) or simply plot a histogram or boxplot of the
errors. Plotting the prediction errors’ distribution is in fact very useful and can
highlight more information than the metrics alone.
To illustrate the use of predictive accuracy measures and charts of prediction
error distribution, consider the error metrics and charts shown in Table 5.1 and
Figure 5.1. These are the result of fitting a certain predictive model to prices of
used Toyota Corolla cars. The training set includes 600 cars and the validation set
includes 400 cars. Results are displayed separately for the training and validation
sets. We can see from the histogram and boxplot corresponding to the validation
set that most errors are in the [−1000, 1000] range, with a few large positive
(under-prediction) errors.
120 EVALUATING PREDICTIVE PERFORMANCE
TABLE 5.1 PREDICTION ERROR METRICS FROM A MODEL FOR TOYOTA CAR PRICES. TRAINING
AND VALIDATION
code for accuracy measure
# package forecast is required to evaluate performance
library(forecast)
# load file
toyota.corolla.df <- read.csv("ToyotaCorolla.csv")
# randomly generate training and validation sets
training <- sample(toyota.corolla.df$Id, 600)
validation <- sample(setdiff(toyota.corolla.df$Id, training), 400)
# run linear regression model
reg <- lm(Price~., data=toyota.corolla.df[,-c(1,2,8,11)], subset=training,
na.action=na.exclude)
pred_t <- predict(reg, na.action=na.pass)
pred_v <- predict(reg, newdata=toyota.corolla.df[validation,-c(1,2,8,11)],
na.action=na.pass)
## evaluate performance
# training
accuracy(pred_t, toyota.corolla.df[training,]$Price)
# validation
accuracy(pred_v, toyota.corolla.df[validation,]$Price)
Output
> # training
> accuracy(pred_t, toyota.corolla.df[training,]$Price)
ME RMSE MAE MPE MAPE
Test set -3.542119e-11 1012.578 784.1166 -0.8180893 7.773598
> # validation
> accuracy(pred_v, toyota.corolla.df[validation,]$Price)
ME RMSE MAE MPE MAPE
Test set -107.8089 1262.623 833.3621 -2.345028 8.846669
Tr
ai
ni
ng
−10000 −5000 0 5000 10000
0
10
20
30
40
50
V
al
id
at
io
n
−10000 −5000 0 5000 10000
0
10
20
30
40
Training Validation
−
60
00
−
40
00
−
20
00
0
20
00
40
00
FIGURE 5.1 HISTOGRAMS AND BOXPLOTS OF TOYOTA PRICE PREDICTION ERRORS, FOR
TRAINING AND VALIDATION SETS
EVALUATING PREDICTIVE PERFORMANCE 121
Comparing Training and Validation Performance
Errors that are based on the training set tell us about model fit, whereas those that
are based on the validation set (called “prediction errors”) measure the model’s
ability to predict new data (predictive performance). We expect training errors
to be smaller than the validation errors (because the model was fitted using the
training set), and the more complex the model, the greater the likelihood that
it will overfit the training data (indicated by a greater difference between the
training and validation errors). In an extreme case of overfitting, the training
errors would be zero (perfect fit of the model to the training data), and the
validation errors would be non-zero and non-negligible. For this reason, it is
important to compare the error plots and metrics (RMSE, MAE, etc.) of the
training and validation sets. Table 5.1 illustrates this comparison: the training set
performance measures appear much lower (better) than those for the validation
set. However, the charts reveal more than the metrics alone: looking at the
charts in Figure 5.1, the discrepancies are most likely due to some outliers in
the validation set, especially the asymmetric outliers in the validation set. The
positive validation errors (under-predictions) are slightly larger than the training
errors, as reflected by the medians and outliers.
Lift Chart
In some applications, the goal is to search, among a set of new records, for a
subset of records that gives the highest cumulative predicted values. In such
cases, a graphical way to assess predictive performance is through a lift chart.
This compares the model’s predictive performance to a baseline model that has
no predictors. A lift chart for a continuous response is relevant only when we are
searching for a set of records that gives the highest cumulative predicted values.
A lift chart is not relevant if we are interested in predicting the outcome value
for each new record.
To illustrate this type of goal, called ranking, consider a car rental firm that
renews its fleet regularly so that customers drive late-model cars. This entails
disposing of a large quantity of used vehicles on a continuing basis. Since the
firm is not primarily in the used car sales business, it tries to dispose of as much
of its fleet as possible through volume sales to used car dealers. However, it is
profitable to sell a limited number of cars through its own channels. Its volume
deals with the used car dealers allow it flexibility to pick and choose which cars
to sell in this fashion, so it would like to have a model for selecting cars for resale
through its own channels. Since all cars were purchased some time ago and the
deals with the used car dealers are for fixed prices (specifying a given number of
cars of a certain make and model class), the cars’ costs are now irrelevant and the
dealer is interested only in maximizing revenue. This is done by selecting for its
122 EVALUATING PREDICTIVE PERFORMANCE
own resale, the cars likely to generate the most revenue. The lift chart in this
case gives the predicted lift for revenue.
The lift chart is based on ordering the set of records of interest (typically
validation data) by their predicted value, from high to low. Then, we accumulate
the actual values and plot their cumulative value on the y-axis as a function of
the number of records accumulated (the x-axis value). This curve is compared
to assigning a naive prediction (ȳ) to each record and accumulating these average
values, which results in a diagonal line. The further away the lift curve from the
diagonal benchmark line, the better the model is doing in separating records
with high value outcomes from those with low value outcomes. The same
information can be presented in a decile lift chart, where the ordered records
are grouped into ten deciles, and for each decile, the chart presents the ratio of
model lift to naive benchmark lift.
Figure 5.2 shows a lift chart and decile lift chart based on fitting a linear
regression model to the Toyota data. The charts are based on the validation data
of 400 cars. It can be seen that the model’s predictive performance in terms of
lift is better than the baseline model, since its lift curve is higher than that of
the baseline model. The lift and decile charts in Figure 5.2 would be useful in
the following scenario: choosing the top 10% of the cars that gave the highest
predicted sales, for example, we would gain 1.7 times the amount of revenue,
compared to choosing 10% of the cars at random. This can be seen from the
decile chart (Figure 5.2). This number can also be computed from the lift chart
by comparing the sales for 40 random cars (the value of the baseline curve at
x = 40), which is $486,871 (= the sum of the actual sales for the 400 validation
set cars divided by 10) with the actual sales of the 40 cars that have the highest
predicted values (the value of the lift curve at x = 40), $835,883. The ratio
between these numbers is 1.7.
5.3 Judging Classifier Performance
The need for performance measures arises from the wide choice of classifiers and
predictive methods. Not only do we have several different methods, but even
within a single method there are usually many options that can lead to completely
different results. A simple example is the choice of predictors used within a
particular predictive algorithm. Before we study these various algorithms in
detail and face decisions on how to set these options, we need to know how we
will measure success.
A natural criterion for judging the performance of a classifier is the prob-
ability of making a misclassification error. Misclassification means that the record
belongs to one class but the model classifies it as a member of a different class.
A classifier that makes no errors would be perfect, but we do not expect to be
JUDGING CLASSIFIER PERFORMANCE 123
0 100 200 300 400
0
10
00
00
0
30
00
00
0
Lift Chart
# cases
C
um
ul
at
iv
e
P
ric
e
10 30 50 70 90
Decile-wise lift chart
Percentile
M
ea
n
R
es
po
ns
e
0.
0
0.
5
1.
0
1.
5
FIGURE 5.2 LIFT CHART (LEFT) AND DECILE LIFT CHART (RIGHT) FOR CONTINUOUS
OUTCOME VARIABLE (SALES OF TOYOTA CARS)
code for generating a lift chart and decile-wise lift chart
toyota.corolla.df <- read.csv("ToyotaCorolla.csv")
# remove missing Price data
toyota.corolla.df <-
toyota.corolla.df[!is.na(toyota.corolla.df[validation,]$Price),]
# generate random Training and Validation sets
training <- sample(toyota.corolla.df$Id, 600)
validation <- sample(toyota.corolla.df$Id, 400)
# regression model based on all numerical predictors
reg <- lm(Price~., data = toyota.corolla.df[,-c(1,2,8,11)], subset = training)
# predictions
pred_v <- predict(reg, newdata = toyota.corolla.df[validation,-c(1,2,8,11)])
# load package gains, compute gains (we will use package caret for categorical y later)
library(gains)
gain <- gains(toyota.corolla.df[validation,]$Price[!is.na(pred_v)], pred_v[!is.na(pred_v)])
# cumulative lift chart
options(scipen=999) # avoid scientific notation
# we will compute the gain relative to price
price <- toyota.corolla.df[validation,]$Price[!is.na(toyota.corolla.df[validation,]$Price)]
plot(c(0,gain$cume.pct.of.total*sum(price))~c(0,gain$cume.obs),
xlab="# cases", ylab="Cumulative Price", main="Lift Chart", type="l")
# baseline
lines(c(0,sum(price))~c(0,dim(toyota.corolla.df[validation,])[1]), col="gray", lty=2)
# Decile-wise lift chart
barplot(gain$mean.resp/mean(price), names.arg = gain$depth,
xlab = "Percentile", ylab = "Mean Response", main = "Decile-wise lift chart")
124 EVALUATING PREDICTIVE PERFORMANCE
able to construct such classifiers in the real world due to “noise” and not hav-
ing all the information needed to classify records precisely. Is there a minimal
probability of misclassification that we should require of a classifier?
Benchmark: The Naive Rule
A very simple rule for classifying a record into one of m classes, ignoring all pre-
dictor information (x1, x2, . . . , xp) that we may have, is to classify the record
as a member of the majority class. In other words, “classify as belonging to the
most prevalent class.” The naive rule is used mainly as a baseline or benchmark
for evaluating the performance of more complicated classifiers. Clearly, a clas-
sifier that uses external predictor information (on top of the class membership
allocation) should outperform the naive rule. There are various performance
measures based on the naive rule that measure how much better than the naive
rule a certain classifier performs. One example is multiple R2, which measures
the distance between the fit of the classifier to the data and the fit of the naive
rule to the data (for further details, see Section 10.6).
Similar to using the sample mean (ȳ) as the naive benchmark in the numerical
outcome case, the naive rule for classification relies solely on the y information
and excludes any additional predictor information.
Class Separation
If the classes are well separated by the predictor information, even a small dataset
will suffice in finding a good classifier, whereas if the classes are not separated at
all by the predictors, even a very large dataset will not help. Figure 5.3 illustrates
this for a two-class case. The top panel includes a small dataset (n = 24 records)
where two predictors (income and lot size) are used for separating owners from
nonowners [we thank Dean Wichern for this example, described in Johnson
and Wichern (2002)]. Here, the predictor information seems useful in that it
separates the two classes (owners/nonowners). The bottom panel shows a much
larger dataset (n = 5000 records) where the two predictors (income and monthly
average credit card spending) do not separate the two classes well in most of the
higher ranges (loan acceptors/nonacceptors).
The Confusion (Classification) Matrix
In practice, most accuracy measures are derived from the confusion matrix, also
called classification matrix. This matrix summarizes the correct and incorrect clas-
sifications that a classifier produced for a certain dataset. Rows and columns
of the confusion matrix correspond to the predicted and true (actual) classes,
respectively. Table 5.2 shows an example of a classification (confusion) matrix
for a two-class (0/1) problem resulting from applying a certain classifier to 3000
JUDGING CLASSIFIER PERFORMANCE 125
FIGURE 5.3 HIGH (TOP) AND LOW (BOTTOM) LEVELS OF SEPARATION BETWEEN TWO
CLASSES, USING TWO PREDICTORS
TABLE 5.2 CONFUSION MATRIX BASED ON
3000 RECORDS AND TWO CLASSES
Actual Class
0 1
Predicted Class 0 2689 85
1 25 201
126 EVALUATING PREDICTIVE PERFORMANCE
records. The two diagonal cells (upper left, lower right) give the number of
correct classifications, where the predicted class coincides with the actual class
of the record. The off-diagonal cells give counts of misclassification. The lower
left cell gives the number of class 1 members that were misclassified as 0’s (in this
example, there were 85 such misclassifications). Similarly, the top-right cell gives
the number of class 0 members that were misclassified as 1’s (25 such records). In
R, we can obtain a confusion matrix using the function confusionMatrix() in the
caret package. This function creates the cross-tabulation of actual and predicted
classes. We will see an example later in this chapter.
The confusion matrix gives estimates of the true classification and misclas-
sification rates. Of course, these are estimates and they can be incorrect, but if
we have a large enough dataset and neither class is very rare, our estimates will
be reliable. Sometimes, we may be able to use public data such as US Census
data to estimate these proportions. However, in most business settings, we will
not know them.
Using the Validation Data
To obtain an honest estimate of future classification error, we use the confusion
matrix that is computed from the validation data. In other words, we first partition
the data into training and validation sets by random selection of records. We then
construct a classifier using the training data, and then apply it to the validation
data. This will yield the predicted classifications for records in the validation
set (see Figure 2.4 in Chapter 2). We then summarize these classifications in a
confusion matrix. Although we can summarize our results in a confusion matrix
for training data as well, the resulting confusion matrix is not useful for getting
an honest estimate of the misclassification rate for new data due to the danger of
overfitting.
In addition to examining the validation data confusion matrix to assess the
classification performance on new data, we compare the training data confusion
matrix to the validation data confusion matrix, in order to detect overfitting:
although we expect somewhat inferior results on the validation data, a large
discrepancy in training and validation performance might be indicative of over-
fitting.
Accuracy Measures
Different accuracy measures can be derived from the classification matrix. Con-
sider a two-class case with classes C1 and C2 (e.g., buyer/non-buyer). The
schematic confusion matrix in Table 5.3 uses the notation ni,j to denote the
number of records that are class Ci members and were classified as Cj members.
Of course, if i ̸= j, these are counts of misclassifications. The total number of
records is n = n1,1 + n1,2 + n2,1 + n2,2.
JUDGING CLASSIFIER PERFORMANCE 127
TABLE 5.3 CONFUSION MATRIX: MEANING OF EACH CELL
Actual Class
C1 C2
Predicted Class C1 n1,1 = number of C1 records n2,1 = number of C2 records
classified correctly classified incorrectly as C1
C2 n1,2 = number of C1 records n2,2 = number of C2 records
classified incorrectly as C2 classified correctly
A main accuracy measure is the estimated misclassification rate, also called the
overall error rate. It is given by
err =
n1,2 + n2,1
n
,
where n is the total number of records in the validation dataset. In the example
in Table 5.2, we get err = (25 + 85)/3000 = 3.67%.
We can measure accuracy by looking at the correct classifications—the full
half of the cup—instead of the misclassifications. The overall accuracy of a classifier
is estimated by
accuracy = 1− err = n1,1 + n2,2
n
.
In the example, we have (201 + 2689)/3000 = 96.33%.
Propensities and Cutoff for Classification
The first step in most classification algorithms is to estimate the probability that
a record belongs to each of the classes. These probabilities are also called propen-
sities. Propensities are typically used either as an interim step for generating
predicted class membership (classification), or for rank-ordering the records by
their probability of belonging to a class of interest. Let us consider their first use
in this section. The second use is discussed in Section 5.4.
If overall classification accuracy (involving all the classes) is of interest, the
record can be assigned to the class with the highest probability. In many records,
a single class is of special interest, so we will focus on that particular class and
compare the propensity of belonging to that class to a cutoff value set by the
analyst. This approach can be used with two classes or more than two classes,
though it may make sense in such cases to consolidate classes so that you end up
with two: the class of interest and all other classes. If the probability of belonging
to the class of interest is above the cutoff, the record is assigned to that class.
128 EVALUATING PREDICTIVE PERFORMANCE
C U T O F F V A L U E S F O R T R I A G E
In some cases, it is useful to have two cutoffs, and allow a “cannot say” option for
the classifier. In a two-class situation, this means that for a record, we can make
one of three predictions: The record belongs to C1, or the record belongs to C2,
or we cannot make a prediction because there is not enough information to pick
C1 or C2 confidently. Records that the classifier cannot classify are subjected to
closer scrutiny either by using expert judgment or by enriching the set of predic-
tor variables by gathering additional information that is perhaps more difficult or
expensive to obtain. An example is classification of documents found during legal
discovery (reciprocal forced document disclosure in a legal proceeding). Under tra-
ditional human-review systems, qualified legal personnel are needed to review what
might be tens of thousands of documents to determine their relevance to a case.
Using a classifier and a triage outcome, documents could be sorted into clearly rel-
evant, clearly not relevant, and the gray area documents requiring human review.
This substantially reduces the costs of discovery.
The default cutoff value in two-class classifiers is 0.5. Thus, if the probability
of a record being a class C1 member is greater than 0.5, that record is classified
as a C1. Any record with an estimated probability of less than 0.5 would be
classified as a C2. It is possible, however, to use a cutoff that is either higher or
lower than 0.5. A cutoff greater than 0.5 will end up classifying fewer records as
C1’s, whereas a cutoff less than 0.5 will end up classifying more records as C1.
Typically, the misclassification rate will rise in either case.
Consider the data in Table 5.4, showing the actual class for 24 records, sorted
by the probability that the record is an “owner” (as estimated by a data mining
algorithm). If we adopt the standard 0.5 as the cutoff, our misclassification rate
is 3/24, whereas if we instead adopt a cutoff of 0.25, we classify more records
as owners and the misclassification rate goes up (comprising more nonowners
TABLE 5.4 24 RECORDS WITH THEIR ACTUAL CLASS AND THE PROBABILITY (PROPENSITY)
OF THEM BEING CLASS “OWNER” MEMBERS, AS ESTIMATED BY A CLASSIFIER
Actual Class Probability of Class “owner” Actual Class Probability of Class “owner”
owner 0.9959 owner 0.5055
owner 0.9875 nonowner 0.4713
owner 0.9844 nonowner 0.3371
owner 0.9804 owner 0.2179
owner 0.9481 nonowner 0.1992
owner 0.8892 nonowner 0.1494
owner 0.8476 nonowner 0.0479
nonowner 0.7628 nonowner 0.0383
owner 0.7069 nonowner 0.0248
owner 0.6807 nonowner 0.0218
owner 0.6563 nonowner 0.0161
nonowner 0.6224 nonowner 0.0031
JUDGING CLASSIFIER PERFORMANCE 129
misclassified as owners) to 5/24. Conversely, if we adopt a cutoff of 0.75, we
classify fewer records as owners. The misclassification rate goes up (comprising
more owners misclassified as nonowners) to 6/24. All this can be seen in the
classification tables in Table 5.5.
To see the entire range of cutoff values and how the accuracy or misclassi-
fication rates change as a function of the cutoff, we can plot the performance
measure of interest vs. the cutoff. The results for the riding mowers example are
shown in Figure 5.4. We can see that the accuracy level is pretty stable around
0.8 for cutoff values between 0.2 and 0.8.
Why would we want to use cutoff values different from 0.5 if they increase
the misclassification rate? The answer is that it might be more important to
classify owners properly than nonowners, and we would tolerate a greater mis-
classification of the latter. Or the reverse might be true; in other words, the
costs of misclassification might be asymmetric. We can adjust the cutoff value
in such a case to classify more records as the high-value class, that is, accept
TABLE 5.5 CONFUSION MATRICES BASED ON CUTOFFS OF 0.5, 0.25, AND 0.75 (RIDING
MOWERS EXAMPLE)
> owner.df <- read.csv("ownerExample.csv")
## cutoff = 0.5
> confusionMatrix(ifelse(owner.df$Probability>0.5, ‘owner’, ‘nonowner’), owner.df$Class)
# note: “reference” = “actual”
Confusion Matrix and Statistics
Reference
Prediction nonowner owner
nonowner 10 1
owner 2 11
Accuracy : 0.875
## cutoff = 0.25
> confusionMatrix(ifelse(owner.df$Probability>0.25, ‘owner’, ‘nonowner’), owner.df$Class)
Confusion Matrix and Statistics
Reference
Prediction nonowner owner
nonowner 8 1
owner 4 11
Accuracy : 0.7916667
## cutoff = 0.75
> confusionMatrix(ifelse(owner.df$Probability>0.75, ‘owner’, ‘nonowner’), owner.df$Class)
Confusion Matrix and Statistics
Reference
Prediction nonowner owner
nonowner 11 5
owner 1 7
Accuracy : 0.75
130 EVALUATING PREDICTIVE PERFORMANCE
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Cutoff Value
accuracy
overall error
FIGURE 5.4 PLOTTING ACCURACY AND OVERALL ERROR AS A FUNCTION OF THE CUTOFF
VALUE (RIDING MOWERS EXAMPLE)
code for creating Figure 5.4
# create empty accuracy table
accT = c()
# compute accuracy per cutoff
for (cut in seq(0,1,0.1)){
cm <- confusionMatrix(1 * (df$prob > cut), df$actual)
accT = c(accT, cm$overall[1])
}
# plot accuracy
plot(accT ~ seq(0,1,0.1), xlab = “Cutoff Value”, ylab = “”, type = “l”, ylim = c(0, 1))
lines(1-accT ~ seq(0,1,0.1), type = “l”, lty = 2)
legend(“topright”, c(“accuracy”, “overall error”), lty = c(1, 2), merge = TRUE)
more misclassifications where the misclassification cost is low. Keep in mind
that we are doing so after the data mining model has already been selected—we
are not changing that model. It is also possible to incorporate costs into the
picture before deriving the model. These subjects are discussed in greater detail
below.
JUDGING CLASSIFIER PERFORMANCE 131
Performance in Case of Unequal Importance of Classes
Suppose that it is more important to predict membership correctly in class C1
than in class C2. An example is predicting the financial status (bankrupt/solvent)
of firms. It may be more important to predict correctly a firm that is going
bankrupt than to predict correctly a firm that is going to remain solvent. The
classifier is essentially used as a system for detecting or signaling bankruptcy. In
such a case, the overall accuracy is not a good measure for evaluating the classifier.
Suppose that the important class is C1. The following pair of accuracy measures
are the most popular:
The sensitivity (also termed recall) of a classifier is its ability to detect the
important class members correctly. This is measured by n1,1/(n1,1 + n1,2),
the percentage of C1 members classified correctly.
The specificity of a classifier is its ability to rule out C2 members correctly.
This is measured by n2,2/(n2,1+n2,2), the percentage of C2 members clas-
sified correctly.
It can be useful to plot these measures against the cutoff value in order to
find a cutoff value that balances these measures.
C O M P U T I N G R A T E S : F R O M W H O S E P O I N T O F V I E W ?
Sensitivity and specificity measure the performance of a classifier from the
point of view of the “classifying agency” (e.g., a company classifying cus-
tomers or a hospital classifying patients). They answer the question “how well
does the classifier segregate the important class members?”. It is also possible
to measure accuracy from the perspective of the entity being classified (e.g.,
the customer or the patient), who asks “given my predicted class, what is my
chance of actually belonging to that class?”, although this question is usually
less relevant in a data mining application. The terms “false discovery rate” and
“false omission rate” are measures of performance from the perspective of the
individual entity. If C1 is the important (positive) class, then they are defined
as
The false discovery rate (FDR) is the proportion of C1 predictions that are wrong,
equal to n2,1/(n1,1 + n2,1). Note that this is a ratio within the row of C1
predictions (i.e., it uses only records that were classified as C1).
The false omission rate (FOR) is the proportion of C2 predictions that are wrong,
equal to n1,2/(n1,2 + n2,2). Note that this is a ratio within the row of C2
predictions (i.e., it uses only records that were classified as C2).
ROC Curve A more popular method for plotting the two measures is
through ROC (Receiver Operating Characteristic) curves. Starting from the
132 EVALUATING PREDICTIVE PERFORMANCE
lower left, the ROC curve plots the pairs {sensitivity, specificity} as the cut-
off value descends from 1 to 0. (A typical alternative presentation is to plot
1-specificity on the x-axis, which allows 0 to be placed on the left end of the
axis, and 1 on the right.) Better performance is reflected by curves that are closer
to the top-left corner. The comparison curve is the diagonal, which reflects the
performance of the naive rule, using varying cutoff values (i.e., setting different
thresholds on the level of majority used by the majority rule). A common met-
ric to summarize an ROC curve is “area under the curve (AUC),” which ranges
from 1 (perfect discrimination between classes) to 0.5 (no better than the naive
rule). The ROC curve for the owner/nonowner example and its corresponding
AUC are shown in Figure 5.5.
Specificity
S
en
si
tiv
ity
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
1.0 0.8 0.6 0.4 0.2 0.0
FIGURE 5.5 ROC CURVE FOR RIDING MOWERS EXAMPLE
code for generating ROC curve and computing AUC
library(pROC)
r <- roc(df$actual, df$prob)
plot.roc(r)
# compute auc
auc(r)
Output
> auc(r)
Area under the curve: 0.9375
JUDGING CLASSIFIER PERFORMANCE 133
Asymmetric Misclassification Costs
Implicit in our discussion of the lift curve, which measures how effective we
are in identifying the members of one particular class, is the assumption that the
error of misclassifying a record belonging to one class is more serious than for
the other class. For example, misclassifying a household as unlikely to respond to
a sales offer when it belongs to the class that would respond incurs a greater cost
(the opportunity cost of the foregone sale) than the converse error. In the former
case, you are missing out on a sale worth perhaps tens or hundreds of dollars.
In the latter, you are incurring the costs of contacting someone who will not
purchase. In such a scenario, using the misclassification rate as a criterion can be
misleading.
Note that we are assuming that the cost (or benefit) of making correct classi-
fications is zero. At first glance, this may seem incomplete. After all, the benefit
(negative cost) of classifying a buyer correctly as a buyer would seem substantial.
And in other circumstances (e.g., scoring our classification algorithm to fresh
data to implement our decisions), it will be appropriate to consider the actual
net dollar impact of each possible classification (or misclassification). Here, how-
ever, we are attempting to assess the value of a classifier in terms of classification
error, so it greatly simplifies matters if we can capture all cost/benefit information
in the misclassification cells. So, instead of recording the benefit of classifying
a respondent household correctly, we record the cost of failing to classify it as
a respondent household. It amounts to the same thing and our goal becomes
the minimization of costs, whether the costs are actual costs or missed benefits
(opportunity costs).
Consider the situation where the sales offer is mailed to a random sample of
people for the purpose of constructing a good classifier. Suppose that the offer is
accepted by 1% of those households. For these data, if a classifier simply classifies
every household as a non-responder, it will have an error rate of only 1% but it
will be useless in practice. A classifier that misclassifies 2% of buying households
as nonbuyers and 20% of the nonbuyers as buyers would have a higher error rate
but would be better if the profit from a sale is substantially higher than the cost
of sending out an offer. In these situations, if we have estimates of the cost of
both types of misclassification, we can use the confusion matrix to compute the
expected cost of misclassification for each record in the validation data. This
enables us to compare different classifiers using overall expected costs (or profits)
as the criterion.
Suppose that we are considering sending an offer to 1000 more people,
where on average 1% of whom respond (1). Naively classifying everyone as
a 0 has an error rate of only 1%. Using a data mining routine, suppose that we
can produce these classifications:
134 EVALUATING PREDICTIVE PERFORMANCE
Actual 0 Actual 1
Predicted 0 970 2
Predicted 1 20 8
These classifications have an error rate of 100× (20 + 2)/1000 = 2.2%—
higher than the naive rate.
Now suppose that the profit from a responder is $10 and the cost of sending
the offer is $1. Classifying everyone as a 0 still has a misclassification rate of only
1%, but yields a profit of $0. Using the data mining routine, despite the higher
misclassification rate, yields a profit of $60.
The matrix of profit is as follows (nothing is sent to the predicted 0’s so there
are no costs or sales in that column):
Profit Actual 0 Actual 1
Predicted 0 0 0
Predicted 1 − $20 $80
Looked at purely in terms of costs, when everyone is classified as a 0, there
are no costs of sending the offer; the only costs are the opportunity costs of
failing to make sales to the ten 1’s = $100. The cost (actual costs of sending the
offer, plus the opportunity costs of missed sales) of using the data mining routine
to select people to send the offer to is only $48, as follows:
Costs Actual 0 Actual 1
Predicted 0 0 $20
Predicted 1 $20 $8
However, this does not improve the actual classifications themselves. A better
method is to change the classification rules (and hence the misclassification rates)
as discussed in the preceding section, to reflect the asymmetric costs.
A popular performance measure that includes costs is the average misclassifica-
tion cost, which measures the average cost of misclassification per classified record.
Denote by q1 the cost of misclassifying a class C1 record (as belonging to class
C2) and by q2 the cost of misclassifying a class C2 record (as belonging to class
C1). The average misclassification cost is
q1n1,2 + q2n2,1
n
.
Thus, we are looking for a classifier that minimizes this quantity. This can be
computed, for instance, for different cutoff values.
JUDGING CLASSIFIER PERFORMANCE 135
It turns out that the optimal parameters are affected by the misclassification
costs only through the ratio of these costs. This can be seen if we write the
foregoing measure slightly differently:
q1n1,2 + q2n2,1
n
=
n1,2
n1,1 + n1,2
n1,1 + n1,2
n
q1 +
n2,1
n2,1 + n2,2
n2,1 + n2,2
n
q2.
Minimizing this expression is equivalent to minimizing the same expression
divided by a constant. If we divide by q1, it can be seen clearly that the mini-
mization depends only on q2/q1 and not on their individual values. This is very
practical, because in many cases it is difficult to assess the costs associated with
misclassifying a C1 member and a C2 member, but estimating the ratio is easier.
This expression is a reasonable estimate of future misclassification cost if the
proportions of classes C1 and C2 in the sample data are similar to the proportions
of classesC1 andC2 that are expected in the future. If instead of a random sample,
we draw a sample such that one class is oversampled (as described in the next
section), then the sample proportions ofC1’s andC2’s will be distorted compared
to the future or population. We can then correct the average misclassification
cost measure for the distorted sample proportions by incorporating estimates of
the true proportions (from external data or domain knowledge), denoted by
p(C1) and p(C2), into the formula:
n1,2
n1,1 + n1,2
p(C1) q1 +
n2,1
n2,1 + n2,2
p(C2) q2.
Using the same logic as above, it can be shown that optimizing this quantity
depends on the costs only through their ratio (q2/q1) and on the prior probabil-
ities only through their ratio [p(C2)/p(C1)]. This is why software packages that
incorporate costs and prior probabilities might prompt the user for ratios rather
than actual costs and probabilities.
Generalization to More Than Two Classes
All the comments made above about two-class classifiers extend readily to clas-
sification into more than two classes. Let us suppose that we have m classes
C1, C2, . . . , Cm. The confusion matrix has m rows and m columns. The mis-
classification cost associated with the diagonal cells is, of course, always zero.
Incorporating prior probabilities of the various classes (where now we have m
such numbers) is still done in the same manner. However, evaluating misclas-
sification costs becomes much more complicated: For an m-class case we have
m(m− 1) types of misclassifications. Constructing a matrix of misclassification
costs thus becomes prohibitively complicated.
136 EVALUATING PREDICTIVE PERFORMANCE
5.4 Judging Ranking Performance
We now turn to the predictive goal of detecting, among a set of new records,
the ones most likely to belong to a class of interest. Recall that this differs from
the goal of predicting class membership for each new record.
Lift Charts for Binary Data
We already introduced lift charts in the context of a numerical outcome (Section
5.2). We now describe lift charts, also called lift curves, gains curves, or gains
charts, for a binary outcome. This is a more common usage than for predicted
continuous outcomes. The lift curve helps us determine how effectively we can
“skim the cream” by selecting a relatively small number of records and getting a
relatively large portion of the responders. The input required to construct a lift
curve is a validation dataset that has been “scored” by appending to each record
the propensity that it will belong to a given class.
Let’s continue with the case in which a particular class is relatively rare and of
much more interest than the other class: tax cheats, debt defaulters, or respon-
ders to a mailing. We would like our classification model to sift through the
records and sort them according to which ones are most likely to be tax cheats,
responders to the mailing, and so on. We can then make more informed deci-
sions. For example, we can decide how many and which tax returns to examine
if looking for tax cheats. The model will give us an estimate of the extent to
which we will encounter more and more non-cheaters as we proceed through
the sorted data starting with the records most likely to be tax cheats. Or we
can use the sorted data to decide to which potential customers a limited-budget
mailing should be targeted. In other words, we are describing the case when
our goal is to obtain a rank ordering among the records according to their class
membership propensities.
Sorting by Propensity To construct a lift chart, we sort the set of records
by propensity, in descending order. This is the propensity to belong to the
important class, say C1. Then, in each row, we compute the cumulative number
ofC1 members (Actual Class =C1). For example, Table 5.6 shows the 24 records
ordered in descending class “1” propensity. The right-most column accumulates
the number of actual 1’s. The lift chart then plots this cumulative column against
the number of records.
In R, there are multiple libraries for creating lift charts. Two useful options
include the caret library, which is straightforward and easy to use, but cannot
be used for a numerical outcome variable. The second option is with the gains
library. Figure 5.6 shows lift curves (and R code) for these two methods. Note
that the caret package lift chart uses a percentage on the y-axis.
JUDGING RANKING PERFORMANCE 137
TABLE 5.6 RECORDS SORTED BY PROPENSITY OF
OWNERSHIP (HIGH TO LOW) FOR THE
MOWER EXAMPLE
Obs Propensity Actual Cumulative
of 1 Class Actual Class
1 0.995976726 1 1
2 0.987533139 1 2
3 0.984456382 1 3
4 0.980439587 1 4
5 0.948110638 1 5
6 0.889297203 1 6
7 0.847631864 1 7
8 0.762806287 0 7
9 0.706991915 1 8
10 0.680754087 1 9
11 0.656343749 1 10
12 0.622419543 0 10
13 0.505506928 1 11
14 0.471340450 0 11
15 0.337117362 0 11
16 0.217967810 1 12
17 0.199240432 0 12
18 0.149482655 0 12
19 0.047962588 0 12
20 0.038341401 0 12
21 0.024850999 0 12
22 0.021806029 0 12
23 0.016129906 0 12
24 0.003559986 0 12
Interpreting the lift chart What is considered good or bad performance?
The ideal ranking performance would place all the 1’s at the beginning (the
actual 1’s would have the highest propensities and be at the top of the table),
and all the 0’s at the end. A lift chart corresponding to this ideal case would
be a diagonal line with slope 1 which turns into a horizontal line (once all
the 1’s were accumulated). In the example, the lift curve for the best possible
classifier—a classifier that makes no errors—would overlap the existing curve at
the start, continue with a slope of 1 until it reached all the 12 1’s, then continue
horizontally to the right.
In contrast, a useless model would be one that randomly assigns propensities
(shuffling the 1’s and 0’s randomly in the Actual Class column). Such behavior
would increase the cumulative number of 1’s, on average, by #1
′s
n
in each row.
And in fact, this is the diagonal line joining the points (0,0) to (24,12) seen in
Figure 5.6. This serves as a reference line. For any given number of records (the
x-axis value), it represents the expected number of 1 classifications if we did not
138 EVALUATING PREDICTIVE PERFORMANCE
% Samples Tested
%
S
am
pl
es
F
ou
nd
0
20
40
60
80
100
0 20 40 60 80 100 0 5 10 15 20
0
2
4
6
8
10
12
# cases
C
um
ul
at
iv
e
FIGURE 5.6 LIFT CHART FOR THE MOWER EXAMPLE USING CARET PACKAGE (TOP) AND GAINS
PACKAGE (BOTTOM)
code for creating a lift chart: two options
# first option with ‘caret’ library:
library(caret)
lift.example <- lift(relevel(as.factor(actual), ref="1") ~ prob, data = df)
xyplot(lift.example, plot = "gain")
# Second option with 'gains' library:
library(gains)
df <- read.csv("liftExample.csv")
gain <- gains(df$actual, df$prob, groups=dim(df)[1])
plot(c(0, gain$cume.pct.of.total*sum(df$actual)) ~ c(0, gain$cume.obs),
xlab = "# cases", ylab = "Cumulative", type="l")
lines(c(0,sum(df$actual))~c(0,dim(df)[1]), col="gray", lty=2)
have a model but simply selected records at random. It provides a benchmark
against which we can evaluate the ranking performance of the model. In this
example, although our model is not perfect, it seems to perform much better
than the random benchmark.
How do we read a lift chart? For a given number of records (x-axis), the lift
curve value on the y-axis tells us how much better we are doing compared to
random assignment. For example, looking at Figure 5.6, if we use our model to
choose the top 10 records, the lift curve tells us that we would be right for about
nine of them (or 18% using the caret package lift curve). If we simply select
10 records at random, we expect to be right for 10× 12/24 = 5 records. The
model gives us a “lift” in detecting class 1 members of 9/5 = 1.8. The lift will
vary with the number of records we choose to act on. A good classifier will give
us a high lift when we act on only a few records. As we include more records,
the lift will decrease.
Decile Lift Charts
The information from the lift chart can be portrayed as a decile chart, as shown in
Figure 5.7, which is widely used in direct marketing predictive modeling. The
JUDGING RANKING PERFORMANCE 139
8 17 29 38 50 58 67 79 88 100
Decile−wise lift chart
Percentile
M
ea
n
R
es
po
ns
e
0.
0
0.
5
1.
0
1.
5
2.
0
FIGURE 5.7 DECILE LIFT CHART
code for creating a decile lift chart
# use gains() to compute deciles.
# when using the caret package, deciles must be computed manually.
gain <- gains(df$actual, df$prob,)
barplot(gain$mean.resp / mean(df$actual), names.arg = gain$depth, xlab = "Percentile",
ylab = "Mean Response", main = "Decile-wise lift chart")
decile chart aggregates all the lift information into 10 buckets. The dots show,
on the y-axis, the factor by which our model outperforms a random assignment
of 0’s and 1’s, taking one decile at a time. Reading the bar on the left, we see
that taking 8% of the records that are ranked by the model as “the most probable
1’s” (having the highest propensities) yields twice as many 1’s as would a random
selection of 8% of the records. In this example, the decile chart indicates that we
can even use the model to select the top 29% records with the highest propensities
and still perform twice as well as random.
Beyond Two Classes
A lift chart cannot be used with a multiclass classifier unless a single “important
class” is defined and the classifications are reduced to “important” and “unim-
portant” classes.
Lift Charts Incorporating Costs and Benefits
When the benefits and costs of correct and incorrect classification are known or
can be estimated, the lift chart is still a useful presentation and decision tool. As
140 EVALUATING PREDICTIVE PERFORMANCE
before, we need a classifier that assigns to each record a propensity that it belongs
to a particular class. The procedure is then as follows:
1. Sort the records in descending order of predicted probability of success
(where success = belonging to the class of interest).
2. For each record, record the cost (benefit) associated with the actual out-
come.
3. For the highest propensity (i.e., first) record, its x-axis value is 1 and its
y-axis value is its cost or benefit (computed in Step 2) on the lift curve.
4. For the next record, again calculate the cost (benefit) associated with the
actual outcome. Add this to the cost (benefit) for the previous record.
This sum is the y-axis coordinate of the second point on the lift curve.
Its x-axis value is 2.
5. Repeat Step 4 until all records have been examined. Connect all the
points, and this is the lift curve.
6. The reference line is a straight line from the origin to the point y = total
net benefit and x = n(n = number of records).
Note: It is entirely possible for a reference line that incorporates costs and ben-
efits to have a negative slope if the net value for the entire dataset is negative. For
example, if the cost of mailing to a person is $0.65, the value of a responder is
$25, and the overall response rate is 2%, the expected net value of mailing to a list
of 10,000 is (0.02× $25× 10, 000)− ($0.65× 10, 000) = $5000− $6500 =
−$1500. Hence, the y-value at the far right of the lift curve (x = 10,000) is
−1500, and the slope of the reference line from the origin will be negative. The
optimal point will be where the lift curve is at a maximum (i.e., mailing to about
3000 people) in Figure 5.8.
Lift as a Function of Cutoff
We could also plot the lift as a function of the cutoff value. The only difference
is the scale on the x-axis. When the goal is to select the top records based on a
certain budget, the lift vs. number of records is preferable. In contrast, when the
goal is to find a cutoff that distinguishes well between the two classes, the lift vs.
cutoff value is more useful.
5.5 Oversampling
As we saw briefly in Chapter 2, when classes are present in very unequal pro-
portions, simple random sampling may produce too few of the rare class to yield
useful information about what distinguishes them from the dominant class. In
such cases, stratified sampling is often used to oversample the records from the
OVERSAMPLING 141
FIGURE 5.8 LIFT CURVE INCORPORATING COSTS
rarer class and improve the performance of classifiers. It is often the case that the
rarer events are the more interesting or important ones: responders to a mailing,
those who commit fraud, defaulters on debt, and the like. This same stratified
sampling procedure is sometimes called weighted sampling or undersampling, the
latter referring to the fact that the more plentiful class is undersampled, relative
to the rare class. We shall stick to the term oversampling.
In all discussions of oversampling, we assume the common situation in which there
are two classes, one of much greater interest than the other. Data with more than
two classes do not lend themselves to this procedure.
Consider the data in Figure 5.9, where × represents non-responders, and ◦,
responders. The two axes correspond to two predictors. The dashed vertical line
does the best job of classification under the assumption of equal costs: It results
in just one misclassification (one ◦ is misclassified as an ×). If we incorporate
more realistic misclassification costs—let’s say that failing to catch a ◦ is five times
as costly as failing to catch an ×—the costs of misclassification jump to 5. In
such a case, a horizontal line as shown in Figure 5.10, does a better job: It results
in misclassification costs of just 2.
Oversampling is one way of incorporating these costs into the training pro-
cess. In Figure 5.11, we can see that classification algorithms would automatically
determine the appropriate classification line if four additional ◦’s were present at
each existing ◦. We can achieve appropriate results either by taking five times
as many ◦’s as we would get from simple random sampling (by sampling with
replacement if necessary), or by replicating the existing ◦’s fourfold.
142 EVALUATING PREDICTIVE PERFORMANCE
FIGURE 5.9 CLASSIFICATION ASSUMING EQUAL COSTS OF MISCLASSIFICATION
FIGURE 5.10 CLASSIFICATION ASSUMING UNEQUAL COSTS OF MISCLASSIFICATION
OVERSAMPLING 143
FIGURE 5.11 CLASSIFICATION USING OVERSAMPLING TO ACCOUNT FOR UNEQUAL COSTS
Oversampling without replacement in accord with the ratio of costs (the first
option above) is the optimal solution, but may not always be practical. There may
not be an adequate number of responders to assure that there will be enough of
them to fit a model if they constitute only a small proportion of the total. Also, it
is often the case that our interest in discovering responders is known to be much
greater than our interest in discovering non-responders, but the exact ratio of
costs is difficult to determine. When faced with very low response rates in a
classification problem, practitioners often sample equal numbers of responders
and non-responders as a relatively effective and convenient approach. Whatever
approach is used, when it comes time to assess and predict model performance,
we will need to adjust for the oversampling in one of two ways:
1. score the model to a validation set that has been selected without over-
sampling (i.e., via simple random sampling), or
2. score the model to an oversampled validation set, and reweight the results
to remove the effects of oversampling.
The first method is more straightforward and easier to implement. We
describe how to oversample and how to evaluate performance for each of the
two methods.
144 EVALUATING PREDICTIVE PERFORMANCE
When classifying data with very low response rates, practitioners typically:
• train models on data that are 50% responder, 50% non-responder
• validate the models with an unweighted (simple random) sample from the original
data
Oversampling the Training Set
How is weighted sampling done? When responders are sufficiently scarce that
you will want to use all of them, one common procedure is as follows:
1. First, the response and non-response data are separated into two distinct
sets, or strata.
2. Records are then randomly selected for the training set from each stratum.
Typically, one might select half the (scarce) responders for the training set,
then an equal number of non-responders.
3. The remaining responders are put in the validation set.
4. Non-responders are randomly selected for the validation set in sufficient
numbers to maintain the original ratio of responders to non-responders.
5. If a test set is required, it can be taken randomly from the validation set.
Evaluating Model Performance Using a Non-oversampled Validation Set
Although the oversampled data can be used to train models, they are often not
suitable for evaluating model performance, because the number of responders
will (of course) be exaggerated. The most straightforward way of gaining an
unbiased estimate of model performance is to apply the model to regular data
(i.e., data not oversampled). In short, train the model on oversampled data, but
validate it with regular data.
Evaluating Model Performance if Only Oversampled Validation Set Exists
In some cases, very low response rates may make it more practical to use over-
sampled data not only for the training data, but also for the validation data. This
might happen, for example, if an analyst is given a dataset for exploration and
prototyping that is already oversampled to boost the proportion with the rare
response of interest (perhaps because it is more convenient to transfer and work
with a smaller dataset). In such cases, it is still possible to assess how well the
model will do with real data, but this requires the oversampled validation set
to be reweighted, in order to restore the class of records that were underrepre-
sented in the sampling process. This adjustment should be made to the confusion
matrix and to the lift chart in order to derive good accuracy measures. These
adjustments are described next.
OVERSAMPLING 145
I. Adjusting the Confusion Matrix for Oversampling Suppose the
response rate in the data as a whole is 2%, and that the data were oversampled,
yielding a sample in which the response rate is 25 times higher (50% responders).
The relationship is as follows:
Responders: 2% of the whole data; 50% of the sample
Non-responders: 98% of the whole data, 50% of the sample
Each responder in the whole data is worth 25 responders in the sample
(50/2). Each non-responder in the whole data is worth 0.5102 non-responders
in the sample (50/98). We call these values oversampling weights.
Assume that the validation confusion matrix looks like this:
CONFUSION MATRIX, OVERSAMPLED DATA (VALIDATION)
Actual 0 Actual 1 Total
Predicted 0 390 80 470
Predicted 1 110 420 530
Total 500 500 1000
At this point, the misclassification rate appears to be (80 + 110)/1000 =
19%, and the model ends up classifying 53% of the records as 1’s. However, this
reflects the performance on a sample where 50% are responders.
To estimate predictive performance when this model is used to score the
original population (with 2% responders), we need to undo the effects of the
oversampling. The actual number of responders must be divided by 25, and the
actual number of non-responders divided by 0.5102.
The revised confusion matrix is as follows:
CONFUSION MATRIX, REWEIGHTED
Actual 0 Actual 1 Total
Predicted 0 390/0.5102 = 764.4 80/25 = 3.2
Predicted 1 110/0.5102 = 215.6 420/25 = 16.8
Total 980 20 1000
The adjusted misclassification rate is (3.2 + 215.6)/1000 = 21.9%. The
model ends up classifying (215.6+16.8)/1000 = 23.24% of the records as 1’s,
when we assume 2% responders.
146 EVALUATING PREDICTIVE PERFORMANCE
II. Adjusting the Lift Curve for Oversampling The lift curve is likely
to be a more useful measure in low-response situations, where our interest lies
not so much in classifying all the records correctly as in finding a model that
guides us toward those records most likely to contain the response of interest
(under the assumption that scarce resources preclude examining or contacting
all the records). Typically, our interest in such a case is in maximizing value
or minimizing cost, so we will show the adjustment process incorporating the
benefit/cost element. The following procedure can be used:
1. Sort the validation records in order of the predicted probability of success
(where success = belonging to the class of interest).
2. For each record, record the cost (benefit) associated with the actual out-
come.
3. Divide that value by the oversampling rate. For example, if responders
are overweighted by a factor of 25, divide by 25.
4. For the highest probability (i.e., first) record, the value above is the y-
coordinate of the first point on the lift chart. The x-coordinate is index
number 1.
5. For the next record, again calculate the adjusted value associated with the
actual outcome. Add this to the adjusted cost (benefit) for the previous
record. This sum is the y-coordinate of the second point on the lift curve.
The x-coordinate is index number 2.
6. Repeat Step 5 until all records have been examined. Connect all the
points, and this is the lift curve.
7. The reference line is a straight line from the origin to the point y = total
net benefit and x = n (n = number of records).
PROBLEMS 147
PROBLEMS
5.1 A data mining routine has been applied to a transaction dataset and has classified 88
records as fraudulent (30 correctly so) and 952 as non-fraudulent (920 correctly so).
Construct the confusion matrix and calculate the overall error rate.
5.2 Suppose that this routine has an adjustable cutoff (threshold) mechanism by which you
can alter the proportion of records classified as fraudulent. Describe how moving the
cutoff up or down would affect
a. the classification error rate for records that are truly fraudulent
b. the classification error rate for records that are truly nonfraudulent
5.3 FiscalNote is a startup founded by a Washington, DC entrepreneur and funded by a
Singapore sovereign wealth fund, the Winklevoss twins of Facebook fame, and oth-
ers. It uses machine learning and data mining techniques to predict for its clients
whether legislation in the US Congress and in US state legislatures will pass or not.
The company reports 94% accuracy. (Washington Post, November 21, 2014, “Capital
Business”)
Considering just bills introduced in the US Congress, do a bit of internet research
to learn about numbers of bills introduced and passage rates. Identify the possible types
of misclassifications, and comment on the use of overall accuracy as a metric. Include
a discussion of other possible metrics and the potential role of propensities.
5.4 Consider Figure 5.12, the decile-wise lift chart for the transaction data model, applied
to new data.
1 2 3 4 5 6 7 8 9 10
Percentile
M
ea
n
R
es
po
ns
e
0
1
2
3
4
5
6
FIGURE 5.12 DECILE-WISE LIFT CHART FOR TRANSACTION DATA
a. Interpret the meaning of the first and second bars from the left.
b. Explain how you might use this information in practice.
c. Another analyst comments that you could improve the accuracy of the model by
classifying everything as nonfraudulent. If you do that, what is the error rate?
d. Comment on the usefulness, in this situation, of these two metrics of model per-
formance (error rate and lift).
5.5 A large number of insurance records are to be examined to develop a model for pre-
dicting fraudulent claims. Of the claims in the historical database, 1% were judged
148 EVALUATING PREDICTIVE PERFORMANCE
to be fraudulent. A sample is taken to develop a model, and oversampling is used to
provide a balanced sample in light of the very low response rate. When applied to
this sample (n = 800), the model ends up correctly classifying 310 frauds, and 270
nonfrauds. It missed 90 frauds, and classified 130 records incorrectly as frauds when
they were not.
a. Produce the confusion matrix for the sample as it stands.
b. Find the adjusted misclassification rate (adjusting for the oversampling).
c. What percentage of new records would you expect to be classified as fraudulent?
5.6 A firm that sells software services has been piloting a new product and has records
of 500 customers who have either bought the services or decided not to. The target
value is the estimated profit from each sale (excluding sales costs). The global mean
is $2128. However, the cost of the sales effort is not cheap—the company figures it
comes to $2500 for each of the 500 customers (whether they buy or not). The firm
developed a predictive model in hopes of being able to identify the top spenders in
the future. The lift and decile charts for the validation set are shown in Figure 5.13.
0 50 100 150 200
0
10
00
00
30
00
00
50
00
00
# cases
C
um
ul
at
iv
e
10 20 30 40 50 60 70 80 90 100
Decile-wise lift chart
Percentile
M
ea
n
R
es
po
ns
e
0.
0
0.
5
1.
0
1.
5
2.
0
FIGURE 5.13 LIFT AND DECILE-WISE LIFT CHARTS FOR SOFTWARE SERVICES PRODUCT SALES
a. If the company begins working with a new set of 1000 leads to sell the same services,
similar to the 500 in the pilot study, without any use of predictive modeling to target
sales efforts, what is the estimated profit?
b. If the firm wants the average profit on each sale to at least double the sales effort cost,
and applies an appropriate cutoff with this predictive model to a new set of 1000
leads, how far down the new list of 1000 should it proceed (how many deciles)?
c. Still considering the new list of 1000 leads, if the company applies this predictive
model with a lower cutoff of $2500, how far should it proceed down the ranked
leads, in terms of deciles?
PROBLEMS 149
d. Why use this two-stage process for predicting sales—why not simply develop a
model for predicting profit for the 1000 new leads?
5.7 Table 5.7 shows a small set of predictive model validation results for a classification
model, with both actual values and propensities.
a. Calculate error rates, sensitivity, and specificity using cutoffs of 0.25, 0.5, and 0.75.
b. Create a decile-wise lift chart in R.
TABLE 5.7 PROPENSITIES AND ACTUAL CLASS
MEMBERSHIP FOR VALIDATION DATA
Propensity of 1 Actual
0.03 0
0.52 0
0.38 0
0.82 1
0.33 0
0.42 0
0.55 1
0.59 0
0.09 0
0.21 0
0.43 0
0.04 0
0.08 0
0.13 0
0.01 0
0.79 1
0.42 0
0.29 0
0.08 0
0.02 0
Part IV
Prediction and Classification Methods
CHAPTER 6
Multiple Linear Regression
In this chapter, we introduce linear regression models for the purpose of pre-
diction. We discuss the differences between fitting and using regression models
for the purpose of inference (as in classical statistics) and for prediction. A pre-
dictive goal calls for evaluating model performance on a validation set, and for
using predictive metrics. We then raise the challenges of using many predictors
and describe variable selection algorithms that are often implemented in linear
regression procedures.
6.1 Introduction
The most popular model for making predictions is the multiple linear regres-
sion model encountered in most introductory statistics courses and textbooks.
This model is used to fit a relationship between a numerical outcome variable
Y (also called the response, target, or dependent variable) and a set of predictors
X1, X2, . . . , Xp (also referred to as independent variables, input variables, regres-
sors, or covariates). The assumption is that the following function approximates
the relationship between the predictors and outcome variable:
Y = β0 + β1x1 + β2x2 + · · ·+ βpxp + ϵ, (6.1)
where β0, . . . , βp are coefficients and ϵ is the noise or unexplained part. Data are
then used to estimate the coefficients and to quantify the noise. In predictive
modeling, the data are also used to evaluate model performance.
Regression modeling means not only estimating the coefficients but also
choosing which predictors to include and in what form. For example, a numer-
ical predictor can be included as is, or in logarithmic form [log(X)], or in a
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
153
154 MULTIPLE LINEAR REGRESSION
binned form (e.g., age group). Choosing the right form depends on domain
knowledge, data availability, and needed predictive power.
Multiple linear regression is applicable to numerous predictive modeling sit-
uations. Examples are predicting customer activity on credit cards from their
demographics and historical activity patterns, predicting expenditures on vaca-
tion travel based on historical frequent flyer data, predicting staffing require-
ments at help desks based on historical data and product and sales information,
predicting sales from cross-selling of products from historical information, and
predicting the impact of discounts on sales in retail outlets.
6.2 Explanatory vs. Predictive Modeling
Before introducing the use of linear regression for prediction, we must clarify an
important distinction that often escapes those with earlier familiarity with linear
regression from courses in statistics. In particular, the two popular but different
objectives behind fitting a regression model are:
1. Explaining or quantifying the average effect of inputs on an outcome
(explanatory or descriptive task, respectively)
2. Predicting the outcome value for new records, given their input values
(predictive task)
The classical statistical approach is focused on the first objective. In that scenario,
the data are treated as a random sample from a larger population of interest.
The regression model estimated from this sample is an attempt to capture the
average relationship in the larger population. This model is then used in decision-
making to generate statements such as “a unit increase in service speed (X1) is
associated with an average increase of 5 points in customer satisfaction (Y ), all
other factors (X2, X3, . . . , Xp) being equal.” If X1 is known to cause Y , then
such a statement indicates actionable policy changes—this is called explanatory
modeling. When the causal structure is unknown, then this model quantifies the
degree of association between the inputs and outcome variable, and the approach
is called descriptive modeling.
In predictive analytics, however, the focus is typically on the second goal:
predicting new individual records. Here we are not interested in the coefficients
themselves, nor in the “average record,” but rather in the predictions that this
model can generate for new records. In this scenario, the model is used for
micro-decision-making at the record level. In our previous example, we would
use the regression model to predict customer satisfaction for each new customer
of interest.
EXPLANATORY VS. PREDICTIVE MODELING 155
Both explanatory and predictive modeling involve using a dataset to fit a
model (i.e., to estimate coefficients), checking model validity, assessing its per-
formance, and comparing to other models. However, the modeling steps and
performance assessment differ in the two cases, usually leading to different final
models. Therefore, the choice of model is closely tied to whether the goal is
explanatory or predictive.
In explanatory and descriptive modeling, where the focus is on modeling
the average record, we try to fit the best model to the data in an attempt to learn
about the underlying relationship in the population. In contrast, in predictive
modeling (data mining), the goal is to find a regression model that best predicts
new individual records. A regression model that fits the existing data too well is
not likely to perform well with new data. Hence, we look for a model that has
the highest predictive power by evaluating it on a holdout set and using predictive
metrics (see Chapter 5).
Let us summarize the main differences in using a linear regression in the two
scenarios:
1. A good explanatory model is one that fits the data closely, whereas a good
predictive model is one that predicts new records accurately. Choices of
input variables and their form can therefore differ.
2. In explanatory models, the entire dataset is used for estimating the best-
fit model, to maximize the amount of information that we have about
the hypothesized relationship in the population. When the goal is to
predict outcomes of new individual records, the data are typically split
into a training set and a validation set. The training set is used to estimate
the model, and the validation or holdout set is used to assess this model’s
predictive performance on new, unobserved data.
3. Performance measures for explanatory models measure how close the
data fit the model (how well the model approximates the data) and how
strong the average relationship is, whereas in predictive models perfor-
mance is measured by predictive accuracy (how well the model predicts
new individual records).
4. In explanatory models the focus is on the coefficients (β), whereas in
predictive models the focus is on the predictions (ŷ).
For these reasons, it is extremely important to know the goal of the analysis
before beginning the modeling process. A good predictive model can have a
looser fit to the data on which it is based, and a good explanatory model can
have low prediction accuracy. In the remainder of this chapter, we focus on
predictive models because these are more popular in data mining and because
most statistics textbooks focus on explanatory modeling.
156 MULTIPLE LINEAR REGRESSION
6.3 Estimating the Regression Equation and
Prediction
Once we determine the predictors to include and their form, we estimate the
coefficients of the regression formula from the data using a method called ordinary
least squares (OLS). This method finds values β̂0, β̂1, β̂2, . . . , β̂p that minimize
the sum of squared deviations between the actual outcome values (Y ) and their
predicted values based on that model (Ŷ ).
To predict the value of the outcome variable for a record with predictor
values x1, x2, . . . , xp, we use the equation
Ŷ = β̂0 + β̂1x1 + β̂2x2 + · · ·+ β̂pxp. (6.2)
Predictions based on this equation are the best predictions possible in the sense
that they will be unbiased (equal to the true values on average) and will have the
smallest mean squared error compared to any unbiased estimates if we make the
following assumptions:
1. The noise ϵ (or equivalently, Y ) follows a normal distribution.
2. The choice of predictors and their form is correct (linearity).
3. The records are independent of each other.
4. The variability in the outcome values for a given set of predictors is the
same regardless of the values of the predictors (homoskedasticity).
An important and interesting fact for the predictive goal is that even if we drop
the first assumption and allow the noise to follow an arbitrary distribution, these estimates
are very good for prediction, in the sense that among all linear models, as defined by
equation (6.1), the model using the least squares estimates, β̂0, β̂1, β̂2, . . . , β̂p,
will have the smallest mean squared errors. The assumption of a normal dis-
tribution is required in explanatory modeling, where it is used for constructing
confidence intervals and statistical tests for the model parameters.
Even if the other assumptions are violated, it is still possible that the resulting
predictions are sufficiently accurate and precise for the purpose they are intended
for. The key is to evaluate predictive performance of the model, which is the
main priority. Satisfying assumptions is of secondary interest and residual analysis
can give clues to potential improved models to examine.
Example: Predicting the Price of Used Toyota Corolla Cars
A large Toyota car dealership offers purchasers of new Toyota cars the option to
buy their used car as part of a trade-in. In particular, a new promotion promises
ESTIMATING THE REGRESSION EQUATION AND PREDICTION 157
TABLE 6.1 VARIABLES IN THE TOYOTA COROLLA
EXAMPLE
Variable Description
Price Offer price in Euros
Age Age in months as of August 2004
Kilometers Accumulated kilometers on odometer
Fuel Type Fuel type (Petrol, Diesel, CNG)
HP Horsepower
Metallic Metallic color? (Yes = 1, No = 0)
Automatic Automatic (Yes = 1, No = 0)
CC Cylinder volume in cubic centimeters
Doors Number of doors
QuartTax Quarterly road tax in Euros
Weight Weight in kilograms
to pay high prices for used Toyota Corolla cars for purchasers of a new car.
The dealer then sells the used cars for a small profit. To ensure a reasonable
profit, the dealer needs to be able to predict the price that the dealership will
get for the used cars. For that reason, data were collected on all previous sales
of used Toyota Corollas at the dealership. The data include the sales price and
other information on the car, such as its age, mileage, fuel type, and engine
size. A description of each of these variables is given in Table 6.1. A sam-
ple of this dataset is shown in Table 6.2. The total number of records in the
dataset is 1000 cars (we use the first 1000 cars from the dataset ToyotoCorolla.csv).
After partitioning the data into training (60%) and validation (40%) sets, we fit
a multiple linear regression model between price (the outcome variable) and
the other variables (as predictors) using only the training set. Table 6.3 shows
the estimated coefficients. Notice that the Fuel Type predictor has three cate-
gories (Petrol, Diesel, and CNG). We therefore have two dummy variables in the
model: Fuel_TypePetrol (0/1) and Fuel_TypeDiesel (0/1); the third, for CNG
(0/1), is redundant given the information on the first two dummies. Including
the redundant dummy would cause the regression to fail, since the redundant
dummy will be a perfect linear combination of the other two; R’s “lm” routine
handles this issue automatically.
The regression coefficients are then used to predict prices of individual used
Toyota Corolla cars based on their age, mileage, and so on. Table 6.4 shows a
sample of predicted prices for 20 cars in the validation set, using the estimated
model. It gives the predictions and their errors (relative to the actual prices)
for these 20 cars. Below the predictions, we have overall measures of predictive
158 MULTIPLE LINEAR REGRESSION
TABLE 6.2 PRICES AND ATTRIBUTES FOR USED TOYOTA COROLLA CARS
(SELECTED ROWS AND COLUMNS ONLY)
Fuel Auto- Quart
Price Age Kilometers Type HP Metallic matic CC Doors Tax Weight
13500 23 46986 Diesel 90 1 0 2000 3 210 1165
13750 23 72937 Diesel 90 1 0 2000 3 210 1165
13950 24 41711 Diesel 90 1 0 2000 3 210 1165
14950 26 48000 Diesel 90 0 0 2000 3 210 1165
13750 30 38500 Diesel 90 0 0 2000 3 210 1170
12950 32 61000 Diesel 90 0 0 2000 3 210 1170
16900 27 94612 Diesel 90 1 0 2000 3 210 1245
18600 30 75889 Diesel 90 1 0 2000 3 210 1245
21500 27 19700 Petrol 192 0 0 1800 3 100 1185
12950 23 71138 Diesel 69 0 0 1900 3 185 1105
20950 25 31461 Petrol 192 0 0 1800 3 100 1185
19950 22 43610 Petrol 192 0 0 1800 3 100 1185
19600 25 32189 Petrol 192 0 0 1800 3 100 1185
21500 31 23000 Petrol 192 1 0 1800 3 100 1185
22500 32 34131 Petrol 192 1 0 1800 3 100 1185
22000 28 18739 Petrol 192 0 0 1800 3 100 1185
22750 30 34000 Petrol 192 1 0 1800 3 100 1185
17950 24 21716 Petrol 110 1 0 1600 3 85 1105
16750 24 25563 Petrol 110 0 0 1600 3 19 1065
16950 30 64359 Petrol 110 1 0 1600 3 85 1105
15950 30 67660 Petrol 110 1 0 1600 3 85 1105
16950 29 43905 Petrol 110 0 1 1600 3 100 1170
15950 28 56349 Petrol 110 1 0 1600 3 85 1120
16950 28 32220 Petrol 110 1 0 1600 3 85 1120
16250 29 25813 Petrol 110 1 0 1600 3 85 1120
15950 25 28450 Petrol 110 1 0 1600 3 85 1120
17495 27 34545 Petrol 110 1 0 1600 3 85 1120
15750 29 41415 Petrol 110 1 0 1600 3 85 1120
11950 39 98823 CNG 110 1 0 1600 5 197 1119
accuracy. Note that the mean error (ME) is $ − 40 and RMSE = $1321. A
histogram of the residuals (Figure 6.1) shows that most of the errors are between
±$2000. This error magnitude might be small relative to the car price, but
should be taken into account when considering the profit. Another observation
of interest is the large positive residuals (under-predictions), which may or may
not be a concern, depending on the application. Measures such as the mean
error, and error percentiles are used to assess the predictive performance of a
model and to compare models.
ESTIMATING THE REGRESSION EQUATION AND PREDICTION 159
TABLE 6.3 LINEAR REGRESSION MODEL OF PRICE VS. CAR ATTRIBUTES
code for fitting a regression model
car.df <- read.csv("ToyotaCorolla.csv")
# use first 1000 rows of data
car.df <- car.df[1:1000, ]
# select variables for regression
selected.var <- c(3, 4, 7, 8, 9, 10, 12, 13, 14, 17, 18)
# partition data
set.seed(1) # set seed for reproducing the partition
train.index <- sample(c(1:1000), 600)
train.df <- car.df[train.index, selected.var]
valid.df <- car.df[-train.index, selected.var]
# use lm() to run a linear regression of Price on all 11 predictors in the
# training set.
# use . after ~ to include all the remaining columns in train.df as predictors.
car.lm <- lm(Price ~ ., data = train.df)
# use options() to ensure numbers are not displayed in scientific notation.
options(scipen = 999)
summary(car.lm)
Partial Output
> summary(car.lm)
Call:
lm(formula = Price ~ ., data = train.df)
Residuals:
Min 1Q Median 3Q Max
-8212.5 -839.2 -14.3 831.5 7270.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1774.877829 1643.744823 -1.080 0.2807
Age_08_04 -135.430875 4.875906 -27.776 < 0.0000000000000002 ***
KM -0.019003 0.002341 -8.116 0.00000000000000283 ***
Fuel_TypeDiesel 1208.339159 534.431400 2.261 0.0241 *
Fuel_TypePetrol 2425.876714 520.587979 4.660 0.00000391697679667 ***
HP 38.985537 5.587183 6.978 0.00000000000811621 ***
Met_Color 84.792715 126.883452 0.668 0.5042
Automatic 306.684154 289.433138 1.060 0.2898
CC 0.031966 0.099075 0.323 0.7471
Doors -44.157742 64.056530 -0.689 0.4909
Quarterly_Tax 16.677343 2.602668 6.408 0.00000000030287017 ***
Weight 12.667487 1.536587 8.244 0.00000000000000109 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1406 on 588 degrees of freedom
Multiple R-squared: 0.8567, Adjusted R-squared: 0.854
F-statistic: 319.6 on 11 and 588 DF, p-value: < 0.00000000000000022
160 MULTIPLE LINEAR REGRESSION
TABLE 6.4 PREDICTED PRICES (AND ERRORS) FOR 20 CARS IN VALIDATION SET AND
SUMMARY PREDICTIVE MEASURES FOR ENTIRE VALIDATION SET (CALLED TEST
SET IN R)
code for prediction and measuring accuracy
library(forecast)
# use predict() to make predictions on a new set.
car.lm.pred <- predict(car.lm, valid.df)
options(scipen=999, digits = 0)
some.residuals <- valid.df$Price[1:20] - car.lm.pred[1:20]
data.frame("Predicted" = car.lm.pred[1:20], "Actual" = valid.df$Price[1:20],
"Residual" = some.residuals)
options(scipen=999, digits = 3)
# use accuracy() to compute common accuracy measures.
accuracy(car.lm.pred, valid.df$Price)
Output
> data.frame(“Predicted” = car.lm.pred[1:20],
+ “Actual” = valid.df$Price[1:20], “Residual” = some.residuals)
Predicted Actual Residual
3 17175 13950 -3225
6 15704 12950 -2754
8 16727 18600 1873
9 20709 21500 791
10 14668 12950 -1718
11 20756 20950 194
13 20743 19600 -1143
16 20592 22000 1408
17 20116 22750 2634
18 16695 17950 1255
19 14930 16750 1820
20 15072 16950 1878
25 16130 16250 120
26 16622 15950 -672
27 16235 17495 1260
29 15832 16950 1118
33 15564 15950 386
34 15639 14950 -689
37 15836 15950 114
38 16477 14950 -1527
> accuracy(car.lm.pred, valid.df$Price)
ME RMSE MAE MPE MAPE
Test set -40.1 1321 1012 -1.72 9.01
VARIABLE SELECTION IN LINEAR REGRESSION 161
code for plotting histogram of validation errors
library(forecast)
car.lm.pred <- predict(car.lm, valid.df)
all.residuals <- valid.df$Price - car.lm.pred
length(all.residuals[which(all.residuals > -1406 & all.residuals < 1406)])/400
hist(all.residuals, breaks = 25, xlab = "Residuals", main = "")
Residuals
F
re
qu
en
cy
−4000 −2000 0 2000 4000 6000
0
10
20
30
40
50
60
70
FIGURE 6.1 HISTOGRAM OF MODEL ERRORS (BASED ON VALIDATION SET)
6.4 Variable Selection in Linear Regression
Reducing the Number of Predictors
A frequent problem in data mining is that of using a regression equation to
predict the value of a dependent variable when we have many variables available
to choose as predictors in our model. Given the high speed of modern algorithms
for multiple linear regression calculations, it is tempting in such a situation to take
a kitchen-sink approach: Why bother to select a subset? Just use all the variables
in the model.
Another consideration favoring the inclusions of numerous variables is the
hope that a previously hidden relationship will emerge. For example, a company
found that customers who had purchased anti-scuff protectors for chair and table
162 MULTIPLE LINEAR REGRESSION
legs had lower credit risks. However, there are several reasons for exercising
caution before throwing all possible variables into a model.
• It may be expensive or not feasible to collect a full complement of pre-
dictors for future predictions.
• We may be able to measure fewer predictors more accurately (e.g., in
surveys).
• The more predictors, the higher the chance of missing values in the data.
If we delete or impute records with missing values, multiple predictors
will lead to a higher rate of record deletion or imputation.
• Parsimony is an important property of good models. We obtain more
insight into the influence of predictors in models with few parameters.
• Estimates of regression coefficients are likely to be unstable, due to mul-
ticollinearity in models with many variables. (Multicollinearity is the pres-
ence of two or more predictors sharing the same linear relationship with
the outcome variable.) Regression coefficients are more stable for parsi-
monious models. One very rough rule of thumb is to have a number of
records n larger than 5(p+ 2), where p is the number of predictors.
• It can be shown that using predictors that are uncorrelated with the out-
come variable increases the variance of predictions.
• It can be shown that dropping predictors that are actually correlated with
the outcome variable can increase the average error (bias) of predictions.
The last two points mean that there is a trade-off between too few and too
many predictors. In general, accepting some bias can reduce the variance in pre-
dictions. This bias–variance trade-off is particularly important for large numbers
of predictors, because in that case, it is very likely that there are variables in the
model that have small coefficients relative to the standard deviation of the noise
and also exhibit at least moderate correlation with other variables. Dropping
such variables will improve the predictions, as it reduces the prediction variance.
This type of bias–variance trade-off is a basic aspect of most data mining proce-
dures for prediction and classification. In light of this, methods for reducing the
number of predictors p to a smaller set are often used.
How to Reduce the Number of Predictors
The first step in trying to reduce the number of predictors should always be to use
domain knowledge. It is important to understand what the various predictors are
measuring and why they are relevant for predicting the outcome variable. With
this knowledge, the set of predictors should be reduced to a sensible set that
reflects the problem at hand. Some practical reasons for predictor elimination
VARIABLE SELECTION IN LINEAR REGRESSION 163
are the expense of collecting this information in the future; inaccuracy; high
correlation with another predictor; many missing values; or simply irrelevance.
Also helpful in examining potential predictors are summary statistics and graphs,
such as frequency and correlation tables, predictor-specific summary statistics
and plots, and missing value counts.
The next step makes use of computational power and statistical performance
metrics. In general, there are two types of methods for reducing the number
of predictors in a model. The first is an exhaustive search for the “best” subset
of predictors by fitting regression models with all the possible combinations of
predictors. The exhaustive search approach is not practical in many applications,
and implementation in R can be tedious and unstable. The second approach
is to search through a partial set of models. We describe these two approaches
next.
Exhaustive Search The idea here is to evaluate all subsets of predictors.
Since the number of subsets for even moderate values of p is very large, after
the algorithm creates the subsets and runs all the models, we need some way
to examine the most promising subsets and to select from them. The challenge
is to select a model that is not too simplistic in terms of excluding important
parameters (the model is under-fit), nor overly complex thereby modeling random
noise (the model is over-fit). Several criteria for evaluating and comparing models
are based on metrics computed from the training data:
One popular criterion is the adjusted R2, which is defined as
R2adj = 1−
n− 1
n− p− 1
(1−R2),
where R2 is the proportion of explained variability in the model (in a model
with a single predictor, this is the squared correlation). Like R2, higher values
of R2adj indicate better fit. Unlike R
2, which does not account for the number
of predictors used, R2adj uses a penalty on the number of predictors. This avoids
the artificial increase in R2 that can result from simply increasing the number of
predictors but not the amount of information. It can be shown that using R2adj
to choose a subset is equivalent to picking the subset that minimizes σ̂2.
A second popular set of criteria for balancing under-fitting and over-fitting
are the Akaike Information Criterion (AIC) and Schwartz’s Bayesian Information
Criterion (BIC). AIC and BIC measure the goodness of fit of a model, but also
include a penalty that is a function of the number of parameters in the model.
As such, they can be used to compare various models for the same data set. AIC
and BIC are estimates of prediction error based in information theory. Their
derivation is beyond the scope of this book, but suffice it to say that models with
smaller AIC and BIC values are considered better.
164 MULTIPLE LINEAR REGRESSION
A third criterion often used for subset selection is Mallow’s Cp (see formula
below1). This criterion assumes that the full model (with all predictors) is unbi-
ased, although it may have predictors that if dropped would reduce prediction
variability. With this assumption, we can show that if a subset model is unbiased,
the average Cp value equals p+ 1 (= number of predictors + 1), the size of the
subset. So a reasonable approach to identifying subset models with small bias is
to examine those with values of Cp that are near p+ 1. Good models are those
that have values of Cp near p + 1 and that have small p (i.e., are of small size).
Cp is computed from the formula
Cp =
SSE
σ̂2full
+ 2(p+ 1)− n, (6.3)
where σ̂2full is the estimated value of σ
2 in the full model that includes all predic-
tors. It is important to remember that the usefulness of this approach depends
heavily on the reliability of the estimate of σ2 for the full model. This requires
that the training set contain a large number of records relative to the number of
predictors.
Note: It can be shown that for linear regression, in large samples Mallows’s Cp
is equivalent to AIC.
Finally, a useful point to note is that for a fixed size of subset, R2, R2adj, Cp,
AIC, and BIC all select the same subset. In fact, there is no difference between
them in the order of merit they ascribe to subsets of a fixed size. This is good to
know if comparing models with the same number of predictors, but often we
want to compare models with different numbers of predictors.
Table 6.5 gives the results of applying an exhaustive search on the Toyota
Corolla price data (with the 11 predictors). It reports the best model with a
single predictor, two predictors, and so on. It can be seen that the R2adj increases
until eight predictors are used (number of coefficients = 9) and then stabilizes.
The Cp indicates that a model with 7 to 8 predictors is good. The dominant
predictor in all models is the age of the car, with horsepower and mileage playing
important roles as well.
Popular Subset Selection Algorithms The second method of finding
the best subset of predictors relies on a partial, iterative search through the space
of all possible regression models. The end product is one best subset of pre-
dictors (although there do exist variations of these methods that identify several
1Mallow’s Cp is unrelated to the CP, or complexity parameter, used in classification and regression
trees, described in Chapter 9).
VARIABLE SELECTION IN LINEAR REGRESSION 165
TABLE 6.5 EXHAUSTIVE SEARCH FOR REDUCING PREDICTORS IN TOYOTA COROLLA EXAMPLE
code for best subset
# use regsubsets() in package leaps to run an exhaustive search.
# unlike with lm, categorical predictors must be turned into dummies manually.
library(leaps)
# create dummies for fuel type
Fuel_Type <- as.data.frame(model.matrix(~ 0 + Fuel_Type, data=train.df))
# replace Fuel_Type column with 2 dummies
train.df <- cbind(train.df[,-4], Fuel_Type[,])
head(train.df)
search <- regsubsets(Price ~ ., data = train.df, nbest = 1, nvmax = dim(train.df)[2],
method = "exhaustive")
sum <- summary(search)
# show models
sum$which
# show metrics
sum$rsq
sum$adjr2
sum$Cp
Output
> sum$which
(Intercept) Age_08_04 KM HP Met_Color Automatic CC Doors Quarterly_Tax
1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3 TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
4 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
5 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
6 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
7 TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
8 TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
9 TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
10 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
11 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Weight Fuel_TypeCNG Fuel_TypeDiesel Fuel_TypePetrol
1 FALSE FALSE FALSE FALSE
2 TRUE FALSE FALSE FALSE
3 TRUE FALSE FALSE FALSE
4 TRUE FALSE FALSE FALSE
5 TRUE FALSE FALSE FALSE
6 TRUE FALSE FALSE TRUE
7 TRUE TRUE TRUE FALSE
8 TRUE TRUE TRUE FALSE
9 TRUE TRUE TRUE FALSE
10 TRUE FALSE TRUE TRUE
11 TRUE TRUE TRUE FALSE
> sum$rsq
[1] 0.7560439 0.7929293 0.8276610 0.8447333 0.8506850 0.8549587
[7] 0.8561788 0.8564857 0.8565820 0.8566933 0.8567187
> sum$adjr2
[1] 0.7556359 0.7922356 0.8267935 0.8436895 0.8494282 0.8534911
[7] 0.8544782 0.8545430 0.8543943 0.8542602 0.8540382
> sum$cp
[1] 403.45 254.33 114.04 46.10 23.72 8.21 5.21 5.95 7.56 9.10 11.00
166 MULTIPLE LINEAR REGRESSION
close-to-best choices for different sizes of predictor subsets). This approach is
computationally cheaper, but it has the potential of missing “good” combina-
tions of predictors. None of the methods guarantee that they yield the best
subset for any criterion, such as R2adj. They are reasonable methods for situations
with a large number of predictors, but for a moderate number of predictors, the
exhaustive search is preferable.
Three popular iterative search algorithms are forward selection, backward elimi-
nation, and stepwise regression. In forward selection, we start with no predictors and
then add predictors one by one. Each predictor added is the one (among all
predictors) that has the largest contribution to R2 on top of the predictors that
are already in it. The algorithm stops when the contribution of additional pre-
dictors is not statistically significant. The main disadvantage of this method is
that the algorithm will miss pairs or groups of predictors that perform very well
together but perform poorly as single predictors. This is similar to interviewing
job candidates for a team project one by one, thereby missing groups of candi-
dates who perform superiorly together (“colleagues”), but poorly on their own
or with non-colleagues.
In backward elimination, we start with all predictors and then at each step,
eliminate the least useful predictor (according to statistical significance). The
algorithm stops when all the remaining predictors have significant contributions.
The weakness of this algorithm is that computing the initial model with all
predictors can be time-consuming and unstable. Stepwise regression is like forward
selection except that at each step, we consider dropping predictors that are not
statistically significant, as in backward elimination.
R has several libraries with stepwise functions: function regsubsets() in the
leaps package implements (in addition to exhaustive search) forward selection,
backward elimination, and stepwise regression. Predictors are added/dropped
based on eitherR2,R2adj, orCp. In contrast, function step() in the stats package,
as well as function stepAIC() in the MASS package perform model selection using
the AIC criterion (stepAIC offers a wider range of object classes).
Table 6.6 shows the result of backward elimination for the Toyota Corolla
example. The chosen seven-predictor model is identical to the best seven-
predictor model chosen by the exhaustive search. However, recall that the
exhaustive search indicated a higher R2adj for the eight-predictor model. In com-
parison, forward selection (Table 6.7) selected an 11-predictor model, thereby
not eliminating any predictor. The results for stepwise selection, seen in Table
6.8, are the same as those obtained by backward elimination.
Finally, additional ways to reduce the dimension of the data are by using
principal components (Chapter 4) and regression trees (Chapter 9).
VARIABLE SELECTION IN LINEAR REGRESSION 167
TABLE 6.6 BACKWARD ELIMINATION FOR REDUCING PREDICTORS IN TOYOTA COROLLA
EXAMPLE
code for stepwise regression
# use step() to run stepwise regression.
# set directions = to either “backward”, “forward”, or “both”.
car.lm.step <- step(car.lm, direction = "backward")
summary(car.lm.step) # Which variables did it drop?
car.lm.step.pred <- predict(car.lm.step, valid.df)
accuracy(car.lm.step.pred, valid.df$Price)
Output
> summary(car.lm.step)
Call:
lm(formula = Price ~ Age_08_04 + KM + Fuel_Type + HP + Quarterly_Tax +
Weight, data = train.df)
Residuals:
Min 1Q Median 3Q Max
-8263 -825 1 839 7312
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1853.36897 1620.35672 -1.14 0.253
Age_08_04 -135.72630 4.83995 -28.04 < 0.0000000000000002 ***
KM -0.01912 0.00233 -8.19 0.0000000000000016 ***
Fuel_TypeDiesel 1179.35368 526.25097 2.24 0.025 *
Fuel_TypePetrol 2374.05722 517.80593 4.58 0.0000055461532557 ***
HP 39.27366 5.51783 7.12 0.0000000000031903 ***
Quarterly_Tax 16.43837 2.58633 6.36 0.0000000004140248 ***
Weight 12.74441 1.47320 8.65 < 0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1400 on 592 degrees of freedom
Multiple R-squared: 0.856, Adjusted R-squared: 0.854
F-statistic: 503 on 7 and 592 DF, p-value: <0.0000000000000002
> car.lm.step.pred <- predict(car.lm.step, valid.df)
> accuracy(car.lm.step.pred, valid.df$Price)
ME RMSE MAE MPE MAPE
Test set -38.9 1321 1016 -1.67 9.05
168 MULTIPLE LINEAR REGRESSION
TABLE 6.7 FORWARD SELECTION FOR REDUCING PREDICTORS IN TOYOTA COROLLA EXAMPLE
> summary(car.lm.step)
Call:
lm(formula = Price ~ Age_08_04 + KM + Fuel_Type + HP + Met_Color +
Automatic + CC + Doors + Quarterly_Tax + Weight, data = train.df)
Residuals:
Min 1Q Median 3Q Max
-8213 -839 -14 831 7271
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1774.87783 1643.74482 -1.08 0.281
Age_08_04 -135.43088 4.87591 -27.78 < 0.0000000000000002 ***
KM -0.01900 0.00234 -8.12 0.0000000000000028 ***
Fuel_TypeDiesel 1208.33916 534.43140 2.26 0.024 *
Fuel_TypePetrol 2425.87671 520.58798 4.66 0.0000039169767967 ***
HP 38.98554 5.58718 6.98 0.0000000000081162 ***
Met_Color 84.79272 126.88345 0.67 0.504
Automatic 306.68415 289.43314 1.06 0.290
CC 0.03197 0.09908 0.32 0.747
Doors -44.15774 64.05653 -0.69 0.491
Quarterly_Tax 16.67734 2.60267 6.41 0.0000000003028702 ***
Weight 12.66749 1.53659 8.24 0.0000000000000011 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1410 on 588 degrees of freedom
Multiple R-squared: 0.857, Adjusted R-squared: 0.854
F-statistic: 320 on 11 and 588 DF, p-value: <0.0000000000000002
TABLE 6.8 STEPWISE REGRESSION FOR REDUCING PREDICTORS IN TOYOTA COROLLA
EXAMPLE
> summary(car.lm.step)
Call:
lm(formula = Price ~ Age_08_04 + KM + Fuel_Type + HP + Quarterly_Tax +
Weight, data = train.df)
Residuals:
Min 1Q Median 3Q Max
-8263 -825 1 839 7312
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1853.36897 1620.35672 -1.14 0.253
Age_08_04 -135.72630 4.83995 -28.04 < 0.0000000000000002 ***
KM -0.01912 0.00233 -8.19 0.0000000000000016 ***
Fuel_TypeDiesel 1179.35368 526.25097 2.24 0.025 *
Fuel_TypePetrol 2374.05722 517.80593 4.58 0.0000055461532557 ***
HP 39.27366 5.51783 7.12 0.0000000000031903 ***
Quarterly_Tax 16.43837 2.58633 6.36 0.0000000004140248 ***
Weight 12.74441 1.47320 8.65 < 0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1400 on 592 degrees of freedom
Multiple R-squared: 0.856, Adjusted R-squared: 0.854
F-statistic: 503 on 7 and 592 DF, p-value: <0.0000000000000002
PROBLEMS 169
PROBLEMS
6.1 Predicting Boston Housing Prices. The file BostonHousing.csv contains informa-
tion collected by the US Bureau of the Census concerning housing in the area of
Boston, Massachusetts. The dataset includes information on 506 census housing tracts
in the Boston area. The goal is to predict the median house price in new tracts based
on information such as crime rate, pollution, and number of rooms. The dataset con-
tains 13 predictors, and the response is the median house price (MEDV). Table 6.9
describes each of the predictors and the response.
TABLE 6.9 DESCRIPTION OF VARIABLES FOR BOSTON HOUSING
EXAMPLE
CRIM Per capita crime rate by town
ZN Proportion of residential land zoned for lots over 25,000 ft2
INDUS Proportion of nonretail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; = 0 otherwise)
NOX Nitric oxide concentration (parts per 10 million)
RM Average number of rooms per dwelling
AGE Proportion of owner-occupied units built prior to 1940
DIS Weighted distances to five Boston employment centers
RAD Index of accessibility to radial highways
TAX Full-value property-tax rate per $10,000
PTRATIO Pupil/teacher ratio by town
LSTAT Percentage lower status of the population
MEDV Median value of owner-occupied homes in $1000s
a. Why should the data be partitioned into training and validation sets? What will the
training set be used for? What will the validation set be used for?
b. Fit a multiple linear regression model to the median house price (MEDV) as a
function of CRIM, CHAS, and RM. Write the equation for predicting the median
house price from the predictors in the model.
c. Using the estimated regression model, what median house price is predicted for a
tract in the Boston area that does not bound the Charles River, has a crime rate of
0.1, and where the average number of rooms per house is 6? What is the prediction
error?
d. Reduce the number of predictors:
i. Which predictors are likely to be measuring the same thing among the 13
predictors? Discuss the relationships among INDUS, NOX, and TAX.
ii. Compute the correlation table for the 12 numerical predictors and search for
highly correlated pairs. These have potential redundancy and can cause multi-
collinearity. Choose which ones to remove based on this table.
iii. Use stepwise regression with the three options (backward, forward, both) to reduce
the remaining predictors as follows: Run stepwise on the training set. Choose
the top model from each stepwise run. Then use each of these models separately
to predict the validation set. Compare RMSE, MAPE, and mean error, as well
as lift charts. Finally, describe the best model.
6.2 Predicting Software Reselling Profits. Tayko Software is a software catalog firm
that sells games and educational software. It started out as a software manufacturer
170 MULTIPLE LINEAR REGRESSION
and then added third-party titles to its offerings. It recently revised its collection of
items in a new catalog, which it mailed out to its customers. This mailing yielded
2000 purchases. Based on these data, Tayko wants to devise a model for predicting
the spending amount that a purchasing customer will yield. The file Tayko.csv contains
information on 2000 purchases. Table 6.10 describes the variables to be used in the
problem (the Excel file contains additional variables).
TABLE 6.10 DESCRIPTION OF VARIABLES FOR TAYKO SOFTWARE EXAMPLE
FREQ Number of transactions in the preceding year
LAST_UPDATE Number of days since last update to customer record
WEB Whether customer purchased by Web order at least once
GENDER Male or female
ADDRESS_RES Whether it is a residential address
ADDRESS_US Whether it is a US address
SPENDING (response) Amount spent by customer in test mailing (in dollars)
a. Explore the spending amount by creating a pivot table for the categorical variables
and computing the average and standard deviation of spending in each category.
b. Explore the relationship between spending and each of the two continuous
predictors by creating two scatterplots (Spending vs. Freq, and Spending vs.
last_update_days_ago. Does there seem to be a linear relationship?
c. To fit a predictive model for Spending:
i. Partition the 2000 records into training and validation sets.
ii. Run a multiple linear regression model for Spending vs. all six predictors. Give
the estimated predictive equation.
iii. Based on this model, what type of purchaser is most likely to spend a large
amount of money?
iv. If we used backward elimination to reduce the number of predictors, which
predictor would be dropped first from the model?
v. Show how the prediction and the prediction error are computed for the first
purchase in the validation set.
vi. Evaluate the predictive accuracy of the model by examining its performance
on the validation set.
vii. Create a histogram of the model residuals. Do they appear to follow a normal
distribution? How does this affect the predictive performance of the model?
6.3 Predicting Airfare on New Routes.
The following problem takes place in the United States in the late 1990s, when many
major US cities were facing issues with airport congestion, partly as a result of the 1978
deregulation of airlines. Both fares and routes were freed from regulation, and low-fare
carriers such as Southwest (SW) began competing on existing routes and starting non-
stop service on routes that previously lacked it. Building completely new airports is
generally not feasible, but sometimes decommissioned military bases or smaller munic-
ipal airports can be reconfigured as regional or larger commercial airports. There are
numerous players and interests involved in the issue (airlines, city, state and federal
authorities, civic groups, the military, airport operators), and an aviation consulting
firm is seeking advisory contracts with these players. The firm needs predictive models
PROBLEMS 171
to support its consulting service. One thing the firm might want to be able to predict
is fares, in the event a new airport is brought into service. The firm starts with the
file Airfares.csv, which contains real data that were collected between Q3-1996 and
Q2-1997. The variables in these data are listed in Table 6.11, and are believed to be
important in predicting FARE. Some airport-to-airport data are available, but most
data are at the city-to-city level. One question that will be of interest in the analysis is
the effect that the presence or absence of Southwest has on FARE.
TABLE 6.11 DESCRIPTION OF VARIABLES FOR AIRFARE EXAMPLE
S_CODE Starting airport’s code
S_CITY Starting city
E_CODE Ending airport’s code
E_CITY Ending city
COUPON Average number of coupons (a one-coupon flight is a nonstop flight,
a two-coupon flight is a one-stop flight, etc.) for that route
NEW Number of new carriers entering that route between Q3-96 and Q2-97
VACATION Whether (Yes) or not (No) a vacation route
SW Whether (Yes) or not (No) Southwest Airlines serves that route
HI Herfindahl index: measure of market concentration
S_INCOME Starting city’s average personal income
E_INCOME Ending city’s average personal income
S_POP Starting city’s population
E_POP Ending city’s population
SLOT Whether or not either endpoint airport is slot-controlled
(this is a measure of airport congestion)
GATE Whether or not either endpoint airport has gate constraints
(this is another measure of airport congestion)
DISTANCE Distance between two endpoint airports in miles
PAX Number of passengers on that route during period of data collection
FARE Average fare on that route
a. Explore the numerical predictors and response (FARE) by creating a correlation
table and examining some scatterplots between FARE and those predictors. What
seems to be the best single predictor of FARE?
b. Explore the categorical predictors (excluding the first four) by computing the per-
centage of flights in each category. Create a pivot table with the average fare in
each category. Which categorical predictor seems best for predicting FARE?
c. Find a model for predicting the average fare on a new route:
i. Convert categorical variables (e.g., SW) into dummy variables. Then, partition
the data into training and validation sets. The model will be fit to the training
data and evaluated on the validation set.
ii. Use stepwise regression to reduce the number of predictors. You can ignore
the first four predictors (S_CODE, S_CITY, E_CODE, E_CITY). Report the
estimated model selected.
iii. Repeat (ii) using exhaustive search instead of stepwise regression. Compare the
resulting best model to the one you obtained in (ii) in terms of the predictors
that are in the model.
iv. Compare the predictive accuracy of both models (ii) and (iii) using measures
such as RMSE and average error and lift charts.
172 MULTIPLE LINEAR REGRESSION
v. Using model (iii), predict the average fare on a route with the following char-
acteristics: COUPON = 1.202, NEW = 3, VACATION = No, SW = No,
HI = 4442.141, S_INCOME = $28,760, E_INCOME = $27,664, S_POP =
4,557,004, E_POP = 3,195,503, SLOT = Free, GATE = Free, PAX = 12,782,
DISTANCE = 1976 miles.
vi. Predict the reduction in average fare on the route in (v) if Southwest decides
to cover this route [using model (iii)].
vii. In reality, which of the factors will not be available for predicting the average
fare from a new airport (i.e., before flights start operating on those routes)?
Which ones can be estimated? How?
viii. Select a model that includes only factors that are available before flights begin
to operate on the new route. Use an exhaustive search to find such a model.
ix. Use the model in (viii) to predict the average fare on a route with character-
istics COUPON = 1.202, NEW = 3, VACATION = No, SW = No, HI
= 4442.141, S_INCOME = $28,760, E_INCOME = $27,664, S_ POP =
4,557,004, E_POP = 3,195,503, SLOT = Free, GATE = Free, PAX = 12782,
DISTANCE = 1976 miles.
x. Compare the predictive accuracy of this model with model (iii). Is this model
good enough, or is it worthwhile reevaluating the model once flights begin on
the new route?
d. In competitive industries, a new entrant with a novel business plan can have a
disruptive effect on existing firms. If a new entrant’s business model is sustainable,
other players are forced to respond by changing their business practices. If the goal
of the analysis was to evaluate the effect of Southwest Airlines’ presence on the
airline industry rather than predicting fares on new routes, how would the analysis
be different? Describe technical and conceptual aspects.
6.4 Predicting Prices of Used Cars. The file ToyotaCorolla.csv contains data on used
cars (Toyota Corolla) on sale during late summer of 2004 in the Netherlands. It has
1436 records containing details on 38 attributes, including Price, Age, Kilometers,
HP, and other specifications. The goal is to predict the price of a used Toyota Corolla
based on its specifications. (The example in Section 6.3 is a subset of this dataset.)
Split the data into training (50%), validation (30%), and test (20%) datasets.
Run a multiple linear regression with the outcome variable Price and pre-
dictor variables Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quar-
terly_Tax, Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player,
Powered_Windows, Sport_Model, and Tow_Bar.
a. What appear to be the three or four most important car specifications for predicting
the car’s price?
b. Using metrics you consider useful, assess the performance of the model in predict-
ing prices.
CHAPTER 7
k-Nearest Neighbors (k-NN)
In this chapter, we describe the k-nearest-neighbors algorithm that can be used
for classification (of a categorical outcome) or prediction (of a numerical out-
come). To classify or predict a new record, the method relies on finding “similar”
records in the training data. These “neighbors” are then used to derive a classifi-
cation or prediction for the new record by voting (for classification) or averaging
(for prediction). We explain how similarity is determined, how the number of
neighbors is chosen, and how a classification or prediction is computed. k-NN
is a highly automated data-driven method. We discuss the advantages and weak-
nesses of the k-NN method in terms of performance and practical considerations
such as computational time.
7.1 The k-NN Classifier (Categorical
Outcome)
The idea in k-nearest-neighbors methods is to identify k records in the training
dataset that are similar to a new record that we wish to classify. We then use these
similar (neighboring) records to classify the new record into a class, assigning the
new record to the predominant class among these neighbors. Denote the values
of the predictors for this new record by x1, x2, . . . , xp. We look for records
in our training data that are similar or “near” the record to be classified in the
predictor space (i.e., records that have values close to x1, x2, . . . , xp). Then,
based on the classes to which those proximate records belong, we assign a class
to the record that we want to classify.
Determining Neighbors
The k-nearest-neighbors algorithm is a classification method that does not make
assumptions about the form of the relationship between the class membership
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
173
174 k-NEAREST NEIGHBORS (k-NN)
(Y ) and the predictors X1, X2, . . . , Xp. This is a nonparametric method
because it does not involve estimation of parameters in an assumed function
form, such as the linear form assumed in linear regression (Chapter 6). Instead,
this method draws information from similarities between the predictor values of
the records in the dataset.
A central question is how to measure the distance between records based on
their predictor values. The most popular measure of distance is the Euclidean
distance. The Euclidean distance between two records (x1, x2, . . . , xp) and
(u1, u2, . . . , up) is√
(x1 − u1)2 + (x2 − u2)2 + · · ·+ (xp − up)2. (7.1)
You will find a host of other distance metrics in Chapters 12 and 15 for
both numerical and categorical variables. However, the k-NN algorithm relies
on many distance computations (between each record to be predicted and every
record in the training set), and therefore the Euclidean distance, which is com-
putationally cheap, is the most popular in k-NN.
To equalize the scales that the various predictors may have, note that in
most cases, predictors should first be standardized before computing a Euclidean
distance. Also note that the means and standard deviations used to standardize
new records are those of the training data, and the new record is not included
in calculating them. The validation data, like new data, are also not included in
this calculation.
Classification Rule
After computing the distances between the record to be classified and existing
records, we need a rule to assign a class to the record to be classified, based on
the classes of its neighbors. The simplest case is k = 1, where we look for
the record that is closest (the nearest neighbor) and classify the new record as
belonging to the same class as its closest neighbor. It is a remarkable fact that this
simple, intuitive idea of using a single nearest neighbor to classify records can be
very powerful when we have a large number of records in our training set. It
turns out that the misclassification error of the 1-nearest neighbor scheme has a
misclassification rate that is no more than twice the error when we know exactly
the probability density functions for each class.
The idea of the 1-nearest neighbor can be extended to k > 1 neighbors as
follows:
1. Find the nearest k neighbors to the record to be classified.
2. Use a majority decision rule to classify the record, where the record is
classified as a member of the majority class of the k neighbors.
THE k-NN CLASSIFIER (CATEGORICAL OUTCOME) 175
Example: Riding Mowers
A riding-mower manufacturer would like to find a way of classifying families
in a city into those likely to purchase a riding mower and those not likely to
buy one. A pilot random sample is undertaken of 12 owners and 12 nonowners
in the city. The data are shown in Table 7.1. We first partition the data into
training data (14 households) and validation data (10 households). Obviously,
this dataset is too small for partitioning, which can result in unstable results, but
we will continue with this partitioning for illustration purposes. A scatter plot
of the training data is shown in Figure 7.1.
TABLE 7.1 LOT SIZE, INCOME, AND OWNERSHIP OF A RIDING MOWER
FOR 24 HOUSEHOLDS
Household Income Lot Size Ownership of
Number ($000s) (000s ft2) Riding Mower
1 60.0 18.4 Owner
2 85.5 16.8 Owner
3 64.8 21.6 Owner
4 61.5 20.8 Owner
5 87.0 23.6 Owner
6 110.1 19.2 Owner
7 108.0 17.6 Owner
8 82.8 22.4 Owner
9 69.0 20.0 Owner
10 93.0 20.8 Owner
11 51.0 22.0 Owner
12 81.0 20.0 Owner
13 75.0 19.6 Nonowner
14 52.8 20.8 Nonowner
15 64.8 17.2 Nonowner
16 43.2 20.4 Nonowner
17 84.0 17.6 Nonowner
18 49.2 17.6 Nonowner
19 59.4 16.0 Nonowner
20 66.0 18.4 Nonowner
21 47.4 16.4 Nonowner
22 33.0 18.8 Nonowner
23 51.0 14.0 Nonowner
24 63.0 14.8 Nonowner
25 60.0 20.0 ?
Now consider a new household with $60,000 income and lot size 20,000 ft2
(also shown in Figure 7.1). Among the households in the training set, the one
closest to the new household (in Euclidean distance after normalizing income
and lot size) is household 9, with $69,000 income and lot size 20,000 ft2. If
we use a 1-NN classifier, we would classify the new household as an owner,
like household 9. If we use k = 3, the three nearest households are 9, 14, and
1, as can be seen visually in the scatter plot, and as computed by the software
(see output in Table 7.2). Two of these neighbors are owners of riding mowers,
176 k-NEAREST NEIGHBORS (k-NN)
code for loading and partitioning the riding mower data, and plotting scatter plot
mower.df <- read.csv("RidingMowers.csv")
set.seed(111)
train.index <- sample(row.names(mower.df), 0.6*dim(mower.df)[1])
valid.index <- setdiff(row.names(mower.df), train.index)
train.df <- mower.df[train.index, ]
valid.df <- mower.df[valid.index, ]
## new household
new.df <- data.frame(Income = 60, Lot_Size = 20)
## scatter plot
plot(Lot_Size ~ Income, data=train.df, pch=ifelse(train.df$Ownership=="Owner", 1, 3))
text(train.df$Income, train.df$Lot_Size, rownames(train.df), pos=4)
text(60, 20, "X")
legend("topright", c("owner", "non-owner", "newhousehold"), pch = c(1, 3, 4))
50 60 70 80 90 100 110
16
17
18
19
20
21
22
Income
Lo
t_
S
iz
e
15
17
9
11
8
201
10
7
2
19
14
18
12X
owner
nonowner
newhousehold
FIGURE 7.1 SCATTER PLOT OF LOT SIZE VS. INCOME FOR THE 18 HOUSEHOLDS IN THE
TRAINING SET AND THE NEW HOUSEHOLD TO BE CLASSIFIED
and the last is a nonowner. The majority vote is therefore owner, and the new
household would be classified as an owner (see bottom of output in Table 7.2).
Choosing k
The advantage of choosing k > 1 is that higher values of k provide smoothing
that reduces the risk of overfitting due to noise in the training data. Generally
speaking, if k is too low, we may be fitting to the noise in the data. However,
if k is too high, we will miss out on the method’s ability to capture the local
structure in the data, one of its main advantages. In the extreme, k = n =
the number of records in the training dataset. In that case, we simply assign
THE k-NN CLASSIFIER (CATEGORICAL OUTCOME) 177
TABLE 7.2 RUNNING k-NN
code for normalizing data and finding nearest neighbors
# initialize normalized training, validation data, complete data frames to originals
train.norm.df <- train.df
valid.norm.df <- valid.df
mower.norm.df <- mower.df
# use preProcess() from the caret package to normalize Income and Lot_Size.
norm.values <- preProcess(train.df[, 1:2], method=c("center", "scale"))
train.norm.df[, 1:2] <- predict(norm.values, train.df[, 1:2])
valid.norm.df[, 1:2] <- predict(norm.values, valid.df[, 1:2])
mower.norm.df[, 1:2] <- predict(norm.values, mower.df[, 1:2])
new.norm.df <- predict(norm.values, new.df)
# use knn() to compute knn.
# knn() is available in library FNN (provides a list of the nearest neighbors)
# and library class (allows a numerical output variable).
library(FNN)
nn <- knn(train = train.norm.df[, 1:2], test = new.norm.df,
cl = train.norm.df[, 3], k = 3)
row.names(train.df)[attr(nn, "nn.index")]
Output
> row.names(train.df)[attr(nn, “nn.index”)]
[1] “9” “14” “1”
< nn
[1] Owner
attr(,"nn.index")
[,1][,2][,3]
[1,] 3 12 7
attr(,"nn.dist")
[,1] [,2] [,3]
[,1] 0.5137338 0.5716287 0.7946045
Levels: Owner
all records to the majority class in the training data, irrespective of the values
of (x1, x2, . . . , xp), which coincides with the naive rule! This is clearly a case
of oversmoothing in the absence of useful information in the predictors about
the class membership. In other words, we want to balance between overfitting
to the predictor information and ignoring this information completely. A bal-
anced choice greatly depends on the nature of the data. The more complex and
irregular the structure of the data, the lower the optimum value of k. Typically,
values of k fall in the range 1 to 20. We will use odd numbers to avoid ties.
So how is k chosen? Answer: We choose the k with the best classification
performance. We use the training data to classify the records in the validation
data, then compute error rates for various choices of k. For our example, if
we choose k = 1, we will classify in a way that is very sensitive to the local
characteristics of the training data. On the other hand, if we choose a large
178 k-NEAREST NEIGHBORS (k-NN)
TABLE 7.3 ACCURACY (OR CORRECT RATE) OF k-NN PREDICTIONS IN VALIDATION SET FOR
VARIOUS CHOICES OF k.
code for measuring the accuracy of different k values
library(caret)
# initialize a data frame with two columns: k, and accuracy.
accuracy.df <- data.frame(k = seq(1, 14, 1), accuracy = rep(0, 14))
# compute knn for different k on validation.
for(i in 1:14) {
knn.pred <- knn(train.norm.df[, 1:2], valid.norm.df[, 1:2],
cl = train.norm.df[, 3], k = i)
accuracy.df[i, 2] <- confusionMatrix(knn.pred, valid.norm.df[, 3])$overall[1]
}
Output
> accuracy.df
k accuracy
1 1 0.7
2 2 0.7
3 3 0.8
4 4 0.9
5 5 0.8
6 6 0.9
7 7 0.9
8 8 1.0
9 9 0.9
10 10 0.9
11 11 0.9
12 12 0.8
13 13 0.4
14 14 0.4
value of k, such as k = 18, we would simply predict the most frequent class in
the dataset in all cases. This is a very stable prediction but it completely ignores
the information in the predictors. To find a balance, we examine the accuracy (of
predictions in the validation set) that results from different choices of k between
1 and 14. For an even number k, if there is a tie in classifying a household, the
tie is broken randomly.1 This is shown in Table 7.3. We would choose k = 4,
which maximizes our accuracy in the validation set.2 Note, however, that now
the validation set is used as part of the training process (to set k) and does not
reflect a true holdout set as before. Ideally, we would want a third test set to
evaluate the performance of the method on data that it did not see.
1If you are interested in reproducibility of results, check the accuracy only for each odd k.
2Partitioning such a small dataset is unwise in practice, as results will heavily rely on the particular
partition. For instance, if you use a different partitioning, you might obtain a different “optimal” k.
We use this example for illustration only.
THE k-NN CLASSIFIER (CATEGORICAL OUTCOME) 179
TABLE 7.4 CLASSIFYING A NEW HOUSEHOLD USING THE “BEST K” = 4
code for running the k-NN algorithm to classify the new household
knn.pred.new <- knn(mower.norm.df[, 1:2], new.norm.df,
cl = mower.norm.df[, 3], k = 4)
row.names(train.df)[attr(nn, "nn.index")]
Partial Output
> row.names(train.df)[attr(nn, “nn.index”)]
[1] “9” “14” “1” “20”
> knn.pred.new
[1] Owner
attr(,”nn.index”)
[,1] [,2] [,3] [,4]
[1,] 3 12 7 6
attr(,”nn.dist”)
[,1] [,2] [,3] [,4]
[1,] 0.514 0.572 0.795 0.865
Levels: Owner
Once k is chosen, we rerun the algorithm on the combined training and
testing sets in order to generate classifications of new records. An example is
shown in Table 7.4, where the four nearest neighbors are used to classify the
new household.
Setting the Cutoff Value
k-NN uses a majority decision rule to classify a new record, where the record is
classified as a member of the majority class of the k neighbors. The definition of
“majority” is directly linked to the notion of a cutoff value applied to the class
membership probabilities. Let us consider a binary outcome case. For a new
record, the proportion of class 1 members among its neighbors is an estimate of
its propensity (probability) of belonging to class 1. In the riding mowers example
with k = 4, we found that the four nearest neighbors to the new household (with
income = $60,000 and lot size = 20,000 ft2) are households 9, 14, 1, and 20.
Since two of these are owners and the other two are nonowners, we can estimate
for the new household a probability of 0.5 of being an owner (and 0.5 for being
a nonowner). Using a simple majority rule is equivalent to setting the cutoff
value to 0.5. In Table 7.4, we see that the software broke the tie by (randomly)
assigning class owner to this record.
As mentioned in Chapter 5, changing the cutoff value affects the confusion
matrix (i.e., the error rates). Hence, in some cases we might want to choose a
cutoff other than the default 0.5 for the purpose of maximizing accuracy or for
incorporating misclassification costs.
180 k-NEAREST NEIGHBORS (k-NN)
k-NN with More Than Two Classes
The k-NN classifier can easily be applied to an outcome with m classes, where
m > 2. The “majority rule” means that a new record is classified as a member of
the majority class of its k neighbors. An alternative, when there is a specific class
that we are interested in identifying (and are willing to “overidentify” records
as belonging to this class), is to calculate the proportion of the k neighbors
that belong to this class of interest, use that as an estimate of the probability
(propensity) that the new record belongs to that class, and then refer to a user-
specified cutoff value to decide whether to assign the new record to that class.
For more on the use of cutoff value in classification where there is a single class
of interest, see Chapter 5.
Converting Categorical Variables to Binary Dummies
It usually does not make sense to calculate Euclidean distance between two non-
numeric categories (e.g., cookbooks and maps, in a bookstore). Therefore,
before k-NN can be applied, categorical variables must be converted to binary
dummies. In contrast to the situation with statistical models such as regression,
allm binaries should be created and used with k-NN. While mathematically this
is redundant, sincem−1 dummies contain the same information asm dummies,
this redundant information does not create the multicollinearity problems that it
does for linear models. Moreover, in k-NN the use of m−1 dummies can yield
different classifications than the use of m dummies, and lead to an imbalance in
the contribution of the different categories to the model.
7.2 k-NN for a Numerical Outcome
The idea of k-NN can readily be extended to predicting a continuous value (as
is our aim with multiple linear regression models). The first step of determining
neighbors by computing distances remains unchanged. The second step, where
a majority vote of the neighbors is used to determine class, is modified such that
we take the average outcome value of the k-nearest neighbors to determine the
prediction. Often, this average is a weighted average, with the weight decreasing
with increasing distance from the point at which the prediction is required. In
R, we can use function knn() in the class package to compute k-NN numerical
predictions for the validation set.
Another modification is in the error metric used for determining the “best
k.” Rather than the overall error rate used in classification, RMS error or another
prediction error metric should be used in prediction (see Chapter 5).
k-NN FOR A NUMERICAL OUTCOME 181
P A N D O R A
Pandora is an Internet music radio service that allows users to build customized
“stations” that play music similar to a song or artist that they have specified. Pan-
dora uses a k-NN style clustering/classification process called the Music Genome
Project to locate new songs or artists that are close to the user-specified song or
artist.
Pandora was the brainchild of Tim Westergren, who worked as a musician and a nanny
when he graduated from Stanford in the 1980s. Together with Nolan Gasser, who
was studying medieval music, he developed a “matching engine” by entering data
about a song’s characteristics into a spreadsheet. The first result was surprising—
a Beatles song matched to a Bee Gees song, but they built a company around the
concept. The early days were hard—Westergren racked up over $300,000 in personal
debt, maxed out 11 credit cards, and ended up in the hospital once due to stress-
induced heart palpitations. A venture capitalist finally invested funds in 2004 to
rescue the firm, and as of 2013, it is listed on the NY Stock Exchange.
In simplified terms, the process works roughly as follows for songs:
1. Pandora has established hundreds of variables on which a song can be measured
on a scale from 0–5. Four such variables from the beginning of the list are
• Acid Rock Qualities
• Accordion Playing
• Acousti-Lectric Sonority
• Acousti-Synthetic Sonority
2. Pandora pays musicians to analyze tens of thousands of songs, and rate each
song on each of these attributes. Each song will then be represented by
a row vector of values between 0 and 5, for example, for Led Zeppelin’s Kashmir:
Kashmir 4 0 3 3 … (high on acid rock attributes, no accordion, etc.)
This step represents a costly investment, and lies at the heart of Pandora’s value
because these variables have been tested and selected because they accurately
reflect the essence of a song, and provide a basis for defining highly individu-
alized preferences.
3. The online user specifies a song that s/he likes (the song must be in Pandora’s
database).
4. Pandora then calculates the statistical distance1 between the user’s song, and
the songs in its database. It selects a song that is close to the user-specified
song and plays it.
5. The user then has the option of saying “I like this song,” “I don’t like this song,”
or saying nothing.
6. If “like” is chosen, the original song, plus the new song are merged into a 2-
song cluster2 that is represented by a single vector, comprised of means of the
variables in the original two song vectors.
182 k-NEAREST NEIGHBORS (k-NN)
7. If “dislike” is chosen, the vector of the song that is not liked is stored for
future reference. (If the user does not express an opinion about the song, in
our simplified example here, the new song is not used for further comparisons.)
8. Pandora looks in its database for a new song, one whose statistical distance
is close to the “like” song cluster,3 and not too close to the “dislike” song.
Depending on the user’s reaction, this new song might be added to the “like”
cluster or “dislike” cluster.
Over time, Pandora develops the ability to deliver songs that match a particular taste
of a particular user. A single user might build up multiple stations around different
song clusters. Clearly, this is a less limiting approach than selecting music in terms
of which “genre” it belongs to.
While the process described above is a bit more complex than the basic “classi-
fication of new data” process described in this chapter, the fundamental process—
classifying a record according to its proximity to other records—is the same at its
core. Note the role of domain knowledge in this machine learning process—the vari-
ables have been tested and selected by the project leaders, and the measurements
have been made by human experts.
Further reading: See www.pandora.com, Wikipedia’s article on the Music
Genome Project, and Joyce John’s article “Pandora and the Music Genome Project,”
Scientific Computing, vol. 23, no. 10: 14, p. 40–41, Sep. 2006.
1See Section 12.5 in Chapter 12 for an explanation of statistical distance.
2See Chapter 15 for more on clusters.
3See Case 21.6 “Segmenting Consumers of Bath Soap” for an exercise involving the identification of
clusters, which are then used for classification purposes.
7.3 Advantages and Shortcomings of k-NN
Algorithms
The main advantage of k-NN methods is their simplicity and lack of paramet-
ric assumptions. In the presence of a large enough training set, these methods
perform surprisingly well, especially when each class is characterized by multiple
combinations of predictor values. For instance, in real-estate databases, there are
likely to be multiple combinations of {home type, number of rooms, neighbor-
hood, asking price, etc.} that characterize homes that sell quickly vs. those that
remain for a long period on the market.
There are three difficulties with the practical exploitation of the power of the
k-NN approach. First, although no time is required to estimate parameters from
the training data (as would be the case for parametric models such as regression),
the time to find the nearest neighbors in a large training set can be prohibitive. A
http://www.pandora.com
ADVANTAGES AND SHORTCOMINGS OF k-NN ALGORITHMS 183
number of ideas have been implemented to overcome this difficulty. The main
ideas are:
• Reduce the time taken to compute distances by working in a reduced
dimension using dimension reduction techniques such as principal com-
ponents analysis (Chapter 4).
• Use sophisticated data structures such as search trees to speed up identifi-
cation of the nearest neighbor. This approach often settles for an “almost
nearest” neighbor to improve speed. An example is using bucketing, where
the records are grouped into buckets so that records within each bucket
are close to each other. For a to-be-predicted record, buckets are ordered
by their distance to the record. Starting from the nearest bucket, the dis-
tance to each of the records within the bucket is measured. The algorithm
stops when the distance to a bucket is larger than the distance to the closest
record thus far.
Second, the number of records required in the training set to qualify as large
increases exponentially with the number of predictors p. This is because the
expected distance to the nearest neighbor goes up dramatically with p unless
the size of the training set increases exponentially with p. This phenomenon
is known as the curse of dimensionality, a fundamental issue pertinent to all clas-
sification, prediction, and clustering techniques. This is why we often seek to
reduce the number of predictors through methods such as selecting subsets of
the predictors for our model or by combining them using methods such as prin-
cipal components analysis, singular value decomposition, and factor analysis (see
Chapter 4).
Third, k-NN is a “lazy learner”: the time-consuming computation is
deferred to the time of prediction. For every record to be predicted, we compute
its distances from the entire set of training records only at the time of prediction.
This behavior prohibits using this algorithm for real-time prediction of a large
number of records simultaneously.
184 k-NEAREST NEIGHBORS (k-NN)
PROBLEMS
7.1 Calculating Distance with Categorical Predictors. This exercise with a tiny
dataset illustrates the calculation of Euclidean distance, and the creation of binary
dummies. The online education company Statistics.com segments its customers and
prospects into three main categories: IT professionals (IT), statisticians (Stat), and other
(Other). It also tracks, for each customer, the number of years since first contact
(years). Consider the following customers; information about whether they have taken
a course or not (the outcome to be predicted) is included:
Customer 1: Stat, 1 year, did not take course
Customer 2: Other, 1.1 year, took course
a. Consider now the following new prospect:
Prospect 1: IT, 1 year
Using the above information on the two customers and one prospect, create one
dataset for all three with the categorical predictor variable transformed into 2 bina-
ries, and a similar dataset with the categorical predictor variable transformed into 3
binaries.
b. For each derived dataset, calculate the Euclidean distance between the prospect and
each of the other two customers. (Note: while it is typical to normalize data for k-
NN, this is not an iron-clad rule and you may proceed here without normalization.)
c. Using k-NN with k = 1, classify the prospect as taking or not taking a course
using each of the two derived datasets. Does it make a difference whether you use
2 or 3 dummies?
7.2 Personal Loan Acceptance. Universal Bank is a relatively young bank growing
rapidly in terms of overall customer acquisition. The majority of these customers are
liability customers (depositors) with varying sizes of relationship with the bank. The
customer base of asset customers (borrowers) is quite small, and the bank is interested
in expanding this base rapidly to bring in more loan business. In particular, it wants
to explore ways of converting its liability customers to personal loan customers (while
retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy
conversion rate of over 9% success. This has encouraged the retail marketing depart-
ment to devise smarter campaigns with better target marketing. The goal is to use
k-NN to predict whether a new customer will accept a loan offer. This will serve as
the basis for the design of a new campaign.
The file UniversalBank.csv contains data on 5000 customers. The data include
customer demographic information (age, income, etc.), the customer’s relationship
with the bank (mortgage, securities account, etc.), and the customer response to the
last personal loan campaign (Personal Loan). Among these 5000 customers, only 480
(= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
Partition the data into training (60%) and validation (40%) sets.
a. Consider the following customer:
Age = 40, Experience = 10, Income = 84, Family = 2, CCAvg = 2, Education_1
= 0, Education_2 = 1, Education_3 = 0, Mortgage = 0, Securities Account = 0,
CD Account = 0, Online = 1, and Credit Card = 1. Perform a k-NN classification
with all predictors except ID and ZIP code using k = 1. Remember to transform
categorical predictors with more than two categories into dummy variables first.
PROBLEMS 185
Specify the success class as 1 (loan acceptance), and use the default cutoff value of
0.5. How would this customer be classified?
b. What is a choice of k that balances between overfitting and ignoring the predictor
information?
c. Show the confusion matrix for the validation data that results from using the best
k.
d. Consider the following customer: Age = 40, Experience = 10, Income = 84,
Family = 2, CCAvg = 2, Education_1 = 0, Education_2 = 1, Education_3 = 0,
Mortgage = 0, Securities Account = 0, CD Account = 0, Online = 1 and Credit
Card = 1. Classify the customer using the best k.
e. Repartition the data, this time into training, validation, and test sets (50% : 30% :
20%). Apply the k-NN method with the k chosen above. Compare the confusion
matrix of the test set with that of the training and validation sets. Comment on the
differences and their reason.
7.3 Predicting Housing Median Prices. The file BostonHousing.csv contains infor-
mation on over 500 census tracts in Boston, where for each tract multiple variables
are recorded. The last column (CAT.MEDV) was derived from MEDV, such that it
obtains the value 1 if MEDV > 30 and 0 otherwise. Consider the goal of predicting
the median value (MEDV) of a tract, given the information in the first 12 columns.
Partition the data into training (60%) and validation (40%) sets.
a. Perform a k-NN prediction with all 12 predictors (ignore the CAT.MEDV col-
umn), trying values of k from 1 to 5. Make sure to normalize the data, and choose
function knn() from package class rather than package FNN. To make sure R is
using the class package (when both packages are loaded), use class::knn(). What
is the best k? What does it mean?
b. Predict the MEDV for a tract with the following information, using the best k:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO LSTAT
0.2 0 7 0 0.538 6 62 4.7 4 307 21 10
c. If we used the above k-NN algorithm to score the training data, what would be
the error of the training set?
d. Why is the validation data error overly optimistic compared to the error rate when
applying this k-NN predictor to new data?
e. If the purpose is to predict MEDV for several thousands of new tracts, what would
be the disadvantage of using k-NN prediction? List the operations that the algo-
rithm goes through in order to produce each prediction.
CHAPTER 8
The Naive Bayes Classifier
In this chapter, we introduce the naive Bayes classifier, which can be applied to
data with categorical predictors. We review the concept of conditional prob-
abilities, then present the complete, or exact, Bayesian classifier. We next see
how it is impractical in most cases, and learn how to modify it and use instead
the naive Bayes classifier, which is more generally applicable.
8.1 Introduction
The naive Bayes method (and, indeed, an entire branch of statistics) is named
after the Reverend Thomas Bayes (1702–1761). To understand the naive Bayes
classifier, we first look at the complete, or exact, Bayesian classifier. The basic
principle is simple. For each record to be classified:
1. Find all the other records with the same predictor profile (i.e., where the
predictor values are the same).
2. Determine what classes the records belong to and which class is most
prevalent.
3. Assign that class to the new record.
Alternatively (or in addition), it may be desirable to tweak the method so
that it answers the question: “What is the propensity of belonging to the class of
interest?” instead of “Which class is the most probable?” Obtaining class prob-
abilities allows using a sliding cutoff to classify a record as belonging to class
Ci, even if Ci is not the most probable class for that record. This approach is
useful when there is a specific class of interest that we are interested in iden-
tifying, and we are willing to “overidentify” records as belonging to this class.
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
187
188 THE NAIVE BAYES CLASSIFIER
(See Chapter 5 for more details on the use of cutoffs for classification and on
asymmetric misclassification costs)
Cutoff Probability Method
1. Establish a cutoff probability for the class of interest above which we con-
sider that a record belongs to that class.
2. Find all the training records with the same predictor profile as the new
record (i.e., where the predictor values are the same).
3. Determine the probability that those records belong to the class of
interest.
4. If that probability is above the cutoff probability, assign the new record
to the class of interest.
Conditional Probability
Both procedures incorporate the concept of conditional probability, or the prob-
ability of event A given that event B has occurred [denoted P (A|B)]. In this
case, we will be looking at the probability of the record belonging to class Ci
given that its predictor values are x1, x2, . . . , xp. In general, for a response with
m classes C1, C2, . . . , Cm, and the predictor values x1, x2, . . . , xp, we want to
compute
P (Ci|x1, . . . , xp). (8.1)
To classify a record, we compute its probability of belonging to each of the classes
in this way, then classify the record to the class that has the highest probability
or use the cutoff probability to decide whether it should be assigned to the class
of interest.
From this definition, we see that the Bayesian classifier works only with cat-
egorical predictors. If we use a set of numerical predictors, then it is highly
unlikely that multiple records will have identical values on these numerical pre-
dictors. Therefore, numerical predictors must be binned and converted to cate-
gorical predictors. The Bayesian classifier is the only classification or prediction method
presented in this book that is especially suited for (and limited to) categorical predictor
variables.
Example 1: Predicting Fraudulent Financial Reporting
An accounting firm has many large companies as customers. Each customer
submits an annual financial report to the firm, which is then audited by the
accounting firm. For simplicity, we will designate the outcome of the audit
as “fraudulent” or “truthful,” referring to the accounting firm’s assessment of
the customer’s financial report. The accounting firm has a strong incentive to
APPLYING THE FULL (EXACT) BAYESIAN CLASSIFIER 189
be accurate in identifying fraudulent reports—if it passes a fraudulent report as
truthful, it would be in legal trouble.
The accounting firm notes that, in addition to all the financial records, it
also has information on whether or not the customer has had prior legal trouble
(criminal or civil charges of any nature filed against it). This information has not
been used in previous audits, but the accounting firm is wondering whether it
could be used in the future to identify reports that merit more intensive review.
Specifically, it wants to know whether having had prior legal trouble is predictive
of fraudulent reporting.
In this case, each customer is a record, and the outcome variable of interest,
Y = {fraudulent, truthful}, has two classes into which a company can be classi-
fied: C1 = fraudulent and C2 = truthful. The predictor variable—“prior legal
trouble”—has two values: 0 (no prior legal trouble) and 1 (prior legal trouble).
The accounting firm has data on 1500 companies that it has investigated
in the past. For each company, it has information on whether the financial
report was judged fraudulent or truthful and whether the company had prior
legal trouble. The data were partitioned into a training set (1000 firms) and a
validation set (500 firms). Counts in the training set are shown in Table 8.1.
TABLE 8.1 PIVOT TABLE FOR FINANCIAL REPORTING EXAMPLE
Prior Legal No Prior Legal
(X = 1) (X = 0) Total
Fraudulent (C1) 50 50 100
Truthful (C2) 180 720 900
Total 230 770 1000
8.2 Applying the Full (Exact) Bayesian
Classifier
Now consider the financial report from a new company, which we wish to clas-
sify as either fraudulent or truthful by using these data. To do this, we compute
the probabilities, as above, of belonging to each of the two classes.
If the new company had had prior legal trouble, the probability of belonging
to the fraudulent class would be P (fraudulent | prior legal) = 50/230 (of the 230
companies with prior legal trouble in the training set, 50 had fraudulent financial
reports). The probability of belonging to the other class, “truthful,” is, of course,
the remainder = 180/230.
190 THE NAIVE BAYES CLASSIFIER
Using the “Assign to the Most Probable Class” Method
If a company had prior legal trouble, we assign it to the “truthful” class. Similar
calculations for the case of no prior legal trouble are left as an exercise to the
reader. In this example, using the rule “assign to the most probable class,” all
records are assigned to the “truthful” class. This is the same result as the naive
rule of “assign all records to the majority class.”
Using the Cutoff Probability Method
In this example, we are more interested in identifying the fraudulent reports—
those are the ones that can land the auditor in jail. We recognize that, in order
to identify the fraudulent reports, some truthful reports will be misidentified as
fraudulent, and the overall classification accuracy may decline. Our approach is,
therefore, to establish a cutoff value for the probability of being fraudulent, and
classify all records above that value as fraudulent. The Bayesian formula for the
calculation of this probability that a record belongs to class Ci is as follows:
P (Ci|x1, . . . , xp) =
P (x1, . . . , xp|Ci)P (Ci)
P (x1, . . . , xp|C1)P (C1) + · · ·+ P (x1, . . . , xp|Cm)P (Cm)
.
(8.2)
In this example (where frauds are rarer), if the cutoff were established at 0.20,
we would classify a prior legal trouble record as fraudulent because P (fraudulent
| prior legal) = 50/230 = 0.22. The user can treat this cutoff as a “slider” to
be adjusted to optimize performance, like other parameters in any classification
model.
Practical Difficulty with the Complete (Exact) Bayes Procedure
The approach outlined above amounts to finding all the records in the sample
that are exactly like the new record to be classified in the sense that all the
predictor values are all identical. This was easy in the small example presented
above, where there was just one predictor.
When the number of predictors gets larger (even to a modest number like
20), many of the records to be classified will be without exact matches. This
can be understood in the context of a model to predict voting on the basis of
demographic variables. Even a sizable sample may not contain even a single
match for a new record who is a male Hispanic with high income from the US
Midwest who voted in the last election, did not vote in the prior election, has
three daughters and one son, and is divorced. And this is just eight variables, a
small number for most data mining exercises. The addition of just a single new
variable with five equally frequent categories reduces the probability of a match
by a factor of 5.
APPLYING THE FULL (EXACT) BAYESIAN CLASSIFIER 191
Solution: Naive Bayes
In the naive Bayes solution, we no longer restrict the probability calculation to
those records that match the record to be classified. Instead we use the entire
dataset.
Returning to our original basic classification procedure outlined at the
beginning of the chapter, recall that the procedure for classifying a new record
was:
1. Find all the other records with the same predictor profile (i.e., where the
predictor values are the same).
2. Determine what classes the records belong to and which class is most
prevalent.
3. Assign that class to the new record.
The naive Bayes modification (for the basic classification procedure) is as follows:
1. For class C1, estimate the individual conditional probabilities for each
predictor P (xj|C1)—these are the probabilities that the predictor value
in the record to be classified occurs in class C1. For example, for X1
this probability is estimated by the proportion of x1 values among the C1
records in the training set.
2. Multiply these probabilities by each other, then by the proportion of
records belonging to class C1.
3. Repeat Steps 1 and 2 for all the classes.
4. Estimate a probability for class Ci by taking the value calculated in Step
2 for class Ci and dividing it by the sum of such values for all classes.
5. Assign the record to the class with the highest probability for this set of
predictor values.
The above steps lead to the naive Bayes formula for calculating the probability
that a record with a given set of predictor values x1, . . . , xp belongs to class C1
among m classes. The formula can be written as follows:
Pnb(C1 | x1, . . . xp) =
P (C1)[P (x1 | C1)P (x2 | C1) · · ·P (xp | C1)]
P (C1)[P (x1 | C1)P (x2 | C1) · · ·P (xp | C1)] + · · ·+ P (Cm)[P (x1 | Cm)P (x2 | Cm) · · ·P (xp | Cm)]
.
(8.3)
This is a somewhat formidable formula; see Example 2 for a simpler numeri-
cal version. Note that all the needed quantities can be obtained from pivot tables
of Y vs. each of the categorical predictors.
192 THE NAIVE BAYES CLASSIFIER
The Naive Bayes Assumption of Conditional Independence
In probability terms, we have made a simplifying assumption that the exact con-
ditional probability of seeing a record with predictor profile x1, x2, . . . , xp within
a certain class, P (x1, x2, . . . , xp|Ci), is well approximated by the product of
the individual conditional probabilities P (x1|Ci)× P (x2|Ci) · · · × P (xp|Ci).
These two quantities are identical when the predictors are independent within
each class.
For example, suppose that “lost money last year” is an additional variable in
the accounting fraud example. The simplifying assumption we make with naive
Bayes is that, within a given class, we no longer need to look for the records
characterized both by “prior legal trouble” and “lost money last year.” Rather,
assuming that the two are independent, we can simply multiply the probability
of “prior legal trouble” by the probability of “lost money last year.” Of course,
complete independence is unlikely in practice, where some correlation between
predictors is expected.
In practice, despite the assumption violation, the procedure works quite
well—primarily because what is usually needed is not a propensity for each
record that is accurate in absolute terms but just a reasonably accurate rank order-
ing of propensities. Even when the assumption is violated, the rank ordering of
the records’ propensities is typically preserved.
Note that if all we are interested in is a rank ordering, and the denominator
remains the same for all classes, it is sufficient to concentrate only on the numer-
ator. The disadvantage of this approach is that the probability values it yields
(the propensities), while ordered correctly, are not on the same scale as the exact
values that the user would anticipate.
Using the Cutoff Probability Method
The above procedure is for the basic case where we seek maximum classification
accuracy for all classes. In the case of the relatively rare class of special interest, the
procedure is:
1. Establish a cutoff probability for the class of interest above which we con-
sider that a record belongs to that class.
2. For the class of interest, compute the probability that each individual
predictor value in the record to be classified occurs in the training data.
3. Multiply these probabilities times each other, then times the proportion
of records belonging to the class of interest.
4. Estimate the probability for the class of interest by taking the value cal-
culated in Step 3 for the class of interest and dividing it by the sum of the
similar values for all classes.
APPLYING THE FULL (EXACT) BAYESIAN CLASSIFIER 193
5. If this value falls above the cutoff, assign the new record to the class of
interest, otherwise not.
6. Adjust the cutoff value as needed, as a parameter of the model.
Example 2: Predicting Fraudulent Financial Reports, Two Predictors
Let us expand the financial reports example to two predictors, and, using a small
subset of data, compare the complete (exact) Bayes calculations to the naive Bayes
calculations.
Consider the 10 customers of the accounting firm listed in Table 8.2. For
each customer, we have information on whether it had prior legal trouble,
whether it is a small or large company, and whether the financial report was
found to be fraudulent or truthful. Using this information, we will calculate the
conditional probability of fraud, given each of the four possible combinations
{y, small}, {y, large}, {n, small}, {n, large}.
TABLE 8.2 INFORMATION ON 10 COMPANIES
Company Prior Legal Trouble Company Size Status
1 Yes Small Truthful
2 No Small Truthful
3 No Large Truthful
4 No Large Truthful
5 No Small Truthful
6 No Small Truthful
7 Yes Small Fraudulent
8 Yes Large Fraudulent
9 No Large Fraudulent
10 Yes Large Fraudulent
Complete (Exact) Bayes Calculations: The probabilities are computed as
P (fraudulent|PriorLegal = y, Size = small) = 1/2 = 0.5
P (fraudulent|PriorLegal = y, Size = large) = 2/2 = 1
P (fraudulent|PriorLegal = n, Size = small) = 0/3 = 0
P (fraudulent|PriorLegal = n, Size = large) = 1/3 = 0.33
Naive Bayes Calculations: Now we compute the naive Bayes probabilities.
For the conditional probability of fraudulent behaviors given {PriorLegal = y,
Size = small}, the numerator is a multiplication of the proportion of {Prior-
Legal = y} instances among the fraudulent companies, times the proportion of
{Size = small} instances among the fraudulent companies, times the proportion
of fraudulent companies: (3/4)(1/4)(4/10) = 0.075. To get the actual prob-
abilities, we must also compute the numerator for the conditional probability
194 THE NAIVE BAYES CLASSIFIER
of truthful behaviors given {PriorLegal = y, Size = small}: (1/6)(4/6)(6/10) =
0.067. The denominator is then the sum of these two conditional probabilities
(0.075 + 0.067 = 0.14). The conditional probability of fraudulent behaviors
given {PriorLegal = y, Size = small} is therefore 0.075/0.14 = 0.53. In a similar
fashion, we compute all four conditional probabilities:
Pnb(fraudulent|PriorLegal = y, Size = small) =
(3/4)(1/4)(4/10)
(3/4)(1/4)(4/10) + (1/6)(4/6)(6/10)
= 0.53
Pnb(fraudulent|PriorLegal = y, Size = large) = 0.87
Pnb(fraudulent|PriorLegal = n, Size = small) = 0.07
Pnb(fraudulent|PriorLegal = n, Size = large) = 0.31
Note how close these naive Bayes probabilities are to the exact Bayes probabil-
ities. Although they are not equal, both would lead to exactly the same classi-
fication for a cutoff of 0.5 (and many other values). It is often the case that the
rank ordering of probabilities is even closer to the exact Bayes method than the
probabilities themselves, and for classification purposes it is the rank orderings
that matter.
We now consider a larger numerical example, where information on flights
is used to predict flight delays.
Example 3: Predicting Delayed Flights
Predicting flight delays can be useful to a variety of organizations: airport author-
ities, airlines, and aviation authorities. At times, joint task forces have been
formed to address the problem. If such an organization were to provide ongo-
ing real-time assistance with flight delays, it would benefit from some advance
notice about flights that are likely to be delayed.
In this simplified illustration, we look at five predictors (see Table 8.3). The
outcome of interest is whether or not the flight is delayed (delayed here means
arrived more than 15 minutes late). Our data consist of all flights from the Wash-
ington, DC area into the New York City area during January 2004. A record
is a particular flight. The percentage of delayed flights among these 2201 flights
is 19.5%. The data were obtained from the Bureau of Transportation Statistics
(available on the web at www.transtats.bts.gov). The goal is to accurately predict
whether or not a new flight (not in this dataset), will be delayed. The outcome
variable is whether the flight was delayed, and thus it has two classes (1 = delayed
and 0 = on time). In addition, information is collected on the predictors listed
in Table 8.3.
The data were first partitioned into training (60%) and validation (40%) sets,
and then a naive Bayes classifier was applied to the training set (we use package
e1071).
http://www.transtats.bts.gov
APPLYING THE FULL (EXACT) BAYESIAN CLASSIFIER 195
TABLE 8.3 DESCRIPTION OF VARIABLES FOR FLIGHT DELAYS EXAMPLE
Day of Week Coded as 1 = Monday, 2 = Tuesday, …, 7 = Sunday
Sch. Dep. Time Broken down into 18 intervals between 6:00 AM and 10:00 PM
Origin Three airport codes: DCA (Reagan National), IAD (Dulles),
BWI (Baltimore–Washington Int’l)
Destination Three airport codes: JFK (Kennedy), LGA (LaGuardia), EWR (Newark)
Carrier Eight airline codes: CO (Continental), DH (Atlantic Coast), DL (Delta),
MQ (American Eagle), OH (Comair), RU (Continental Express),
UA (United), and US (USAirways)
The first part of the output in Table 8.4 shows the ratios of delayed flights
and on time flights in the training set (called a priori probabilities), followed by
the conditional probabilities for each class, as a function of the predictor values.
Note that the conditional probabilities in the naive Bayes output can be replicated
simply by using pivot tables on the training data, looking at the proportion of
records for each value relative to the entire class. This is illustrated in Table
8.5, which displays the proportion of delayed (or on time) flights by destination
airport (each row adds up to 1).
Note that in this example, there are no predictor values that were not rep-
resented in the training data.
To classify a new flight, we compute the probability that it will be delayed and
the probability that it will be on time. Recall that since both will have the same
denominator, we can just compare the numerators. Each numerator is computed
by multiplying all the conditional probabilities of the relevant predictor values
and, finally, multiplying by the proportion of that class (in this case P̂ (delayed)
= 0.197). Let us use an example: to classify a Delta flight from DCA to LGA
departing between 10:00 AM and 11:00 AM on a Sunday, we first compute the
numerators:
P̂ (delayed|Carrier = DL, Day_Week = 7, Dep_Time = 10, Dest = LGA, Origin = DCA)
∝ (0.115)(0.146)(0.027)(0.400)(0.519)(0.197) = 0.000019
P̂ (ontime|Carrier = DL, Day_Week = 7, Dep_Time = 10, Dest = LGA, Origin = DCA)
∝ (0.198)(0.106)(0.049)(0.537)(0.651)(0.803) = 0.00029
The symbol ∝ means “is proportional to,” reflecting the fact that this calcula-
tion deals only with the numerator in the naive Bayes formula (8.3). Compar-
ing the numerators, it is therefore, more likely that the flight will be on time.
Note that a record with such a combination of predictor values does not exist
in the training set, and therefore we use the naive Bayes rather than the exact
Bayes.
196 THE NAIVE BAYES CLASSIFIER
TABLE 8.4 NAIVE BAYES CLASSIFIER APPLIED TO FLIGHT DELAYS (TRAINING) DATA
code for running naive Bayes
library(e1071)
delays.df <- read.csv("FlightDelays.csv")
# change numerical variables to categorical first
delays.df$DAY_WEEK <- factor(delays.df$DAY_WEEK)
delays.df$DEP_TIME <- factor(delays.df$DEP_TIME)
# create hourly bins departure time
delays.df$CRS_DEP_TIME <- factor(round(delays.df$CRS_DEP_TIME/100))
# Create training and validation sets.
selected.var <- c(10, 1, 8, 4, 2, 13)
train.index <- sample(c(1:dim(delays.df)[1]), dim(delays.df)[1]*0.6)
train.df <- delays.df[train.index, selected.var]
valid.df <- delays.df[-train.index, selected.var]
# run naive bayes
delays.nb <- naiveBayes(Flight.Status ~ ., data = train.df)
delays.nb
Partial output
A-priori probabilities:
Y
delayed ontime
0.197 0.803
Conditional probabilities:
DAY_WEEK
Y 1 2 3 4 5 6 7
delayed 0.2154 0.1346 0.1269 0.1269 0.1846 0.0654 0.1462
ontime 0.1311 0.1377 0.1453 0.1755 0.1783 0.1264 0.1057
CRS_DEP_TIME
Y 6 7 8 9 10 11 12 13 14
delayed 0.0308 0.0538 0.0692 0.0269 0.0269 0.0154 0.0615 0.0346 0.0462
ontime 0.0632 0.0575 0.0736 0.0557 0.0491 0.0368 0.0613 0.0783 0.0623
CRS_DEP_TIME
Y 15 16 17 18 19 20 21
delayed 0.2077 0.0731 0.1346 0.0308 0.0846 0.0192 0.0846
ontime 0.1142 0.0840 0.0981 0.0406 0.0415 0.0283 0.0557
ORIGIN
Y BWI DCA IAD
delayed 0.0885 0.5192 0.3923
ontime 0.0632 0.6519 0.2849
DEST
Y EWR JFK LGA
delayed 0.381 0.219 0.400
ontime 0.289 0.174 0.538
CARRIER
Y CO DH DL MQ OH RU UA US
delayed 0.0692 0.3308 0.1154 0.1731 0.0115 0.2154 0.0154 0.0692
ontime 0.0377 0.2302 0.1981 0.1292 0.0142 0.1840 0.0132 0.1934
APPLYING THE FULL (EXACT) BAYESIAN CLASSIFIER 197
TABLE 8.5 PIVOT TABLE OF FLIGHT STATUS BY DESTINATION AIRPORT (TRAINING DATA)
# use prop.table() with margin = 1 to convert a count table to a proportion table,
# where each row sums up to 1 (use margin = 2 for column sums).
> prop.table(table(train.df$Flight.Status, train.df$DEST), margin = 1)
EWR JFK LGA
delayed 0.381 0.219 0.400
ontime 0.289 0.174 0.538
To compute the actual probability, we divide each of the numerators by their
sum:
P̂ (delayed|Carrier =DL, Day_Week= 7, Dep_Time= 10, Dest = LGA, Origin =DCA)=
=
0.000019
0.000019 + 0.00029
=0.06
P̂ (on time|Carrier =DL, Day_Week= 7, Dep_Time= 10, Dest = LGA, Origin =DCA)=
=
0.00029
0.000019 + 0.00029
=0.94
Of course, we rely on software to compute these probabilities for any records
of interest (in the training set, the validation set, or for scoring new data). Table
8.6 shows the estimated probability and class for the example flight.
TABLE 8.6 SCORING THE EXAMPLE FLIGHT (PROBABILITY AND CLASS)
code for scoring data using naive Bayes
## predict probabilities
pred.prob <- predict(delays.nb, newdata = valid.df, type = "raw")
## predict class membership
pred.class <- predict(delays.nb, newdata = valid.df)
df <- data.frame(actual = valid.df$Flight.Status, predicted = pred.class, pred.prob)
df[valid.df$CARRIER == "DL" & valid.df$DAY_WEEK == 7 & valid.df$CRS_DEP_TIME == 10 &
valid.df$DEST == "LGA" & valid.df$ORIGIN == "DCA",]
Output
> df[valid.df$CARRIER == “DL” & valid.df$DAY_WEEK == 7 & valid.df$CRS_DEP_TIME == 10 &
+ valid.df$DEST == “LGA” & valid.df$ORIGIN == “DCA”,]
actual predicted delayed ontime
69 ontime ontime 0.0604 0.94
700 ontime ontime 0.0604 0.94
198 THE NAIVE BAYES CLASSIFIER
Finally, to evaluate the performance of the naive Bayes classifier for our data,
we use the confusion matrix, lift charts, and all the measures that were described
in Chapter 5. For our example, the confusion matrices for the training and
validation sets are shown in Table 8.7. We see that the overall accuracy level is
around 80% for both the training and validation data. In comparison, a naive
rule that would classify all 880 flights in the validation set as on time would
have missed the 172 delayed flights, also resulting in a 80% accuracy. Thus,
by a simple accuracy measure, the naive Bayes model does no better than the
naive rule. However, examining the lift chart (Figure 8.1) shows the strength
of the naive Bayes in capturing the delayed flights effectively, when the goal is
ranking.
TABLE 8.7 CONFUSION MATRICES FOR FLIGHT DELAY USING A NAIVE BAYES CLASSIFIER
code for confusion matrices
library(caret)
# training
pred.class <- predict(delays.nb, newdata = train.df)
confusionMatrix(pred.class, train.df$Flight.Status)
# validation
pred.class <- predict(delays.nb, newdata = valid.df)
confusionMatrix(pred.class, valid.df$Flight.Status)
Partial output
> # training
> confusionMatrix(pred.class, train.df$Flight.Status)
Confusion Matrix and Statistics
Reference
Prediction delayed ontime
delayed 30 39
ontime 230 1021
Accuracy : 0.796
> # validation
> confusionMatrix(pred.class, valid.df$Flight.Status)
Confusion Matrix and Statistics
Reference
Prediction delayed ontime
delayed 21 25
ontime 147 688
Accuracy : 0.805
ADVANTAGES AND SHORTCOMINGS OF THE NAIVE BAYES CLASSIFIER 199
code for creating Figure 8.1
library(gains)
gain <- gains(ifelse(valid.df$Flight.Status=="delayed",1,0), pred.prob[,1], groups=100)
plot(c(0,gain$cume.pct.of.total*sum(valid.df$Flight.Status=="delayed"))~c(0,gain$cume.obs),
xlab="# cases", ylab="Cumulative", main="", type="l")
lines(c(0,sum(valid.df$Flight.Status=="delayed"))~c(0, dim(valid.df)[1]), lty=2)
0 200 400 600 800
0
50
10
0
15
0
# cases
C
um
ul
at
iv
e
FIGURE 8.1 LIFT CHART OF NAIVE BAYES CLASSIFIER APPLIED TO FLIGHT DELAYS DATA
8.3 Advantages and Shortcomings of the
Naive Bayes Classifier
The naive Bayes classifier’s beauty is in its simplicity, computational efficiency,
good classification performance, and ability to handle categorical variables
directly. In fact, it often outperforms more sophisticated classifiers even when
the underlying assumption of independent predictors is far from true. This
advantage is especially pronounced when the number of predictors is very large.
Three main issues should be kept in mind, however. First, the naive Bayes
classifier requires a very large number of records to obtain good results.
Second, where a predictor category is not present in the training data, naive
Bayes assumes that a new record with that category of the predictor has zero
probability. This can be a problem if this rare predictor value is important. One
example is the binary predictorWeather in the flights delay dataset, which we did
not use for analysis, and which denotes bad weather. When the weather was bad,
all flights were delayed. Consider another example, where the outcome variable
is bought high-value life insurance and a predictor category is owns yacht. If the
200 THE NAIVE BAYES CLASSIFIER
training data have no records with owns yacht = 1, for any new records where owns
yacht = 1, naive Bayes will assign a probability of 0 to the outcome variable bought
high-value life insurance. With no training records with owns yacht = 1, of course,
no data mining technique will be able to incorporate this potentially important
variable into the classification model—it will be ignored. With naive Bayes,
however, the absence of this predictor actively “outvotes” any other information
in the record to assign a 0 to the outcome value (when, in this case, it has a
relatively good chance of being a 1). The presence of a large training set (and
judicious binning of continuous predictors, if required) helps mitigate this effect.
A popular solution in such cases is to replace zero probabilities with non-zero
values using a method called smoothing (e.g., Laplace smoothing can be applied
by using argument laplace = 0 in function naiveBayes()).
Finally, good performance is obtained when the goal is classification or ranking
of records according to their probability of belonging to a certain class. How-
ever, when the goal is to estimate the probability of class membership (propensity), this
method provides very biased results. For this reason, the naive Bayes method is
rarely used in credit scoring (Larsen, 2005).
S P A M F I L T E R I N G
Filtering spam in e-mail has long been a widely familiar application of data min-
ing. Spam filtering, which is based in large part on natural language vocabulary,
is a natural fit for a naive Bayesian classifier, which uses exclusively categorical
variables. Most spam filters are based on this method, which works as follows:
1. Humans review a large number of e-mails, classify them as “spam” or “not spam,”
and from these select an equal (also large) number of spam e-mails and non-
spam e-mails. This is the training data.
2. These e-mails will contain thousands of words; for each word, compute the fre-
quency with which it occurs in the spam dataset, and the frequency with which
it occurs in the non-spam dataset. Convert these frequencies into estimated
probabilities (i.e., if the word “free” occurs in 500 out of 1000 spam e-mails,
and only 100 out of 1000 non-spam e-mails, the probability that a spam e-mail
will contain the word “free” is 0.5, and the probability that a non-spam e-mail
will contain the word “free” is 0.1).
3. If the only word in a new message that needs to be classified as spam or not spam
is “free,” we would classify the message as spam, since the Bayesian posterior
probability is 0.5/(0.5+01) or 5/6 that, given the appearance of “free,” the
message is spam.
4. Of course, we will have many more words to consider. For each such word, the
probabilities described in Step 2 are calculated, and multiplied together, and
formula (8.3) is applied to determine the naive Bayes probability of belonging
to the classes. In the simple version, class membership (spam or not spam) is
determined by the higher probability.
ADVANTAGES AND SHORTCOMINGS OF THE NAIVE BAYES CLASSIFIER 201
5. In a more flexible interpretation, the ratio between the “spam” and “not spam”
probabilities is treated as a score for which the operator can establish (and
change) a cutoff threshold—anything above that level is classified as spam.
6. Users have the option of building a personalized training database by classify-
ing incoming messages as spam or not spam, and adding them to the training
database. One person’s spam may be another person’s substance.
It is clear that, even with the “Naive” simplification, this is an enormous com-
putational burden. Spam filters now typically operate at two levels—at servers
(intercepting some spam that never makes it to your computer) and on individual
computers (where you have the option of reviewing it). Spammers have also found
ways to ”poison” the vocabulary-based Bayesian approach, by including sequences
of randomly selected irrelevant words. Since these words are randomly selected,
they are unlikely to be systematically more prevalent in spam than in non-spam,
and they dilute the effect of key spam terms such as “Viagra” and “free.” For this rea-
son, sophisticated spam classifiers also include variables based on elements other
than vocabulary, such as the number of links in the message, the vocabulary in
the subject line, determination of whether the “From:” e-mail address is the real
originator (anti-spoofing), use of HTML and images, and origination at a dynamic
or static IP address (the latter are more expensive and cannot be set up quickly).
202 THE NAIVE BAYES CLASSIFIER
PROBLEMS
8.1 Personal Loan Acceptance. The file UniversalBank.csv contains data on 5000 cus-
tomers of Universal Bank. The data include customer demographic information (age,
income, etc.), the customer’s relationship with the bank (mortgage, securities account,
etc.), and the customer response to the last personal loan campaign (Personal Loan).
Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was
offered to them in the earlier campaign. In this exercise, we focus on two predictors:
Online (whether or not the customer is an active user of online banking services) and
Credit Card (abbreviated CC below) (does the customer hold a credit card issued by
the bank), and the outcome Personal Loan (abbreviated Loan below).
Partition the data into training (60%) and validation (40%) sets.
a. Create a pivot table for the training data with Online as a column variable, CC as
a row variable, and Loan as a secondary row variable. The values inside the table
should convey the count. In R use functions melt() and cast(), or function table().
b. Consider the task of classifying a customer who owns a bank credit card and is
actively using online banking services. Looking at the pivot table, what is the
probability that this customer will accept the loan offer? [This is the probability of
loan acceptance (Loan = 1) conditional on having a bank credit card (CC = 1) and
being an active user of online banking services (Online = 1)].
c. Create two separate pivot tables for the training data. One will have Loan (rows) as
a function of Online (columns) and the other will have Loan (rows) as a function
of CC.
d. Compute the following quantities [P (A | B) means “the probability of A given B”]:
i. P (CC = 1 | Loan = 1) (the proportion of credit card holders among the loan
acceptors)
ii. P (Online = 1 | Loan = 1)
iii. P (Loan = 1) (the proportion of loan acceptors)
iv. P (CC = 1 | Loan = 0)
v. P (Online = 1 | Loan = 0)
vi. P (Loan = 0)
e. Use the quantities computed above to compute the naive Bayes probability
P (Loan = 1 | CC = 1, Online = 1).
f. Compare this value with the one obtained from the pivot table in (b). Which is a
more accurate estimate?
g. Which of the entries in this table are needed for computing P (Loan = 1 | CC = 1,
Online = 1)? In R, run naive Bayes on the data. Examine the model output on
training data, and find the entry that corresponds to P (Loan = 1 | CC = 1, Online
= 1). Compare this to the number you obtained in (e).
8.2 Automobile Accidents. The file Accidents.csv contains information on 42,183 actual
automobile accidents in 2001 in the United States that involved one of three levels of
injury: NO INJURY, INJURY, or FATALITY. For each accident, additional infor-
mation is recorded, such as day of week, weather conditions, and road type. A firm
might be interested in developing a system for quickly classifying the severity of an
accident based on initial reports and associated data in the system (some of which rely
on GPS-assisted reporting).
PROBLEMS 203
Our goal here is to predict whether an accident just reported will involve an injury
(MAX_SEV_IR = 1 or 2) or will not (MAX_SEV_IR = 0). For this purpose, create
a dummy variable called INJURY that takes the value “yes” if MAX_SEV_IR = 1 or
2, and otherwise “no.”
a. Using the information in this dataset, if an accident has just been reported and no
further information is available, what should the prediction be? (INJURY = Yes
or No?) Why?
b. Select the first 12 records in the dataset and look only at the response (INJURY)
and the two predictors WEATHER_R and TRAF_CON_R.
i. Create a pivot table that examines INJURY as a function of the two predictors
for these 12 records. Use all three variables in the pivot table as rows/columns.
ii. Compute the exact Bayes conditional probabilities of an injury (INJURY =
Yes) given the six possible combinations of the predictors.
iii. Classify the 12 accidents using these probabilities and a cutoff of 0.5.
iv. Compute manually the naive Bayes conditional probability of an injury given
WEATHER_R = 1 and TRAF_CON_R = 1.
v. Run a naive Bayes classifier on the 12 records and two predictors using R.
Check the model output to obtain probabilities and classifications for all 12
records. Compare this to the exact Bayes classification. Are the resulting clas-
sifications equivalent? Is the ranking (= ordering) of observations equivalent?
c. Let us now return to the entire dataset. Partition the data into training (60%) and
validation (40%).
i. Assuming that no information or initial reports about the accident itself are
available at the time of prediction (only location characteristics, weather con-
ditions, etc.), which predictors can we include in the analysis? (Use the
Data_Codes sheet.)
ii. Run a naive Bayes classifier on the complete training set with the relevant pre-
dictors (and INJURY as the response). Note that all predictors are categorical.
Show the confusion matrix.
iii. What is the overall error for the validation set?
iv. What is the percent improvement relative to the naive rule (using the validation
set)?
v. Examine the conditional probabilities output. Why do we get a probability of
zero for P(INJURY = No | SPD_LIM = 5)?
CHAPTER 9
Classification and Regression Trees
This chapter describes a flexible data-driven method that can be used for both
classification (called classification tree) and prediction (called regression tree). Among
the data-driven methods, trees are the most transparent and easy to interpret.
Trees are based on separating records into subgroups by creating splits on pre-
dictors. These splits create logical rules that are transparent and easily under-
standable, for example, “IF Age < 55 AND Education > 12 THEN class = 1.”
The resulting subgroups should be more homogeneous in terms of the outcome
variable, thereby creating useful prediction or classification rules. We discuss the
two key ideas underlying trees: recursive partitioning (for constructing the tree) and
pruning (for cutting the tree back). In the context of tree construction, we also
describe a few metrics of homogeneity that are popular in tree algorithms, for
determining the homogeneity of the resulting subgroups of records. We explain
that pruning is a useful strategy for avoiding overfitting and show how it is done.
We also describe alternative strategies for avoiding overfitting. As with other
data-driven methods, trees require large amounts of data. However, once con-
structed, they are computationally cheap to deploy even on large samples. They
also have other advantages such as being highly automated, robust to outliers,
and able to handle missing values. In addition to prediction and classification,
we describe how trees can be used for dimension reduction. Finally, we intro-
duce random forests and boosted trees, which combine results from multiple trees
to improve predictive power.
9.1 Introduction
If one had to choose a classification technique that performs well across a wide
range of situations without requiring much effort from the analyst while being
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
205
206 CLASSIFICATION AND REGRESSION TREES
Income < 106
Education < 1.5
Family < 2.5
Income < 114
Income < 116
CCAvg < 2.4
2712 288
2312 31
400 257
366 45
359 0
7 45
7 4 0 41
34 212
34 22
29 7 5 15
0 190
0
0
0
0
0
1
0 1
1
0
0 1
1
yes no
FIGURE 9.1 BEST-PRUNED TREE OBTAINED BY FITTING A FULL TREE TO THE TRAINING DATA
readily understandable by the consumer of the analysis, a strong contender would
be the tree methodology developed by Breiman et al. (1984). We discuss this
classification procedure first, then in later sections we show how the proce-
dure can be extended to prediction of a numerical outcome. The program that
Breiman et al. created to implement these procedures was called CART (Clas-
sification And Regression Trees). A related procedure is called C4.5.
What is a classification tree? Figure 9.1 shows a tree for classifying bank
customers who receive a loan offer as either acceptors or nonacceptors, based
on information such as their income, education level, and average credit card
expenditure. One of the reasons that tree classifiers are very popular is that
they provide easily understandable classification rules (at least if the trees are not
too large). Consider the tree in the example. The gray terminal nodes are marked
with 0 or 1 corresponding to a nonacceptor (0) or acceptor (1). The values above
the white nodes give the splitting value on a predictor. The values below the
nodes gives the number of records in the split. This tree can easily be translated
into a set of rules for classifying a bank customer. For example, the bottom-left
rectangle node under the “Family” circle in this tree gives us the following rule:
IF(Income ≥ 106) AND (Education < 1.5) AND (Family < 2.5)
THEN Class = 0 (nonacceptor).
In the following, we show how trees are constructed and evaluated.
CLASSIFICATION TREES 207
9.2 Classification Trees
Two key ideas underlie classification trees. The first is the idea of recursive parti-
tioning of the space of the predictor variables. The second is the idea of pruning
using validation data. In the next few sections, we describe recursive partitioning
and in subsequent sections explain the pruning methodology.
Recursive Partitioning
Let us denote the outcome variable by Y and the input (predictor) variables by
X1, X2, X3, . . . , Xp. In classification, the outcome variable will be a categori-
cal variable. Recursive partitioning divides up the p-dimensional space of the X
predictor variables into nonoverlapping multidimensional rectangles. The pre-
dictor variables here are considered to be continuous, binary, or ordinal. This
division is accomplished recursively (i.e., operating on the results of prior divi-
sions). First, one of the predictor variables is selected, say Xi, and a value of
Xi, say si, is chosen to split the p-dimensional space into two parts: one part
that contains all the points with Xi < si and the other with all the points
with Xi ≥ si. Then, one of these two parts is divided in a similar manner by
again choosing a predictor variable (it could be Xi or another variable) and a
split value for that variable. This results in three (multidimensional) rectangular
regions. This process is continued so that we get smaller and smaller rectangu-
lar regions. The idea is to divide the entire X-space up into rectangles such
that each rectangle is as homogeneous or “pure” as possible. By pure, we mean
containing records that belong to just one class. (Of course, this is not always
possible, as there may be records that belong to different classes but have exactly
the same values for every one of the predictor variables.)
Let us illustrate recursive partitioning with an example.
Example 1: Riding Mowers
We again use the riding-mower example presented in Chapter 3. A riding-
mower manufacturer would like to find a way of classifying families in a city
into those likely to purchase a riding mower and those not likely to buy one. A
pilot random sample of 12 owners and 12 nonowners in the city is undertaken.
The data are shown and plotted in Table 9.1 and Figure 9.2.
If we apply the classification tree procedure to these data, the procedure will
choose Income for the first split with a splitting value of 60. The (X1, X2) space
is now divided into two rectangles, one with Income < 60 and the other with
Income ≥ 60. This is illustrated in Figure 9.3.
Notice how the split has created two rectangles, each of which is much more
homogeneous than the rectangle before the split. The left rectangle contains
208 CLASSIFICATION AND REGRESSION TREES
TABLE 9.1 LOT SIZE, INCOME, AND OWNERSHIP OF A
RIDING MOWER FOR 24 HOUSEHOLDS
Household Income Lot Size Ownership of
Number ($000s) (000s ft2) Riding Mower
1 60.0 18.4 Owner
2 85.5 16.8 Owner
3 64.8 21.6 Owner
4 61.5 20.8 Owner
5 87.0 23.6 Owner
6 110.1 19.2 Owner
7 108.0 17.6 Owner
8 82.8 22.4 Owner
9 69.0 20.0 Owner
10 93.0 20.8 Owner
11 51.0 22.0 Owner
12 81.0 20.0 Owner
13 75.0 19.6 Nonowner
14 52.8 20.8 Nonowner
15 64.8 17.2 Nonowner
16 43.2 20.4 Nonowner
17 84.0 17.6 Nonowner
18 49.2 17.6 Nonowner
19 59.4 16.0 Nonowner
20 66.0 18.4 Nonowner
21 47.4 16.4 Nonowner
22 33.0 18.8 Nonowner
23 51.0 14.0 Nonowner
24 63.0 14.8 Nonowner
FIGURE 9.2 SCATTER PLOT OF LOT SIZE VS. INCOME FOR 24 OWNERS AND NONOWNERS OF
RIDING MOWERS
CLASSIFICATION TREES 209
FIGURE 9.3 SPLITTING THE 24 RECORDS BY INCOME VALUE OF 60
points that are mostly nonowners (seven nonowners and one owner) and the
right rectangle contains mostly owners (11 owners and five nonowners).
How was this particular split selected? The algorithm examined each pre-
dictor variable (in this case, Income and Lot Size) and all possible split values
for each variable to find the best split. What are the possible split values for
a variable? They are simply the values for each predictor. The possible split
points for Income are {33.0, 43.2, 47.4, . . . , 110.1} and those for Lot Size are
{14.0, 14.8, 16.0, . . . , 23.6}. These split points are ranked according to how
much they reduce impurity (heterogeneity) in the resulting rectangle. A pure
rectangle is one that is composed of a single class (e.g., owners). The reduction
in impurity is defined as overall impurity before the split minus the sum of the
impurities for the two rectangles that result from a split.
Categorical Predictors The previous description used numerical predic-
tors; however, categorical predictors can also be used in the recursive partitioning
context. To handle categorical predictors, the split choices for a categorical pre-
dictor are all ways in which the set of categories can be divided into two subsets.
For example, a categorical variable with four categories, say {a, b, c, d}, can be
split in seven ways into two subsets: {a} and {b, c, d}; {b} and {a, c, d}; {c}
and {a, b, d}; {d} and {a, b, c}; {a, b} and {c, d}; {a, c} and {b, d}; and finally
{a, d} and {b, c}. When the number of categories is large, the number of splits
becomes very large.
210 CLASSIFICATION AND REGRESSION TREES
Measures of Impurity
There are a number of ways to measure impurity. The two most popular mea-
sures are the Gini index and an entropy measure. We describe both next. Denote
the m classes of the response variable by k = 1, 2, . . . ,m.
The Gini impurity index for a rectangle A is defined by
I(A) = 1−
m∑
k=1
p2k,
where pk is the proportion of records in rectangle A that belong to class k. This
measure takes values between 0 (when all the records belong to the same class)
and (m− 1)/m (when all m classes are equally represented). Figure 9.4 shows
the values of the Gini index for a two-class case as a function of pk. It can be
seen that the impurity measure is at its peak when pk = 0.5 (i.e., when the
rectangle contains 50% of each of the two classes).
FIGURE 9.4 VALUES OF THE GINI INDEX FOR A TWO-CLASS CASE AS A FUNCTION OF THE
PROPORTION OF RECORDS IN CLASS 1 (p1)
A second impurity measure is the entropy measure. The entropy for a rect-
angle A is defined by
entropy(A) = −
m∑
k=1
pk log2(pk)
(to compute log2(x) in R, use function log2()). This measure ranges between 0
(most pure, all records belong to the same class) and log2(m) (when all m classes
are represented equally). In the two-class case, the entropy measure is maximized
(like the Gini index) at pk = 0.5.
Let us compute the impurity in the riding mower example before and after
the first split (using Income with the value of 60). The unsplit dataset contains
CLASSIFICATION TREES 211
12 owners and 12 nonowners. This is a two-class case with an equal number of
records from each class. Both impurity measures are therefore at their maximum
value: Gini = 0.5 and entropy = log2(2) = 1. After the split, the left rectan-
gle contains seven nonowners and one owner. The impurity measures for this
rectangle are:
Gini_left = 1− (7/8)2 − (1/8)2 = 0.219
entropy_left = −(7/8) log2(7/8)− (1/8) log2(1/8) = 0.544
The right rectangle contains 11 owners and five nonowners. The impurity mea-
sures of the right rectangle are therefore
Gini_right = 1− (11/16)2 − (5/16)2 = 0.430
entropy_right = −(11/16) log2(11/16)− (5/16) log2(5/16) = 0.896
The combined impurity of the two rectangles that were created by the split
is a weighted average of the two impurity measures, weighted by the number of
records in each:
Gini = (8/24)(0.219) + (16/24)(0.430) = 0.359
entropy = (8/24)(0.544) + (16/24)(0.896) = 0.779
Thus, the Gini impurity index decreased from 0.5 before the split to 0.359 after
the split. Similarly, the entropy impurity measure decreased from 1 before the
split to 0.779 after the split.
By comparing the reduction in impurity across all possible splits in all possible
predictors, the next split is chosen. If we continue splitting the mower data,
the next split is on the Lot Size variable at the value 21. Figure 9.5 shows
that once again the tree procedure has astutely chosen to split a rectangle to
increase the purity of the resulting rectangles. The lower-left rectangle, which
contains data points with Income < 60 and Lot Size < 21, has all points that are
nonowners; whereas the upper left rectangle, which contains data points with
Income < 60 and Lot Size ≥ 21, consists exclusively of a single owner. In other
words, the two left rectangles are now “pure.” We can see how the recursive
partitioning is refining the set of constituent rectangles to become purer as the
algorithm proceeds. The final stage of the recursive partitioning is shown in
Figure 9.6. Notice that each rectangle is now pure: it contains data points from
just one of the two classes.
The reason the method is called a classification tree algorithm is that each split
can be depicted as a split of a node into two successor nodes. The first split is
shown as a branching of the root node of a tree in Figure 9.7. The full-grown
tree is shown in Figure 9.8. (Note that in R the split values are integers).
212 CLASSIFICATION AND REGRESSION TREES
FIGURE 9.5 SPLITTING THE 24 RECORDS FIRST BY INCOME VALUE OF 60 AND THEN LOT SIZE
VALUE OF 21
FIGURE 9.6 FINAL STAGE OF RECURSIVE PARTITIONING; EACH RECTANGLE
CONSISTING OF A SINGLE CLASS (OWNERS OR NONOWNERS)
CLASSIFICATION TREES 213
code for running and plotting classification tree with single split
library(rpart)
library(rpart.plot)
mower.df <- read.csv("RidingMowers.csv")
# use rpart() to run a classification tree.
# define rpart.control() in rpart() to determine the depth of the tree.
class.tree <- rpart(Ownership ~ ., data = mower.df,
control = rpart.control(maxdepth = 2), method = "class")
## plot tree
# use prp() to plot the tree. You can control plotting parameters such as color, shape,
# and information displayed (which and where).
prp(class.tree, type = 1, extra = 1, split.font = 1, varlen = -10)
Income < 60
12 12
7 1 5 11
Nonowner
Nonowner Owner
yes no
FIGURE 9.7 TREE REPRESENTATION OF FIRST SPLIT (CORRESPONDS TO FIGURE 9.3)
Income < 60
Lot_Size < 21 Lot_Size < 20
Income < 85
Income >= 62
12 12
7 1
7 0 0 1
5 11
5 4
5 1
5 0 0 1
0 3
0 7
Nonowner
Nonowner
Nonowner Owner
Owner
Nonowner
Nonowner
Nonowner Owner
Owner
Owner
yes no
FIGURE 9.8 TREE REPRESENTATION AFTER ALL SPLITS (CORRESPONDS TO FIGURE 9.6). THIS
IS THE FULL GROWN TREE
214 CLASSIFICATION AND REGRESSION TREES
Tree Structure
We have two types of nodes in a tree: decision nodes and terminal nodes. Nodes
that have successors are called decision nodes because if we were to use a tree to
classify a new record for which we knew only the values of the predictor vari-
ables, we would “drop” the record down the tree so that at each decision node,
the appropriate branch is taken until we get to a node that has no successors.
Such nodes are called the terminal nodes (or leaves of the tree), and represent the
partitioning of the data by predictors.
It is useful to note that the type of trees grown by R’s rpart() function, also
known as CART or binary trees, have the property that the number of terminal
nodes is exactly one more than the number of decision nodes.
When using the R function prp() for plotting a tree, decision nodes are
depicted by white ovals, and terminal nodes by gray ovals.
The name of the variable chosen for splitting and its splitting value are above
each decision node. The numbers below a node are the number of records in
that node that had values lesser than (left side) or larger or equal to (right side)
the splitting value. (“yes” and “no” tags at the top node clarify on which side
we find the lesser values).
Classifying a New Record
To classify a new record, it is “dropped” down the tree. When it has dropped
all the way down to a terminal node, we can assign its class simply by taking
a “vote” of all the training data that belonged to the terminal node when the
tree was grown. The class with the highest vote is assigned to the new record.
For instance, a new record reaching the rightmost terminal node in Figure 9.8,
which has a majority of records that belong to the owner class, would be classified
as “owner.” Alternatively, if a single class is of interest, the algorithm counts the
number of “votes” for this class, converts it to a proportion (propensity), then
compares it to a user-specified cutoff value. See Chapter 5 for further discussion
of the use of a cutoff value in classification, for cases where a single class is of
interest.
In a binary classification situation (typically, with a success class that is rel-
atively rare and of particular interest), we can also establish a lower cutoff to
better capture those rare successes (at the cost of lumping in more failures as suc-
cesses). With a lower cutoff, the votes for the success class only need attain that
lower cutoff level for the entire terminal node to be classified as a success. The
cutoff therefore determines the proportion of votes needed for determining the
terminal node class.
EVALUATING THE PERFORMANCE OF A CLASSIFICATION TREE 215
9.3 Evaluating the Performance of a
Classification Tree
We have seen with previous methods that the modeling job is not completed by
fitting a model to training data; we need out-of-sample data to assess and tune
the model. This is particularly true with classification and regression trees, for
two reasons:
• Tree structure can be quite unstable, shifting substantially depending on
the sample chosen.
• A fully-fit tree will invariably lead to overfitting.
To visualize the first challenge, potential instability, imagine that we partition
the data randomly into two samples, A and B, and we build a tree with each. If
there are several predictors of roughly equal predictive power, you can see that it
would be easy for samples A and B to select different predictors for the top level
split, just based on which records ended up in which sample. And a different
split at the top level would likely cascade down and yield completely different
sets of rules. So we should view the results of a single tree with some caution.
To illustrate the second challenge, overfitting, let’s examine another example.
Example 2: Acceptance of Personal Loan
Universal Bank is a relatively young bank that is growing rapidly in terms of
overall customer acquisition. The majority of these customers are liability cus-
tomers with varying sizes of relationship with the bank. The customer base of
asset customers is quite small, and the bank is interested in growing this base
rapidly to bring in more loan business. In particular, it wants to explore ways of
converting its liability (deposit) customers to personal loan customers.
A campaign the bank ran for liability customers showed a healthy conversion
rate of over 9% successes. This has encouraged the retail marketing department to
devise smarter campaigns with better target marketing. The goal of our analysis
is to model the previous campaign’s customer behavior to analyze what combi-
nation of factors make a customer more likely to accept a personal loan. This
will serve as the basis for the design of a new campaign.
Our predictive model will be a classification tree. To assess the accuracy of
the tree in classifying new records, we start with the tools and criteria discussed
in Chapter 5—partitioning the data into training and validation sets, and later
introduce the idea of cross-validation.
The bank’s dataset includes data on 5000 customers. The data include cus-
tomer demographic information (age, income, etc.), customer response to the
last personal loan campaign (Personal Loan), and the customer’s relationship with
216 CLASSIFICATION AND REGRESSION TREES
the bank (mortgage, securities account, etc.). Table 9.2 shows a sample of the
bank’s customer database for 20 customers, to illustrate the structure of the data.
Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that
was offered to them in the earlier campaign.
After randomly partitioning the data into training (3000 records) and vali-
dation (2000 records), we use the training data to construct a tree. A tree with
7 splits is shown in Figure 9.9 (this is the default tree produced by rpart() for this
data). The top node refers to all the records in the training set, of which 2709
customers did not accept the loan and 291 customers accepted the loan. The “0”
in the top node’s rectangle represents the majority class (“did not accept” = 0).
The first split, which is on the income variable, generates left and right child
nodes. To the left is the child node with customers who have income less than
114. Customers with income greater than or equal to 114 go to the right. The
splitting process continues; where it stops depends on the parameter settings of
the algorithm. The eventual classification of customer appears in the terminal
nodes. Of the eight terminal nodes, four lead to classification of “did not accept”
and four lead to classification of “accept.”
A full-grown tree is shown in Figure 9.10. Although it is difficult to see
the exact splits, let us assess the performance of this tree with the validation data
and compare it to the smaller default tree. Each record in the validation data is
“dropped down” the tree and classified according to the terminal node it reaches.
These predicted classes can then be compared to the actual memberships via a
confusion matrix. When a particular class is of interest, a lift chart is useful for
assessing the model’s ability to capture those members.
Table 9.3 displays the confusion matrices for the training and validation sets
of the small default tree (top) and for the full tree (bottom). Comparing the
two training matrices, we see that the full tree has higher accuracy: it is 100%
accurate in classifying the training data, which means it has completely pure
terminal nodes. In contrast, the confusion matrices for the validation data show
that the smaller tree is more accurate. The main reason is that the full-grown tree
overfits the training data (to perfect accuracy!). This motivates the next section,
where we describe ways to avoid overfitting by either stopping the growth of
the tree before it is fully grown or by pruning the full-grown tree.
9.4 Avoiding Overfitting
One danger in growing deeper trees on the training data is overfitting. As dis-
cussed in Chapter 5, overfitting will lead to poor performance on new data. If we
look at the overall error at the various sizes of the tree, it is expected to decrease
as the number of terminal nodes grows until the point of overfitting. Of course,
for the training data the overall error decreases more and more until it is zero
AVOIDING OVERFITTING 217
T
A
B
L
E
9
.2
SA
M
PL
E
OF
DA
TA
FO
R
20
CU
ST
OM
ER
S
OF
U
N
IV
ER
SA
L
BA
N
K
Pr
of
es
si
on
al
Fa
m
ily
CC
Pe
rs
on
al
Se
cu
ri
ti
es
CD
On
lin
e
Cr
ed
it
ID
Ag
e
Ex
pe
ri
en
ce
In
co
m
e
Si
ze
Av
g
Ed
uc
at
io
n
M
or
tg
ag
e
Lo
an
Ac
co
un
t
Ac
co
un
t
Ba
nk
in
g
Ca
rd
1
25
1
49
4
1.
60
U
G
0
No
Ye
s
No
No
No
2
45
19
34
3
1.
50
U
G
0
No
Ye
s
No
No
No
3
39
15
11
1
1.
00
U
G
0
No
No
No
No
No
4
35
9
10
0
1
2.
70
Gr
ad
0
No
No
No
No
No
5
35
8
45
4
1.
00
Gr
ad
0
No
No
No
No
Ye
s
6
37
13
29
4
0.
40
Gr
ad
15
5
No
No
No
Ye
s
No
7
53
27
72
2
1.
50
Gr
ad
0
No
No
No
Ye
s
No
8
50
24
22
1
0.
30
Pr
of
0
No
No
No
No
Ye
s
9
35
10
81
3
0.
60
Gr
ad
10
4
No
No
No
Ye
s
No
10
34
9
18
0
1
8.
90
Pr
of
0
Ye
s
No
No
No
No
11
65
39
10
5
4
2.
40
Pr
of
0
No
No
No
No
No
12
29
5
45
3
0.
10
Gr
ad
0
No
No
No
Ye
s
No
13
48
23
11
4
2
3.
80
Pr
of
0
No
Ye
s
No
No
No
14
59
32
40
4
2.
50
Gr
ad
0
No
No
No
Ye
s
No
15
67
41
11
2
1
2.
00
U
G
0
No
Ye
s
No
No
No
16
60
30
22
1
1.
50
Pr
of
0
No
No
No
Ye
s
Ye
s
17
38
14
13
0
4
4.
70
Pr
of
13
4
Ye
s
No
No
No
No
18
42
18
81
4
2.
40
U
G
0
No
No
No
No
No
19
46
21
19
3
2
8.
10
Pr
of
0
Ye
s
No
No
No
No
20
55
28
21
1
0.
50
Gr
ad
0
No
Ye
s
No
No
Ye
s
218 CLASSIFICATION AND REGRESSION TREES
code for creating a default classification tree
library(rpart)
library(rpart.plot)
bank.df <- read.csv("UniversalBank.csv")
bank.df <- bank.df[ , -c(1, 5)] # Drop ID and zip code columns.
# partition
set.seed(1)
train.index <- sample(c(1:dim(bank.df)[1]), dim(bank.df)[1]*0.6)
train.df <- bank.df[train.index, ]
valid.df <- bank.df[-train.index, ]
# classification tree
default.ct <- rpart(Personal.Loan ~ ., data = train.df, method = "class")
# plot tree
prp(default.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen = -10)
Income < 114
CCAvg < 3
CD.Account < 0.5
Income < 92
Family < 2.5
Education < 1.5
Family < 2.5
2709 291
2360 50
2233 9
127 41
123 31
89 10
34 21
32 7 2 14
4 10
349 241
342 39
342 0 0 39
7 202
0
0
0
0
0
0
0
0 1
1
0
0
0 1
1
yes no
FIGURE 9.9 DEFAULT CLASSIFICATION TREE FOR THE LOAN ACCEPTANCE DATA USING THE
TRAINING SET (3000 RECORDS)
AVOIDING OVERFITTING 219
In
co
m
e
<
1
14
C
C
A
vg
<
3
In
co
m
e
<
1
06
Fa
m
ily
<
3
.5
E
xp
er
ie
nc
e
>
=
1
6
A
ge
<
4
2
In
co
m
e
>
=
1
08
Fa
m
ily
<
2
.5
A
ge
>
=
2
8
A
ge
<
2
8
A
ge
<
3
3
C
C
A
vg
>
=
1
.1
E
du
ca
tio
n
>
=
1
.5
In
co
m
e
>
=
1
08
A
ge
<
4
2
C
D
.A
cc
ou
nt
<
0
.5
In
co
m
e
<
9
2
A
ge
>
=
3
0
E
du
ca
tio
n
>
=
2
.5
Fa
m
ily
>
=
2
.5
C
C
A
vg
>
=
3
.2
C
C
A
vg
<
3
C
C
A
vg
>
=
3
.2
A
ge
<
4
4
A
ge
>
=
3
4
A
ge
<
3
2
M
or
tg
ag
e
>
=
1
09
A
ge
>
=
4
4
Fa
m
ily
>
=
1
.5
A
ge
>
=
6
1
In
co
m
e
>
=
9
2
Fa
m
ily
<
2
.5
E
du
ca
tio
n
<
1
.5
A
ge
<
5
6
E
xp
er
ie
nc
e
>
=
3
4
In
co
m
e
<
1
04
A
ge
<
4
6
Fa
m
ily
<
1
.5
In
co
m
e
>
=
1
10
A
ge
>
=
6
0
A
ge
<
6
3
C
C
A
vg
>
=
4
M
or
tg
ag
e
<
1
96
In
co
m
e
>
=
9
2
A
ge
<
4
2
E
du
ca
tio
n
<
1
.5
Fa
m
ily
<
2
.5
In
co
m
e
<
1
16
C
C
A
vg
<
2
.2
O
nl
in
e
>
=
0
.5
A
ge
>
=
5
9
M
or
tg
ag
e
>
=
1
42
27
09
2
91
23
60
5
0
22
33
9
21
83
0
50
9
39
4
23
0
16
4
16
3
16
2
15
1
12
0
3
1
3
0
0
1
1
1
1
0
0
1
0
1
0
1
11
5
11
2
11
1
10
0
1
1
1
0
0
10
1
0
3
12
7
4
1
12
3
3
1
89
1
0
88
7
38
0
50
7
30
2
21
0
9
2
9
0
0
2
20
5
20
4
13
1
12
0
1
1
1
0
0
1
7
3
4
0
3
3
3
2
2
0
1
2
1
0
0
20
1
0
1
1
3
1
0
0
3
34
2
1
32
7
26
1
25
0
1
1
1
0
0
1
6
6
6
2
5
0
1
2
1
0
0
20
4
2
1
4
1
0
1
1
4
1
1
1
0
0
1
0
1
3
4
1
0
4
4
4
1
3
0
1
1
1
0
0
10
3
0
6
34
9
2
41
34
2
3
9
34
2
0
0
3
9
7
2
02
7
1
0
6
2
5
0
1
2
1
0
0
2
1
8
1
0
0
8
0
1
92
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1
0
0
0
0
0
0
11
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
11
1
1
0
1
0
0
0
0
0
0
1
0
0
0
1
0
11
1
0
1
0
0
1
1
1
0
0
0
0
0
11
1
0
0
0
1
1
1
0
0
1
0
1
1
0
1
1
ye
s
no
F
IG
U
R
E
9
.1
0
A
FU
LL
TR
EE
FO
R
TH
E
LO
AN
AC
CE
PT
AN
CE
DA
TA
U
SI
N
G
TH
E
TR
AI
N
IN
G
SE
T
(3
00
0
RE
CO
RD
S)
code for creating a deeper classification tree
deeper.ct <- rpart(Personal.Loan ~ ., data = train.df, method = "class", cp = 0, minsplit = 1)
# count number of leaves
length(deeper.ct$frame$var[deeper.ct$frame$var == "
# plot tree
prp(deeper.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen = -10,
box.col=ifelse(deeper.ct$frame$var == “
220 CLASSIFICATION AND REGRESSION TREES
TABLE 9.3 CONFUSION MATRICES AND ACCURACY FOR THE DEFAULT (SMALL) AND DEEPER
(FULL) CLASSIFICATION TREES, ON THE TRAINING AND VALIDATION SETS OF THE
PERSONAL LOAN DATA
code for classifying the validation data using a tree and computing the confusion matrices and
accuracy for the training and validation data
# classify records in the validation data.
# set argument type = “class” in predict() to generate predicted class membership.
default.ct.point.pred.train <- predict(default.ct,train.df,type = "class")
# generate confusion matrix for training data
confusionMatrix(default.ct.point.pred.train, train.df$PersonalLoan)
### repeat the code for the validation set, and the deeper tree
Output
> # default tree: training
> confusionMatrix(default.ct.point.pred.train, train.df$Personal.Loan)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2696 26
1 13 265
Accuracy : 0.987
> # default tree: validation
> confusionMatrix(default.ct.point.pred.valid, valid.df$Personal.Loan)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1792 18
1 19 171
Accuracy : 0.9815
> # deeper tree: training
> confusionMatrix(deeper.ct.point.pred.train, train.df$Personal.Loan)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2709 0
1 0 291
Accuracy : 1
> # deeper tree: validation
> confusionMatrix(deeper.ct.point.pred.valid, valid.df$Personal.Loan)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1788 19
1 23 170
Accuracy : 0.979
AVOIDING OVERFITTING 221
FIGURE 9.11 ERROR RATE AS A FUNCTION OF THE NUMBER OF SPLITS FOR TRAINING VS.
VALIDATION DATA: OVERFITTING
at the maximum level of the tree. However, for new data, the overall error is
expected to decrease until the point where the tree fully models the relationship
between class and the predictors. After that, the tree starts to model the noise
in the training set, and we expect the overall error for the validation set to start
increasing. This is depicted in Figure 9.11. One intuitive reason a large tree
may overfit is that its final splits are based on very small numbers of records. In
such cases, class difference is likely to be attributed to noise rather than predictor
information.
Stopping Tree Growth: Conditional Inference Trees
One can think of different criteria for stopping the tree growth before it starts
overfitting the data. Examples are tree depth (i.e., number of splits), minimum
number of records in a terminal node, and minimum reduction in impurity. In
R’s rpart(), for example, we can control the depth of the tree with the complexity
parameter (CP). The problem is that it is not simple to determine what is a good
stopping point using such rules.
Previous methods developed were based on the idea of recursive partition-
ing, using rules to prevent the tree from growing excessively and overfitting
the training data. One popular method called CHAID (chi-squared automatic
interaction detection) is a recursive partitioning method that predates classifica-
tion and regression tree (CART) procedures by several years and is widely used
in database marketing applications to this day. It uses a well-known statistical
test (the chi-square test for independence) to assess whether splitting a node
improves the purity by a statistically significant amount. In particular, at each
node, we split on the predictor with the strongest association with the outcome
variable. The strength of association is measured by the p-value of a chi-squared
test of independence. If for the best predictor the test does not show a sig-
nificant improvement, the split is not carried out, and the tree is terminated.
222 CLASSIFICATION AND REGRESSION TREES
This method is more suitable for categorical predictors, but it can be adapted to
continuous predictors by binning the continuous values into categorical bins.
A more general class of trees based on this idea is called conditional inference
trees (see Hothorn et al., 2006). The R implementation is given in the party and
partykit packages, and is suitable for both numerical and categorical outcome
and predictor variables.
Pruning the Tree
An alternative popular solution that has proven to be more successful than stop-
ping tree growth is pruning the full-grown tree. This is the basis of methods such
as CART (developed by Breiman et al., implemented in multiple data mining
software packages such as R’s rpart package, SAS Enterprise Miner, CART,
and MARS) and C4.5 (developed by Quinlan and implemented in packages such
as IBM SPSS Modeler). In C4.5, the training data are used both for growing
and pruning the tree. In CART, the innovation is to use the validation data to
prune back the tree that is grown from training data. CART and CART-like
procedures use validation data to prune back the tree that has deliberately been
overgrown using the training data.
The idea behind pruning is to recognize that a very large tree is likely to
overfit the training data, and that the weakest branches, which hardly reduce the
error rate, should be removed. In the mower example, the last few splits resulted
in rectangles with very few points (four rectangles in the full tree had a single
record). We can see intuitively that these last splits are likely just capturing noise
in the training set rather than reflecting patterns that would occur in future data,
such as the validation data. Pruning consists of successively selecting a decision
node and redesignating it as a terminal node [lopping off the branches extending
beyond that decision node (its subtree) and thereby reducing the size of the tree].
The pruning process trades off misclassification error in the validation dataset
against the number of decision nodes in the pruned tree to arrive at a tree that
captures the patterns—but not the noise—in the training data.
Cross-Validation
Pruning the tree with the validation data solves the problem of overfitting, but it
does not address the problem of instability. Recall that the CART algorithm may
be unstable in choosing one or another variable for the top-level splits, and this
effect then cascades down and produces highly variable rule sets. The solution
is to avoid relying on just one partition of the data into training and validation.
Rather, we do so repeatedly using cross-validation (see below), then pool the
results. Of course, just accumulating a set of different trees with their different
rules will not do much by itself. However, we can use the results from all those
trees to learn how deep to grow the original tree. In this process, we introduce
AVOIDING OVERFITTING 223
a parameter that can measure, and control, how deep we grow the tree. We will
note this parameter value for each minimum-error tree in the cross-validation
process, take an average, then apply that average to limit tree growth to this
optimal depth when working with new data.
The cost complexity (CC) of a tree is equal to its misclassification error
(based on the training data) plus a penalty factor for the size of the tree. For a
tree T that has L(T ) terminal nodes, the cost complexity can be written as
CC(T ) = err(T ) + αL(T ),
where err(T ) is the fraction of training records that are misclassified by tree T
and α is a penalty factor for tree size. When α = 0, there is no penalty for having
too many nodes in a tree, and this yields a tree using the cost complexity criterion
that is the full-grown unpruned tree. When we increase α to a very large value
the penalty cost component swamps the misclassification error component of
the cost complexity criterion, and the result is simply the tree with the fewest
terminal nodes: namely, the tree with one node. So there is a range of trees,
from tiny to large, corresponding to a range of α, from large to small.
Returning to the cross-validation process, we can now associate a value of
α with the minimum error tree developed in each iteration of that process.
Here is a simple version of the algorithm:
1. Partition the data into training and validation sets.
2. Grow the tree with the training data.
3. Prune it successively, step by step, recording CP (using the training data)
at each step.
4. Note the CP that corresponds to the minimum error on the validation
data.
5. Repartition the data into training and validation, and repeat the growing,
pruning and CP recording process.
6. Do this again and again, and average the CP’s that reflect minimum error
for each tree.
7. Go back to the original data, or future data, and grow a tree, stopping at
this optimum CP value.
Typically, cross-validation is done such that the partitions (also called “folds”)
used for validation are non-overlapping. R automatically does this cross-
validation process to select CP and build a default pruned tree, but the user
can, instead, specify an alternate value of CP (e.g., if you wanted to see what a
deeper tree looked like).
224 CLASSIFICATION AND REGRESSION TREES
In Table 9.4, we see the complexity-parameter table of cross-validation errors
for eight trees of increasing depth grown on the Universal Bank data. We can
simply choose the tree with the lowest cross-validation error (xerror). In this case,
the tree in row 6 has the lowest cross-validation error. Figure 9.12 displays the
pruned tree with 15 terminal nodes. This tree was pruned back from the largest
tree using the complexity parameter value CP = 0.003436426, which yielded
the lowest cross-validation error in Table 9.4.
TABLE 9.4 TABLE OF COMPLEXITY PARAMETER (CP) VALUES AND ASSOCIATED TREE ERRORS
code for tabulating tree error as a function of the complexity parameter (CP)
# argument xval refers to the number of folds to use in rpart’s built-in
# cross-validation procedure
# argument cp sets the smallest value for the complexity parameter.
cv.ct <- rpart(Personal.Loan ~ ., data = train.df, method = "class",
cp = 0.00001, minsplit = 5, xval = 5)
# use printcp() to print the table.
printcp(cv.ct)
Output
CP nsplit rel error xerror xstd
1 0.3350515 0 1.000000 1.00000 0.055705
2 0.1340206 2 0.329897 0.37457 0.035220
3 0.0154639 3 0.195876 0.19931 0.025917
4 0.0068729 7 0.134021 0.17182 0.024096
5 0.0051546 12 0.099656 0.17182 0.024096
6 0.0034364 14 0.089347 0.16838 0.023858
7 0.0022910 19 0.072165 0.17182 0.024096
8 0.0000100 25 0.058419 0.17182 0.024096
Best-Pruned Tree
A further enhancement, in the interest of model parsimony, is to incorporate
the sampling error which might cause this minimum to vary if we had a differ-
ent sample. The enhancement uses the estimated standard error of the cross-
validation error (xstd) to prune the tree even further; we can add one standard
error to the minimum xerror. This is sometimes called the Best-Pruned Tree. For
example, xstd for the tree in row 6 is 0.023858. We can choose a smaller tree by
going up to the row with a cross-validation error that is larger, but still within
one standard error—that is, xerror plus xstd (0.16838 + 0.023858 = 0.192238).
Here, the tree in row 4 is a smaller tree with the lowest xerror in this range. The
best-pruned tree for the loan acceptance example is shown in Figure 9.13. In
this case it coincides with the default tree (Figure 9.9).
AVOIDING OVERFITTING 225
In
co
m
e
<
1
14
C
C
A
vg
<
3
In
co
m
e
<
1
06 F
am
ily
<
3
.5 C
C
A
vg
>
=
1
.1
C
D
.A
cc
ou
nt
<
0
.5
In
co
m
e
<
9
2
A
ge
>
=
3
0
Fa
m
ily
<
2
.5
E
du
ca
tio
n
<
1
.5
In
co
m
e
<
1
04
A
ge
<
4
6
C
C
A
vg
>
=
4
M
or
tg
ag
e
<
1
96
E
du
ca
tio
n
<
1
.5
Fa
m
ily
<
2
.5
In
co
m
e
<
1
16
C
C
A
vg
<
2
.2
O
nl
in
e
>
=
0
.5
27
09
2
91
23
60
5
0
22
33
9
21
83
0
50
9
39
4
11
5
11
2
0
3
12
7
4
1
12
3
3
1
89
1
0
88
7
1
3
34
2
1
32
7
26
1
6
6
6
2
5
0
1
2
0
4
2
1
4
4
1
0
4
4
4
1
0
3
0
6
34
9
2
41
34
2
3
9
34
2
0
0
3
9
7
2
02
7
1
0
6
2
5
0
1
2
1
8
0
1
92
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1
11
1
0
0
1
1
0
0
0
1
1
1
0
0
1
1
1
ye
s
no
F
IG
U
R
E
9
.1
2
PR
U
N
ED
CL
AS
SI
FI
CA
TI
ON
TR
EE
FO
R
TH
E
LO
AN
AC
CE
PT
AN
CE
DA
TA
U
SI
N
G
CP
TH
AT
YI
EL
DE
D
LO
W
ES
T
XE
RR
OR
IN
TA
BL
E
9.
4
code for pruning the tree
# prune by lower cp
pruned.ct <- prune(cv.ct,
cp = cv.ct$cptable[which.min(cv.ct$cptable[,"xerror"]),"CP"])
length(pruned.ct$frame$var[pruned.ct$frame$var == "
prp(pruned.ct, type = 1, extra = 1, split.font = 1, varlen = -10)
226 CLASSIFICATION AND REGRESSION TREES
Income < 114
CCAvg < 3
CD.Account < 0.5
Income < 92
Family < 2.5
Education < 1.5
Family < 2.5
2709 291
2360 50
2233 9
127 41
123 31
89 10
34 21
32 7 2 14
4 10
349 241
342 39
342 0 0 39
7 202
0
0
0
0
0
0
0
0 1
1
0
0
0 1
1
yes no
FIGURE 9.13 BEST-PRUNED TREE OBTAINED BY FITTING A FULL TREE TO THE TRAINING DATA,
PRUNING IT USING THE CROSS-VALIDATION DATA, AND CHOOSING THE
SMALLEST TREE WITHIN ONE STANDARD ERROR OF THE MINIMUM XERROR TREE
9.5 Classification Rules from Trees
As described in Section 9.1, classification trees provide easily understandable
classification rules (if the trees are not too large). Each terminal node is equivalent
to a classification rule. Returning to the example, the right-most terminal node
in the best-pruned tree (Figure 9.13) gives us the rule
IF (Income ≥ 114) AND (Education ≥ 1.5)
THEN Class = 1.
However, in many cases, the number of rules can be reduced by removing redun-
dancies. For example, consider the rule from the second-from-bottom-left-most
terminal node in Figure 9.13:
IF (Income < 114) AND (CCAvg ≥ 3) AND (CD.Account < 0.5)
AND (Income < 92)
THEN Class = 0
This rule can be simplified to
IF (Income < 92) AND (CCAvg ≥ 3) AND (CD.Account < 0.5)
THEN Class = 0
This transparency in the process and understandability of the algorithm that leads
to classifying a record as belonging to a certain class is very advantageous in
settings where the final classification is not the only thing of interest. Berry
CLASSIFICATION TREES FOR MORE THAN TWO CLASSES 227
and Linoff (2000) give the example of health insurance underwriting, where the
insurer is required to show that coverage denial is not based on discrimination.
By showing rules that led to denial (e.g., income < $20K AND low credit
history), the company can avoid law suits. Compared to the output of other
classifiers, such as discriminant functions, tree-based classification rules are easily
explained to managers and operating staff. Their logic is certainly far more
transparent than that of weights in neural networks!
9.6 Classification Trees for More Than Two
Classes
Classification trees can be used with an outcome variable that has more than
two classes. In terms of measuring impurity, the two measures presented earlier
(the Gini impurity index and the entropy measure) were defined for m classes
and hence can be used for any number of classes. The tree itself would have
the same structure, except that its terminal nodes would take one of the m–class
labels.
9.7 Regression Trees
The tree method can also be used for a numerical outcome variable. Regression
trees for prediction operate in much the same fashion as classification trees. The
outcome variable (Y ) is a numerical variable in this case, but both the principle
and the procedure are the same: Many splits are attempted, and for each, we
measure “impurity” in each branch of the resulting tree. The tree procedure
then selects the split that minimizes the sum of such measures. To illustrate
a regression tree, consider the example of predicting prices of Toyota Corolla
automobiles (from Chapter 6). The dataset includes information on 1000 sold
Toyota Corolla cars (We use the first 1000 cars from the dataset ToyotoCorolla.csv.
The goal is to find a predictive model of price as a function of 10 predictors
(including mileage, horsepower, number of doors, etc.). A regression tree for
these data was built using a training set of 600 records. The best-pruned tree is
shown in Figure 9.14.
We see that from the 12 input variables (including dummies), only three
predictors show up as useful for predicting price: the age of the car, its weight,
and horsepower.
Three details differ between regression trees and classification trees: pre-
diction, impurity measures, and evaluating performance. We describe these
next.
228 CLASSIFICATION AND REGRESSION TREES
Age_08_0 >= 32
Age_08_0 >= 54
Age_08_0 >= 44
Weight < 1258
HP < 98
11756
10386
9358
11696
10952 12332
18443
17811
16417 18882
24262
yes no
FIGURE 9.14 BEST-PRUNED REGRESSION TREE FOR TOYOTA COROLLA PRICES
Prediction
Predicting the outcome value for a record is performed in a fashion similar to the
classification case: The predictor information is used for “dropping” the record
down the tree until reaching a terminal node. For instance, to predict the price
of a Toyota Corolla with Age = 60, Horse_Power = 100, and Weight = 1200,
we drop it down the tree and reach the node that has the value $9358. This is
the price prediction for this car according to the tree. In classification trees, the
value of the terminal node (which is one of the categories) is determined by the
“voting” of the training records that were in that terminal node. In regression
trees, the value of the terminal node is determined by the average outcome value
of the training records that were in that terminal node. In the example above,
the value $9358 is the average of the 279 cars in the training set that fall in the
category of Age ≥ 54.
Measuring Impurity
We described two types of impurity measures for nodes in classification trees:
the Gini index and the entropy-based measure. In both cases, the index is a
function of the ratio between the categories of the records in that node. In
regression trees, a typical impurity measure is the sum of the squared deviations
from the mean of the terminal node. This is equivalent to the sum of the squared
errors, because the mean of the terminal node is exactly the prediction. In the
example above, the impurity of the node with the value $9358 is computed by
subtracting 9358 from the price of each of the 279 cars in the training set that
IMPROVING PREDICTION: RANDOM FORESTS AND BOOSTED TREES 229
fell in that terminal node, then squaring these deviations and summing them up.
The lowest impurity possible is zero, when all values in the node are equal.
Evaluating Performance
As stated above, predictions are obtained by averaging the outcome values in the
nodes. We therefore have the usual definition of predictions and errors. The
predictive performance of regression trees can be measured in the same way that
other predictive methods are evaluated (e.g., linear regression), using summary
measures such as RMSE.
9.8 Improving Prediction: Random Forests
and Boosted Trees
Notwithstanding the transparency advantages of a single tree as described above,
in a pure prediction application, where visualizing a set of rules does not matter,
better performance is provided by several extensions to trees that combine results
from multiple trees. These are examples of ensembles (see Chapter 13). One
popular multitree approach is random forests, introduced by Breiman and Cutler.1
Random forests are a special case of bagging, a method for improving predictive
power by combining multiple classifiers or prediction algorithms. See Chapter
13 for further details on bagging.
Random Forests
The basic idea in random forests is to:
1. Draw multiple random samples, with replacement, from the data (this
sampling approach is called the bootstrap).
2. Using a random subset of predictors at each stage, fit a classification (or
regression) tree to each sample (and thus obtain a “forest”).
3. Combine the predictions/classifications from the individual trees to
obtain improved predictions. Use voting for classification and averaging
for prediction.
The code and output in Figure 9.15 illustrates applying a random forest in R
to the personal loan example. The accuracy of the random forest is slightly
higher than the single default tree that we fit earlier (compare to the validation
performance in Table 9.3).
Unlike a single tree, results from a random forest cannot be displayed in a
tree-like diagram, thereby losing the interpretability that a single tree provides.
1For further details on random forests, see www.stat.berkeley.edu/users/breiman/RandomForests/cc_home.htm.
http://www.stat.berkeley.edu/users/breiman/RandomForests/cc_home.htm
230 CLASSIFICATION AND REGRESSION TREES
code for running a random forest, plotting variable importance plot, and computing accuracy
library(randomForest)
## random forest
rf <- randomForest(as.factor(Personal.Loan) ~ ., data = train.df, ntree = 500,
mtry = 4, nodesize = 5, importance = TRUE)
## variable importance plot
varImpPlot(rf, type = 1)
## confusion matrix
rf.pred <- predict(rf, valid.df)
confusionMatrix(rf.pred, valid.df$Personal.Loan)
Partial Output
> confusionMatrix(rf.pred, valid.df$Personal.Loan)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1801 19
1 10 170
Accuracy : 0.986
Securities.Account
Online
CreditCard
Mortgage
Age
CD.Account
Experience
CCAvg
Family
Education
Income
0 20 40 60 80 100 120 140
rf
MeanDecreaseAccuracy
FIGURE 9.15 VARIABLE IMPORTANCE PLOT FROM RANDOM FOREST (PERSONAL LOAN
EXAMPLE)
However, random forests can produce “variable importance” scores, which mea-
sure the relative contribution of the different predictors. The importance score
for a particular predictor is computed by summing up the decrease in the Gini
index for that predictor over all the trees in the forest. Figure 9.15 shows the vari-
able importance plots generated from the random forest model for the personal
loan example. We see that Income and Education have the highest scores, with
IMPROVING PREDICTION: RANDOM FORESTS AND BOOSTED TREES 231
Family being third. Importance scores for the other predictors are considerably
lower.
Boosted Trees
The second type of multitree improvement is boosted trees. Here a sequence of
trees is fitted, so that each tree concentrates on misclassified records from the
previous tree.
1. Fit a single tree.
2. Draw a sample that gives higher selection probabilities to misclassified
records.
3. Fit a tree to the new sample.
4. Repeat Steps 2 and 3 multiple times.
5. Use weighted voting to classify records, with heavier weight for later
trees.
Table 9.5 shows the result of running a boosted tree on the loan acceptance
example that we saw earlier. We can see that compared to the performance of
the single pruned tree (Table 9.3), the boosted tree has better performance on
the validation data in terms of overall accuracy and especially in terms of cor-
rect classification of 1’s—the rare class of special interest. Where does boosting’s
special talent for finding 1’s come from? When one class is dominant (0’s con-
stitute over 90% of the data here), basic classifiers are tempted to classify cases
as belonging to the dominant class, and the 1’s in this case constitute most of
TABLE 9.5 BOOSTED TREE: CONFUSION MATRIX FOR THE VALIDATION SET (LOAN DATA)
code for running boosted trees
library(adabag)
library(rpart)
library(caret)
boost <- boosting(Personal.Loan ~ ., data = train.df)
pred <- predict(boost, valid.df)
confusionMatrix(pred$class, valid.df$Personal.Loan)
Output
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1805 15
1 6 174
Accuracy : 0.9895
232 CLASSIFICATION AND REGRESSION TREES
the misclassifications with the single best-pruned tree. The boosting algorithm
concentrates on the misclassifications (which are mostly 1’s), so it is naturally
going to do well in reducing the misclassification of 1’s (from 18 in the single
tree to 15 in the boosted tree, in the validation set).
9.9 Advantages and Weaknesses of a Tree
Tree methods are good off-the-shelf classifiers and predictors. They are also
useful for variable selection, with the most important predictors usually showing
up at the top of the tree. Trees require relatively little effort from users in the
following senses: First, there is no need for transformation of variables (any
monotone transformation of the variables will give the same trees). Second,
variable subset selection is automatic since it is part of the split selection. In the
loan example, note that the best-pruned tree has automatically selected just three
variables (Income, Education, and Family) out of the set of 14 variables available.
Trees are also intrinsically robust to outliers, since the choice of a split
depends on the ordering of values and not on the absolute magnitudes of these
values. However, they are sensitive to changes in the data, and even a slight
change can cause very different splits!
Unlike models that assume a particular relationship between the outcome
and predictors (e.g., a linear relationship such as in linear regression and linear
discriminant analysis), classification and regression trees are nonlinear and non-
parametric. This allows for a wide range of relationships between the predictors
and the outcome variable. However, this can also be a weakness: Since the
splits are done on one predictor at a time, rather than on combinations of pre-
dictors, the tree is likely to miss relationships between predictors, in particular
linear structures like those in linear or logistic regression models. Classifica-
tion trees are useful classifiers in cases where horizontal and vertical splitting of
the predictor space adequately divides the classes. But consider, for instance, a
dataset with two predictors and two classes, where separation between the two
classes is most obviously achieved by using a diagonal line (as shown in Figure
9.16). In such cases, a classification tree is expected to have lower performance
than methods such as discriminant analysis. One way to improve performance
is to create new predictors that are derived from existing predictors, which can
capture hypothesized relationships between predictors (similar to interactions in
regression models). Random forests are another solution in such situations.
Another performance issue with classification trees is that they require a large
dataset in order to construct a good classifier. From a computational aspect, trees
can be relatively expensive to grow, because of the multiple sorting involved
in computing all possible splits on every variable. Pruning the data using the
validation sets adds further computation time.
ADVANTAGES AND WEAKNESSES OF A TREE 233
FIGURE 9.16 SCATTER PLOT DESCRIBING A TWO-PREDICTOR CASE WITH TWO CLASSES. THE
BEST SEPARATION IS ACHIEVED WITH A DIAGONAL LINE, WHICH
CLASSIFICATION TREES CANNOT DO
Although trees are useful for variable selection, one challenge is that they
“favor” predictors with many potential split points. This includes categorical
predictors with many categories and numerical predictors with many different
values. Such predictors have a higher chance of appearing in a tree. One simplis-
tic solution is to combine multiple categories into a smaller set and bin numerical
predictors with many values. Alternatively, some special algorithms avoid this
problem by using a different splitting criterion [e.g., conditional inference trees
in the R package party—see Hothorn et al. (2006)—and QUEST classification
trees—see Loh and Shih (1997) and www.math.ccu.edu.tw/~yshih/quest.html].
An appealing feature of trees is that they handle missing data without having
to impute values or delete records with missing values. Finally, a very important
practical advantage of trees is the transparent rules that they generate. Such
transparency is often useful in managerial applications, though this advantage is
lost in the ensemble versions of trees (random forests, boosted trees).
http://www.math.ccu.edu.tw/~yshih/quest.html]
234 CLASSIFICATION AND REGRESSION TREES
PROBLEMS
9.1 Competitive Auctions on eBay.com. The file eBayAuctions.csv contains informa-
tion on 1972 auctions that transacted on eBay.com during May–June 2004. The goal
is to use these data to build a model that will classify auctions as competitive or non-
competitive. A competitive auction is defined as an auction with at least two bids placed
on the item auctioned. The data include variables that describe the item (auction cat-
egory), the seller (his/her eBay rating), and the auction terms that the seller selected
(auction duration, opening price, currency, day-of-week of auction close). In addi-
tion, we have the price at which the auction closed. The task is to predict whether or
not the auction will be competitive.
Data Preprocessing. Convert variable Duration into a categorical variable. Split the
data into training (60%) and validation (40%) datasets.
a. Fit a classification tree using all predictors, using the best-pruned tree. To avoid
overfitting, set the minimum number of records in a terminal node to 50 (in R:
minbucket = 50). Also, set the maximum number of levels to be displayed at seven
(in R: maxdepth = 7).Write down the results in terms of rules. (Note: If you had to
slightly reduce the number of predictors due to software limitations, or for clarity
of presentation, which would be a good variable to choose?)
b. Is this model practical for predicting the outcome of a new auction?
c. Describe the interesting and uninteresting information that these rules provide.
d. Fit another classification tree (using the best-pruned tree, with a minimum number
of records per terminal node = 50 and maximum allowed number of displayed levels
= 7), this time only with predictors that can be used for predicting the outcome of
a new auction. Describe the resulting tree in terms of rules. Make sure to report
the smallest set of rules required for classification.
e. Plot the resulting tree on a scatter plot: Use the two axes for the two best (quan-
titative) predictors. Each auction will appear as a point, with coordinates corre-
sponding to its values on those two predictors. Use different colors or symbols
to separate competitive and noncompetitive auctions. Draw lines (you can sketch
these by hand or use R) at the values that create splits. Does this splitting seem
reasonable with respect to the meaning of the two predictors? Does it seem to do
a good job of separating the two classes?
f. Examine the lift chart and the confusion matrix for the tree. What can you say
about the predictive performance of this model?
g. Based on this last tree, what can you conclude from these data about the chances
of an auction obtaining at least two bids and its relationship to the auction settings
set by the seller (duration, opening price, ending day, currency)? What would you
recommend for a seller as the strategy that will most likely lead to a competitive
auction?
9.2 Predicting Delayed Flights. The file FlightDelays.csv contains information on all
commercial flights departing the Washington, DC area and arriving at New York
during January 2004. For each flight, there is information on the departure and arrival
airports, the distance of the route, the scheduled time and date of the flight, and so
on. The variable that we are trying to predict is whether or not a flight is delayed. A
delay is defined as an arrival that is at least 15 minutes later than scheduled.
PROBLEMS 235
Data Preprocessing. Transform variable day of week (DAY_WEEK) info a cate-
gorical variable. Bin the scheduled departure time into eight bins (in R use function
cut()). Use these and all other columns as predictors (excluding DAY_OF_MONTH).
Partition the data into training and validation sets.
a. Fit a classification tree to the flight delay variable using all the relevant predictors.
Do not include DEP_TIME (actual departure time) in the model because it is
unknown at the time of prediction (unless we are generating our predictions of
delays after the plane takes off, which is unlikely). Use a pruned tree with maximum
of 8 levels, setting cp = 0.001. Express the resulting tree as a set of rules.
b. If you needed to fly between DCA and EWR on a Monday at 7:00 AM, would you
be able to use this tree? What other information would you need? Is it available in
practice? What information is redundant?
c. Fit the same tree as in (a), this time excluding the Weather predictor. Display both
the pruned and unpruned tree. You will find that the pruned tree contains a single
terminal node.
i. How is the pruned tree used for classification? (What is the rule for classifying?)
ii. To what is this rule equivalent?
iii. Examine the unpruned tree. What are the top three predictors according to
this tree?
iv. Why, technically, does the pruned tree result in a single node?
v. What is the disadvantage of using the top levels of the unpruned tree as opposed
to the pruned tree?
vi. Compare this general result to that from logistic regression in the example in
Chapter 10. What are possible reasons for the classification tree’s failure to find
a good predictive model?
9.3 Predicting Prices of Used Cars (Regression Trees). The file ToyotaCorolla.csv
contains the data on used cars (Toyota Corolla) on sale during late summer of 2004
in the Netherlands. It has 1436 records containing details on 38 attributes, including
Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price
of a used Toyota Corolla based on its specifications. (The example in Section 9.7 is a
subset of this dataset).
Data Preprocessing. Split the data into training (60%), and validation (40%) datasets.
a. Run a regression tree (RT) with outcome variable Price and predictors Age_08_04,
KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfg_Guarantee,
Guarantee_Period, Airco, Automatic_Airco, CD_Player, Powered_Windows,
Sport_Model, and Tow_Bar. Keep the minimum number of records in a terminal
node to 1, maximum number of tree levels to 100, and cp = 0.001, to make the
run least restrictive.
i. Which appear to be the three or four most important car specifications for
predicting the car’s price?
ii. Compare the prediction errors of the training and validation sets by examining
their RMS error and by plotting the two boxplots. What is happening with the
training set predictions? How does the predictive performance of the validation
set compare to the training set? Why does this occur?
iii. How can we achieve predictions for the training set that are not equal to the
actual prices?
236 CLASSIFICATION AND REGRESSION TREES
iv. Prune the full tree using the cross-validation error. Compared to the full tree,
what is the predictive performance for the validation set?
b. Let us see the effect of turning the price variable into a categorical variable. First,
create a new variable that categorizes price into 20 bins. Now repartition the data
keeping Binned_Price instead of Price. Run a classification tree with the same set
of input variables as in the RT, and with Binned_Price as the output variable. Keep
the minimum number of records in a terminal node to 1.
i. Compare the tree generated by the CT with the one generated by the RT. Are
they different? (Look at structure, the top predictors, size of tree, etc.) Why?
ii. Predict the price, using the RT and the CT, of a used Toyota Corolla with the
specifications listed in Table 9.6.
TABLE 9.6 SPECIFICATIONS FOR A PARTICULAR
TOYOTA COROLLA
Variable Value
Age_-08_-04 77
KM 117,000
Fuel_Type Petrol
HP 110
Automatic No
Doors 5
Quarterly_Tax 100
Mfg_Guarantee No
Guarantee_Period 3
Airco Yes
Automatic_Airco No
CD_Player No
Powered_Windows No
Sport_Model No
Tow_Bar Yes
iii. Compare the predictions in terms of the predictors that were used, the mag-
nitude of the difference between the two predictions, and the advantages and
disadvantages of the two methods.
CHAPTER 10
Logistic Regression
In this chapter, we describe the highly popular and powerful classification
method called logistic regression. Like linear regression, it relies on a specific
model relating the predictors with the outcome. The user must specify the pre-
dictors to include as well as their form (e.g., including any interaction terms).
This means that even small datasets can be used for building logistic regression
classifiers, and that once the model is estimated, it is computationally fast and
cheap to classify even large samples of new records. We describe the logistic
regression model formulation and its estimation from data. We also explain the
concepts of “logit,” “odds,” and “probability” of an event that arise in the logistic
model context and the relations among the three. We discuss variable importance
using coefficient and statistical significance and also mention variable selection
algorithms for dimension reduction. Our presentation is strictly from a data
mining perspective, where classification is the goal and performance is evaluated
on a separate validation set. However, because logistic regression is also heav-
ily used in statistical analyses for purposes of inference, we give a brief review
of key concepts related to coefficient interpretation, goodness-of-fit evaluation,
inference, and multiclass models in the Appendix at the end of this chapter.
10.1 Introduction
Logistic regression extends the ideas of linear regression to the situation where
the outcome variable, Y , is categorical. We can think of a categorical variable as
dividing the records into classes. For example, if Y denotes a recommendation
on holding/selling/buying a stock, we have a categorical variable with three
categories. We can think of each of the stocks in the dataset (the records) as
belonging to one of three classes: the hold class, the sell class, and the buy class.
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
237
238 LOGISTIC REGRESSION
Logistic regression can be used for classifying a new record, where its class is
unknown, into one of the classes, based on the values of its predictor variables
(called classification). It can also be used in data where the class is known, to
find factors distinguishing between records in different classes in terms of their
predictor variables, or “predictor profile” (called profiling). Logistic regression is
used in applications such as
1. Classifying customers as returning or nonreturning (classification)
2. Finding factors that differentiate between male and female top executives
(profiling)
3. Predicting the approval or disapproval of a loan based on information such
as credit scores (classification)
The logistic regression model is used in a variety of fields: whenever a
structured model is needed to explain or predict categorical (in particular,
binary) outcomes. One such application is in describing choice behavior in
econometrics.
In this chapter, we focus on the use of logistic regression for classification.
We deal only with a binary outcome variable having two possible classes. In
the Appendix, we show how the results can be extended to the case where Y
assumes more than two possible classes. Popular examples of binary outcomes
are success/failure, yes/no, buy/don’t buy, default/don’t default, and survive/die.
For convenience, we often code the values of the binary outcome variable Y as
0 and 1.
Note that in some cases we may choose to convert a continuous outcome
variable or an outcome variables with multiple classes into a binary outcome vari-
able for purposes of simplification, reflecting the fact that decision-making may
be binary (approve the loan/don’t approve, make an offer/don’t make an offer).
As with multiple linear regression, the predictor variables X1, X2, . . . , Xk may
be categorical variables, continuous variables, or a mixture of these two types.
While in multiple linear regression the aim is to predict the value of the contin-
uous Y for a new record, in logistic regression the goal is to predict which class a
new record will belong to, or simply to classify the record into one of the classes.
In the stock example, we would want to classify a new stock into one of the three
recommendation classes: sell, hold, or buy. Or, we might want to compute for
a new record its propensity (= the probability) to belong to each class, and then
possibly rank a set of new records from highest to lowest propensity in order to
act on those with the highest propensity.
In logistic regression, we take two steps: the first step yields estimates of the
propensities or probabilities of belonging to each class. In the binary case, we get
an estimate of p = P(Y = 1), the probability of belonging to class 1 (which also
tells us the probability of belonging to class 0). In the next step, we use a cutoff
THE LOGISTIC REGRESSION MODEL 239
value on these probabilities in order to classify each case into one of the classes.
For example, in a binary case, a cutoff of 0.5 means that cases with an estimated
probability of P(Y = 1) ≥ 0.5 are classified as belonging to class 1, whereas cases
with P(Y = 1) < 0.5 are classified as belonging to class 0. This cutoff does not
need to be set at 0.5. When the event in question is a low probability but notable
or important event (say, 1 = fraudulent transaction), a lower cutoff may be used
to classify more cases as belonging to class 1.
10.2 The Logistic Regression Model
The idea behind logistic regression is straightforward: Instead of using Y directly
as the outcome variable, we use a function of it, which is called the logit. The
logit, it turns out, can be modeled as a linear function of the predictors. Once
the logit has been predicted, it can be mapped back to a probability.
To understand the logit, we take several intermediate steps: First, we look
at p = P(Y = 1), the probability of belonging to class 1 (as opposed to class 0).
In contrast to the binary variable Y , which only takes the values 0 and 1, p can
take any value in the interval [0, 1]. However, if we express p as a linear function
of the q predictors1 in the form
p = β0 + β1x1 + β2x2 + · · ·+ βqxq, (10.1)
it is not guaranteed that the right-hand side will lead to values within the interval
[0, 1]. The solution is to use a nonlinear function of the predictors in the form
p =
1
1 + e−(β0+β1x1+β2x2+···+βqxq)
. (10.2)
This is called the logistic response function. For any values x1, . . . , xq, the right-
hand side will always lead to values in the interval [0, 1]. Next, we look at a
different measure of belonging to a certain class, known as odds. The odds of
belonging to class 1 are defined as the ratio of the probability of belonging to class 1
to the probability of belonging to class 0:
Odds(Y = 1) =
p
1− p
. (10.3)
This metric is very popular in horse races, sports, gambling, epidemiology, and
other areas. Instead of talking about the probability of winning or contacting a
disease, people talk about the odds of winning or contacting a disease. How are
these two different? If, for example, the probability of winning is 0.5, the odds
1Unlike elsewhere in the book, where p denotes the number of predictors, in this chapter we use q,
to avoid confusion with the probability p.
240 LOGISTIC REGRESSION
of winning are 0.5/0.5 = 1. We can also perform the reverse calculation: Given
the odds of an event, we can compute its probability by manipulating equation
(10.3):
p =
odds
1 + odds
. (10.4)
Substituting (10.2) into (10.4), we can write the relationship between the odds
and the predictors as
Odds(Y = 1) = eβ0+β1x1+β2x2+···+βqxq . (10.5)
This last equation describes a multiplicative (proportional) relationship between
the predictors and the odds. Such a relationship is interpretable in terms of
percentages, for example, a unit increase in predictor Xj is associated with an
average increase of βj×100% in the odds (holding all other predictors constant).
Now, if we take a natural logarithm2 on both sides, we get the standard
formulation of a logistic model:
log(odds) = β0 + β1x1 + β2x2 + · · ·+ βqxq. (10.6)
The log(odds), called the logit, takes values from −∞ (very low odds) to ∞
(very high odds).3 A logit of 0 corresponds to even odds of 1 (probability =
0.5). Thus, our final formulation of the relation between the outcome and the
predictors uses the logit as the outcome variable and models it as a linear function
of the q predictors.
To see the relationship between the probability, odds, and logit of belonging
to class 1, look at Figure 10.1, which shows the odds (top) and logit (bottom) as
a function of p. Notice that the odds can take any non-negative value, and that
the logit can take any real value.
10.3 Example: Acceptance of Personal Loan
Recall the example described in Chapter 9 of acceptance of a personal loan
by Universal Bank. The bank’s dataset includes data on 5000 customers. The
data include the customer’s response to the last personal loan campaign (Personal
Loan), as well as customer demographic information (Age, Income, etc.) and
the customer’s relationship with the bank (mortgage, securities account, etc.).
See Table 10.1. Among these 5000 customers, only 480 (= 9.6%) accepted the
personal loan offered to them in a previous campaign. The goal is to build a
model that identifies customers who are most likely to accept the loan offer in
future mailings.
2The natural logarithm function is typically denoted ln() or log(). In this book, we use log().
3We use the terms odds and odds(Y = 1) interchangeably.
EXAMPLE: ACCEPTANCE OF PERSONAL LOAN 241
FIGURE 10.1 (a) ODDS AND (b) LOGIT AS A FUNCTION OF p
Model with a Single Predictor
Consider first a simple logistic regression model with just one predictor. This is
conceptually analogous to the simple linear regression model in which we fit a
straight line to relate the outcome, Y , to a single predictor, X .
Let us construct a simple logistic regression model for classification of cus-
tomers using the single predictor Income. The equation relating the outcome
TABLE 10.1 DESCRIPTION OF PREDICTORS FOR ACCEPTANCE OF PERSONAL LOAN EXAMPLE
Age Customer’s age in completed years
Experience Number of years of professional experience
Income Annual income of the customer ($000s)
Family Size Family size of the customer
CCAvg Average spending on credit cards per month ($000s)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any ($000s)
Securities Account Coded as 1 if customer has securities account with bank
CD Account Coded as 1 if customer has certificate of deposit (CD) account with bank
Online Banking Coded as 1 if customer uses Internet banking facilities
Credit Card Coded as 1 if customer uses credit card issued by Universal Bank
242 LOGISTIC REGRESSION
variable to the predictor in terms of probabilities is
P(Personal Loan = Yes | Income = x) = 1
1 + e−(β0+β1x)
,
or equivalently, in terms of odds,
Odds(Personal Loan = Yes | Income = x) = eβ0+β1x. (10.7)
The estimated coefficients for the model are b0 = −6.16715 and b1 =
0.03757. So the fitted model is
P(Personal Loan = Yes | Income = x) = 1
1 + e6.16715−0.03757x
. (10.8)
Although logistic regression can be used for prediction in the sense that
we predict the probability of a categorical outcome, it is most often used for
classification. To see the difference between the two, consider predicting the
probability of a customer accepting the loan offer as opposed to classifying the
customer as an acceptor/nonacceptor. From Figure 10.2, it can be seen that
the loan acceptance probabilities produced by the logistic regression model (the
s-shaped curve in Figure 10.2) can yield values between 0 and 1. To end up
with classifications into either 1 or 0 (e.g., a customer either accepts the loan
offer or not), we need a threshold, or cutoff value (see section on “Propensities
and Cutoff for Classification” in Chapter 5). This is true in the case of multiple
predictor variables as well.
In the Universal Bank example, in order to classify a new customer as an
acceptor/nonacceptor of the loan offer, we use the information on his/her
FIGURE 10.2 PLOT OF DATA POINTS (PERSONAL LOAN AS A FUNCTION OF INCOME) AND THE
FITTED LOGISTIC CURVE
EXAMPLE: ACCEPTANCE OF PERSONAL LOAN 243
income by plugging it into the fitted equation in (10.8). This yields an esti-
mated probability of accepting the loan offer. We then compare it to the cutoff
value. The customer is classified as an acceptor if the probability of his/her
accepting the offer is above the cutoff.4
Estimating the Logistic Model from Data: Computing Parameter Estimates
In logistic regression, the relation between Y and the β parameters is nonlinear.
For this reason, the β parameters are not estimated using the method of least
squares (as in multiple linear regression). Instead, a method called maximum
likelihood is used. The idea, in brief, is to find the estimates that maximize
the chance of obtaining the data that we have. This requires iterations using a
computer program.5
Algorithms to compute the coefficient estimates are less robust than algo-
rithms for linear regression. Computed estimates are generally reliable for well-
behaved datasets where the number of records with outcome variable values of
both 0 and 1 are large; their ratio is “not too close” to either 0 or 1; and when
the number of coefficients in the logistic regression model is small relative to
the sample size (say, no more than 10%). As with linear regression, collinearity
(strong correlation among the predictors) can lead to computational difficulties.
Computationally intensive algorithms have been developed recently that circum-
vent some of these difficulties. For technical details on the maximum likelihood
estimation in logistic regression, see Hosmer and Lemeshow (2000).
To illustrate a typical output from such a procedure, we fit a logistic model
to the training set of 3000 Universal Bank customers. The outcome variable
is Personal Loan, with Yes defined as the success (this is equivalent to setting the
outcome variable to 1 for an acceptor and 0 for a nonacceptor).
Data Preprocessing We start by converting predictor variable Education
into a factor variable. In the dataset, it is coded as an integer, taking on val-
ues 1, 2, or 3. To turn it into a factor variable, we use the R function factor().
Then when we include this predictor variable in R’s logistic regression, it will
4Here we compared the probability to a cutoff c. If we prefer to look at odds of accepting rather
than the probability, an equivalent method is to use the equation in (10.7) and compare the odds to
c/(1 − c). If the odds are higher than this number, the customer is classified as an acceptor. If it is
lower, we classify the customer as a nonacceptor.
5The method of maximum likelihood ensures good asymptotic (large sample) properties for the esti-
mates. Under very general conditions, maximum likelihood estimators are: (1) Consistent—The prob-
ability of the estimator differing from the true value approaches zero with increasing sample size,
(2) Asymptotically efficient—The variance is the smallest possible among consistent estimators, and (3)
Asymptotically normally distributed—This allows us to compute confidence intervals and perform statis-
tical tests in a manner analogous to the analysis of multiple linear regression models, provided that the
sample size is large.
244 LOGISTIC REGRESSION
automatically create two dummy variables from the factor’s three levels. The
logistic regression will only use two of the three levels because using all three
would create a multicollinearity issue (see Chapter 6). In total, the logistic regres-
sion function in R will include 6 = 2 + 1 + 1 + 1 + 1 dummy variables to
describe the five categorical predictors from Table 10.1. Together with the six
numerical predictors, we have a total of 12 predictors.
Next, we partition the data randomly into training (60%) and validation
(40%) sets. We use the training set to fit a logistic regression model and the
validation set to assess the model’s performance.
Estimated Model Table 10.2 presents the output from running a logistic
regression using the 12 predictors on the training data.
Ignoring p-values for the coefficients, a model based on all 12 predictors has
the estimated logistic equation
Logit(Personal Loan = Yes) = (10.9)
−12.6806− 0.0369 Age + 0.0491 Experience
+0.0613 Income + 0.5435 Family + 0.2166 CCAvg
+4.2681 EducationGraduate + 4.4408 EducationAdvanced/Professional
+0.0015 Mortgage − 1.1457 Securities.Account + 4.5856 CD.Account
−0.8588 Online − 1.2514 Credit Card
The positive coefficients for the dummy variables EducationGraduate,
EducProf, and CD.Account mean that holding a CD account and having graduate
or professional education (all marked by 1 in the dummy variables) are associated
with higher probabilities of accepting the loan offer. In contrast, having a secu-
rities account, using online banking, and owning a Universal Bank credit card
are associated with lower acceptance rates. For the continuous predictors, posi-
tive coefficients indicate that a higher value on that predictor is associated with
a higher probability of accepting the loan offer (e.g., Income: higher-income
customers tend more to accept the offer). Similarly, negative coefficients indi-
cate that a higher value on that predictor is associated with a lower probability
of accepting the loan offer (e.g., Age: older customers are less likely to accept
the offer).
Interpreting Results in Terms of Odds (for a Profiling Goal)
Logistic models, when they are appropriate for the data, can give useful informa-
tion about the roles played by different predictor variables. For example, suppose
we want to know how increasing family income by one unit will affect the prob-
ability of loan acceptance. This can be found straightforwardly if we consider
not probabilities, but odds.
EXAMPLE: ACCEPTANCE OF PERSONAL LOAN 245
TABLE 10.2 LOGISTIC REGRESSION MODEL FOR LOAN ACCEPTANCE (TRAINING DATA)
code for fitting a logistic regression model
bank.df <- read.csv("UniversalBank.csv")
bank.df <- bank.df[ , -c(1, 5)] # Drop ID and zip code columns.
# treat Education as categorical (R will create dummy variables)
bank.df$Education <- factor(bank.df$Education, levels = c(1, 2, 3),
labels = c("Undergrad", "Graduate", "Advanced/Professional"))
# partition data
set.seed(2)
train.index <- sample(c(1:dim(bank.df)[1]), dim(bank.df)[1]*0.6)
train.df <- bank.df[train.index, ]
valid.df <- bank.df[-train.index, ]
# run logistic regression
# use glm() (general linear model) with family = "binomial" to fit a logistic
# regression.
logit.reg <- glm(Personal.Loan ~ ., data = train.df, family = "binomial")
options(scipen=999)
summary(logit.reg)
Output
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.6805628 2.2903370 -5.537 0.0000000308 ***
Age -0.0369346 0.0848937 -0.435 0.66351
Experience 0.0490645 0.0844410 0.581 0.56121
Income 0.0612953 0.0039762 15.416 < 0.0000000000000002 ***
Family 0.5434657 0.0994936 5.462 0.0000000470 ***
CCAvg 0.2165942 0.0601900 3.599 0.00032 ***
EducationGraduate 4.2681068 0.3703378 11.525 < 0.0000000000000002 ***
EducationAdvanced/Professional 4.4408154 0.3723360 11.927 < 0.0000000000000002 ***
Mortgage 0.0015499 0.0007926 1.955 0.05052 .
Securities.Account -1.1457476 0.3955796 -2.896 0.00377 **
CD.Account 4.5855656 0.4777696 9.598 < 0.0000000000000002 ***
Online -0.8588074 0.2191217 -3.919 0.0000888005 ***
CreditCard -1.2514213 0.2944767 -4.250 0.0000214111 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1901.71 on 2999 degrees of freedom
Residual deviance: 682.19 on 2987 degrees of freedom
AIC: 708.19
Number of Fisher Scoring iterations: 8
Recall that the odds are given by
Odds = eβ0+β1x1+β2x2+···+βqxq .
At first, let us return to the single predictor example, where we model a cus-
tomer’s acceptance of a personal loan offer as a function of his/her income:
Odds(Personal Loan = Yes | Income) = eβ0+β1Income.
246 LOGISTIC REGRESSION
We can think of the model as a multiplicative model of odds. The odds
that a customer with income zero will accept the loan is estimated by
e−6.16715+(0.03757)(0) = 0.0021. These are the base case odds. In this exam-
ple, it is obviously economically meaningless to talk about a zero income; the
value zero and the corresponding base-case odds could be meaningful, how-
ever, in the context of other predictors. The odds of accepting the loan with an
income of $100K will increase by a multiplicative factor of e(0.039)(100) = 42.8
over the base case, so the odds that such a customer will accept the offer are
e−6.16715+(0.03757)(100) = 0.0898.
Suppose that the value of Income, or in general X1, is increased by one unit
from x1 to x1 + 1, while the other predictors are held at their current value
(x2, . . . , x12). We get the odds ratio
odds(x1 + 1, x2, . . . , x12)
odds(x1, . . . , x12)
=
eβ0+β1(x1+1)+β2x2+···+β12x12
eβ0+β1x1+β2x2+···+β12x12
= eβ1 .
This tells us that a single unit increase in X1, holding X2, . . . , X12 constant,
is associated with an increase in the odds that a customer accepts the offer by a
factor of eβ1 . In other words, eβ1 is the multiplicative factor by which the odds
(of belonging to class 1) increase when the value of X1 is increased by 1 unit,
holding all other predictors constant. If β1 < 0, an increase in X1 is associated with
a decrease in the odds of belonging to class 1, whereas a positive value of β1 is
associated with an increase in the odds.
When a predictor is a dummy variable, the interpretation is technically the
same but has a different practical meaning. For instance, the coefficient for
CD.Account was estimated from the data to be 4.023356. Recall that the refer-
ence group is customers not holding a CD account. We interpret this coefficient
as follows: e4.023356 = 55.9 are the odds that a customer who has a CD account
will accept the offer relative to a customer who does not have a CD account,
holding all other variables constant. This means that customers who hold CD
accounts at Universal Bank are more likely to accept the offer than customers
without a CD account (holding all other variables constant).
The advantage of reporting results in odds as opposed to probabilities is that
statements such as those above are true for any value of X1. Unless X1 is a
dummy variable, we cannot apply such statements about the effect of increasing
X1 by a single unit to probabilities. This is because the result depends on the
actual value of X1. So if we increase X1 from, say, 3 to 4, the effect on p,
the probability of belonging to class 1, will be different than if we increase X1
from 30 to 31. In short, the change in the probability, p, for a unit increase
in a particular predictor variable, while holding all other predictors constant, is
not a constant—it depends on the specific values of the predictor variables. We
therefore talk about probabilities only in the context of specific records.
EVALUATING CLASSIFICATION PERFORMANCE 247
10.4 Evaluating Classification Performance
The general measures of performance that were described in Chapter 5 are used
to assess the logistic model performance. Recall that there are several perfor-
mance measures, the most popular being those based on the confusion matrix
(accuracy alone or combined with costs) and the lift chart. As in other classifi-
cation methods, the goal is to find a model that accurately classifies records to
their class, using only the predictor information. A variant of this goal is ranking,
or finding a model that does a superior job of identifying the members of a par-
ticular class of interest for a set of new records (which might come at some cost
to overall accuracy). Since the training data are used for selecting the model, we
expect the model to perform quite well for those data, and therefore prefer to
test its performance on the validation set. Recall that the data in the validation
set were not involved in the model building, and thus we can use them to test
the model’s ability to classify data that it has not “seen” before.
To obtain the confusion matrix from a logistic regression analysis, we use the
estimated equation to predict the probability of class membership (the propensi-
ties) for each record in the validation set, and use the cutoff value to decide on
the class assignment of these records. We then compare these classifications to
the actual class memberships of these records. In the Universal Bank case, we
use the estimated model in equation (10.10) to predict the probability of offer
acceptance in a validation set that contains 2000 customers (these data were not
used in the modeling step). Technically, this is done by predicting the logit using
the estimated model in equation (10.10) and then obtaining the probabilities p
through the relation p = elogit/1 + elogit. We then compare these probabilities
to our chosen cutoff value in order to classify each of the 2000 validation records
as acceptors or nonacceptors.
Table 10.3 shows propensities for the first 5 records in the validation set.
Suppose that we use a cutoff of 0.5. We see that the first three customers have a
probability of accepting the offer that is lower than the cutoff of 0.5, and there-
fore they are classified as nonacceptors (0). And indeed, they were nonacceptors
(actual = 0). The fourth and fifth customers’ probability of acceptance is esti-
mated by the model to exceed 0.5, and they are therefore classified as acceptors
(1). While the fourth customer was indeed an acceptor (actual = 1), our model
misclassified the fifth customer as an acceptor, when in fact s/he was a nonac-
ceptor (actual = 0).
Another useful tool for assessing model classification performance are the lift
(gains) chart and decile-wise lift chart (see Chapter 5). Figure 10.3 illustrates the
lift chart obtained for the personal loan offer logistic model using the validation
set. The “lift” over the base curve indicates for a given number of cases (read on
the x-axis), the additional responders that you can identify by using the model.
The same information is portrayed in in Figure 10.3: Taking the 10% of the
248 LOGISTIC REGRESSION
TABLE 10.3 PROPENSITIES FOR THE FIRST FIVE CUSTOMERS IN VALIDATION DATA
code for using logistic regression to generate predicted probabilities
# use predict() with type = "response" to compute predicted probabilities.
logit.reg.pred <- predict(logit.reg, valid.df[, -8], type = "response")
# first 5 actual and predicted records
data.frame(actual = valid.df$Personal.Loan[1:5], predicted = logit.reg.pred[1:5])
Output
> data.frame(actual = valid.df$Personal.Loan[1:5],
+ predicted = logit.reg.pred[1:5])
actual predicted
2 0 0.00002707663
6 0 0.00326343313
9 0 0.03966293189
10 1 0.98846040544
11 0 0.59933974797
records that are ranked by the model as “most probable 1’s” yields 7.9 times as
many 1’s as would simply selecting 10% of the records at random.
Variable Selection
The next step includes searching for alternative models. One option is to look
for simpler models by trying to reduce the number of predictors used. We can
also build more complex models that reflect interactions among predictors by
creating and including new variables that are derived from the predictors. For
example, if we hypothesize that there is an interactive effect between income and
family size, we should add an interaction term of the form Income × Family.
The choice among the set of alternative models is guided primarily by perfor-
mance on the validation data. For models that perform roughly equally well,
simpler models are generally preferred over more complex models. Note also
that performance on validation data may be overly optimistic when it comes to
predicting performance on data that have not been exposed to the model at all.
This is because when the validation data are used to select a final model among a
set of model, we are selecting based on how well the model performs with those
data and therefore may be incorporating some of the random idiosyncrasies of
the validation data into the judgment about the best model. The model still may
be the best for the validation data among those considered, but it will probably
not do as well with the unseen data. Therefore, it is useful to evaluate the chosen
model on a new test set to get a sense of how well it will perform on new data. In
addition, one must consider practical issues such as costs of collecting variables,
error-proneness, and model complexity in the selection of the final model.
EVALUATING CLASSIFICATION PERFORMANCE 249
code for creating lift chart and decile-wise lift chart
library(gains)
gain <- gains(valid.df$Personal.Loan, logit.reg.pred, groups=length(logit.reg.pred))
# plot lift chart
plot(c(0,gain$cume.pct.of.total*sum(valid.df$Personal.Loan))~c(0,gain$cume.obs),
xlab="# cases", ylab="Cumulative", main="", type="l")
lines(c(0,sum(valid.df$Personal.Loan))~c(0, dim(valid.df)[1]), lty=2)
# compute deciles and plot decile-wise chart
heights <- gain$mean.resp/mean(valid.df$Personal.Loan)
midpoints <- barplot(heights, names.arg = gain$depth, ylim = c(0,9),
xlab = "Percentile", ylab = "Mean Response", main = "Decile-wise lift chart")
# add labels to columns
text(midpoints, heights+0.5, labels=round(heights, 1), cex = 0.8)
0 500 1000 1500 2000
0
50
10
0
15
0
# cases
C
um
ul
at
iv
e
10 20 30 40 50 60 70 80 90 100
Decile-wise lift chart
Percentile
M
ea
n
R
es
po
ns
e
0
2
4
6
8
7.9
1.1
0.4 0.3 0.1 0 0.1 0.1 0.1 0
FIGURE 10.3 LIFT CHART AND DECILE-WISE LIFT CHART FOR THE VALIDATION DATA FOR
UNIVERSAL BANK LOAN OFFER
250 LOGISTIC REGRESSION
As in linear regression, in logistic regression we can use automated variable
selection heuristics such as stepwise selection, forward selection, and backward
elimination (See Section 6.4 in Chapter 6.) In R, use function step() in the
stats package or function stepAIC() in the MASS package) for stepwise, forward,
and backward elimination. If the dataset is not too large, we can even try an
exhaustive search over all possible models (use R function glmulti() in package
glmulti, although it can be slow).
10.5 Example of Complete Analysis:
Predicting Delayed Flights
Predicting flight delays can be useful to a variety of organizations: airport author-
ities, airlines, aviation authorities. At times, joint task forces have been formed
to address the problem. Such an organization, if it were to provide ongoing
real-time assistance with flight delays,would benefit from some advance notice
about flights likely to be delayed.
In this simplified illustration, we look at six predictors (see Table 10.4). The
outcome of interest is whether the flight is delayed or not (delayed means more
than 15 minutes late). Our data consist of all flights from the Washington, DC
area into the New York City area during January 2004. The percent of delayed
flights among these 2201 flights is 19.5%. The data were obtained from the
Bureau of Transportation Statistics website (www.transtats.bts.gov).
The goal is to predict accurately whether a new flight, not in this dataset,
will be delayed or not. The outcome variable is a variable called Flight Status,
coded as delayed or ontime.
Other information available on the website, such as distance and arrival time,
is irrelevant because we are looking at a certain route (distance, flight time, etc.
should be approximately equal for all flights in the data). A sample of the data for
20 flights is shown in Table 10.5. Figures 10.4 and 10.5 show visualizations of
TABLE 10.4 DESCRIPTION OF PREDICTORS FOR FLIGHT DELAYS EXAMPLE
Day of Week Coded as 1 = Monday, 2 = Tuesday,..., 7 = Sunday
Departure Time Broken down into 18 intervals between 6:00 AM and 10:00 PM
Origin Three airport codes: DCA (Reagan National), IAD (Dulles),
BWI (Baltimore–Washington Int’l)
Destination Three airport codes: JFK (Kennedy), LGA (LaGuardia),
EWR (Newark)
Carrier Eight airline codes: CO (Continental), DH (Atlantic Coast),
DL (Delta), MQ (American Eagle), OH (Comair),
RU (Continental Express), UA (United), and US (USAirways)
Weather Coded as 1 if there was a weather-related delay
http://www.transtats.bts.gov
EXAMPLE OF COMPLETE ANALYSIS: PREDICTING DELAYED FLIGHTS 251
TABLE 10.5 SAMPLE OF 20 FLIGHTS
Flight Day of Departure
Status Carrier Week Time Destination Origin Weather
ontime DL 2 728 LGA DCA 0
delayed US 3 1600 LGA DCA 0
ontime DH 5 1242 EWR IAD 0
ontime US 2 2057 LGA DCA 0
ontime DH 3 1603 JFK IAD 0
ontime CO 6 1252 EWR DCA 0
ontime RU 6 1728 EWR DCA 0
ontime DL 5 1031 LGA DCA 0
ontime RU 6 1722 EWR IAD 0
delayed US 1 627 LGA DCA 0
delayed DH 2 1756 JFK IAD 0
ontime MQ 6 1529 JFK DCA 0
ontime US 6 1259 LGA DCA 0
ontime DL 2 1329 LGA DCA 0
ontime RU 2 1453 EWR BWI 0
ontime RU 5 1356 EWR DCA 0
delayed DH 7 2244 LGA IAD 0
ontime US 7 1053 LGA DCA 0
ontime US 2 1057 LGA DCA 0
ontime US 4 632 LGA DCA 0
the relationships between flight delays and different predictors or combinations
of predictors. From Figure 10.4, we see that Sundays and Mondays saw the
largest proportion of delays. Delay rates also seem to differ by carrier, by time
of day, as well as by origin and destination airports. For Weather, we see a
strong distinction between delays when Weather = 1 (in that case there is always
a delay) and Weather = 0. The heatmap in Figure 10.5 reveals some specific
combinations with high rates of delays, such as Sunday flights by carrier RU,
departing from BWI, or Sunday flights by MQ departing from DCA. We can
also see combinations with very low delay rates.
Our main goal is to find a model that can obtain accurate classifications of
new flights based on their predictor information. An alternative goal is finding
a certain percentage of flights that are most/least likely to get delayed (ranking).
And a third different goal is profiling flights: finding out which factors are asso-
ciated with a delay (not only in this sample but in the entire population of flights
on this route), and for those factors we would like to quantify these effects. A
logistic regression model can be used for all these goals, albeit in different ways.
Data Preprocessing
Create a binary outcome variable called isDelay that takes the value 1 if Flight
Status = delayed and 0 otherwise. Transform day of week into a categorical
252 LOGISTIC REGRESSION
code for generating bar charts of average delay vs. predictors
# code for generating top-right bar chart
# for other plots, replace aggregating variable by setting argument by = in
# aggregate().
# in function barplot(), set the x-label (argument xlab =) and y-label
# (argument names.arg =)
# according to the variable of choice.
barplot(aggregate(delays.df$Flight.Status == "delayed", by = list(delays.df$DAY_WEEK),
mean, rm.na = T)[,2], xlab = "Day of Week", ylab = "Average Delay",
names.arg = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
Mon Tue Wed Thu Fri Sat Sun
Day of Week
A
ve
ra
ge
D
el
ay
0.
00
0.
10
0.
20
6 8 10 12 14 16 18 20
Departure Time
A
ve
ra
ge
D
el
ay
0.
0
0.
1
0.
2
0.
3
IAD BWI DCA
Origin
A
ve
ra
ge
D
el
ay
0.
00
0.
10
0.
20
LGA EWR JFK
Destination
A
ve
ra
ge
D
el
ay
0.
00
0.
10
0.
20
US CO DH DL MQ OH RU UA
Carrier
A
ve
ra
ge
D
el
ay
0.
00
0.
10
0.
20
0 1
Weather
A
ve
ra
ge
D
el
ay
0.
0
0.
4
0.
8
FIGURE 10.4 PROPORTION OF DELAYED FLIGHTS BY EACH OF THE SIX PREDICTORS. TIME OF
DAY IS DIVIDED INTO HOURLY BINS
EXAMPLE OF COMPLETE ANALYSIS: PREDICTING DELAYED FLIGHTS 253
code for generating heatmap for exploring flight delays
library(reshape)
library(ggplot2)
# create matrix for plot
agg <- aggregate(delays.df$Is.Delay,
by = list(delays.df$DAY_WEEK, delays.df$CARRIER, delays.df$ORIGIN),
FUN = mean, na.rm = TRUE)
m <- melt(agg)
names(m)[1:3] <- c("DAY_WEEK", "CARRIER", "ORIGIN")
# plot with ggplot
# use facet_grid() with arguments scales = "free" and space = "free" to skip
# missing values.
ggplot(m, aes(y = CARRIER, x = DAY_WEEK, fill = value)) + geom_tile() +
facet_grid(ORIGIN ~ ., scales = "free", space = "free") +
scale_fill_gradient(low="white", high="black")
OH
RU
CO
DH
DL
MQ
RU
US
DH
RU
UA
B
W
I
D
C
A
IA
D
Mon Tue Wed Thu Fri Sat Sun
DAY_WEEK
C
A
R
R
IE
R
0.0
0.1
0.2
0.3
0.4
0.5
value
FIGURE 10.5 PERCENT OF DELAYED FLIGHTS (DARKER = HIGHER %DELAYS) BY DAY OF WEEK,
ORIGIN, AND CARRIER
variable, and bin and categorize the departure time into hourly intervals between
6:00 AM and 10:00 PM. Set reference categories for categorical variables: IAD
for departure airport, LGA for arrival, USAirways for carrier, and Wednesday
for day (see Figure 10.4 for R code). This yields a total of 34 dummies. In
addition, we have a single dummy for Weather. While this is a large number of
predictors, we start by using all of them, look at performance, and then explore
reducing the dimension. We then partition the data into training set (60%) and
validation set (40%). We use the training set to fit a model and the validation set
to assess the model’s performance.
254 LOGISTIC REGRESSION
Model-Fitting and Estimation
The estimated model with 34 predictors is shown in Table 10.6. Note how
negative coefficients in the logit model (the “Coefficient” column) translate into
odds coefficients lower than 1, and positive logit coefficients translate into odds
coefficients larger than 1.
Model Interpretation
The coefficient for Arrival Airport JFK (DESTJFK) is estimated as −0.19.
Recall that the reference group is LGA. We interpret this coefficient as follows:
e−0.19 = 0.83 are the odds of a flight arriving at JFK being delayed relative to
a flight to LGA being delayed (= the base-case odds), holding all other variables
constant. This means that flights to LGA are more likely to be delayed than
those to JFK (holding everything else constant). If we consider statistical sig-
nificance of the coefficients, we see that in general, the origin and destination
airports are not associated with the chance of delays. For carriers, two carriers
(CO, MQ) are significantly different from the base carrier (USAirways), with
odds of delay ranging between 4 and 5.5 relative to the other airlines. Weather
has an enormous coefficient, which is not statistically significant. Flights leaving
on Sunday or Monday have, on average, odds of 1.8 of delays relative to other
days of the week (the other days seem statistically similar to the reference group
Wednesday). Also, odds of delays appear to change over the course of the day,
with the most noticeable difference from the reference category (6–7 AM) being
3–4 PM.
Model Performance
How should we measure the performance of models? One possible measure is
“percent of flights correctly classified.” Accurate classification can be obtained
from the confusion matrix for the validation data. The confusion matrix gives a
sense of the classification accuracy and what type of misclassification is more fre-
quent. From the confusion matrix and error rates in Figure 10.6, it can be seen
that the model more accurately classifies nondelayed flights and is less accurate
in classifying flights that were delayed. (Note: The same pattern appears in the
confusion matrix for the training data, so it is not surprising to see it emerge for
new data.) If there is an asymmetric cost structure so that one type of misclassifi-
cation is more costly than the other, the cutoff value can be selected to minimize
the cost. Of course, this tweaking should be carried out on the training data and
assessed only using the validation data.
In most conceivable situations, the purpose of the model would be to identify
those flights most likely to be delayed among a set of flights, so that resources
can be directed toward either reducing the delay or mitigating its effects. Air
EXAMPLE OF COMPLETE ANALYSIS: PREDICTING DELAYED FLIGHTS 255
TABLE 10.6 ESTIMATED LOGISTIC REGRESSION MODEL FOR DELAYED FLIGHTS (BASED ON THE
TRAINING SET)
code for data preprocessing and running logistic regression
delays.df <- read.csv("FlightDelays.csv")
# transform variables and create bins
delays.df$DAY_WEEK <- factor(delays.df$DAY_WEEK, levels = c(1:7),
labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
delays.df$CRS_DEP_TIME <- factor(round(delays.df$CRS_DEP_TIME/100))
# create reference categories
delays.df$ORIGIN <- relevel(delays.df$ORIGIN, ref = "IAD")
delays.df$DEST <- relevel(delays.df$DEST, ref = "LGA")
delays.df$CARRIER <- relevel(delays.df$CARRIER, ref = "US")
delays.df$DAY_WEEK <- relevel(delays.df$DAY_WEEK, ref = "Wed")
delays.df$isDelay <- 1 * (delays.df$Flight.Status == "delayed")
# create training and validation sets
selected.var <- c(10, 1, 8, 4, 2, 9, 14)
train.index <- sample(c(1:dim(delays.df)[1]), dim(delays.df)[1]*0.6)
train.df <- delays.df[train.index, selected.var]
valid.df <- delays.df[-train.index, selected.var]
# run logistic model, and show coefficients and odds
lm.fit <- glm(isDelay ~ ., data = train.df, family = "binomial")
data.frame(summary(lm.fit)$coefficients, odds = exp(coef(lm.fit)))
> round(data.frame(summary(lm.fit)$coefficients, odds = exp(coef(lm.fit))), 5)
Estimate Std..Error z.value Pr…z.. odds
(Intercept) -2.60619 0.61423 -4.24302 0.00002 7.382000e-02
DAY_WEEKMon 0.52118 0.26915 1.93644 0.05281 1.684020e+00
DAY_WEEKTue 0.14391 0.28607 0.50307 0.61491 1.154780e+00
DAY_WEEKThu -0.05467 0.27159 -0.20128 0.84048 9.468000e-01
DAY_WEEKFri -0.00027 0.27089 -0.00098 0.99922 9.997300e-01
DAY_WEEKSat -0.74215 0.35108 -2.11392 0.03452 4.760900e-01
DAY_WEEKSun 0.66789 0.27670 2.41378 0.01579 1.950120e+00
CRS_DEP_TIME7 -0.04694 0.51691 -0.09081 0.92764 9.541400e-01
CRS_DEP_TIME8 0.29127 0.48747 0.59751 0.55017 1.338130e+00
CRS_DEP_TIME9 -0.36183 0.60420 -0.59886 0.54927 6.964000e-01
CRS_DEP_TIME10 -0.34262 0.59867 -0.57231 0.56711 7.099100e-01
CRS_DEP_TIME11 -0.58081 0.82897 -0.70064 0.48353 5.594500e-01
CRS_DEP_TIME12 0.62069 0.46722 1.32848 0.18402 1.860210e+00
CRS_DEP_TIME13 0.03636 0.51341 0.07082 0.94354 1.037030e+00
CRS_DEP_TIME14 0.25826 0.50724 0.50915 0.61065 1.294680e+00
CRS_DEP_TIME15 1.04599 0.41727 2.50677 0.01218 2.846230e+00
CRS_DEP_TIME16 0.55740 0.45512 1.22474 0.22067 1.746130e+00
CRS_DEP_TIME17 0.67856 0.42316 1.60355 0.10881 1.971040e+00
CRS_DEP_TIME18 0.28052 0.59063 0.47494 0.63483 1.323810e+00
CRS_DEP_TIME19 0.75145 0.50054 1.50126 0.13329 2.120070e+00
CRS_DEP_TIME20 0.92481 0.68216 1.35570 0.17519 2.521380e+00
CRS_DEP_TIME21 0.88665 0.44103 2.01041 0.04439 2.426980e+00
ORIGINBWI 0.40424 0.38936 1.03821 0.29917 1.498170e+00
ORIGINDCA -0.43414 0.36966 -1.17443 0.24022 6.478200e-01
DESTEWR -0.06771 0.33162 -0.20417 0.83822 9.345400e-01
DESTJFK -0.18893 0.25224 -0.74900 0.45386 8.278500e-01
CARRIERCO 1.70665 0.53019 3.21895 0.00129 5.510500e+00
CARRIERDH 0.92781 0.49642 1.86898 0.06163 2.528960e+00
CARRIERDL 0.30436 0.33592 0.90605 0.36491 1.355750e+00
CARRIERMQ 1.38772 0.32314 4.29444 0.00002 4.005710e+00
CARRIEROH -0.40804 0.84431 -0.48327 0.62890 6.649600e-01
CARRIERRU 0.94661 0.49224 1.92309 0.05447 2.576970e+00
CARRIERUA 0.51400 0.83733 0.61385 0.53931 1.671970e+00
Weather 17.85034 502.07524 0.03555 0.97164 5.653314e+07
256 LOGISTIC REGRESSION
code for evaluating performance of all-predictor model
library(gains)
pred <- predict(lm.fit, valid.df)
gain <- gains(valid.df$isDelay, lm.fit$fitted.values, groups=100)
plot(c(0,gain$cume.pct.of.total*sum(valid.df$isDelay))~
c(0,gain$cume.obs),
xlab="# cases", ylab="Cumulative", main="", type="l")
lines(c(0,sum(valid.df$isDelay))~c(0, dim(valid.df)[1]), lty=2)
Output
> confusionMatrix(ifelse(pred > 0.5, 1, 0), valid.df$isDelay)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 714 156
1 0 11
Accuracy : 0.8229
0 200 400 600 800
0
50
10
0
15
0
# cases
C
um
ul
at
iv
e
FIGURE 10.6 CONFUSION MATRIX AND LIFT CHART FOR THE FLIGHT DELAY VALIDATION DATA
USING ALL PREDICTORS
traffic controllers might work to open up additional air routes or allocate more
controllers to a specific area for a short time. Airlines might bring on personnel to
rebook passengers and to activate standby flight crews and aircraft. Hotels might
allocate space for stranded travellers. In all cases, the resources available are going
to be limited and might vary over time and from organization to organization.
In this situation, the most useful model would provide an ordering of flights by
their probability of delay, letting the model users decide how far down that list to
go in taking action. Therefore, model lift is a useful measure of performance—
as you move down that list of flights, ordered by their delay probability, how
much better does the model do in predicting delay than would a naive model
which is simply the average delay rate for all flights? From the lift curve for the
EXAMPLE OF COMPLETE ANALYSIS: PREDICTING DELAYED FLIGHTS 257
validation data (Figure 10.6), we see that our model is superior to the baseline
(simple random selection of flights).
Variable Selection
From the data exploration charts (Figures 10.4 and 10.5) and from the coef-
ficient table for the flights delay model (Table 10.6), it appears that several of
the predictors could be dropped or coded differently. Additionally, we look
at the number of flights in different categories to identify categories with very
few or no flights—such categories are candidates for removal or merger (see
Table 10.7).
First, we find that most carriers depart from a single airport (DCA): For
those that depart from all three airports, the delay rates are similar regardless
of airport. We therefore drop the departure airport distinction by excluding
Origin dummies and find that the model performance and fit is not harmed.
We also drop the destination airport for a practical reason: Not all carriers fly
to all airports. Our model would then be invalid for prediction in nonexistent
combinations of carrier and destination airport. We also try grouping carriers,
day of week, and hour of day into fewer categories that are more distinguishable
with respect to delays. For example, Sundays and Mondays seem to have a
similar rate of delays, which differs from the lower rate on Tuesday–Saturday. We
therefore group the days of week into Sunday+Monday and Other, resulting in
a single dummy variable.
Table 10.8 displays the estimated smaller model, with its training and valida-
tion confusion matrices and error rates. Figure 10.7 presents the lift chart on the
validation set. It can be seen that this model competes well with the larger model
in terms of classification accuracy and lift, while using much less information.
We therefore conclude with a six-predictor model that requires only knowl-
edge of the carrier, the day of week, the hour of the day, and whether it is
likely that there will be a delay due to weather. However, this weather vari-
able refers to actual weather at flight time, not a forecast, and is not known in
advance! If the aim is to predict in advance whether a particular flight will be
TABLE 10.7 NUMBER OF FLIGHTS BY CARRIER AND ORIGIN
BWI DCA IAD Total
CO 94 94
DH 27 524 551
DL 388 388
MQ 295 295
OH 30 30
RU 115 162 131 408
UA 31 31
US 404 404
Total 145 1370 686 2201
258 LOGISTIC REGRESSION
TABLE 10.8 LOGISTIC REGRESSION MODEL WITH FEWER PREDICTORS
code for logistic regression with fewer predictors
# fewer predictors
delays.df$Weekend <- delays.df$DAY_WEEK %in% c("Sun", "Sat")
delays.df$CARRIER_CO_MQ_DH_RU <- delays.df$CARRIER %in% c("CO", "MQ", "DH", "RU")
delays.df$MORNING <- delays.df$CRS_DEP_TIME %in% c(6, 7, 8, 9)
delays.df$NOON <- delays.df$CRS_DEP_TIME %in% c(10, 11, 12, 13)
delays.df$AFTER2P <- delays.df$CRS_DEP_TIME %in% c(14, 15, 16, 17, 18)
delays.df$EVENING <- delays.df$CRS_DEP_TIME %in% c(19, 20)
set.seed(1) # Set the seed for the random number generator for reproducing the
# partition.
train.index <- sample(c(1:dim(delays.df)[1]), dim(delays.df)[1]*0.6)
valid.index <- setdiff(c(1:dim(delays.df)[1]), train.index)
train.df <- delays.df[train.index, ]
valid.df <- delays.df[valid.index, ]
lm.fit <- glm(isDelay ~ Weekend + Weather + CARRIER_CO_MQ_DH_RU + MORNING + NOON +
AFTER2P + EVENING, data = train.df, family = "binomial")
summary(lm.fit)
# evaluate
pred <- predict(lm.fit, valid.df)
confusionMatrix(ifelse(pred > 0.5, 1, 0), valid.df$isDelay)
Output
> summary(lm.fit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.974836 0.287812 -6.862 6.81e-12 ***
WeekendTRUE -0.008063 0.173019 -0.047 0.9628
Weather 18.100254 501.012404 0.036 0.9712
CARRIER_CO_MQ_DH_RUTRUE 1.130216 0.178381 6.336 2.36e-10 ***
MORNINGTRUE -0.739863 0.292933 -2.526 0.0115 *
NOONTRUE -0.567475 0.298423 -1.902 0.0572 .
AFTER2PTRUE -0.106968 0.259789 -0.412 0.6805
EVENINGTRUE 0.181173 0.357225 0.507 0.6120
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> confusionMatrix(ifelse(pred > 0.5, 1, 0), valid.df$isDelay)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 714 156
1 0 11
Accuracy : 0.8229
APPENDIX: LOGISTIC REGRESSION FOR PROFILING 259
0 200 400 600 800
0
50
10
0
15
0
# cases
C
um
ul
at
iv
e
FIGURE 10.7 LIFT CHART FOR LOGISTIC REGRESSION MODEL WITH FEWER PREDICTORS ON THE
VALIDATION SET
delayed, a model without Weather must be used. In contrast, if the goal is
profiling delayed vs. ontime flights, we can keep Weather in the model to allow
evaluating the impact of the other factors while holding weather constant [that
is, (approximately) comparing days with weather delays to days without weather
delays].
To conclude, based on the model built from January 2004 data, the highest
chance of an ontime flight from DC to New York is on Tuesday–Saturday around
noon, on Delta, Comair, United, or USAirways. And clearly, good weather is
advantageous!
10.6 Appendix: Logistic Regression for
Profiling
The presentation of logistic regression in this chapter has been primarily from a
data mining perspective where classification or ranking is the goal, and perfor-
mance is evaluated by reviewing results with a validation sample. For reference,
some key concepts of a classical statistical perspective are included below.
Appendix A: Why Linear Regression Is Problematic for a Categorical Outcome
Now that you have seen how logistic regression works, we explain why it is
considered preferable over linear regression for a binary outcome. Technically,
one can apply a multiple linear regression model to this problem, treating the
outcome variable Y as continuous. This is called a Linear Probability Model. Of
course, Y must be coded numerically (e.g., 1 for customers who accepted the
loan offer and 0 for customers who did not accept it). Although software will
260 LOGISTIC REGRESSION
yield an output that at first glance may seem standard (e.g., Table 10.9), a closer
look will reveal several anomalies:
1. Using the model to predict Y for each of the records (or classify them)
yields predictions that are not necessarily 0 or 1.
2. A look at the histogram or probability plot of the residuals reveals that
the assumption that the outcome variable (or residuals) follows a normal
distribution is violated. Clearly, if Y takes only the values 0 and 1, it
cannot be normally distributed. In fact, a more appropriate distribution
for the number of 1’s in the dataset is the binomial distribution with
p = P (Y = 1).
3. The assumption that the variance of Y is constant across all classes is
violated. Since Y follows a binomial distribution, its variance is np(1−
p). This means that the variance will be higher for classes where the
probability of adoption, p, is near 0.5 than where it is near 0 or 1.
The first anomaly is the main challenge when the goal is classification, especially
if we are interested in propensities (p = P (Y = 1)). The second and third
anomalies are relevant to profiling, where we use statistical inference that relies
on standard errors.
Below you will find partial output from running a multiple linear regression
of Personal Loan (PL, coded as PL = 1 for customers who accepted the loan
offer and PL = 0 otherwise) on three of the predictors. The estimated model is
P̂L = −0.2326196 + 0.0030989 Income + 0.0344897 Family + 0.2872284 CD
TABLE 10.9 OUTPUT FOR MULTIPLE LINEAR REGRESSION MODEL OF PERSONAL LOAN ON
THREE PREDICTORS
> reg <- lm(Personal.Loan ~ Income + Family + CD.Account, data = bank.df)
> summary(reg)
Call:
lm(formula = Personal.Loan ~ Income + Family + CD.Account, data = bank.df)
Residuals:
Min 1Q Median 3Q Max
-0.81774 -0.12536 -0.02930 0.06407 0.99670
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2326196 0.0103867 -22.40 <0.0000000000000002 ***
Income 0.0030989 0.0000765 40.51 <0.0000000000000002 ***
Family 0.0344897 0.0030243 11.40 <0.0000000000000002 ***
CD.Account 0.2872284 0.0145981 19.68 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2421 on 4996 degrees of freedom
Multiple R-squared: 0.325, Adjusted R-squared: 0.3246
F-statistic: 801.9 on 3 and 4996 DF, p-value: < 0.00000000000000022
APPENDIX: LOGISTIC REGRESSION FOR PROFILING 261
To predict whether a new customer will accept the personal loan offer (PL = 1)
or not (PL = 0), we input the information on its values for these three pre-
dictors. For example, we would predict the loan offer acceptance of a cus-
tomer with an annual income of $50K with two family members, who does not
hold CD accounts in Universal Bank to be −0.2326196+ (0.0030989)(50)+
(0.0344897)(2) = −0.0086952. Clearly, this is not a valid “loan acceptance”
value. Furthermore, the histogram of the residuals (Figure 10.8) reveals that the
residuals are probably not normally distributed. Therefore, our estimated model
is based on violated assumptions and cannot be used for inference.
Residuals
F
re
qu
en
cy
−0.5 0.0 0.5 1.0
0
40
0
80
0
12
00
FIGURE 10.8 HISTOGRAM OF RESIDUALS FROM A MULTIPLE LINEAR REGRESSION FOR
UNIVERSAL BANK ON THE THREE PREDICTORS. WE SEE THE RESIDUALS DO NOT
FOLLOW THE NORMAL DISTRIBUTION
Appendix B: Evaluating Explanatory Power
When the purpose of the analysis is profiling (identifying predictor profiles that
distinguish the two classes, or explaining the differences between the classes in
terms of predictor values), we are less interested in how well the model classi-
fies new data than in how well the model fits the data it was trained on. For
example, if we are interested in characterizing the average loan offer acceptor
vs. nonacceptor in terms of income, education, and so on, we want to find a
model that fits the data best. We therefore mention popular measures used to
assess how well the model fits the data. Clearly, we look at the training set in
order to evaluate goodness of fit (and in fact, we do not need to partition the
data).
Overall Strength-of-Fit As in multiple linear regression, we first evaluate
the overall explanatory power of the model before looking at single predictors.
262 LOGISTIC REGRESSION
We ask: Is this set of predictors better than a simple naive model for explaining
the difference between classes?6
The deviance D is a statistic that measures overall goodness of fit. It is
similar to the concept of sum of squared errors (SSE) in the case of least squares
estimation (used in linear regression). We compare the deviance of our model,
D (called Residual deviance in R), to the deviance of the naive (Null) model, D0.
For example, in Table 10.10, we see D = 682.19 and D0 = 1901.71. If the
reduction in deviance is statistically significant (as indicated by a low p-value7),
we consider our model to provide a good overall fit, better than a model with
no explanatory (X) variables.
TABLE 10.10 MEASURES OF EXPLANATORY POWER FOR UNIVERSAL BANK TRAINING DATA
WITH A 12 PREDICTOR MODEL
> summary(logit.reg)
Call:
glm(formula = Personal.Loan ~ ., family = “binomial”, data = train.df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0380 -0.1847 -0.0627 -0.0183 3.9810
Coefficients:
[OMITTED]
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1901.71 on 2999 degrees of freedom
Residual deviance: 682.19 on 2987 degrees of freedom
AIC: 708.19
Number of Fisher Scoring iterations: 8
Finally, the confusion matrix and lift chart for the training data (Figure 10.9)
give a sense of how accurately the model classifies the data. If the model fits the
data well, we expect it to classify these data accurately into their actual classes.
Impact of Single Predictors As in multiple linear regression, the output
from a logistic regression procedure typically yields a coefficient table, where
6In a naive model, no explanatory variables (X ’s) exist and each record is classified as belonging to the
majority class.
7The difference between the deviance of a naive model and deviance of the model at hand approxi-
mately follows a chi-squared distribution with k degrees of freedom, where k is the number of predic-
tors in the model at hand. Therefore, to get the p-value, compute the difference between the deviances
(d) and then compute the probability that a chi-squared variable with k degrees of freedom is larger
than d.
APPENDIX: LOGISTIC REGRESSION FOR PROFILING 263
> confusionMatrix(ifelse(logit.reg$fitted > 0.5, 1, 0), train.df[, 8])
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2687 92
1 24 197
Accuracy : 0.9613
0 500 1000 1500 2000
0
50
10
0
15
0
# cases
C
um
ul
at
iv
e
FIGURE 10.9 CONFUSION MATRIX AND LIFT CHART FOR UNIVERSAL BANK TRAINING DATA
WITH 12 PREDICTORS
for each predictor Xi, we have an estimated coefficient bi and an associated
standard error. The associated p-value indicates the statistical significance of the
predictor Xi, with very low p-values indicating a statistically significant relation-
ship between the predictor and the outcome (given that the other predictors are
accounted for), a relationship that is most likely not a result of chance. Three
important points to remember are:
1. A statistically significant relationship is not necessarily a practically significant
one, in which the predictor has great impact. If the sample is very large,
the p-value will be very small simply because the chance uncertainty
associated with fitting a model to a particular data set, which can be
considerable in a small sample, is negligible in a large sample.
2. Comparing the coefficient magnitudes, or equivalently the odds mag-
nitudes, is meaningless unless all predictors have the same scale. Recall
that each coefficient is multiplied by the predictor value, so that different
predictor scales lead to different coefficient scales.
264 LOGISTIC REGRESSION
3. A statistically significant predictor means that on average, a unit increase in
that predictor is associated with a certain effect on the outcome (holding
all other predictors constant). It does not, however, indicate predictive
power. Statistical significance is of major importance in explanatory mod-
eling, or profiling, but of secondary importance in predictive modeling
(classification). In predictive modeling, statistically significant predictors
might give hints as to more and less important predictors, but the even-
tual choice of predictors should be based on predictive measures, such as
the validation set confusion matrix (for classification) or the validation set
lift chart (for ranking).
Appendix C: Logistic Regression for More Than Two Classes
The logistic model for a binary outcome can be extended for more than two
classes. Suppose that there are m classes. Using a logistic regression model,
for each record we would have m probabilities of belonging to each of the m
classes. Since the m probabilities must add up to 1, we need estimate only m−1
probabilities.
Ordinal Classes Ordinal classes are classes that have a meaningful order.
For example, in stock recommendations, the three classes buy, hold, and sell can
be treated as ordered. As a simple rule, if classes can be numbered in a meaningful
way, we consider them ordinal. When the number of classes is large (typically,
more than 5), we can treat the outcome variable as continuous and perform
multiple linear regression. When m = 2, the logistic model described above
is used. We therefore need an extension of the logistic regression for a small
number of ordinal classes (3 ≤ m ≤ 5). There are several ways to extend the
binary-class case. Here, we describe the proportional odds or cumulative logit method.
For other methods, see Hosmer and Lemeshow (2000).
For simplicity of interpretation and computation, we look at cumulative prob-
abilities of class membership. For example, in the stock recommendations, we
have m = 3 classes. Let us denote them by 1 = buy, 2 = hold, and 3 = sell. The
probabilities estimated by the model are P (Y ≤ 1), (the probability of a buy
recommendation) and P (Y ≤ 2) (the probability of a buy or hold recommen-
dation). The three noncumulative probabilities of class membership can easily
be recovered from the two cumulative probabilities:
P (Y = 1) = P (Y ≤ 1),
P (Y = 2) = P (Y ≤ 2)− P (Y ≤ 1),
P (Y = 3) = 1− P (Y ≤ 2).
APPENDIX: LOGISTIC REGRESSION FOR PROFILING 265
Next, we want to model each logit as a function of the predictors. Corre-
sponding to each of them−1 cumulative probabilities is a logit. In our example,
we would have
logit(buy) = log
P (Y ≤ 1)
1− P (Y ≤ 1)
,
logit(buy or hold) = log
P (Y ≤ 2)
1− P (Y ≤ 2)
.
Each of the logits is then modeled as a linear function of the predictors (as in
the two-class case). If in the stock recommendations we have a single predictor
value x, we compute two logit values using two equations
logit(buy) = α0 + β1x,
logit(buy or hold) = β0 + β1x.
This means that both lines have the same slope (β1) but different intercepts. Once
the coefficients α0, β0, β1 are estimated, we can compute the class membership
probabilities by rewriting the logit equations in terms of probabilities. For the
three-class case, for example, we would have
P (Y = 1) = P (Y ≤ 1) = 1
1 + e−(a0+b1x)
,
P (Y = 2) = P (Y ≤ 2)− P (Y ≤ 1) = 1
1 + e−(b0+b1x)
− 1
1 + e−(a0+b1x)
,
P (Y = 3) = 1− P (Y ≤ 2) = 1− 1
1 + e−(b0+b1x)
,
where a0, b0, and b1 are the estimates obtained from the training set.
For each record, we now have the estimated probabilities that it belongs to
each of the classes. In our example, each stock would have three probabilities:
for a buy recommendation, a hold recommendation, and a sell recommendation.
The last step is to classify the record into one of the classes. This is done by
assigning it to the class with the highest membership probability. For example,
if a stock had estimated probabilities P (Y = 1) = 0.2, P (Y = 2) = 0.3, and
P (Y = 3) = 0.5, we would classify it as getting a sell recommendation.
Nominal Classes When the classes cannot be ordered and are simply
different from one another, we are in the case of nominal classes. An example
is the choice between several brands of cereal. A simple way to verify that the
classes are nominal is when it makes sense to tag them as A,B,C, . . ., and the
assignment of letters to classes does not matter. For simplicity, let us assume that
there are m = 3 brands of cereal that consumers can choose from (assuming that
266 LOGISTIC REGRESSION
each consumer chooses one). Then we estimate the probabilities P (Y = A),
P (Y = B), and P (Y = C). As before, if we know two of the probabilities,
the third probability is determined. We therefore use one of the classes as the
reference class. Let us use C as the reference brand.
The goal, once again, is to model the class membership as a function of
predictors. So in the cereals example, we might want to predict which cereal
will be chosen if we know the cereal’s price x.
Next, we form m−1 pseudologit equations that are linear in the predictors.
In our example, we would have
logit(A) = log
P (Y = A)
P (Y = C)
= α0 + α1x,
logit(B) = log
P (Y = B)
P (Y = C)
= β0 + β1x.
Once the four coefficients are estimated from the training set, we can estimate
the class membership probabilities8:
P (Y = A) =
ea0+a1x
1 + ea0+a1x + eb0+b1x
,
P (Y = B) =
eb0+b1x
1 + ea0+a1x + eb0+b1x
,
P (Y = C) = 1− P (Y = A)− P (Y = B),
where a0, a1, b0, and b1 are the coefficient estimates obtained from the training
set. Finally, a record is assigned to the class that has the highest probability.
Table 10.11 presents the R code for ordinal and nominal multinomial
regression.
8From the two logit equations, we see that
P (Y = A) = P (Y = C)eα0+α1x
P (Y = B) = P (Y = C)eβ0+β1x
Since P (Y = A) + P (Y = B) + P (Y = C) = 1, we get
P (Y = C) = 1− P (Y = C)eα0+α1x − P (Y = C)eβ0+β1x
=
1
eα0+α1x+e
β0+β1x
.
By plugging this form into the two equations above it, we also obtain the membership probabilities
in classes A and B.
APPENDIX: LOGISTIC REGRESSION FOR PROFILING 267
TABLE 10.11 ORDINAL AND NOMINAL MULTINOMIAL REGRESSION IN R
code for logistic regression with more than 2 classes
# simulate simple data
Y = rep(c(“a”, “b”, “c”), 100)
x = rep(c(1, 2, 3), 100) + rnorm(300, 0, 1)
# ordinal logistic regression
library(MASS)
Y = factor(Y, ordered = T)
polr(Y ~ x)
# nominal logistic regression
library(nnet)
Y = factor(Y, ordered = F)
multinom(Y ~ x)
Output
> polr(Y ~ x)
Call:
polr(formula = Y ~ x)
Coefficients:
x
1.110152
Intercepts:
a|b b|c
1.328406 3.218809
Residual Deviance: 543.3761
AIC: 549.3761
> multinom(Y ~ x)
# weights: 9 (4 variable)
initial value 329.583687
iter 10 value 268.610761
iter 10 value 268.610760
final value 268.610760
converged
Call:
multinom(formula = Y ~ x)
Coefficients:
(Intercept) x
b -1.715653 1.058281
c -3.399917 1.738026
Residual Deviance: 537.2215
AIC: 545.2215
268 LOGISTIC REGRESSION
PROBLEMS
10.1 Financial Condition of Banks. The file Banks.csv includes data on a sample of
20 banks. The “Financial Condition” column records the judgment of an expert
on the financial condition of each bank. This outcome variable takes one of two
possible values—weak or strong—according to the financial condition of the bank. The
predictors are two ratios used in the financial analysis of banks: TotLns&Lses/Assets is
the ratio of total loans and leases to total assets and TotExp/Assets is the ratio of total
expenses to total assets. The target is to use the two ratios for classifying the financial
condition of a new bank.
Run a logistic regression model (on the entire dataset) that models the status of
a bank as a function of the two financial measures provided. Specify the success class
as weak (this is similar to creating a dummy that is 1 for financially weak banks and 0
otherwise), and use the default cutoff value of 0.5.
a. Write the estimated equation that associates the financial condition of a bank with
its two predictors in three formats:
i. The logit as a function of the predictors
ii. The odds as a function of the predictors
iii. The probability as a function of the predictors
b. Consider a new bank whose total loans and leases/assets ratio = 0.6 and total
expenses/assets ratio = 0.11. From your logistic regression model, estimate the
following four quantities for this bank (use R to do all the intermediate calcula-
tions; show your final answers to four decimal places): the logit, the odds, the
probability of being financially weak, and the classification of the bank (use cutoff
= 0.5).
c. The cutoff value of 0.5 is used in conjunction with the probability of being finan-
cially weak. Compute the threshold that should be used if we want to make a
classification based on the odds of being financially weak, and the threshold for the
corresponding logit.
d. Interpret the estimated coefficient for the total loans & leases to total assets ratio
(TotLns&Lses/Assets) in terms of the odds of being financially weak.
e. When a bank that is in poor financial condition is misclassified as financially strong,
the misclassification cost is much higher than when a financially strong bank is
misclassified as weak. To minimize the expected cost of misclassification, should the
cutoff value for classification (which is currently at 0.5) be increased or decreased?
10.2 Identifying Good System Administrators. A management consultant is study-
ing the roles played by experience and training in a system administrator’s ability to
complete a set of tasks in a specified amount of time. In particular, she is interested
in discriminating between administrators who are able to complete given tasks within
a specified time and those who are not. Data are collected on the performance of 75
randomly selected administrators. They are stored in the file SystemAdministrators.csv.
The variable Experience measures months of full-time system administrator expe-
rience, while Training measures the number of relevant training credits. The outcome
variable Completed is either Yes or No, according to whether or not the administrator
completed the tasks.
PROBLEMS 269
a. Create a scatter plot of Experience vs. Training using color or symbol to distinguish
programmers who completed the task from those who did not complete it. Which
predictor(s) appear(s) potentially useful for classifying task completion?
b. Run a logistic regression model with both predictors using the entire dataset as
training data. Among those who completed the task, what is the percentage of
programmers incorrectly classified as failing to complete the task?
c. To decrease the percentage in part (b), should the cutoff probability be increased
or decreased?
d. How much experience must be accumulated by a programmer with 4 years of
training before his or her estimated probability of completing the task exceeds 0.5?
10.3 Sales of Riding Mowers. A company that manufactures riding mowers wants to
identify the best sales prospects for an intensive sales campaign. In particular, the man-
ufacturer is interested in classifying households as prospective owners or nonowners
on the basis of Income (in $1000s) and Lot Size (in 1000 ft2). The marketing expert
looked at a random sample of 24 households, given in the file RidingMowers.csv. Use
all the data to fit a logistic regression of ownership on the two predictors.
a. What percentage of households in the study were owners of a riding mower?
b. Create a scatter plot of Income vs. Lot Size using color or symbol to distinguish
owners from nonowners. From the scatter plot, which class seems to have a higher
average income, owners or nonowners?
c. Among nonowners, what is the percentage of households classified correctly?
d. To increase the percentage of correctly classified nonowners, should the cutoff
probability be increased or decreased?
e. What are the odds that a household with a $60K income and a lot size of 20,000
ft2 is an owner?
f. What is the classification of a household with a $60K income and a lot size of
20,000 ft2? Use cutoff = 0.5.
g. What is the minimum income that a household with 16,000 ft2 lot size should have
before it is classified as an owner?
10.4 Competitive Auctions on eBay.com. The file eBayAuctions.csv contains informa-
tion on 1972 auctions transacted on eBay.com during May–June 2004. The goal is to
use these data to build a model that will distinguish competitive auctions from non-
competitive ones. A competitive auction is defined as an auction with at least two
bids placed on the item being auctioned. The data include variables that describe the
item (auction category), the seller (his or her eBay rating), and the auction terms that
the seller selected (auction duration, opening price, currency, day of week of auction
close). In addition, we have the price at which the auction closed. The goal is to
predict whether or not an auction of interest will be competitive.
Data preprocessing. Create dummy variables for the categorical predictors.
These include Category (18 categories), Currency (USD, GBP, Euro), EndDay
(Monday–Sunday), and Duration (1, 3, 5, 7, or 10 days).
a. Create pivot tables for the mean of the binary outcome (Competitive?) as a function
of the various categorical variables (use the original variables, not the dummies).
Use the information in the tables to reduce the number of dummies that will be
used in the model. For example, categories that appear most similar with respect
to the distribution of competitive auctions could be combined.
270 LOGISTIC REGRESSION
b. Split the data into training (60%) and validation (40%) datasets. Run a logistic
model with all predictors with a cutoff of 0.5.
c. If we want to predict at the start of an auction whether it will be competitive, we
cannot use the information on the closing price. Run a logistic model with all
predictors as above, excluding price. How does this model compare to the full
model with respect to predictive accuracy?
d. Interpret the meaning of the coefficient for closing price. Does closing price have
a practical significance? Is it statistically significant for predicting competitiveness
of auctions? (Use a 10% significance level.)
e. Use stepwise selection (use function step() in the stats package or function
stepAIC() in the MASS package) and an exhaustive search (use function glmulti() in
package glmulti) to find the model with the best fit to the training data. Which
predictors are used?
f. Use stepwise selection and an exhaustive search to find the model with the lowest
predictive error rate (use the validation data). Which predictors are used?
g. What is the danger of using the best predictive model that you found?
h. Explain why the best-fitting model and the best predictive models are the same or
different.
i. If the major objective is accurate classification, what cutoff value should be used?
j. Based on these data, what auction settings set by the seller (duration, opening price,
ending day, currency) would you recommend as being most likely to lead to a
competitive auction?
CHAPTER 11
Neural Nets
In this chapter, we describe neural networks, a flexible data-driven method that
can be used for classification or prediction. Although considered a “blackbox” in
terms of interpretability, neural nets have been highly successful in terms of pre-
dictive accuracy. We discuss the concepts of “nodes” and “layers” (input layers,
output layers, and hidden layers) and how they connect to form the structure
of a network. We then explain how a neural network is fitted to data using a
numerical example. Because overfitting is a major danger with neural nets, we
present a strategy for avoiding it. We describe the different parameters that a user
must specify and explain the effect of each on the process. Finally, we discuss
the usefulness of neural nets and their limitations.
11.1 Introduction
Neural networks, also called artificial neural networks, are models for classification
and prediction. The neural network is based on a model of biological activity in
the brain, where neurons are interconnected and learn from experience. Neural
networks mimic the way that human experts learn. The learning and memory
properties of neural networks resemble the properties of human learning and
memory, and they also have a capacity to generalize from particulars.
A number of successful applications have been reported in financial appli-
cations (see Trippi and Turban, 1996) such as bankruptcy predictions, currency
market trading, picking stocks and commodity trading, detecting fraud in credit
card and monetary transactions, and customer relationship management (CRM).
There have also been a number of successful applications of neural nets in
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
271
272 NEURAL NETS
engineering applications. One of the best known is ALVINN, an autonomous
vehicle driving application for normal speeds on highways. Using as input a
30×32 grid of pixel intensities from a fixed camera on the vehicle, the classifier
provides the direction of steering. The outcome variable is a categorical one
with 30 classes, such as sharp left, straight ahead, and bear right.
The main strength of neural networks is their high predictive performance.
Their structure supports capturing very complex relationships between predic-
tors and an outcome variable, which is often not possible with other predictive
models.
11.2 Concept and Structure of a Neural
Network
The idea behind neural networks is to combine the predictor information in a
very flexible way that captures complicated relationships among these variables
and between them and the outcome variable. For instance, recall that in linear
regression models, the form of the relationship between the outcome and the
predictors is specified directly by the user (see Chapter 6.) In many cases, the
exact form of the relationship is very complicated, or is generally unknown. In
linear regression modeling we might try different transformations of the predic-
tors, interactions between predictors, and so on, but the specified form of the
relationship remains linear. In comparison, in neural networks the user is not
required to specify the correct form. Instead, the network tries to learn about
such relationships from the data. In fact, linear regression and logistic regression
can be thought of as special cases of very simple neural networks that have only
input and output layers and no hidden layers.
Although researchers have studied numerous different neural network archi-
tectures, the most successful applications of neural networks in data mining have
been multilayer feedforward networks. These are networks in which there is an input
layer consisting of nodes (sometimes called neurons) that simply accept the pre-
dictor values, and successive layers of nodes that receive input from the previous
layers. The outputs of nodes in each layer are inputs to nodes in the next layer.
The last layer is called the output layer. Layers between the input and output
layers are known as hidden layers. A feedforward network is a fully connected
network with a one-way flow and no cycles. Figure 11.1 shows a diagram for
this architecture, with two hidden layers and one node in the output layer rep-
resenting the outcome value to be predicted. In a classification problem with m
classes, there would be m output nodes (or m− 1 output nodes, depending on
the software).
FITTING A NETWORK TO DATA 273
FIGURE 11.1 MULTILAYER FEEDFORWARD NEURAL NETWORK
11.3 Fitting a Network to Data
To illustrate how a neural network is fitted to data, we start with a very small
illustrative example. Although the method is by no means operational in such
a small example, it is useful for explaining the main steps and operations, for
showing how computations are done, and for integrating all the different aspects
of neural network data fitting. We will later discuss a more realistic setting.
Example 1: Tiny Dataset
Consider the following very small dataset. Table 11.1 includes information on a
tasting score for a certain processed cheese. The two predictors are scores for fat
and salt, indicating the relative presence of fat and salt in the particular cheese
sample (where 0 is the minimum amount possible in the manufacturing process,
and 1 the maximum). The outcome variable is the cheese sample’s consumer
taste preference, where like or dislike indicate whether the consumer likes the
cheese or not.
Figure 11.2 describes an example of a typical neural net that could be used
for predicting cheese preference (like/dislike) by new consumers, based on these
data. We numbered the nodes in the example from 1 to 7. Nodes 1 and 2
belong to the input layer, nodes 3 to 5 belong to the hidden layer, and nodes
6 and 7 belong to the output layer. The values on the connecting arrows are
called weights, and the weight on the arrow from node i to node j is denoted
by wi,j . The additional bias nodes, denoted by θj , serve as an intercept for the
output from node j. These are all explained in further detail below.
274 NEURAL NETS
TABLE 11.1 TINY EXAMPLE ON TASTING SCORES FOR
SIX CONSUMERS AND TWO PREDICTORS
Obs. Fat Score Salt Score Acceptance
1 0.2 0.9 like
2 0.1 0.1 dislike
3 0.2 0.4 dislike
4 0.2 0.5 dislike
5 0.4 0.5 like
6 0.3 0.8 like
Computing Output of Nodes
We discuss the input and output of the nodes separately for each of the three
types of layers (input, hidden, and output). The main difference is the function
used to map from the input to the output of the node.
Input nodes take as input the values of the predictors. Their output is the
same as the input. If we have p predictors, the input layer will usually include p
nodes. In our example, there are two predictors, and therefore the input layer
(shown in Figure 11.2) includes two nodes, each feeding into each node of the
hidden layer. Consider the first record: The input into the input layer is Fat =
0.2 and Salt = 0.9, and the output of this layer is also x1 = 0.2 and x2 = 0.9.
Hidden layer nodes take as input the output values from the input layer.
The hidden layer in this example consists of three nodes, each receiving input
from all the input nodes. To compute the output of a hidden layer node, we
compute a weighted sum of the inputs and apply a certain function to it. More
formally, for a set of input values x1, x2, . . . , xp, we compute the output of
1
2
3
4
5
6
Input Layer Hidden Layer Output Layer
w13
Fat
Salt
w14
w23
w25
w24
w15
w36
w46
w56
θ3
θ4
θ5
θ6
dislike
like7
w47
w37
w57
θ 7
FIGURE 11.2 NEURAL NETWORK FOR THE TINY EXAMPLE. CIRCLES REPRESENT NODES
(“NEURONS”), Wi,j ON ARROWS ARE WEIGHTS, AND θj ARE NODE BIAS
VALUES
FITTING A NETWORK TO DATA 275
node j by taking the weighted sum1 θj +
∑p
i=1 wijxi, where θj, w1,j, . . . , wp,j
are weights that are initially set randomly, then adjusted as the network “learns.”
Note that θj , also called the bias of node j, is a constant that controls the level of
contribution of node j. In the next step, we take a function g of this sum. The
function g, also called a transfer function or activation function is some monotone
function. Examples include the linear function [g(s) = bs], an exponential
function [g(s) = exp(bs)], and a logistic/sigmoidal function [g(s) = 1/1 +
e−s]. This last function is by far the most popular one in neural networks. Its
practical value arises from the fact that it has a squashing effect on very small or
very large values but is almost linear in the range where the value of the function
is between 0.1 and 0.9.
If we use a logistic activation function, we can write the output of node j
in the hidden layer as
Outputj = g
(
θj +
p∑
i=1
wijxi
)
=
1
1 + e−(θj+
∑p
i=1 wijxi)
. (11.1)
Initializing the Weights The values of θj and wij are initialized to small,
usually random, numbers (typically, but not always, in the range 0.00 ± 0.05).
Such values represent a state of no knowledge by the network, similar to a model
with no predictors. The initial weights are used in the first round of training.
Returning to our example, suppose that the initial weights for node 3 are
θ3 = −0.3, w1,3 = 0.05, and w2,3 = 0.01 (as shown in Figure 11.3). Using
1
2
3
4
5
Input Layer Hidden Layer Output Layer
0.05
–0.01
0.2
0.030.01
–0.01
0.01
0.05
0.015
Fat = 0.2
Salt = 0.9
0.2
0.9
0.43
0.51
0.52
0.506
–0.3
0.02
0.05
–0.015
dislike
like
6
7
–0.02
–0.03
–0.02
–0.04
0.481
FIGURE 11.3 COMPUTING NODE OUTPUTS (IN BOLDFACE TYPE) USING THE FIRST RECORD IN
THE TINY EXAMPLE AND A LOGISTIC FUNCTION
1Other options exist for combining inputs, such as taking the maximum or minimum of the weighted
inputs rather than their sum, but they are much less popular.
276 NEURAL NETS
the logistic function, we can compute the output of node 3 in the hidden layer
(using the first record) as
Output3 =
1
1 + e−[−0.3+(0.05)(0.2)+(0.01)(0.9)]
= 0.43.
Figure 11.3 shows the initial weights, inputs, and outputs for the first record
in our tiny example. If there is more than one hidden layer, the same calculation
applies, except that the input values for the second, third, and so on, hidden
layers would be the output of the preceding hidden layer. This means that the
number of input values into a certain node is equal to the number of nodes in
the preceding layer. (If there was an additional hidden layer in our example, its
nodes would receive input from the three nodes in the first hidden layer.)
Finally, the output layer obtains input values from the (last) hidden layer. It
applies the same function as above to create the output. In other words, it takes a
weighted sum of its input values and then applies the function g. In our example,
output nodes 6 and 7 receive input from the three hidden layer nodes. We can
compute the output of these nodes by
Output6 =
1
1 + e−[−0.04+ (−0.02)(0.43)+ (−0.03)(0.51)+ (0.015)(0.52)]
= 0.481
Output7 =
1
1 + e−[−0.015+ (0.01)(0.430)+ (0.05)(0.507)+ (0.015)(0.511)]
= 0.506.
These two numbers are almost the propensities P(Y = dislike | Fat = 0.2,
Salt = 0.9) and P(Y = like | Fat = 0.2. Salt = 0.9). The last step involves
normalizing these two values so that they add up to 1. In other words,
P(Y = dislike) = Output6/(Output6 + Output7) = 0.481/(0.481 + 0.506) = 0.49
P(Y = like) = 1 − P(Y = dislike) = 0.506/(0.481 + 0.506) = 0.51
For classification, we use a cutoff value (for a binary outcome) on the propensity.
Using a cutoff of 0.5, we would classify this record as like. For applications with
more than two classes, we choose the output node with the largest value.
Relation to Linear and Logistic Regression Consider a neural net-
work with a single output node and no hidden layers. For a dataset with p
predictors, the output node receives x1, x2, . . . , xp, takes a weighted sum of
these, and applies the g function. The output of the neural network is therefore
g (θ +
∑p
i=1 wixi).
First, consider a numerical outcome variable Y . If g is the identity function
[g(s) = s], the output is simply
Ŷ = θ +
p∑
i=1
wixi.
FITTING A NETWORK TO DATA 277
This is exactly equivalent to the formulation of a multiple linear regression! This
means that a neural network with no hidden layers, a single output node, and an
identity function g searches only for linear relationships between the outcome
and the predictors.
Now consider a binary output variable Y . If g is the logistic function, the
output is simply
P̂ (Y = 1) =
1
1 + e−(θ+
∑p
i=1 wixi)
,
which is equivalent to the logistic regression formulation!
In both cases, although the formulation is equivalent to the linear and logistic
regression models, the resulting estimates for the weights (coefficients in linear and
logistic regression) can differ, because the estimation method is different. The
neural net estimation method is different from least squares, the method used
to calculate coefficients in linear regression, or the maximum likelihood method
used in logistic regression. We explain below the method by which the neural
network learns.
Preprocessing the Data
When using a logistic activation function (option act.fct = ’logistic’ in R), neural
networks perform best when the predictors and outcome variable are on a scale
of [0,1]. For this reason, all variables should be scaled to a [0,1] interval before
entering them into the network. For a numerical variable X that takes values in
the range [a, b] where a < b, we normalize the measurements by subtracting a
and dividing by b− a. The normalized measurement is then
Xnorm =
X − a
b− a
.
Note that if [a, b] is within the [0,1] interval, the original scale will be stretched.
If a and b are unknown, we can estimate them from the minimal and maximal
values of X in the data. Even if new data exceed this range by a small amount,
yielding normalized values slightly lower than 0 or larger than 1, this will not
affect the results much.
For binary variables, no adjustment is needed other than creating dummy
variables. For categorical variables with m categories, if they are ordinal in
nature, a choice of m fractions in [0,1] should reflect their perceived ordering.
For example, if four ordinal categories are equally distant from each other, we
can map them to [0, 0.25, 0.5, 1]. If the categories are nominal, transforming
into m− 1 dummies is a good solution.
Another operation that improves the performance of the network is to trans-
form highly skewed predictors. In business applications, there tend to be many
highly right-skewed variables (such as income). Taking a log transform of a
278 NEURAL NETS
right-skewed variable (before converting to a [0,1] scale) will usually spread out
the values more symmetrically.
Another common sigmoidal function is the hyperbolic tangent (option act.fct
= ’tanh’ in R). When using this function, it is usually better to scale predictors
to a [-1,1] scale.
Training the Model
Training the model means estimating the weights θj and wij that lead to the
best predictive results. The process that we described earlier (Section 11.1) for
computing the neural network output for a record is repeated for all records
in the training set. For each record, the model produces a prediction which
is then compared with the actual outcome value. Their difference is the error
for the output node. However, unlike least squares or maximum likelihood,
where a global function of the errors (e.g., sum of squared errors) is used for
estimating the coefficients, in neural networks, the estimation process uses the
errors iteratively to update the estimated weights.
In particular, the error for the output node is distributed across all the hidden
nodes that led to it, so that each node is assigned “responsibility” for part of the
error. Each of these node-specific errors is then used for updating the weights.
Back Propagation of Error The most popular method for using model
errors to update weights (“learning”) is an algorithm called back propagation. As
the name implies, errors are computed from the last layer (the output layer) back
to the hidden layers.
Let us denote by ŷk the output from output node k. The error associated
with output node k is computed by
errk = ŷk(1− ŷk)(yk − ŷk).
Notice that this is similar to the ordinary definition of an error (yk − ŷk) multi-
plied by a correction factor. The weights are then updated as follows:
θnewj = θ
old
j + l × errj, (11.2)
wnewi,j = w
old
i,j + l × errj,
where l is a learning rate or weight decay parameter, a constant ranging typically
between 0 and 1, which controls the amount of change in weights from one
iteration to the next.
In our example, the error associated with output node 7 for the first record
is (0.506)(1 − 0.506)(1 − 0.506) = 0.123. For output node 6 the error is
0.481(1− 0.481)(1− 0.481) = 0.129. These errors are then used to compute
the errors associated with the hidden layer nodes, and those weights are updated
accordingly using a formula similar to (11.2).
FITTING A NETWORK TO DATA 279
Two methods for updating the weights are case updating and batch updating.
In case updating, the weights are updated after each record is run through the
network (called a trial). For example, if we used case updating in the tiny exam-
ple, the weights would first be updated after running record 1 as follows: Using
a learning rate of 0.5, the weights θ7, w3,7, w4,7, and w5,7 are updated to
θ7 = −0.015 + (0.5)(0.123) = 0.047
w3,7 = 0.01 + (0.5)(0.123) = 0.072
w4,7 = 0.05 + (0.5)(0.123) = 0.112
w5,7 = 0.015 + (0.5)(0.123) = 0.077
Similarly, we obtain updated weights θ6 = 0.025, w3,6 = 0.045, w4,6 = 0.035,
and w5,6 = 0.045. These new weights are next updated after the second record
is run through the network, the third, and so on, until all records are used. This
is called one epoch, sweep, or iteration through the data. Typically, there are many
iterations.
In batch updating, the entire training set is run through the network before
each updating of weights takes place. In that case, the errors errk in the updating
equation is the sum of the errors from all records. In practice, case updating
tends to yield more accurate results than batch updating, but requires a longer
run time. This is a serious consideration, since even in batch updating, hundreds
or even thousands of sweeps through the training data are executed.
When does the updating stop? The most common conditions are one of the
following:
1. When the new weights are only incrementally different from those of the
preceding iteration
2. When the misclassification rate reaches a required threshold
3. When the limit on the number of runs is reached
Let us examine the output from running a neural network on the tiny data.
Following Figures 11.2 and 11.3, we used a single hidden layer with three nodes.
R has several packages for neural nets, the most common ones are nnet and
neuralnet (note that package nnet does not enable multilayer networks and has
no plotting option). We used neuralnet in this example. The weights and
model output are shown in Table 11.2. Figure 11.4 shows these weights in a
format similar to that of our previous diagrams.
The first 3×3 table shows the weights that connect the input layer and the
hidden layer. The Bias nodes (first row of the weights table) are the weights θ3, θ4,
and θ5. The weights in this table are used to compute the output of the hidden
layer nodes. They were computed iteratively after choosing a random initial set
of weights (like the set we chose in Figure 11.3). We use the weights in the way
280 NEURAL NETS
TABLE 11.2 NEURAL NETWORK WITH A SINGLE HIDDEN LAYER (THREE NODES) FOR THE TINY
DATA EXAMPLE
code for running a neural network
nn <- neuralnet(Like + Dislike ~ Salt + Fat, data = df, linear.output = F, hidden = 3)
# display weights
nn$weights
# display predictions
prediction(nn)
# plot network
plot(nn, rep="best")
Output
> nn$weights
[[1]]
[[1]][[1]]
[,1] [,2] [,3]
[1,] -1.061143694 3.057021840 3.337952001
[2,] 2.326024132 -3.408663181 -4.293213530
[3,] 4.106434697 -6.525668384 -5.929418648
[[1]][[2]]
[,1] [,2]
[1,] -0.3495332882 -1.677855862
[2,] 5.8777145665 -3.606625360
[3,] -5.3529200726 5.620329700
[4,] -6.1115038896 6.696286857
> prediction(nn)
Data Error: 0;
$rep1
Salt Fat Like Dislike
1 0.1 0.1 0.0002415535993 0.99965512479
2 0.4 0.2 0.0344215786564 0.96556787694
3 0.5 0.2 0.1248666747740 0.87816827940
4 0.9 0.2 0.9349452648141 0.07022732257
5 0.8 0.3 0.9591361793188 0.04505630529
6 0.5 0.4 0.8841904620140 0.12672437721
−5.92942
−6
.52
56
7
4.
10
64
3
Fat
−4.29321
−3.40866
2.32
602
Salt
6.69
629−6
.1
11
5
5.62033
−5
.35
29
2
−3.60663
5.87771
Dislike
Like
3.33795
3.05702
−1.06114
1
−1.67786
−0.34953
1
Error: 0.03757 Steps: 76
FIGURE 11.4 NEURAL NETWORK FOR THE TINY EXAMPLE WITH FINAL WEIGHTS FROM R
OUTPUT. VALUES ON THE FIVE DOWNWARDS ARROWS DENOTE THE BIAS
FITTING A NETWORK TO DATA 281
we described earlier to compute the hidden layer’s output. For instance, for the
first record, the output of our previous node 3 is:
Output3 =
1
1 + e−[−1.06+(2.33)(0.9)+(4.11)(0.2)]
= 0.86.
Similarly, we can compute the output from the two other hidden nodes for
the same record and get Output4 = 0.21 and Output5 = 0.15. The second
4×2 table gives the weights connecting the hidden and output layer nodes. To
compute the probability (= propensity) for the dislike output node for the first
record, we use the outputs from the hidden layer that we computed above, and
get
Outputdislike =
1
1 + e−[−1.68+ (−3.61)(0.86)+ (5.62)(0.21)+ (6.70)(0.15)]
= 0.07.
Similarly, we can compute the probability for the like output node, obtaining the
value Output7 = 0.93.
The probabilities for the other five records are computed in the same manner,
replacing the input value in the computation of the hidden layer outputs and then
plugging these outputs into the computation for the output layer. The confusion
matrix based on these probabilities, using and a cutoff of 0.5, is given in Table
11.3. We can see that the network correctly classifies all six records.
TABLE 11.3 CONFUSION MATRIX FOR THE TINY EXAMPLE
code for the confusion matrix
library(caret)
predict <- compute(nn, data.frame(df$Salt, df$Fat))
predicted.class=apply(predict$net.result,1,which.max)-1
confusionMatrix(ifelse(predicted.class=="1", "dislike", "like"), df$Acceptance)
Output
> confusionMatrix(ifelse(predicted.class==”1″, “dislike”, “like”),
+ df$Acceptance)
Confusion Matrix and Statistics
Reference
Prediction dislike like
dislike 3 0
like 0 3
Accuracy : 1
282 NEURAL NETS
Example 2: Classifying Accident Severity
Let’s apply the network training process to some real data: US automobile acci-
dents that have been classified by their level of severity as no injury, injury, or
fatality. A firm might be interested in developing a system for quickly classifying
the severity of an accident, based on initial reports and associated data in the
system (some of which rely on GPS-assisted reporting). Such a system could
be used to assign emergency response team priorities. Table 11.4 shows a small
extract (10 records, four predictor variables) from a US government database.
The explanation of the four predictor variables and outcome variable is given
in Table 11.5. For the analysis, we converted ALCHL_I to a 0/1 dummy variable
(1 = presence of alcohol) and created four dummies for SUR_COND. This gives
us a total of seven predictors.
With the exception of alcohol involvement and a few other variables in the
larger database, most of the variables are ones that we might reasonably expect to
be available at the time of the initial accident report, before accident details and
severity have been determined by first responders. A data mining model that
could predict accident severity on the basis of these initial reports would have
value in allocating first responder resources.
To use a neural net architecture for this classification problem, we use seven
nodes in the input layer, one for each of the seven predictors, and three neurons
TABLE 11.4 SUBSET FROM THE ACCIDENTS DATA, FOR A HIGH-
FATALITY REGION
Obs. ALCHL_I PROFIL_I_R SUR_COND VEH_INVL MAX_SEV_IR
1 1 1 1 1 1
2 2 1 1 1 0
3 2 1 1 1 1
4 1 1 1 1 0
5 2 1 1 1 2
6 2 0 1 1 1
7 2 0 1 3 1
8 2 0 1 4 1
9 2 0 1 2 0
10 2 0 1 2 0
TABLE 11.5 DESCRIPTION OF VARIABLES FOR AUTOMOBILE ACCIDENT EXAMPLE
ALCHL_I Presence (1) or absence (2) of alcohol
PROFIL_I_R Profile of the roadway: level (1), other (0)
SUR_COND Surface condition of the road: dry (1), wet (2), snow/slush (3), ice (4), unknown (9)
VEH_INVL Number of vehicles involved
MAX_SEV_IR Presence of injuries/fatalities: no injuries (0), injury (1), fatality (2)
FITTING A NETWORK TO DATA 283
(one for each class) in the output layer. We use a single hidden layer and experi-
ment with the number of nodes. If we increase the number of nodes from one to
five and examine the resulting confusion matrices, we find that two nodes gives
a good balance between improving the predictive performance on the training
set without deteriorating the performance on the validation set. (Networks with
more than two nodes in the hidden layer performed as well as the two-node net-
work, but add undesirable complexity.) Tables 11.6 and 11.7 show the R code
and output for the accidents data.
Our results can depend on how we set the different parameters, and there
are a few pitfalls to avoid. We discuss these next.
Avoiding Overfitting
A weakness of the neural network is that it can easily overfit the data, causing
the error rate on validation data (and most important, on new data) to be too
large. It is therefore important to limit the number of training iterations and not
to over-train on the data (e.g., in R’s neuralnet() function you can control the
number of iterations using argument stepmax). As in classification and regression
trees, overfitting can be detected by examining the performance on the validation
set, or better, on a cross-validation set, and seeing when it starts deteriorating,
while the training set performance is still improving. This approach is used
in some algorithms, to limit the number of training iterations. The validation
error decreases in the early iterations of the training but after a while, it begins
to increase. The point of minimum validation error is a good indicator of the
best number of iterations for training, and the weights at that stage are likely to
provide the best error rate in new data.
Using the Output for Prediction and Classification
When the neural network is used for predicting a numerical outcome variable,
the resulting output needs to be scaled back to the original units of that outcome
variable. Recall that numerical variables (both predictor and outcome variables)
are usually rescaled to a [0,1] interval before being used by the network. The
output will therefore also be on a [0,1] scale. To transform the prediction back
to the original y units, which were in the range [a, b], we multiply the network
output by b− a and add a.
When the neural net is used for classification and we have m classes, we
will obtain an output from each of the m output nodes (or m − 1 output
nodes depending on the software.2 How do we translate these m outputs into a
2Function neuralnet() by default generated m− 1 output nodes for an outcome with m classes. To get
m output nodes, specify each class on the left-hand-side of the formula—see Table 11.2)
284 NEURAL NETS
TABLE 11.6 A NEURAL NETWORK WITH TWO NODES IN THE HIDDEN LAYER (ACCIDENTS DATA)
code for running and evaluating a neural net on the accidents data
library(neuralnet,nnet,caret)
accidents.df <- read.csv("Accidents.csv")
# selected variables
vars=c("ALCHL_I", "PROFIL_I_R", "VEH_INVL")
# partition the data
set.seed(2)
training=sample(row.names(accidents.df), dim(accidents.df)[1]*0.6)
validation=setdiff(row.names(accidents.df), training)
# when y has multiple classes - need to dummify
trainData <- cbind(accidents.df[training,c(vars)],
class.ind(accidents.df[training,]$SUR_COND),
class.ind(accidents.df[training,]$MAX_SEV_IR))
names(trainData)=c(vars,
paste("SUR_COND_", c(1, 2, 3, 4, 9), sep=""), paste("MAX_SEV_IR_", c(0, 1, 2), sep=""))
validData <- cbind(accidents.df[validation,c(vars)],
class.ind(accidents.df[validation,]$SUR_COND),
class.ind(accidents.df[validation,]$MAX_SEV_IR))
names(validData)=c(vars,
paste("SUR_COND_", c(1, 2, 3, 4, 9), sep=""), paste("MAX_SEV_IR_", c(0, 1, 2), sep=""))
# run nn with 2 hidden nodes
# use hidden= with a vector of integers specifying number of hidden nodes in each layer
nn <- neuralnet(MAX_SEV_IR_0 + MAX_SEV_IR_1 + MAX_SEV_IR_2 ~
ALCHL_I + PROFIL_I_R + VEH_INVL + SUR_COND_1 + SUR_COND_2
+ SUR_COND_3 + SUR_COND_4, data = trainData, hidden = 2)
training.prediction=compute(nn, trainData[,-c(8:11)])
training.class=apply(training.prediction$net.result,1,which.max)-1
confusionMatrix(training.class, accidents.df[training,]$MAX_SEV_IR)
validation.prediction=compute(nn, validData[,-c(8:11)])
validation.class=apply(validation.prediction$net.result,1,which.max)-1
confusionMatrix(validation.class, accidents.df[validation,]$MAX_SEV_IR)
Output
> confusionMatrix(training.class, accidents.df[training,]$MAX_SEV_IR)
Confusion Matrix and Statistics
Reference
Prediction 0 1 2
0 332 0 34
1 0 167 40
2 1 7 18
> confusionMatrix(validation.class, accidents.df[validation,]$MAX_SEV_IR)
Confusion Matrix and Statistics
Reference
Prediction 0 1 2
0 217 0 20
1 0 121 22
2 1 4 15
Overall Statistics
Accuracy : 0.8825
REQUIRED USER INPUT 285
TABLE 11.7 OUTPUT FOR NEURAL NETWORK FOR ACCIDENT DATA, WITH TWO NODES IN THE
HIDDEN LAYER
> confusionMatrix(training.class,
+ accidents.df[training,]$MAX_SEV_IR)
Confusion Matrix and Statistics
Reference
Prediction 0 1 2
0 332 0 34
1 0 167 40
2 1 7 18
> confusionMatrix(validation.class,
+ accidents.df[validation,]$MAX_SEV_IR)
Confusion Matrix and Statistics
Reference
Prediction 0 1 2
0 217 0 20
1 0 121 22
2 1 4 15
Overall Statistics
Accuracy : 0.8825
classification rule? Usually, the output node with the largest value determines
the net’s classification.
In the case of a binary outcome (m = 2), we use two output nodes with a
cutoff value to map a predicted probability to one of the two classes. Although
we typically use a cutoff of 0.5 with other classifiers, in neural networks there is
a tendency for values to cluster around 0.5 (from above and below). An alter-
native is to use the validation set to determine a cutoff that produces reasonable
predictive performance.
11.4 Required User Input
One of the time-consuming and complex aspects of training a model using back
propagation is that we first need to decide on a network architecture. This means
specifying the number of hidden layers and the number of nodes in each layer.
The usual procedure is to make intelligent guesses using past experience and to
do several trial-and-error runs on different architectures. Algorithms exist that
grow the number of nodes selectively during training or trim them in a manner
analogous to what is done in classification and regression trees (see Chapter 9).
Research continues on such methods. As of now, no automatic method seems
clearly superior to the trial-and-error approach. A few general guidelines for
choosing an architecture follow.
286 NEURAL NETS
Number of hidden layers. The most popular choice for the number of hidden
layers is one. A single hidden layer is usually sufficient to capture even very
complex relationships between the predictors.
Size of hidden layer. The number of nodes in the hidden layer also determines
the level of complexity of the relationship between the predictors that the
network captures. The trade-off is between under- and overfitting. On the
one hand, using too few nodes might not be sufficient to capture complex
relationships (recall the special cases of a linear relationship such as in linear
and logistic regression, in the extreme case of zero nodes or no hidden layer).
On the other hand, too many nodes might lead to overfitting. A rule of
thumb is to start with p (number of predictors) nodes and gradually decrease
or increase while checking for overfitting. The number of hidden layers and
the size of each hidden layer can be specified in R’s neuralnet() function using
argument hidden.
Number of output nodes. For a categorical outcome withm classes, the number
of nodes should equal m or m − 1. For a numerical outcome, typically a
single output node is used unless we are interested in predicting more than
one function.
In addition to the choice of architecture, the user should pay attention to
the choice of predictors. Since neural networks are highly dependent on the quality
of their input, the choice of predictors should be done carefully, using domain
knowledge, variable selection, and dimension reduction techniques before using
the network. We return to this point in the discussion of advantages and
weaknesses.
Depending on the software, other parameters that the user might be able
to control are the learning rate (a.k.a. weight decay), l, and the momentum. The
first is used primarily to avoid overfitting, by down-weighting new information.
This helps to tone down the effect of outliers on the weights and avoids getting
stuck in local optima. This parameter typically takes a value in the range [0, 1].
Berry and Linoff (2000) suggest starting with a large value (moving away from
the random initial weights, thereby “learning quickly” from the data) and then
slowly decreasing it as the iterations progress and as the weights are more reliable.
Han and Kamber (2001) suggest the more concrete rule of thumb of setting l =
1/(current number of iterations). This means that at the start, l = 1, during the
second iteration it is l = 0.5, and then it keeps decaying toward l = 0. In R,
you can set the learning rate in function neuralnet() using argument learningrate.
The second parameter, called momentum, is used to “keep the ball rolling”
(hence the term momentum) in the convergence of the weights to the optimum.
The idea is to keep the weights changing in the same direction as they did in
the preceding iteration. This helps to avoid getting stuck in a local optimum.
EXPLORING THE RELATIONSHIP BETWEEN PREDICTORS AND OUTCOME 287
High values of momentum mean that the network will be “reluctant” to learn
from data that want to change the direction of the weights, especially when we
consider case updating. In general, values in the range 0–2 are used.
11.5 Exploring the Relationship Between
Predictors and Outcome
Neural networks are known to be “black boxes” in the sense that their output
does not shed light on the patterns in the data that it models (like our brains). In
fact, that is one of the biggest criticisms of the method. However, in some cases,
it is possible to learn more about the relationships that the network captures by
conducting a sensitivity analysis on the validation set. This is done by setting all
predictor values to their mean and obtaining the network’s prediction. Then,
the process is repeated by setting each predictor sequentially to its minimum,
and then maximum, value. By comparing the predictions from different levels
of the predictors, we can get a sense of which predictors affect predictions more
and in what way.
U N S U P E R V I S E D F E A T U R E E X T R A C T I O N
A N D D E E P L E A R N I N G
Most of the data we have dealt with in this book has been either numeric or cat-
egorical, and the available predictor variables or features have been inherently
informative (e.g., the size of a car’s engine, or a car’s gas mileage). It has been
relatively straightforward to discover their impact on the target variable, and focus
on the most meaningful ones. With some data, however, we begin with a huge
mass of values that are not “predictors” in the same sense. With image data, one
“variable” would be the color and intensity of the pixel in a particular position, and
one 2-inch square image might have 40,000 pixels. We are now in the realm of “big
data.”
How can we derive meaningful higher-level features such as edges, curves,
shapes, and even higher-level features such as faces? Deep learning has made big
strides in this area. Deep learning networks (DLNs) refer to neural nets with many
hidden layers used to self-learn features from complex data. DLNs are especially
effective at capturing local structure and dependencies in complex data, such as in
images or audio. This is done by using neural nets in an unsupervised way, where
the data are used as both the input (with some added noise) and the output. When
the multiple hidden layers are used, the DLN’s learning is hierarchical—typically,
from a single pixel to edges, from edges to higher-level features, and so on. The
network then “pools” similar high-level features across images to build up clusters of
similar features. If they appear often enough in large numbers of images, distinctive
and regular structures like faces, vehicles, and houses will emerge as highest-level
features. See Le et al. (2012) for a technical description of how this method yielded
288 NEURAL NETS
labels not just of faces but of different types of faces (e.g., cat faces, of which there
are many on the Internet).
Other applications include speech and handwriting recognition. Facebook and
Google are effectively using deep learning to identify faces from images, and high-
end retail stores use facial recognition technology involving deep learning to iden-
tify VIP customers and give them special sales attention.3
While DLNs are not new, they have only gained momentum recently, thanks
to dramatic improvements in computing power, allowing us to deal both with the
huge amount of data and the extreme complexity of the networks. The abundance
of unlabeled data such as images and audio has also helped tremendously. As with
neural networks, DLNs suffer from overfitting and slow runtime. Solutions include
using regularization (removing “useless” nodes by zeroing out their coefficients)
and tricks to improve over the back-propagation algorithm.
11.6 Advantages and Weaknesses of Neural
Networks
The most prominent advantage of neural networks is their good predictive per-
formance. They are known to have high tolerance to noisy data and the ability
to capture highly complicated relationships between the predictors and an out-
come variable. Their weakest point is in providing insight into the structure of
the relationship, hence their blackbox reputation.
Several considerations and dangers should be kept in mind when using neural
networks. First, although they are capable of generalizing from a set of examples,
extrapolation is still a serious danger. If the network sees only records in a certain
range, its predictions outside this range can be completely invalid.
Second, neural networks do not have a built-in variable selection mecha-
nism. This means that there is a need for careful consideration of predictors.
Combination with classification and regression trees (see Chapter 9) and other
dimension reduction techniques (e.g., principal components analysis in Chapter
4) is often used to identify key predictors.
Third, the extreme flexibility of the neural network relies heavily on hav-
ing sufficient data for training purposes. A related issue is that in classification
problems, the network requires sufficient records of the minority class in order
to learn it. This is achieved by oversampling, as explained in Chapter 2.
Fourth, a technical problem is the risk of obtaining weights that lead to a local
optimum rather than the global optimum, in the sense that the weights converge
3www.npr.org/sections/alltechconsidered/2013/07/21/203273764/high-end-stores-use-facial-
recognition-tools-to-spot-vips, accessed July 21, 2013.
http://www.npr.org/sections/alltechconsidered/2013/07/21/203273764/high-end-stores-use-facial-recognition-tools-to-spot-vips
ADVANTAGES AND WEAKNESSES OF NEURAL NETWORKS 289
to values that do not provide the best fit to the training data. We described several
parameters used to try to avoid this situation (such as controlling the learning
rate and slowly reducing the momentum). However, there is no guarantee that
the resulting weights are indeed the optimal ones.
Finally, a practical consideration that can determine the usefulness of a neural
network is the computation time. Neural networks are relatively heavy on com-
putation time, requiring a longer runtime than other classifiers. This runtime
grows greatly when the number of predictors is increased (as there will be many
more weights to compute). In applications where real-time or near-real-time
prediction is required, runtime should be measured to make sure that it does not
cause unacceptable delay in the decision-making.
290 NEURAL NETS
PROBLEMS
11.1 Credit Card Use. Consider the hypothetical bank data in Table 11.8 on consumers’
use of credit card credit facilities. Create a small worksheet in Excel, like that used in
Example 1, to illustrate one pass through a simple neural network.
TABLE 11.8 DATA FOR CREDIT CARD EXAMPLE AND VARIABLE
DESCRIPTIONS
Years Salary Used Credit
4 43 0
18 65 1
1 53 0
3 95 0
15 88 1
6 112 1
Years: number of years the customer has been with the bank
Salary: customer’s salary (in thousands of dollars)
Used Credit: 1 = customer has left an unpaid credit card balance at the end of
at least one month in the prior year, 0 = balance was paid off at the end of each
month
11.2 Neural Net Evolution. A neural net typically starts out with random coefficients;
hence, it produces essentially random predictions when presented with its first case.
What is the key ingredient by which the net evolves to produce a more accurate
prediction?
11.3 Car Sales. Consider the data on used cars (ToyotaCorolla.csv) with 1436 records and
details on 38 attributes, including Price, Age, KM, HP, and other specifications. The
goal is to predict the price of a used Toyota Corolla based on its specifications.
a. Fit a neural network model to the data. Use a single hidden layer with 2 nodes.
• Use predictors Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax,
Mfr_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player,
Powered_Windows, Sport_Model, and Tow_Bar.
• Remember to first scale the numerical predictor and outcome variables to a 0–1
scale (use function preprocess() with method = “range”—see Chapter 7) and convert
categorical predictors to dummies.
Record the RMS error for the training data and the validation data. Repeat
the process, changing the number of hidden layers and nodes to {single layer with
5 nodes}, {two layers, 5 nodes in each layer}.
i. What happens to the RMS error for the training data as the number of layers
and nodes increases?
ii. What happens to the RMS error for the validation data?
iii. Comment on the appropriate number of layers and nodes for this application.
11.4 Direct Mailing to Airline Customers. East-West Airlines has entered into a part-
nership with the wireless phone company Telcon to sell the latter’s service via direct
mail. The file EastWestAirlinesNN.csv contains a subset of a data sample of who has
already received a test offer. About 13% accepted.
PROBLEMS 291
You are asked to develop a model to classify East–West customers as to whether
they purchase a wireless phone service contract (outcome variable Phone_Sale). This
model will be used to classify additional customers.
a. Run a neural net model on these data, using a single hidden layer with 5 nodes.
Remember to first convert categorical variables into dummies and scale numerical
predictor variables to a 0–1 (use function preprocess() with method = “range”—see
Chapter 7). Generate a decile-wise lift chart for the training and validation sets.
Interpret the meaning (in business terms) of the leftmost bar of the validation decile-
wise lift chart.
b. Comment on the difference between the training and validation lift charts.
c. Run a second neural net model on the data, this time setting the number of hidden
nodes to 1. Comment now on the difference between this model and the model
you ran earlier, and how overfitting might have affected results.
d. What sort of information, if any, is provided about the effects of the various vari-
ables?
CHAPTER 12
Discriminant Analysis
In this chapter, we describe the method of discriminant analysis, which is a
model-based approach to classification. We discuss the main principle, where
classification is based on the distance of a record from each of the class means. We
explain the underlying measure of “statistical distance,” which takes into account
the correlation between predictors. The output of a discriminant analysis proce-
dure generates estimated “classification functions,” which are then used to pro-
duce classification scores that can be translated into classifications or propensities
(probabilities of class membership). One can also directly integrate misclassi-
fication costs into the discriminant analysis setup, and we explain how this is
achieved. Finally, we discuss the underlying model assumptions, the practical
robustness to some assumption violations, and the advantages of discriminant
analysis when the assumptions are reasonably met (for example, the sufficiency
of a small training sample).
12.1 Introduction
Discriminant analysis is a classification method. Like logistic regression, it is a
classical statistical technique that can be used for classification and profiling. It
uses sets of measurements on different classes of records to classify new records
into one of those classes (classification). Common uses of the method have been
in classifying organisms into species and subspecies; classifying applications for
loans, credit cards, and insurance into low- and high-risk categories; classifying
customers of new products into early adopters, early majority, late majority, and
laggards; classifying bonds into bond rating categories; classifying skulls of human
fossils; as well as in research studies involving disputed authorship, decisions on
college admission, medical studies involving alcoholics and nonalcoholics, and
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
293
294 DISCRIMINANT ANALYSIS
methods to identify human fingerprints. Discriminant analysis can also be used
to highlight aspects that distinguish the classes (profiling).
We return to two examples that were described in earlier chapters, the Rid-
ing Mowers and Personal Loan Acceptance examples. In each of these, the
outcome variable has two classes. We close with a third example involving more
than two classes.
Example 1: Riding Mowers
We return to the example from Chapter 7, where a riding mower manufacturer
would like to find a way of classifying families in a city into those likely to
purchase a riding mower and those not likely to purchase one. A pilot random
sample of 12 owners and 12 nonowners in the city is undertaken. The data are
given in Chapter 7 (Table 7.1), and a scatter plot is shown in Figure 12.1. We can
think of a linear classification rule as a line that separates the two-dimensional
region into two parts, with most of the owners in one half-plane and most
nonowners in the complementary half-plane. A good classification rule would
separate the data so that the fewest points are misclassified: The line shown in
Figure 12.1 seems to do a good job in discriminating between the two classes as
it makes four misclassifications out of 24 points. Can we do better?
FIGURE 12.1 SCATTER PLOT OF LOT SIZE VS. INCOME FOR 24 OWNERS AND NONOWNERS OF
RIDING MOWERS. THE (AD HOC) LINE TRIES TO SEPARATE OWNERS FROM
NONOWNERS
Example 2: Personal Loan Acceptance
The riding mowers example is a classic example and is useful in describing the
concept and goal of discriminant analysis. However, in today’s business applica-
tions, the number of records is much larger, and their separation into classes is
much less distinct. To illustrate this, we return to the Universal Bank example
INTRODUCTION 295
FIGURE 12.2 PERSONAL LOAN ACCEPTANCE AS A FUNCTION OF INCOME AND CREDIT CARD
SPENDING FOR 5000 CUSTOMERS OF THE UNIVERSAL BANK (IN LOG SCALE)
described in Chapter 9, where the bank’s goal is to identify new customers most
likely to accept a personal loan. For simplicity, we will consider only two predic-
tor variables: the customer’s annual income (Income, in $000s), and the average
monthly credit card spending (CCAvg, in $000s). The first part of Figure 12.2
shows the acceptance of a personal loan by a subset of 200 customers from the
bank’s database as a function of Income and CCAvg. We use a logarithmic scale
on both axes to enhance visibility because there are many points condensed in
the low-income, low-CC spending area. Even for this small subset, the sepa-
ration is not clear. The second figure shows all 5000 customers and the added
complexity of dealing with large numbers of records.
296 DISCRIMINANT ANALYSIS
12.2 Distance of a Record from a Class
Finding the best separation between records involves measuring their distance
from their class. The general idea is to classify a record to the class to which it
is closest. Suppose that we are required to classify a new customer of Universal
Bank as being an acceptor or a nonacceptor of their personal loan offer, based on an
income of x. From the bank’s database we find that the mean income for loan
acceptors was $144.75K and for nonacceptors $66.24K. We can use Income as a
predictor of loan acceptance via a simple Euclidean distance rule: If x is closer to the
mean income of the acceptor class than to the mean income of the nonacceptor
class, classify the customer as an acceptor; otherwise, classify the customer as a
nonacceptor. In other words, if |x− 144.75| < |x− 66.24|, then classification =
acceptor; otherwise, nonacceptor. Moving from a single predictor variable (income)
to two or more predictor variables, the equivalent of the mean of a class is the
centroid of a class. This is simply the vector of means x = [x1, . . . , xp]. The
Euclidean distance between a record with p measurements x = [x1, . . . , xp] and
the centroid x is defined as the square root of the sum of the squared differences
between the individual values and the means:
DEuclidean(x, x) =
√
(x1 − x1)2 + · · ·+ (xp − xp)2. (12.1)
Using the Euclidean distance has three drawbacks. First, the distance
depends on the units we choose to measure the predictor variables. We will
get different answers if we decide to measure income in dollars, for instance,
rather than in thousands of dollars.
Second, Euclidean distance does not take into account the variability of the
variables. For example, if we compare the variability in income in the two classes,
we find that for acceptors, the standard deviation is lower than for nonacceptors
($31.6K vs. $40.6K). Therefore, the income of a new customer might be closer
to the acceptors’ mean income in dollars, but because of the large variability
in income for nonacceptors, this customer is just as likely to be a nonacceptor.
We therefore want the distance measure to take into account the variance of the
different variables and measure a distance in standard deviations rather than in
the original units. This is equivalent to z-scores.
Third, Euclidean distance ignores the correlation between the variables. This
is often a very important consideration, especially when we are using many pre-
dictor variables to separate classes. In this case, there will often be variables,
which by themselves are useful discriminators between classes, but in the pres-
ence of other predictor variables are practically redundant, as they capture the
same effects as the other variables.
A solution to these drawbacks is to use a measure called statistical distance (or
Mahalanobis distance). Let us denote by S the covariance matrix between the p
FISHER’S LINEAR CLASSIFICATION FUNCTIONS 297
variables. The definition of a statistical distance is
DStatistical(x, x) = [x − x]′S−1[x − x]
= [(x1 − x1), (x2 − x2), . . . , (xp − xp)]S−1
x1 − x1
x2 − x2
...
xp − xp
(12.2)
(the notation ′, which represents transpose operation, simply turns the column
vector into a row vector). S−1 is the inverse matrix of S, which is the p-
dimension extension to division. When there is a single predictor (p = 1),
this formula reduces to a (squared) z-score calcualtion, since we subtract the
mean and divide by the standard deviation. The statistical distance takes into
account not only the predictor means, but also the spread of the predictor values
and the correlations between the different predictors. To compute a statistical
distance between a record and a class, we must compute the predictor means (the
centroid) and the covariances between each pair of predictors. These are used
to construct the distances. The method of discriminant analysis uses statistical
distance as the basis for finding a separating line (or, if there are more than two
variables, a separating hyperplane) that is equally distant from the different class
means.1 It is based on measuring the statistical distances of a record to each of
the classes and allocating it to the closest class. This is done through classification
functions, which are explained next.
12.3 Fisher’s Linear Classification Functions
Linear classification functions were proposed in 1936 by the noted statistician
R. A. Fisher as the basis for improved separation of records into classes. The
idea is to find linear functions of the measurements that maximize the ratio of
between-class variability to within-class variability. In other words, we would
obtain classes that are very homogeneous and differ the most from each other.
For each record, these functions are used to compute scores that measure the
proximity of that record to each of the classes. A record is classified as belonging
to the class for which it has the highest classification score (equivalent to the
smallest statistical distance).
The classification functions are estimated using software. For example, Table
12.1 shows the classification functions obtained from running discriminant anal-
ysis on the riding mowers data, using two predictors. Note that the number
1An alternative approach finds a separating line or hyperplane that is “best” at separating the different
clouds of points. In the case of two classes, the two methods coincide.
298 DISCRIMINANT ANALYSIS
TABLE 12.1 DISCRIMINANT ANALYSIS FOR RIDING-MOWER DATA, DISPLAYING THE
ESTIMATED CLASSIFICATION FUNCTIONS
code for linear discriminant analysis
library(DiscriMiner)
mowers.df <- read.csv("RidingMowers.csv")
da.reg <- linDA(mowers.df[,1:2], mowers.df[,3])
da.reg$functions
Output
> da.reg$functions
Nonowner Owner
constant -51.4214499777 -73.1602116488
Income 0.3293554091 0.4295857129
Lot_Size 4.6815655074 5.4667502174
of classification functions is equal to the number of classes (in this case, two:
owner/nonowner).
U S I N G C L A S S I F I C A T I O N F U N C T I O N S C O R E S
T O C L A S S I F Y R E C O R D S
For each record, we calculate the value of the classification function (one for
each class); whichever class’s function has the highest value (= score) is the class
assigned to that record.
To classify a family into the class of owners or nonowners, we use the clas-
sification functions to compute the family’s classification scores: A family is
classified into the class of owners if the owner function score is higher than
the nonowner function score, and into nonowners if the reverse is the case.
These functions are specified in a way that can be easily generalized to more
than two classes. The values given for the functions are simply the weights
to be associated with each variable in the linear function in a manner anal-
ogous to multiple linear regression. For instance, the first household has an
income of $60K and a lot size of 18.4K ft2. Their owner score is therefore
−73.16 + (0.43)(60) + (5.47)(18.4) = 53.2, and their nonowner score is
−51.42 + (0.33)(60) + (4.68)(18.4) = 54.48. Since the second score is
higher, the household is (mis)classified by the model as a nonowner. Such cal-
culations are done by the software and do not need to be done manually. For
example, the scores and classifications for all 24 households produced by R’s
linDA() function are given in Table 12.2.
FISHER’S LINEAR CLASSIFICATION FUNCTIONS 299
TABLE 12.2 CLASSIFICATION SCORES, PREDICTED CLASSES, AND PROBABILITIES FOR
RIDING-MOWER DATA
code for obtaining classification scores, predicted classes, and probabilities
da.reg <- linDA(mowers.df[,1:2], mowers.df[,3])
# compute probabilities manually (below); or, use lda() in package MASS with predict()
propensity.owner <- exp(da.reg$scores[,2])/(exp(da.reg$scores[,1])+exp(da.reg$scores[,2]))
data.frame(Actual=mowers.df$Ownership,
da.reg$classification, da.reg$scores, propensity.owner=propensity.owner)
Output
Actual da.reg.classification Nonowner Owner propensity.owner
1 Owner Nonowner 54.48067990 53.20313512 0.21796844638
2 Owner Owner 55.38873802 55.41077045 0.50550788510
3 Owner Owner 71.04259549 72.75874724 0.84763249334
4 Owner Owner 66.21047023 66.96771421 0.68075507343
5 Owner Owner 87.71741659 93.22905050 0.99597675014
6 Owner Owner 74.72663830 79.09877951 0.98753320250
7 Owner Owner 66.54448713 69.44984917 0.94811086581
8 Owner Owner 80.71624526 84.86469025 0.98445646665
9 Owner Owner 64.93538340 65.81620689 0.70699283969
10 Owner Owner 76.58516562 80.49966417 0.98043968864
11 Owner Owner 68.37011705 69.01716449 0.65634480272
12 Owner Owner 68.88764830 70.97123544 0.88929767077
13 Nonowner Owner 65.03888965 66.20702108 0.76280709688
14 Nonowner Nonowner 63.34507817 63.23031851 0.47134152980
15 Nonowner Nonowner 50.44370726 48.70504628 0.14948309577
16 Nonowner Nonowner 58.31064004 56.91959558 0.19924106677
17 Nonowner Owner 58.63995731 59.13979206 0.62242049465
18 Nonowner Nonowner 47.17838908 44.19020925 0.04796273478
19 Nonowner Nonowner 43.04730944 39.82518317 0.03834151004
20 Nonowner Nonowner 56.45681236 55.78064940 0.33711822839
21 Nonowner Nonowner 40.96767073 36.85685471 0.01612995010
22 Nonowner Nonowner 47.46071006 43.79102096 0.02485107730
23 Nonowner Nonowner 30.91759299 25.28316275 0.00355999342
24 Nonowner Nonowner 38.61511030 34.81159148 0.02180608595
An alternative way for classifying a record into one of the classes is to com-
pute the probability of belonging to each of the classes and assigning the record
to the most likely class. If we have two classes, we need only compute a single
probability for each record (of belonging to owners, for example). Using a cutoff
of 0.5 is equivalent to assigning the record to the class with the highest classifica-
tion score. The advantage of this approach is that we obtain propensities, which
can be used for goals such as ranking: we sort the records in order of descending
probabilities and generate lift curves.
Let us assume that there arem classes. To compute the probability of belong-
ing to a certain class k, for a certain record i, we need to compute all the clas-
sification scores c1(i), c2(i), . . . , cm(i) and combine them using the following
300 DISCRIMINANT ANALYSIS
formula:
P[record i(with measurements x1, x2, ..., xp) belongs to class k]
=
eck(i)
ec1(i) + ec2(i) + · · ·+ ecm(i)
.
These probabilities and their computation for the riding mower data are given
in Table 12.2. In R, function linDA() in package DiscriMiner provides scores
and classifications, but does not directly provide propensities, so we need to use
the above formula to compute probabilities from the scores (see R code in Table
12.2). Alternatively, use function lda in package MASS to automatically compute
probabilities.
We now have three misclassifications, compared to four in our original (ad
hoc) classification. This can be seen in Figure 12.3, which includes the line
resulting from the discriminant model.2
FIGURE 12.3 CLASS SEPARATION OBTAINED FROM THE DISCRIMINANT MODEL (COMPARED TO
AD HOC LINE FROM FIGURE 12.1)
12.4 Classification Performance of
Discriminant Analysis
The discriminant analysis method relies on two main assumptions to arrive at
classification scores: First, it assumes that the predictor measurements in all classes
come from a multivariate normal distribution. When this assumption is reason-
ably met, discriminant analysis is a more powerful tool than other classification
2The slope of the line is given by −a1/a2 and the intercept is a1/a2 x1 + x2, where ai is the
difference between the ith classification function coefficients of owners and nonowners (e.g., here
aincome = 0.43− 0.33).
CLASSIFICATION PERFORMANCE OF DISCRIMINANT ANALYSIS 301
methods, such as logistic regression. In fact, Efron (1975) showed that discrim-
inant analysis is 30% more efficient than logistic regression if the data are mul-
tivariate normal, in the sense that we require 30% less records to arrive at the
same results. In practice, it has been shown that this method is relatively robust
to departures from normality in the sense that predictors can be non-normal and
even dummy variables. This is true as long as the smallest class is sufficiently large
(approximately more than 20 records). This method is also known to be sensitive
to outliers in both the univariate space of single predictors and in the multivari-
ate space. Exploratory analysis should therefore be used to locate extreme cases
and determine whether they can be eliminated.
The second assumption behind discriminant analysis is that the correlation
structure between the different predictors within a class is the same across classes.
This can be roughly checked by computing the correlation matrix between the
predictors separately for each class and comparing matrices. If the correlations
differ substantially across classes, the classifier will tend to classify records into the
class with the largest variability. When the correlation structure differs signifi-
cantly and the dataset is very large, an alternative is to use quadratic discriminant
analysis.3
Notwithstanding the caveats embodied in these statistical assumptions, recall
that in a predictive modeling environment, the ultimate test is whether the model
works effectively. A reasonable approach is to conduct some exploratory anal-
ysis with respect to normality and correlation, train and evaluate a model, then,
depending on classification accuracy and what you learned from the initial explo-
ration, circle back and explore further whether outliers should be examined or
choice of predictor variables revisited.
With respect to the evaluation of classification accuracy, we once again use
the general measures of performance that were described in Chapter 5 (judging
the performance of a classifier), with the principal ones based on the confusion
matrix (accuracy alone or combined with costs) for classification and the lift chart
for ranking. The same argument for using the validation set for evaluating per-
formance still holds. For example, in the riding mowers example, families 1, 13,
and 17 are misclassified. This means that the model yields an error rate of 12.5%
for these data. However, this rate is a biased estimate—it is overly optimistic,
because we have used the same data for fitting the classification functions and for
estimating the error. Therefore, as with all other models, we test performance
on a validation set that includes data that were not involved in estimating the
classification functions.
3In practice, quadratic discriminant analysis has not been found useful except when the difference in
the correlation matrices is large and the number of records available for training and testing is large.
The reason is that the quadratic model requires estimating many more parameters that are all subject to
error [for m classes and p variables, the total number of parameters to be estimated for all the different
correlation matrices is mp(p+ 1)/2].
302 DISCRIMINANT ANALYSIS
To obtain the confusion matrix from a discriminant analysis, we either use
the classification scores directly or the propensities (probabilities of class member-
ship) that are computed from the classification scores. In both cases, we decide
on the class assignment of each record based on the highest score or probability.
We then compare these classifications to the actual class memberships of these
records. This yields the confusion matrix.
12.5 Prior Probabilities
So far we have assumed that our objective is to minimize the classification error.
The method presented above assumes that the chances of encountering a record
from either class is the same. If the probability of encountering a record for clas-
sification in the future is not equal for the different classes, we should modify
our functions to reduce our expected (long-run average) error rate. The modi-
fication is done as follows: Let us denote by pj the prior or future probability of
membership in class j (in the two-class case we have p1 and p2 = 1− p1). We
modify the classification function for each class by adding log(pj). To illustrate
this, suppose that the percentage of riding mower owners in the population is
15%, compared to 50% in the sample. This means that the model should classify
fewer households as owners. To account for this distortion, we adjust the constants
in the classification functions from Table 12.1 and obtain the adjusted constants
−73.16+ log(0.15) = −75.06 for owners and −51.42+ log(0.85) = −50.58
for nonowners. To see how this can affect classifications, consider family 13,
which was misclassified as an owner in the case involving equal probability of
class membership. When we account for the lower probability of owning a
mower in the population, family 13 is classified properly as a nonowner (its
owner classification score is below the nonowner score).
12.6 Unequal Misclassification Costs
A second practical modification is needed when misclassification costs are not
symmetrical. If the cost of misclassifying a class 1 record is very different from
the cost of misclassifying a class 2 record, we may want to minimize the expected
cost of misclassification rather than the simple error rate (which does not account
for unequal misclassification costs). In the two-class case, it is easy to manipu-
late the classification functions to account for differing misclassification costs (in
addition to prior probabilities). We denote by q1 the cost of misclassifying a class
1 member (into class 2). Similarly, q2 denotes the cost of misclassifying a class 2
member (into class 1). These costs are integrated into the constants of the classi-
fication functions by adding log(q1) to the constant for class 1 and log(q2) to the
constant of class 2. To incorporate both prior probabilities and misclassification
costs, add log(p1q1) to the constant of class 1 and log(p2q2) to that of class 2.
CLASSIFYING MORE THAN TWO CLASSES 303
In practice, it is not always simple to come up with misclassification costs q1
and q2 for each class. It is usually much easier to estimate the ratio of costs q2/q1
(e.g., the cost of misclassifying a credit defaulter is 10 times more expensive
than that of misclassifying a nondefaulter). Luckily, the relationship between the
classification functions depends only on this ratio. Therefore, we can set q1 = 1
and q2 = ratio and simply add log(q2/q1) to the constant for class 2.
12.7 Classifying More Than Two Classes
Example 3: Medical Dispatch to Accident Scenes
Ideally, every automobile accident call to the emergency number 911 results
in the immediate dispatch of an ambulance to the accident scene. However, in
some cases the dispatch might be delayed (e.g., at peak accident hours or in some
resource-strapped towns or shifts). In such cases, the 911 dispatchers must make
decisions about which units to send based on sketchy information. It is useful
to augment the limited information provided in the initial call with additional
information in order to classify the accident as minor injury, serious injury, or
death. For this purpose, we can use data that were collected on automobile
accidents in the United States in 2001 that involved some type of injury. For
each accident, additional information is recorded, such as day of week, weather
conditions, and road type. Figure 12.4 shows a small sample of records with 11
measurements of interest.
Accident #
RushH
our
WRK_
ZONE
WKDY INT_HWY LGTCON LEVEL
SPD_
LIM
SUR_C
OND
TRAF_WAY WEATHER MAX_SEV
11 10 1 dark_light 701 ice one_way adverse no-injury
12 10 0 dark_light 700 ice divided adverse no-injury
13 10 0 dark_light 650 ice divided adverse non-fatal
14 10 0 dark_light 550 ice two_way not_adverse non-fatal
15 00 0 dark_light 350 snow one_way adverse no-injury
16 10 0 dark_light 351 wet divided adverse no-injury
07 10 1 dark_light 701 wet divided adverse non-fatal
08 10 0 dark_light 351 wet two_way adverse no-injury
19 10 0 dark_light 250 wet one_way adverse non-fatal
10 01 01 dark_light 350 wet divided adverse non-fatal
11 01 01 dark_light 300 wet divided adverse non-fatal
12 01 01 dark_light 600 wet divided not_adverse no-injury
13 01 01 dark_light 400 wet two_way not_adverse no-injury
14 00 01 day 651 dry two_way not_adverse fatal
15 01 00 day 550 dry two_way not_adverse fatal
16 01 01 day 550 dry two_way not_adverse non-fatal
17 01 00 day 550 dry two_way not_adverse non-fatal
18 00 01 dark 550 ice two_way not_adverse no-injury
19 00 00 dark 500 ice two_way adverse no-injury
20 00 00 dark 551 snow divided adverse no-injury
FIGURE 12.4 SAMPLE OF 20 AUTOMOBILE ACCIDENTS FROM THE 2001 DEPARTMENT OF
TRANSPORTATION DATABASE. EACH ACCIDENT IS CLASSIFIED AS ONE OF THREE
INJURY TYPES (NO-INJURY, NONFATAL, OR FATAL), AND HAS 10 MORE
MEASUREMENTS (EXTRACTED FROM A LARGER SET OF MEASUREMENTS)
304 DISCRIMINANT ANALYSIS
The goal is to see how well the predictors can be used to classify injury type
correctly. To evaluate this, a sample of 1000 records was drawn and partitioned
into training and validation sets, and a discriminant analysis was performed on the
training data. The output structure is very similar to that for the two-class case.
The only difference is that each record now has three classification functions
(one for each injury type), and the confusion and error matrices are of size 3×3
to account for all the combinations of correct and incorrect classifications (see
Table 12.3). The rule for classification is still to classify a record to the class
that has the highest corresponding classification score. The classification scores
are computed, as before, using the classification function coefficients. This can
TABLE 12.3 DISCRIMINANT ANALYSIS FOR THE THREE-CLASS INJURY EXAMPLE:
CLASSIFICATION FUNCTIONS AND CONFUSION MATRIX FOR TRAINING SET
code for running linear discriminant analysis on the accidents data
library(DiscriMiner)
library(caret)
accidents.df <- read.csv("Accidents.csv")
da.reg <- linDA(accidents.df[,1:10], accidents.df[,11])
da.reg$functions
confusionMatrix(da.reg$classification, accidents.df$MAX_SEV)
Output
> da.reg$functions
fatal no-injury non-fatal
constant -25.5958095610 -24.514323034 -24.2336221574
RushHour 0.9225623509 1.952403425 1.9031991672
WRK_ZONE 0.5178609440 1.195060274 0.7705682214
WKDY 4.7801494470 6.417633787 6.1165223679
INT_HWY -1.8418782656 -2.673037935 -2.5366224810
LGTCON_day 3.7070124215 3.666075602 3.7276207831
LEVEL 2.6268937732 1.567550702 1.7138656960
SPD_LIM 0.5051317221 0.461479676 0.4520847732
SUR_COND_dry 9.9988600752 15.833794528 16.2565639740
TRAF_two_way 7.1079766143 6.342147286 6.3549435330
WEATHER_adverse 9.6880211017 16.363876853 16.3172755675
> confusionMatrix(da.reg$classification, accidents.df$MAX_SEV)
Confusion Matrix and Statistics
Reference
Prediction fatal no-injury non-fatal
fatal 1 6 6
no-injury 1 114 95
non-fatal 3 172 202
Overall Statistics
Accuracy : 0.5283333
CLASSIFYING MORE THAN TWO CLASSES 305
TABLE 12.4 CLASSIFICATION SCORES, MEMBERSHIP PROBABILITIES, AND CLASSIFICATIONS
FOR THE THREE-CLASS INJURY TRAINING DATASET
code for producing linear discriminant analysis scores and propensities
prob <- exp(da.reg$scores[,1:3])/
(exp(da.reg$scores[,1])+exp(da.reg$scores[,2])+exp(da.reg$scores[,3]))
res <- data.frame(Classification = lda.reg$classification,
Actual = accidents.df$MAX_SEV,
Score = round(da.reg$scores,2),
Propensity = round(propensity,2))
head(res)
Output
> head(res)
Classification Actual Score.fatal Score.no.injury Score.non.fatal
1 no-injury no-injury 25.94 31.42 30.93
2 no-injury non-fatal 15.00 15.58 15.01
3 no-injury no-injury 2.69 9.95 9.81
4 no-injury no-injury 10.10 17.94 17.64
5 no-injury non-fatal 2.42 11.76 11.41
6 no-injury non-fatal 7.47 16.37 15.93
Propensity.fatal Propensity.no.injury Propensity.non.fatal
1 0.00 0.62 0.38
2 0.26 0.47 0.27
3 0.00 0.54 0.46
4 0.00 0.57 0.43
5 0.00 0.59 0.41
6 0.00 0.61 0.39
be seen in Table 12.4. For instance, the no-injury classification score for the
first accident in the training set is −24.51 + (1.95)(1) + (1.19)(0) + · · · +
(16.36)(1) = 31.42. The nonfatal score is similarly computed as 30.93 and the
fatal score as 25.94. Since the no-injury score is highest, this accident is (correctly)
classified as having no injuries.
We can also compute for each accident, the propensities (estimated probabil-
ities) of belonging to each of the three classes using the same relationship between
classification scores and probabilities as in the two-class case. For instance, the
probability of the above accident involving nonfatal injuries is estimated by the
model as
e30.93
e31.42 + e30.93 + e25.94
= 0.38. (12.3)
The probabilities of an accident involving no injuries or fatal injuries are com-
puted in a similar manner. For the first accident in the training set, the highest
probability is that of involving no injuries, and therefore it is classified as a no-
injury accident.
306 DISCRIMINANT ANALYSIS
12.8 Advantages and Weaknesses
Discriminant analysis is typically considered more of a statistical classification
method than a data mining method. This is reflected in its absence or short
mention in many data mining resources. However, it is very popular in social
sciences and has shown good performance. The use and performance of dis-
criminant analysis are similar to those of multiple linear regression. The two
methods therefore share several advantages and weaknesses.
Like linear regression, discriminant analysis searches for the optimal weight-
ing of predictors. In linear regression, weighting is with relation to the numerical
outcome variable, whereas in discriminant analysis, it is with relation to separat-
ing the classes. Both use least squares for estimation and the resulting estimates
are robust to local optima.
In both methods, an underlying assumption is normality. In discriminant
analysis, we assume that the predictors are approximately from a multivariate
normal distribution. Although this assumption is violated in many practical sit-
uations (such as with commonly-used binary predictors), the method is surpris-
ingly robust. According to Hastie et al. (2001), the reason might be that data
can usually support only simple separation boundaries, such as linear boundaries.
However, for continuous variables that are found to be very skewed (as can be
seen through a histogram), transformations such as the log transform can improve
performance. In addition, the method’s sensitivity to outliers commands explor-
ing the data for extreme values and removing those records from the analysis.
An advantage of discriminant analysis as a classifier (like logistic regression
in this respect) is that it provides estimates of single-predictor contributions.4
This is useful for obtaining a ranking of predictor importance, and for variable
selection.
Finally, the method is computationally simple, parsimonious, and especially
useful for small datasets. With its parametric form, discriminant analysis makes
the most out of the data and is therefore especially useful with small samples (as
explained in Section 12.4).
4Comparing predictor contribution requires normalizing all the predictors before running discriminant
analysis. Then, compare each coefficient across the two classification functions: coefficients with large
differences indicate a predictor with high separation power.
PROBLEMS 307
PROBLEMS
12.1 Personal Loan Acceptance. Universal Bank is a relatively young bank growing
rapidly in terms of overall customer acquisition. The majority of these customers are
liability customers with varying sizes of relationship with the bank. The customer
base of asset customers is quite small, and the bank is interested in expanding this
base rapidly to bring in more loan business. In particular, it wants to explore ways of
converting its liability customers to personal loan customers.
A campaign the bank ran for liability customers last year showed a healthy conver-
sion rate of over 9% successes. This has encouraged the retail marketing department
to devise smarter campaigns with better target marketing. The goal of our analysis is
to model the previous campaign’s customer behavior to analyze what combination of
factors make a customer more likely to accept a personal loan. This will serve as the
basis for the design of a new campaign.
The file UniversalBank.csv contains data on 5000 customers. The data include
customer demographic information (e.g., age, income), the customer’s relationship
with the bank (e.g., mortgage, securities account), and the customer response to the
last personal loan campaign (Personal Loan). Among these 5000 customers, only 480
(= 9.6%) accepted the personal loan that was offered to them in the previous campaign.
Partition the data (60% training and 40% validation) and then perform a discrim-
inant analysis that models Personal Loan as a function of the remaining predictors
(excluding zip code). Remember to turn categorical predictors with more than two
categories into dummy variables first. Specify the success class as 1 (personal loan
acceptance), and use the default cutoff value of 0.5.
a. Compute summary statistics for the predictors separately for loan acceptors and
nonacceptors. For continuous predictors, compute the mean and standard devi-
ation. For categorical predictors, compute the percentages. Are there predictors
where the two classes differ substantially?
b. Examine the model performance on the validation set.
i. What is the accuracy rate?
ii. Is one type of misclassification more likely than the other?
iii. Select three customers who were misclassified as acceptors and three who were
misclassified as nonacceptors. The goal is to determine why they are misclassified.
First, examine their probability of being classified as acceptors: is it close to the
threshold of 0.5? If not, compare their predictor values to the summary statistics
of the two classes to determine why they were misclassified.
c. As in many marketing campaigns, it is more important to identify customers who
will accept the offer rather than customers who will not accept it. Therefore, a
good model should be especially accurate at detecting acceptors. Examine the lift
chart and decile-wise lift chart for the validation set and interpret them in light of
this ranking goal.
d. Compare the results from the discriminant analysis with those from a logistic regres-
sion (both with cutoff 0.5 and the same predictors). Examine the confusion matri-
ces, the lift charts, and the decile charts. Which method performs better on your
validation set in detecting the acceptors?
e. The bank is planning to continue its campaign by sending its offer to 1000 additional
customers. Suppose that the cost of sending the offer is $1 and the profit from an
accepted offer is $50. What is the expected profitability of this campaign?
308 DISCRIMINANT ANALYSIS
f. The cost of misclassifying a loan acceptor customer as a nonacceptor is much higher
than the opposite misclassification cost. To minimize the expected cost of misclas-
sification, should the cutoff value for classification (which is currently at 0.5) be
increased or decreased?
12.2 Identifying Good System Administrators. A management consultant is study-
ing the roles played by experience and training in a system administrator’s ability to
complete a set of tasks in a specified amount of time. In particular, she is interested
in discriminating between administrators who are able to complete given tasks within
a specified time and those who are not. Data are collected on the performance of 75
randomly selected administrators. They are stored in the file SystemAdministrators.csv.
Using these data, the consultant performs a discriminant analysis. The vari-
able Experience measures months of full time system administrator experience, while
Training measures number of relevant training credits. The dependent variable Com-
pleted is either Yes or No, according to whether or not the administrator completed
the tasks.
a. Create a scatter plot of Experience vs. Training using color or symbol to differ-
entiate administrators who completed the tasks from those who did not complete
them. See if you can identify a line that separates the two classes with minimum
misclassification.
b. Run a discriminant analysis with both predictors using the entire dataset as training
data. Among those who completed the tasks, what is the percentage of adminis-
trators who are classified incorrectly as failing to complete the tasks?
c. Compute the two classification scores for an administrator with 4 months of expe-
rience and 6 credits of training. Based on these, how would you classify this admin-
istrator?
d. How much experience must be accumulated by an administrator with 4 training
credits before his or her estimated probability of completing the tasks exceeds 0.5?
e. Compare the classification accuracy of this model to that resulting from a logistic
regression with cutoff 0.5.
12.3 Detecting Spam E-mail (from the UCI Machine Learning Repository). A
team at Hewlett-Packard collected data on a large number of e-mail messages from
their postmaster and personal e-mail for the purpose of finding a classifier that can
separate e-mail messages that are spam vs. nonspam (a.k.a. “ham”). The spam con-
cept is diverse: It includes advertisements for products or websites, “make money fast”
schemes, chain letters, pornography, and so on. The definition used here is “unso-
licited commercial e-mail.” The file Spambase.csv contains information on 4601 e-mail
messages, among which 1813 are tagged “spam.” The predictors include 57 attributes,
most of them are the average number of times a certain word (e.g., mail, George) or
symbol (e.g., #, !) appears in the e-mail. A few predictors are related to the number
and length of capitalized words.
a. To reduce the number of predictors to a manageable size, examine how each pre-
dictor differs between the spam and nonspam e-mails by comparing the spam-class
average and nonspam-class average. Which are the 11 predictors that appear to vary
the most between spam and nonspam e-mails? From these 11, which words or signs
occur more often in spam?
b. Partition the data into training and validation sets, then perform a discriminant
analysis on the training data using only the 11 predictors.
PROBLEMS 309
c. If we are interested mainly in detecting spam messages, is this model useful? Use the
confusion matrix, lift chart, and decile chart for the validation set for the evaluation.
d. In the sample, almost 40% of the e-mail messages were tagged as spam. However,
suppose that the actual proportion of spam messages in these e-mail accounts is
10%. Compute the constants of the classification functions to account for this
information.
e. A spam filter that is based on your model is used, so that only messages that are
classified as nonspam are delivered, while messages that are classified as spam are
quarantined. In this case, misclassifying a nonspam e-mail (as spam) has much
heftier results. Suppose that the cost of quarantining a nonspam e-mail is 20 times
that of not detecting a spam message. Compute the constants of the classification
functions to account for these costs (assume that the proportion of spam is reflected
correctly by the sample proportion).
CHAPTER 13
Combining Methods: Ensembles and
Uplift Modeling
In this chapter, we look at two useful approaches that combine methods for
improving predictive power: ensembles and uplift modeling. An ensemble com-
bines multiple supervised models into a “super-model.” The previous chapters
in this part of the book introduced different supervised methods for prediction
and classification. Earlier, in Chapter 5, we learned about evaluating predic-
tive performance, which can be used to compare several models and choose the
best one. An ensemble is based on the powerful notion of combining models.
Instead of choosing a single predictive model, we can combine several models
to achieve improved predictive accuracy. In this chapter, we explain the under-
lying logic of why ensembles can improve predictive accuracy and introduce
popular approaches for combining models, including simple averaging, bagging,
and boosting.
In uplift modeling, we combine supervised modeling with A-B testing, which
is a simple type of a randomized experiment. We describe the basics of A-B
testing and how it is used along with predictive models in persuasion messaging
not to predict outcomes but to predict who should receive which message or
treatment.
13.1 Ensembles1
Ensembles played a major role in the million-dollar Netflix Prize contest that
started in 2006. At the time, Netflix, the largest DVD rental service in the
1This and subsequent sections in this chapter copyright © 2017 Datastats, LLC, and Galit Shmueli.
Used by permission.
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
311
312 COMBINING METHODS: ENSEMBLES AND UPLIFT MODELING
United States, wanted to improve their movie recommendation system (from
www.netflixprize.com):
Netflix is all about connecting people to the movies they love. To help
customers find those movies, we’ve developed our world-class movie
recommendation system: CinematchSM…And while Cinematch is doing
pretty well, it can always be made better.
In a bold move, the company decided to share a large amount of data on
movie ratings by their users, and set up a contest, open to the public, aimed at
improving their recommendation system:
We provide you with a lot of anonymous rating data, and a prediction accuracy
bar that is 10% better than what Cinematch can do on the same training data
set.
During the contest, an active leader-board showed the results of the com-
peting teams. An interesting behavior started appearing: Different teams joined
forces to create combined, or ensemble predictions, which proved more accurate
than the individual predictions. The winning team, called “BellKor’s Pragmatic
Chaos” combined results from the “BellKor” and “Big Chaos” teams alongside
additional members. In a 2010 article in Chance magazine, the Netflix Prize
winners described the power of their ensemble approach:
An early lesson of the competition was the value of combining sets of
predictions from multiple models or algorithms. If two prediction sets achieved
similar RMSEs, it was quicker and more effective to simply average the two
sets than to try to develop a new model that incorporated the best of each
method. Even if the RMSE for one set was much worse than the other, there
was almost certainly a linear combination that improved on the better set.
Why Ensembles Can Improve Predictive Power
The principle of combining methods is popular for reducing risk. For example,
in finance, portfolios are created for reducing investment risk. The return from
a portfolio is typically less risky, because the variation is smaller than each of the
individual components.
In predictive modeling, “risk” is equivalent to variation in prediction error.
The more our prediction errors vary, the more volatile our predictive model.
Consider predictions from two different models for a set of n records. e1,i is the
prediction error for the ith record by method 1 and e2,i is the prediction error
for the same record by method 2.
Suppose that each model produces prediction errors that are, on average,
zero (for some records the model over-predicts and for some it under-predicts,
http://www.netflixprize.com
ENSEMBLES 313
but on average the error is zero):
E(e1,i) = E(e2,i) = 0.
If, for each record, we take an average of the two predictions: yi =
ŷ1,i+ŷ2,i
2
,
then the expected mean error will also be zero:
E (yi − yi) = E
(
yi −
ŷ1,i + ŷ2,i
2
)
(13.1)
= E
(
yi − ŷ1,i
2
+
yi − ŷ2,i
2
)
= E
(
e1,i + e2,i
2
)
= 0.
This means that the ensemble has the same mean error as the individual models.
Now let us examine the variance of the ensemble’s prediction errors:
Var
(
e1,i + e2,i
2
)
=
1
4
(Var(e1,i) + Var(e2,i)) +
1
4
× 2Cov (e1,i, e2,i) .
(13.2)
This variance can be lower than each of the individual variances Var(e1,i) and
Var(e2,i) under some circumstances. A key component is the covariance (or
equivalently, correlation) between the two prediction errors. The case of no
correlation leaves us with a quantity that can be smaller than each of the individ-
ual variances. The variance of the average prediction error will be even smaller
when the two prediction errors are negatively correlated.
In summary, using an average of two predictions can potentially lead to
smaller error variance, and therefore better predictive power. These results gen-
eralize to more than two methods; you can combine results from multiple pre-
diction methods or classifiers.
T H E W I S D O M O F C R O W D S
In his book The Wisdom of Crowds, James Surowiecki recounts how Francis Galton,
a prominent statistician from the 19th century, watched a contest at a county fair
in England. The contest’s objective was to guess the weight of an ox. Individual
contest entries were highly variable, but the mean of all the estimates was surpris-
ingly accurate—within 1% of the true weight of the ox. On balance, the errors from
multiple guesses tended to cancel one another out. You can think of the output of
a predictive model as a more informed version of these guesses. Averaging together
multiple guesses will yield a more precise answer than the vast majority of the indi-
vidual guesses. Note that in Galton’s story, there were a few (lucky) individuals
who scored better than the average. An ensemble estimate will not always be more
accurate than all the individual estimates in all cases, but it will be more accurate
most of the time.
314 COMBINING METHODS: ENSEMBLES AND UPLIFT MODELING
Simple Averaging
The simplest approach for creating an ensemble is to combine the predictions,
classifications, or propensities from multiple models. For example, we might
have a linear regression model, a regression tree, and a k-NN algorithm. We use
each of the three methods to score, say, a test set. We then combine the three
sets of results.
The three models can also be variations that use the same algorithm. For
example, we might have three linear regression models, each using a different
set of predictors.
Combining Predictions In prediction tasks, where the outcome variable
is numerical, we can combine the predictions from the different methods simply
by taking an average. In the above example, for each record in the test set, we
have three predictions (one from each model). The ensemble prediction is then
the average of the three values.
One alternative to a simple average is taking the median prediction, which
would be less affected by extreme predictions. Another possibility is computing
a weighted average, where weights are proportional to a quantity of interest. For
instance, weights can be proportional to the accuracy of the model, or if differ-
ent data sources are used, the weights can be proportional to the quality of the
data.
Ensembles for prediction are useful not only in cross-sectional prediction,
but also in time series forecasting (see Chapters 16–18). In forecasting, the same
approach of combining future forecasts from multiple methods can lead to more
precise predictions. One example is the weather forecasting application Fore-
cast.io (www.forecast.io), which describes their algorithm as follows:
Forecast.io is backed by a wide range of data sources, which are aggregated
together statistically to provide the most accurate forecast possible for a given
location.
Combining Classifications In the case of classification, combining the
results from multiple classifiers can be done using “voting”: For each record, we
have multiple classifications. A simple rule would be to choose the most popular
class among these classifications. For example, we might use a classification tree,
a naive Bayes classifier, and discriminant analysis for classifying a binary outcome.
For each record we then generate three predicted classes. Simple voting would
choose the most common class among the three.
As in prediction, we can assign heavier weights to scores from some models,
based on considerations such as model accuracy or data quality. This would be
done by setting a “majority rule” that is different from 50%.
http://www.forecast.io
ENSEMBLES 315
Combining Propensities Similar to predictions, propensities can be
combined by taking a simple (or weighted) average. Recall that some algo-
rithms, such as naive Bayes (see Chapter 8), produce biased propensities and
should therefore not be simply averaged with propensities from other methods.
Bagging
Another form of ensembles is based on averaging across multiple random data
samples. Bagging, short for “bootstrap aggregating,” comprises two steps:
1. Generate multiple random samples (by sampling with replacement from
the original data)—this method is called “bootstrap sampling.”
2. Running an algorithm on each sample and producing scores.
Bagging improves the performance stability of a model and helps avoid over-
fitting by separately modeling different data samples and then combining the
results. It is therefore especially useful for algorithms such as trees and neural
networks.
Boosting
Boosting is a slightly different approach to creating ensembles. Here the goal is
to directly improve areas in the data where our model makes errors, by forcing
the model to pay more attention to those records. The steps in boosting are:
1. Fit a model to the data.
2. Draw a sample from the data so that misclassified records (or records with
large prediction errors) have higher probabilities of selection.
3. Fit the model to the new sample.
4. Repeat Steps 2–3 multiple times.
Bagging and Boosting in R
In Chapter 9, we described random forests, an ensemble based on bagged trees.
We illustrated a random forest implementation for the personal loan example.
The adabag package in R can be used to generate bagged and boosted trees.
Tables 13.1 and 13.2 show the R code and output producing a bagged tree and
a boosted tree for the personal loan data, and how they are used to generate
classifications for the validation set.
Advantages and Weaknesses of Ensembles
Combining scores from multiple models is aimed at generating more precise pre-
dictions (lowering the prediction error variance). The ensemble approach is most
useful when the combined models generate prediction errors that are negatively
316 COMBINING METHODS: ENSEMBLES AND UPLIFT MODELING
TABLE 13.1 EXAMPLE OF BAGGING AND BOOSTING CLASSIFICATION TREES ON THE PERSONAL
LOAN DATA: R CODE
code for bagging and boosing trees
library(adabag)
library(rpart)
library(caret)
bank.df <- read.csv("UniversalBank.csv")
bank.df <- bank.df[ , -c(1, 5)] # Drop ID and zip code columns.
# transform Personal.Loan into categorical variable
bank.df$Personal.Loan = as.factor(bank.df$Personal.Loan)
# partition the data
train.index <- sample(c(1:dim(bank.df)[1]), dim(bank.df)[1]*0.6)
train.df <- bank.df[train.index, ]
valid.df <- bank.df[-train.index, ]
# single tree
tr <- rpart(Personal.Loan ~ ., data = train.df)
pred <- predict(tr, valid.df, type = "class")
confusionMatrix(pred, valid.df$Personal.Loan)
# bagging
bag <- bagging(Personal.Loan ~ ., data = train.df)
pred <- predict(bag, valid.df, type = "class")
confusionMatrix(pred$class, valid.df$Personal.Loan)
# boosting
boost <- boosting(Personal.Loan ~ ., data = bank.df)
pred <- predict(boost, valid.df, type = "class")
confusionMatrix(pred$class, valid.df$Personal.Loan)
associated, but it can also be useful when the correlation is low. Ensembles can
use simple averaging, weighted averaging, voting, medians, etc. Models can be
based on the same algorithm or on different algorithms, using the same sample
or different samples. Ensembles have become a major strategy for participants in
data mining contests, where the goal is to optimize some predictive measure. In
that sense, ensembles also provide an operational way to obtain solutions with
high predictive power in a fast way, by engaging multiple teams of “data crunch-
ers” working in parallel and combining their results.
Ensembles that are based on different data samples help avoid overfitting.
However, remember that you can also overfit the data with an ensemble if you
tweak it (e.g., choosing the “best” weights when using a weighted average).
The major disadvantage of an ensemble is the resources that it requires: com-
putationally, as well as in terms of software availability and the analyst’s skill
and time investment. Ensembles that combine results from different algorithms
UPLIFT (PERSUASION) MODELING 317
TABLE 13.2 EXAMPLE OF BAGGING AND BOOSTING CLASSIFICATION TREES ON THE PERSONAL
LOAN DATA: OUTPUT
> # single tree
> confusionMatrix(pred, valid.df$Personal.Loan)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1792 18
1 19 171
Accuracy : 0.9815
> # bagging
> confusionMatrix(pred$class, valid.df$Personal.Loan)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1797 23
1 14 166
Accuracy : 0.9815
> # boosting
> confusionMatrix(pred$class, valid.df$Personal.Loan)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1804 16
1 7 173
Accuracy : 0.9885
require developing each of the models and evaluating them. Boosting-type
ensembles and bagging-type ensembles do not require such effort, but they do
have a computational cost (although boosting can be parallelized easily). Ensem-
bles that rely on multiple data sources require collecting and maintaining mul-
tiple data sources. And finally, ensembles are “blackbox” methods, in that the
relationship between the predictors and the outcome variable usually becomes
nontransparent.
13.2 Uplift (Persuasion) Modeling
Long before the advent of the Internet, sending messages directly to individuals
(i.e., direct mail) held a big share of the advertising market. Direct marketing
affords the marketer the ability to invite and monitor direct responses from con-
sumers. This, in turn, allows the marketer to learn whether the messaging is
paying off. A message can be tested with a small section of a large list and, if it
318 COMBINING METHODS: ENSEMBLES AND UPLIFT MODELING
pays off, the message can be rolled out to the entire list. With predictive model-
ing, we have seen that the rollout can be targeted to that portion of the list that
is most likely to respond or behave in a certain way. None of this was possible
with traditional media advertising (television, radio, newspaper, magazine).
Direct response also made it possible to test one message against another and
find out which does better.
A-B Testing
A-B testing is the marketing industry’s term for a standard scientific experiment
in which results can be tracked for each individual. The idea is to test one
treatment against another, or a treatment against a control. “Treatment” is simply
the term for the intervention you are testing: In a medical trial it is typically
a drug, device, or other therapy; in marketing it is typically an offering to a
consumer—for example, an e-mail, or a web page shown to a consumer. A
general display ad in a magazine would not generally qualify, unless it had a
specific call to action that allowed the marketer to trace the action (e.g., purchase)
to a given ad, plus the ability to split the magazine distribution randomly and
provide a different offer to each segment.
An important element of A-B testing is random allocation—the treatments
are assigned or delivered to individuals randomly. That way, any difference
between treatment A and treatment B can be attributed to the treatment (unless
it is due to chance).
Uplift
An A-B test tells you which treatment does better on average, but says nothing
about which treatment does better for which individual. A classic example is
in political campaigns. Consider the following scenario: The campaign director
for Smith, a Democratic Congressional candidate, would like to know which
voters should be called to encourage to support Smith. Voters that tend to vote
Democratic but are not activists might be more inclined to vote for Smith if
they got a call. Active Democrats are probably already supportive of him, and
therefore a call to them would be wasted. Calls to Republicans are not only
wasteful, but they could be harmful.
Campaigns now maintain extensive data on voters to help guide decisions
about outreach to individual voters. Prior to the 2008 Obama campaign, the
practice was to make rule-based decisions based on expert political judgment.
Since 2008, it has increasingly been recognized that, rather than relying on judg-
ment or supposition to determine whether an individual should be called, it is
best to use the data to develop a model that can predict whether a voter will
respond positively to outreach.
UPLIFT (PERSUASION) MODELING 319
Gathering the Data
US states maintain publicly available files of voters, as part of the transparent
oversight process for elections. The voter file contains data such as name, address,
and date of birth. Political parties have “poll-watchers” at elections to record
who votes, so they have additional data on which elections voters voted in.
Census data for neighborhoods can be appended, based on voter address. Finally,
commercial demographic data can be purchased and matched to the voter data.
Table 13.3 shows a small extract of data derived from the voter file for the US state
of Delaware.2 The actual data used in this problem are in the file Persuasion-A.csv
and contain 10,000 records and many additional variables beyond those shown
in Table 13.3.
First, the campaign director conducts a survey of 10,000 voters to deter-
mine their inclination to vote Democratic. Then she conducts an experiment,
randomly splitting the sample of 10,000 voters in half and mailing a message
promoting Smith to half the list (treatment A), and nothing to the other half
(treatment B). The control group that gets no message is essential, since other
campaigns or news events might cause a shift in opinion. The goal is to measure
the change in opinion after the message is sent out, relative to the no-message
control group.
The next step is conducting a post-message survey of the same sample of
10,000 voters, to measure whether each voter’s opinion of Smith has shifted in
TABLE 13.3 DATA ON VOTERS (SMALL SUBSET OF VARIABLES AND RECORDS) AND
DATA DICTIONARY
Voter Age NH_White Comm_PT H_F1 Reg_Days PR_Pelig E_Elig Political_C
1 28 70 0 0 3997 0 20 1
2 23 67 3 0 300 0 0 1
3 57 64 4 0 2967 0 0 0
4 70 53 2 1 16620 100 90 1
5 37 76 2 0 3786 0 20 0
Data Dictionary
Age Voter age in years
NH_White Neighborhood average of % non-Hispanic white in household
Comm_PT Neighborhood % of workers who take public transit
H_F1 Single female household (1 = yes)
Reg_Days Days since voter registered at current address
PR_Pelig Voted in what % of non-presidential primaries
E_Pelig Voted in what % of any primaries
Political_C Is there a political contributor in the home? (1 = yes)
2Thanks to Ken Strasma, founder of the microtargeting firm HaystaqDNA and director of targeting
for the 2004 Kerry campaign and the 2008 Obama campaign, for these data.
320 COMBINING METHODS: ENSEMBLES AND UPLIFT MODELING
a positive direction. A binary variable, Moved_AD, will be added to the above
data, indicating whether opinion has moved in a Democratic direction (1) or
not (0).
Table 13.4 summarizes the results of the survey, by comparing the move-
ment in a Democratic direction for each of the treatments. Overall, the message
(Message = 1) is modestly effective.
Movement in a Democratic direction among those who got no message is
34.4%. This probably reflects the approach of the election, the heightening cam-
paign activity, and the reduction in the “no opinion” category. It also illustrates
the need for a control group. Among those who did get the message, the move-
ment in a Democratic direction is 40.2%. So, overall, the lift from the message
is 5.8%.
TABLE 13.4 RESULTS OF SENDING A PRO-DEMOCRATIC MESSAGE TO
VOTERS
#Voters # Moved Dem. % Moved Dem.
Message = 1 (message sent) 5000 2012 40.2%
Message = 0 (no message sent) 5000 1722 34.4%
We can now append two variables to the voter data shown earlier in Table
13.3: message [whether they received the message (1) or not (0)] and Moved_AD
[whether they moved in a Democratic direction (1) or not (0)]. The augmented
data are shown in Table 13.5.
TABLE 13.5 OUTCOME VARIABLE (MOVED_AD) AND TREATMENT VARIABLE (MESSAGE) ADDED
TO VOTER DATA
Voter Age NH_White Comm_PT H_F1 Reg_Days PR_Pelig E_Elig Political_C Message Moved_AD
1 28 70 0 0 3997 0 20 1 0 1
2 23 67 3 0 300 0 0 1 1 1
3 57 64 4 0 2967 0 0 0 0 0
4 70 53 2 1 16620 100 90 1 0 0
5 37 76 2 0 3786 0 20 0 1 0
A Simple Model
We can develop a predictive model with Moved_AD as the outcome variable,
and various predictor variables, including the treatment Message. Any classification
method can be used; Table 13.6 shows the first few lines from the output of a
logistic regression model used to predict Moved_AD.
However, our interest is not just how the message did overall, nor is it
whether we can predict the probability that a voter’s opinion will move in a
favorable direction. Rather our goal is to predict how much (positive) impact
the message will have on a specific voter. That way the campaign can direct
UPLIFT (PERSUASION) MODELING 321
TABLE 13.6 CLASSIFICATIONS AND PROPENSITIES FROM PREDICTIVE
MODEL (SMALL EXTRACT)
Voter Message Actual Moved_AD Predicted Moved_AD Predicted Prob.
1 0 1 1 0.5975
2 1 1 1 0.5005
3 0 0 0 0.2235
4 0 0 0 0.3052
5 1 0 0 0.4140
its limited resources toward the voters who are the most persuadable—those for
whom sending the message will have the greatest positive effect.
Modeling Individual Uplift
To answer the question about the message’s impact on each voter, we need to
model the effect of the message at the individual voter level. For each voter,
uplift is defined as follows:
Uplift = increase in propensity of favorable opinion after receiving message
To build an uplift model, we follow the following steps to estimate the change
in probability of “success” (propensity) that comes from receiving the treatment
(the message):
1. Randomly split a data sample into treatment and control groups, conduct
an A-B test, and record the outcome (in our example: Moved_AD)
2. Recombining the data sample, partition it into training and validation
sets; build a predictive model with this outcome variable and include a
predictor variable that denotes treatment status (in our example: Message)
3. Score this predictive model to a partition of the data; you can use the vali-
dation partition. This will yield, for each validation record, its propensity
of success given its treatment.
4. Reverse the value of the treatment variable and re-score the same model
to that partition. This will yield for each validation record its propensity
of success had it received the other treatment.
5. Uplift is estimated for each individual by P(Success | Treatment = 1) −
P(Success | Treatment = 0)
6. For new data where no experiment has been performed, simply add a
synthetic predictor variable for treatment and assign first a ‘1,’ score the
model, then a ‘0,’ and score the model again. Estimate uplift for the new
record(s) as above.
Continuing with the small voter example, the results from Step 3 were shown
in Table 13.6—the right column shows the propensities from the model. Next,
322 COMBINING METHODS: ENSEMBLES AND UPLIFT MODELING
TABLE 13.7 CLASSIFICATIONS AND PROPENSITIES FROM PREDICTIVE
MODEL (SMALL EXTRACT) WITH MESSAGE VALUES
REVERSED
Voter Message Actual Moved_AD Predicted Moved_AD Predicted Prob.
1 1 1 1 0.6908
2 0 1 1 0.3996
3 1 0 0 0.3022
4 1 0 0 0.3980
5 0 0 0 0.3194
we re-estimate the logistic model, but with the values of the treatment variable
Message reversed for each row. Table 13.7 shows the propensities with variable
Message reversed (you can see the reversed values in column Message). Finally,
in Step 5, we calculate the uplift for each voter.
Table 13.8 shows the uplift for each voter—the success (Moved_AD = 1)
propensity given Message = 1 minus the success propensity given Message = 0.
TABLE 13.8 UPLIFT: CHANGE IN PROPENSITIES FROM SENDING
MESSAGE VS. NOT SENDING MESSAGE
Voter Prob. if Message = 1 Prob. if Message = 0 Uplift
1 0.6908 0.5975 0.0933
2 0.5005 0.3996 0.1009
3 0.3022 0.2235 0.0787
4 0.3980 0.3052 0.0928
5 0.4140 0.3194 0.0946
Computing Uplift with R
This entire process that we showed manually can be done using R’s uplift
package. One difference between our manual computation and R’s uplift pack-
age is that we used a logistic regression whereas the uplift package uses either a
random forest (upliftRF()) or a k-nearest neighbors classifier (upliftKNN()). Table
13.9 shows the result of implementing uplift analysis in R using a random forest.
The output shows the two conditional probabilities, P(Success | Treatment = 1)
and P(Success | Treatment = 0), estimated for each record. The difference
between the values in Tables 13.8 and 13.9 are due to the use of two differ-
ent predictive algorithms (logistic regression vs. random forests).
Using the Results of an Uplift Model
Once we have estimated the uplift for each individual, the results can be ordered
by uplift. The message could then be sent to all those voters with a positive uplift,
or, if resources are limited, only to a subset—those with the greatest uplift.
UPLIFT (PERSUASION) MODELING 323
TABLE 13.9 UPLIFT IN R APPLIED TO THE VOTERS DATA
code for uplift
library(uplift)
voter.df <- read.csv("Voter-Persuasion.csv")
# transform variable MOVED_AD to numerical
voter.df$MOVED_AD_NUM <- ifelse(voter.df$MOVED_AD == "Y", 1, 0)
set.seed(1)
train.index <- sample(c(1:dim(voter.df)[1]), dim(voter.df)[1]*0.6)
train.df <- voter.df[train.index, ]
valid.df <- voter.df[-train.index, ]
# use upliftRF to apply a Random Forest (alternatively use upliftKNN() to apply kNN).
up.fit <- upliftRF(MOVED_AD_NUM ~ AGE + NH_WHITE + COMM_PT + H_F1 + REG_DAYS+
PR_PELIG + E_PELIG + POLITICALC + trt(MESSAGE_A),
data = train.df, mtry = 3, ntree = 100, split_method = "KL",
minsplit = 200, verbose = TRUE)
pred <- predict(up.fit, newdata = valid.df)
# first colunm: p(y | treatment)
# second colunm: p(y | control)
head(data.frame(pred, "uplift" = pred[,1] - pred[,2]))
Output
pr.y1_ct1 pr.y1_ct0 uplift
1 0.356284 0.319376 0.036908
2 0.489685 0.437061 0.052624
3 0.380408 0.365436 0.014972
4 0.379597 0.341727 0.037870
5 0.371465 0.293659 0.077806
6 0.405424 0.349785 0.055639
Uplift modeling is used mainly in marketing and, more recently, in political
campaigns. It has two main purposes:
• To determine whether to send someone a persuasion message, or just leave
them alone.
• When a message is definitely going to be sent, to determine which mes-
sage, among several possibilities, to send.
Technically this amounts to the same thing—“send no message” is simply
another category of treatment, and an experiment can be constructed with mul-
tiple treatments, for example, no message, message A, and message B. However,
practitioners tend to think of the two purposes as distinct, and tend to focus
on the first. Marketers want to avoid sending discount offers to customers who
would make a purchase anyway, or renew a subscription anyway. Political cam-
paigns, likewise, want to avoid calling voters who would vote for their candidate
in any case. And both parties especially want to avoid sending messages or offers
where the effect might be antagonistic—where the uplift is negative.
324 COMBINING METHODS: ENSEMBLES AND UPLIFT MODELING
13.3 Summary
In practice, the methods discussed in this book are often used not in isolation,
but as building blocks in an analytic process whose goal is always to inform and
provide insight.
In this chapter, we looked at two ways that multiple models are deployed. In
ensembles, multiple models are weighted and combined to produce improved
predictions. In uplift modeling, the results of A-B testing are folded into the
predictive modeling process as a predictor variable to guide choices not just about
whether to send an offer or persuasion message, but also as to who to send it to.
PROBLEMS 325
PROBLEMS
13.1 Acceptance of Consumer Loan Universal Bank has begun a program to encourage
its existing customers to borrow via a consumer loan program. The bank has promoted
the loan to 5000 customers, of whom 480 accepted the offer. The data are available
in file UniversalBank.csv. The bank now wants to develop a model to predict which
customers have the greatest probability of accepting the loan, to reduce promotion
costs and send the offer only to a subset of its customers.
We will develop several models, then combine them in an ensemble. The models
we will use are (1) logistic regression, (2) k-nearest neighbors with k = 3, and (3)
classification trees. Preprocess the data as follows:
• Zip code can be ignored.
• Partition the data: 60% training, 40% validation.
a. Fit models to the data for (1) logistic regression, (2) k-nearest neighbors with k = 3,
and (3) classification trees. Use Personal Loan as the outcome variable. Report the
validation confusion matrix for each of the three models.
b. Create a data frame with the actual outcome, predicted outcome, and each of the
three models. Report the first 10 rows of this data frame.
c. Add two columns to this data frame for (1) a majority vote of predicted outcomes,
and (2) the average of the predicted probabilities. Using the classifications generated
by these two methods derive a confusion matrix for each method and report the
overall accuracy.
d. Compare the error rates for the three individual methods and the two ensemble
methods.
13.2 eBay Auctions—Boosting and Bagging Using the eBay auction data (file eBayAuc-
tions.csv) with variable Competitive as the outcome variable, partition the data into
training (60%) and validation (40%).
a. Run a classification tree, using the default controls of rpart(). Looking at the vali-
dation set, what is the overall accuracy? What is the lift on the first decile?
b. Run a boosted tree with the same predictors (use function boosting() in the adabag
package). For the validation set, what is the overall accuracy? What is the lift on
the first decile?
c. Run a bagged tree with the same predictors (use function bagging() in the adabag
package). For the validation set, what is the overall accuracy? What is the lift on
the first decile?
d. Run a random forest (use function randomForest() in package randomForest with
argument mtry = 4). Compare the bagged tree to the random forest in terms of
validation accuracy and lift on first decile. How are the two methods conceptually
different?
13.3 Predicting Delayed Flights (Boosting). The file FlightDelays.csv contains informa-
tion on all commercial flights departing the Washington, DC area and arriving at New
York during January 2004. For each flight there is information on the departure and
arrival airports, the distance of the route, the scheduled time and date of the flight, and
so on. The variable that we are trying to predict is whether or not a flight is delayed.
A delay is defined as an arrival that is at least 15 minutes later than scheduled.
326 COMBINING METHODS: ENSEMBLES AND UPLIFT MODELING
Data Preprocessing. Transform variable day of week info a categorical variable. Bin
the scheduled departure time into eight bins (in R use function cut()). Partition the
data into training and validation sets.
Run a boosted classification tree for delay. Leave the default number of weak
learners, and select resampling. Set maximum levels to display at 6, and minimum
number of records in a terminal node to 1.
a. Compared with the single tree, how does the boosted tree behave in terms of overall
accuracy?
b. Compared with the single tree, how does the boosted tree behave in terms of
accuracy in identifying delayed flights?
c. Explain why this model might have the best performance over the other models
you fit.
13.4 Hair Care Product—Uplift Modeling This problem uses the data set in Hair-Care-
Product.csv, courtesy of SAS. In this hypothetical case, a promotion for a hair care
product was sent to some members of a buyers club. Purchases were then recorded for
both the members who got the promotion and those who did not.
a. What is the purchase propensity
i. among those who received the promotion?
ii. among those who did not receive the promotion?
b. Partition the data into training (60%) and validation (40%) and fit:
i. Uplift using a Random Forest.
ii. Uplift using k-NN.
c. Report the two models’ recommendations for the first three members.
Part V
Mining Relationships Among Records
CHAPTER 14
Association Rules and Collaborative
Filtering
In this chapter, we describe the unsupervised learning methods of association
rules (also called “affinity analysis” and “market basket analysis”) and collabora-
tive filtering. Both methods are popular in marketing for cross-selling products
associated with an item that a consumer is considering.
In association rules, the goal is to identify item clusters in transaction-type
databases. Association rule discovery in marketing is termed “market basket anal-
ysis” and is aimed at discovering which groups of products tend to be purchased
together. These items can then be displayed together, offered in post-transaction
coupons, or recommended in online shopping. We describe the two-stage pro-
cess of rule generation and then assessment of rule strength to choose a subset.
We look at the popular rule-generating Apriori algorithm, and then criteria for
judging the strength of rules.
In collaborative filtering, the goal is to provide personalized recommen-
dations that leverage user-level information. User-based collaborative filtering
starts with a user, then finds users who have purchased a similar set of items or
ranked items in a similar fashion, and makes a recommendation to the initial user
based on what the similar users purchased or liked. Item-based collaborative fil-
tering starts with an item being considered by a user, then locates other items
that tend to be co-purchased with that first item. We explain the technique and
the requirements for applying it in practice.
14.1 Association Rules
Put simply, association rules, or affinity analysis, constitute a study of “what goes
with what.” This method is also called market basket analysis because it originated
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
329
330 ASSOCIATION RULES AND COLLABORATIVE FILTERING
with the study of customer transactions databases to determine dependencies
between purchases of different items. Association rules are heavily used in retail
for learning about items that are purchased together, but they are also useful
in other fields. For example, a medical researcher might want to learn what
symptoms appear together. In law, word combinations that appear too often
might indicate plagiarism.
Discovering Association Rules in Transaction Databases
The availability of detailed information on customer transactions has led to
the development of techniques that automatically look for associations between
items that are stored in the database. An example is data collected using bar-code
scanners in supermarkets. Such market basket databases consist of a large number
of transaction records. Each record lists all items bought by a customer on a
single-purchase transaction. Managers are interested to know if certain groups
of items are consistently purchased together. They could use such information
for making decisions on store layouts and item placement, for cross-selling, for
promotions, for catalog design, and for identifying customer segments based on
buying patterns. Association rules provide information of this type in the form
of “if–then” statements. These rules are computed from the data; unlike the
if–then rules of logic, association rules are probabilistic in nature.
Association rules are commonly encountered in online recommendation sys-
tems (or recommender systems), where customers examining an item or items for
possible purchase are shown other items that are often purchased in conjunction
with the first item(s). The display from Amazon.com’s online shopping system
illustrates the application of rules like this under “Frequently bought together.”
In the example shown in Figure 14.1, a user browsing a Samsung Galaxy S5 cell
phone is shown a case and a screen protector that are often purchased along with
this phone.
We introduce a simple artificial example and use it throughout the chapter
to demonstrate the concepts, computations, and steps of association rules. We
end by applying association rules to a more realistic example of book purchases.
Example 1: Synthetic Data on Purchases of Phone Faceplates
A store that sells accessories for cellular phones runs a promotion on faceplates.
Customers who purchase multiple faceplates from a choice of six different colors
get a discount. The store managers, who would like to know what colors of
faceplates customers are likely to purchase together, collected the transaction
database as shown in Table 14.1.
Generating Candidate Rules
The idea behind association rules is to examine all possible rules between items
in an if–then format, and select only those that are most likely to be indicators
ASSOCIATION RULES 331
FIGURE 14.1 RECOMMENDATIONS UNDER “FREQUENTLY BOUGHT TOGETHER” ARE BASED ON
ASSOCIATION RULES
of true dependence. We use the term antecedent to describe the IF part, and
consequent to describe the THEN part. In association analysis, the antecedent
and consequent are sets of items (called itemsets) that are disjoint (do not have
any items in common). Note that itemsets are not records of what people buy;
they are simply possible combinations of items, including single items.
332 ASSOCIATION RULES AND COLLABORATIVE FILTERING
TABLE 14.1 TRANSACTIONS DATABASE FOR PURCHASES OF DIFFERENT-COLORED CELLULAR
PHONE FACEPLATES
Transaction Faceplate Colors Purchased
1 red white green
2 white orange
3 white blue
4 red white orange
5 red blue
6 white blue
7 red blue
8 red white blue green
9 red white blue
10 yellow
Returning to the phone faceplate purchase example, one example of a pos-
sible rule is “if red, then white,” meaning that if a red faceplate is purchased, a
white one is, too. Here the antecedent is red and the consequent is white. The
antecedent and consequent each contain a single item in this case. Another pos-
sible rule is “if red and white, then green.” Here the antecedent includes the
itemset {red, white} and the consequent is {green}.
The first step in association rules is to generate all the rules that would be
candidates for indicating associations between items. Ideally, we might want
to look at all possible combinations of items in a database with p distinct items
(in the phone faceplate example, p = 6). This means finding all combinations
of single items, pairs of items, triplets of items, and so on, in the transactions
database. However, generating all these combinations requires a long computa-
tion time that grows exponentially1 in p. A practical solution is to consider only
combinations that occur with higher frequency in the database. These are called
frequent itemsets.
Determining what qualifies as a frequent itemset is related to the concept of
support. The support of a rule is simply the number of transactions that include
both the antecedent and consequent itemsets. It is called a support because it
measures the degree to which the data “support” the validity of the rule. The
support is sometimes expressed as a percentage of the total number of records in
the database. For example, the support for the itemset {red,white} in the phone
faceplate example is 4 (or, 100× 4
10
= 40%).
What constitutes a frequent itemset is therefore defined as an itemset that
has a support that exceeds a selected minimum support, determined by the
user.
1The number of rules that one can generate for p items is 3p−2p+1+1. Computation time therefore
grows by a factor for each additional item. For 6 items we have 602 rules, while for 7 items the number
of rules grows to 1932.
ASSOCIATION RULES 333
The Apriori Algorithm
Several algorithms have been proposed for generating frequent itemsets, but the
classic algorithm is the Apriori algorithm of Agrawal et al. (1993). The key idea
of the algorithm is to begin by generating frequent itemsets with just one item
(one-itemsets) and to recursively generate frequent itemsets with two items, then
with three items, and so on, until we have generated frequent itemsets of all sizes.
It is easy to generate frequent one-itemsets. All we need to do is to count,
for each item, how many transactions in the database include the item. These
transaction counts are the supports for the one-itemsets. We drop one-itemsets
that have support below the desired minimum support to create a list of the
frequent one-itemsets.
To generate frequent two-itemsets, we use the frequent one-itemsets. The
reasoning is that if a certain one-itemset did not exceed the minimum support,
any larger size itemset that includes it will not exceed the minimum support.
In general, generating k-itemsets uses the frequent (k − 1)-itemsets that were
generated in the preceding step. Each step requires a single run through the
database, and therefore the Apriori algorithm is very fast even for a large number
of unique items in a database.
Selecting Strong Rules
From the abundance of rules generated, the goal is to find only the rules that
indicate a strong dependence between the antecedent and consequent itemsets.
To measure the strength of association implied by a rule, we use the measures of
confidence and lift ratio, as described below.
Support and Confidence In addition to support, which we described
earlier, there is another measure that expresses the degree of uncertainty about
the if–then rule. This is known as the confidence2 of the rule. This measure
compares the co-occurrence of the antecedent and consequent itemsets in the
database to the occurrence of the antecedent itemsets. Confidence is defined as
the ratio of the number of transactions that include all antecedent and consequent
itemsets (namely, the support) to the number of transactions that include all the
antecedent itemsets:
Confidence =
no. transactions with both antecedent and consequent itemsets
no. transactions with antecedent itemset
.
2The concept of confidence is different from and unrelated to the ideas of confidence intervals and
confidence levels used in statistical inference.
334 ASSOCIATION RULES AND COLLABORATIVE FILTERING
For example, suppose that a supermarket database has 100,000 point-of-sale
transactions. Of these transactions, 2000 include both orange juice and (over-
the-counter) flu medication, and 800 of these include soup purchases. The
association rule “IF orange juice and flu medication are purchased THEN soup
is purchased on the same trip” has a support of 800 transactions (alternatively,
0.8% = 800/100,000) and a confidence of 40% (= 800/2000).
To see the relationship between support and confidence, let us think about
what each is measuring (estimating). One way to think of support is that it is
the (estimated) probability that a transaction selected randomly from the database
will contain all items in the antecedent and the consequent:
Support = P̂ (antecedent AND consequent).
In comparison, the confidence is the (estimated) conditional probability that a trans-
action selected randomly will include all the items in the consequent given that
the transaction includes all the items in the antecedent:
Confidence=
P̂ (antecedent AND consequent)
P̂ (antecedent)
= P̂ (consequent | antecedent).
A high value of confidence suggests a strong association rule (in which we
are highly confident). However, this can be deceptive because if the antecedent
and/or the consequent has a high level of support, we can have a high value
for confidence even when the antecedent and consequent are independent! For
example, if nearly all customers buy bananas and nearly all customers buy ice
cream, the confidence level of a rule such as “IF bananas THEN ice-cream”
will be high regardless of whether there is an association between the items.
Lift Ratio A better way to judge the strength of an association rule is to
compare the confidence of the rule with a benchmark value, where we assume
that the occurrence of the consequent itemset in a transaction is independent of
the occurrence of the antecedent for each rule. In other words, if the antecedent
and consequent itemsets are independent, what confidence values would we
expect to see? Under independence, the support would be
P (antecedent AND consequent) = P (antecedent) × P (consequent),
and the benchmark confidence would be
P (antecedent) × P (consequent)
P (antecedent)
= P (consequent).
The estimate of this benchmark from the data, called the benchmark confidence
value for a rule, is computed by
Benchmark confidence =
no. transactions with consequent itemset
no. transactions in database
.
ASSOCIATION RULES 335
We compare the confidence to the benchmark confidence by looking at
their ratio: this is called the lift ratio of a rule. The lift ratio is the confidence of
the rule divided by the confidence, assuming independence of consequent from
antecedent:
lift ratio =
confidence
benchmark confidence
.
A lift ratio greater than 1.0 suggests that there is some usefulness to the rule.
In other words, the level of association between the antecedent and consequent
itemsets is higher than would be expected if they were independent. The larger
the lift ratio, the greater the strength of the association.
To illustrate the computation of support, confidence, and lift ratio for the
cellular phone faceplate example, we introduce an alternative presentation of the
data that is better suited to this purpose.
Data Format
Transaction data are usually displayed in one of two formats: a transactions
database (with each row representing a list of items purchased in a single trans-
action), or a binary incidence matrix in which columns are items, rows again
represent transactions, and each cell has either a 1 or a 0, indicating the presence
or absence of an item in the transaction. For example, Table 14.1 displays the
data for the cellular faceplate purchases in a transactions database. We translate
these into binary incidence matrix format in Table 14.2.
Now suppose that we want association rules between items for this database
that have a support count of at least 2 (equivalent to a percentage support of
2/10 = 20%): In other words, rules based on items that were purchased together
in at least 20% of the transactions. By enumeration, we can see that only the
itemsets listed in Table 14.3 have a count of at least 2.
The first itemset {red} has a support of 6, because six of the transactions
included a red faceplate. Similarly, the last itemset {red, white, green} has a
TABLE 14.2 PHONE FACEPLATE DATA IN BINARY INCIDENCE
MATRIX FORMAT
Transaction Red White Blue Orange Green Yellow
1 1 1 0 0 1 0
2 0 1 0 1 0 0
3 0 1 1 0 0 0
4 1 1 0 1 0 0
5 1 0 1 0 0 0
6 0 1 1 0 0 0
7 1 0 1 0 0 0
8 1 1 1 0 1 0
9 1 1 1 0 0 0
10 0 0 0 0 0 1
336 ASSOCIATION RULES AND COLLABORATIVE FILTERING
TABLE 14.3 ITEMSETS WITH SUPPORT
COUNT OF AT LEAST TWO
Itemset Support (Count)
{red} 6
{white} 7
{blue} 6
{orange} 2
{green} 2
{red, white} 4
{red, blue} 4
{red, green} 2
{white, blue} 4
{white, orange} 2
{white, green} 2
{red, white, blue} 2
{red, white, green} 2
support of 2, because only two transactions included red, white, and green face-
plates.
In R, the user will input data using the package arules in the transactions database
format. The package only creates rules with one item as the consequent. It calls
the consequent the right-hand side (RHS) of the rule, and the antecedent the left-
hand-side (LHS) of the rule.
The Process of Rule Selection
The process of selecting strong rules is based on generating all association rules
that meet stipulated support and confidence requirements. This is done in two
stages. The first stage, described earlier, consists of finding all “frequent” item-
sets, those itemsets that have a requisite support. In the second stage, we gener-
ate, from the frequent itemsets, association rules that meet a confidence require-
ment. The first step is aimed at removing item combinations that are rare in
the database. The second stage then filters the remaining rules and selects only
those with high confidence. For most association analysis data, the computa-
tional challenge is the first stage, as described in the discussion of the Apriori
algorithm.
The computation of confidence in the second stage is simple. Since any
subset (e.g., {red} in the phone faceplate example) must occur at least as fre-
quently as the set it belongs to (e.g., {red, white}), each subset will also be in
the list. It is then straightforward to compute the confidence as the ratio of the
support for the itemset to the support for each subset of the itemset. We retain
the corresponding association rule only if it exceeds the desired cutoff value for
confidence. For example, from the itemset {red, white, green} in the phone
ASSOCIATION RULES 337
faceplate purchases, we get the following single-consequent association rules,
confidence values, and lift values:
Rule Confidence Lift
{red, white} ⇒ {green}
support of {red, white, green}
support of {red, white}
= 2/4 = 50%
confidence of rule
benchmark confidence
=
50%
20%
= 2.5
{green} ⇒ {red}
support of {green, red}
support of {green}
= 2/2 = 100%
confidence of rule
benchmark confidence
=
100%
60%
= 1.67
{white, green} ⇒ {red}
support of {white, green, red}
support of {white, green}
= 2/2 = 100%
confidence of rule
benchmark confidence
=
100%
60%
= 1.67
If the desired minimum confidence is 70%, we would report only the second
and third rules.
We can generate association rules in R by coercing the binary incidence
matrix in Table 14.2 into a transaction database like in Table 14.1. We specify
the minimum support (20%) and minimum confidence level percentage (50%).
Table 14.4 shows the output. The output includes information on each rule and
its support, confidence, and lift (Note that here we consider all possible itemsets,
not just {red, white, green} as above.).
Interpreting the Results
We can translate each of the rules from Table 14.4 into an understandable sen-
tence that provides information about performance. For example, we can read
rule #4 as follows:
If orange is purchased, then with confidence 100% white will also be
purchased. This rule has a lift ratio of 1.43.
In interpreting results, it is useful to look at the various measures. The
support for the rule indicates its impact in terms of overall size: How many
transactions are affected? If only a small number of transactions are affected, the
rule may be of little use (unless the consequent is very valuable and/or the rule
is very efficient in finding it).
The lift ratio indicates how efficient the rule is in finding consequents, com-
pared to random selection. A very efficient rule is preferred to an inefficient rule,
but we must still consider support: A very efficient rule that has very low support
may not be as desirable as a less efficient rule with much greater support.
The confidence tells us at what rate consequents will be found, and is useful
in determining the business or operational usefulness of a rule: A rule with low
confidence may find consequents at too low a rate to be worth the cost of (say)
promoting the consequent in all the transactions that involve the antecedent.
338 ASSOCIATION RULES AND COLLABORATIVE FILTERING
TABLE 14.4 BINARY INCIDENCE MATRIX, TRANSACTIONS DATABASE, AND RULES FOR
FACEPLATE EXAMPLE
code for running the Apriori algorithm
fp.df <- read.csv("Faceplate.csv")
# remove first column and convert to matrix
fp.mat <- as.matrix(fp.df[, -1])
# convert the binary incidence matrix into a transactions database
fp.trans <- as(fp.mat, "transactions")
inspect(fp.trans)
## get rules
# when running apriori(), include the minimum support, minimum confidence, and target
# as arguments.
rules <- apriori(fp.trans, parameter = list(supp = 0.2, conf = 0.5, target = "rules"))
# inspect the first six rules, sorted by their lift
inspect(head(sort(rules, by = "lift"), n = 6))
Output
> fp.mat
Red White Blue Orange Green Yellow
[1,] 1 1 0 0 1 0
[2,] 0 1 0 1 0 0
[3,] 0 1 1 0 0 0
[4,] 1 1 0 1 0 0
[5,] 1 0 1 0 0 0
[6,] 0 1 1 0 0 0
[7,] 1 0 1 0 0 0
[8,] 1 1 1 0 1 0
[9,] 1 1 1 0 0 0
[10,] 0 0 0 0 0 1
> inspect(fp.trans)
items
1 {Red,White,Green}
2 {White,Orange}
3 {White,Blue}
4 {Red,White,Orange}
5 {Red,Blue}
6 {White,Blue}
7 {Red,Blue}
8 {Red,White,Blue,Green}
9 {Red,White,Blue}
10 {Yellow}
> inspect(head(sort(rules, by = “lift”), n = 6))
lhs rhs support confidence lift
15 {Red,White} => {Green} 0.2 0.5 2.500000
5 {Green} => {Red} 0.2 1.0 1.666667
14 {White,Green} => {Red} 0.2 1.0 1.666667
4 {Orange} => {White} 0.2 1.0 1.428571
6 {Green} => {White} 0.2 1.0 1.428571
13 {Red,Green} => {White} 0.2 1.0 1.428571
ASSOCIATION RULES 339
Rules and Chance
What about confidence in the nontechnical sense? How sure can we be that
the rules we develop are meaningful? Considering the matter from a statistical
perspective, we can ask: Are we finding associations that are really just chance
occurrences?
Let us examine the output from an application of this algorithm to a small
database of 50 transactions, where each of the nine items is assigned randomly
to each transaction. The data are shown in Table 14.5, and the association rules
generated are shown in Table 14.6. In looking at these tables, remember that
“lhs” and “rhs” refer to itemsets, not records.
In this example, the lift ratios highlight Rule [105] as most interesting, as it
suggests that purchase of item 4 is almost five times as likely when items 3 and 8
TABLE 14.5 FIFTY TRANSACTIONS OF RANDOMLY ASSIGNED ITEMS
Transaction Items Transaction Items Transaction Items
1 8 18 8 35 3 4 6 8
2 3 4 8 19 36 1 4 8
3 8 20 9 37 4 7 8
4 3 9 21 2 5 6 8 38 8 9
5 9 22 4 6 9 39 4 5 7 9
6 1 8 23 4 9 40 2 8 9
7 6 9 24 8 9 41 2 5 9
8 3 5 7 9 25 6 8 42 1 2 7 9
9 8 26 1 6 8 43 5 8
10 27 5 8 44 1 7 8
11 1 7 9 28 4 8 9 45 8
12 1 4 5 8 9 29 9 46 2 7 9
13 5 7 9 30 8 47 4 6 9
14 6 7 8 31 1 5 8 48 9
15 3 7 9 32 3 6 9 49 9
16 1 4 9 33 7 9 50 6 7 8
17 6 7 8 34 7 8 9
TABLE 14.6 ASSOCIATION RULES OUTPUT FOR RANDOM DATA
Min. Support: 2 = 4%
Min. Conf. % : 70
> rules.tbl[rules.tbl$support >= 0.04 & rules.tbl$confidence >= 0.7,]
lhs rhs support confidence lift
[18] {item.2} => {item.9} 0.08 0.8 1.481481
[89] {item.2,item.7} => {item.9} 0.04 1.0 1.851852
[104] {item.3,item.4} => {item.8} 0.04 1.0 1.851852
[105] {item.3,item.8} => {item.4} 0.04 1.0 5.000000
[113] {item.3,item.7} => {item.9} 0.04 1.0 1.851852
[119] {item.1,item.5} => {item.8} 0.04 1.0 1.851852
[149] {item.4,item.5} => {item.9} 0.04 1.0 1.851852
[155] {item.5,item.7} => {item.9} 0.06 1.0 1.851852
[176] {item.6,item.7} => {item.8} 0.06 1.0 1.851852
R code is provided in the next example
340 ASSOCIATION RULES AND COLLABORATIVE FILTERING
are purchased than if item 4 was not associated with the itemset {3,8}. Yet
we know there is no fundamental association underlying these data—they were
generated randomly.
Two principles can guide us in assessing rules for possible spuriousness due
to chance effects:
1. The more records the rule is based on, the more solid the conclusion.
2. The more distinct rules we consider seriously (perhaps consolidating mul-
tiple rules that deal with the same items), the more likely it is that at least
some will be based on chance sampling results. For one person to toss a
coin 10 times and get 10 heads would be quite surprising. If 1000 people
toss a coin 10 times each, it would not be nearly so surprising to have
one get 10 heads. Formal adjustment of “statistical significance” when
multiple comparisons are made is a complex subject in its own right, and
beyond the scope of this book. A reasonable approach is to consider rules
from the top-down in terms of business or operational applicability, and
not consider more than what can reasonably be incorporated in a human
decision-making process. This will impose a rough constraint on the
dangers that arise from an automated review of hundreds or thousands of
rules in search of “something interesting.”
We now consider a more realistic example, using a larger database and real
transactional data.
Example 2: Rules for Similar Book Purchases
The following example (drawn from the Charles Book Club case; see Chapter
21) examines associations among transactions involving various types of books.
The database includes 2000 transactions, and there are 11 different types of
books. The data, in binary incidence matrix form, are shown in Table 14.7.
TABLE 14.7 SUBSET OF BOOK PURCHASE TRANSACTIONS IN BINARY MATRIX FORMAT
ChildBks YouthBks CookBks DoItYBks cefBks ArtBks GeogBks ItalCook ItalAtlas ItalArt Florence
0 1 0 1 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
1 1 1 0 1 0 1 0 0 0 0
0 0 1 0 0 0 1 0 0 0 0
1 0 0 0 0 1 0 0 0 0 1
0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 1 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0
1 1 1 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
ASSOCIATION RULES 341
For instance, the first transaction included YouthBks (youth books) DoItYBks
(do-it-yourself books), and GeogBks (geography books). Table 14.8 shows some
of the rules generated for these data, given that we specified a minimal support
of 200 transactions (out of 4000 transactions) and a minimal confidence of 50%.
This resulted in 21 rules (see Table 14.8).
TABLE 14.8 RULES FOR BOOK PURCHASE TRANSACTIONS
code for running the Apriori algorithm
all.books.df <- read.csv("CharlesBookClub.csv")
# create a binary incidence matrix
count.books.df <- all.books.df[, 8:18]
incid.books.df <- ifelse(count.books.df > 0, 1, 0)
incid.books.mat <- as.matrix(incid.books.df[, -1])
# convert the binary incidence matrix into a transactions database
books.trans <- as(incid.books.mat, "transactions")
inspect(books.trans)
# plot data
itemFrequencyPlot(books.trans)
# run apriori function
rules <- apriori(books.trans,
parameter = list(supp= 200/4000, conf = 0.5, target = "rules"))
# inspect rules
inspect(sort(rules, by = "lift"))
Output
> inspect(sort(rules, by = “lift”))
lhs rhs support confidence lift
16 {DoItYBks,GeogBks} => {YouthBks} 0.05450 0.5396040 2.264864
18 {CookBks,GeogBks} => {YouthBks} 0.08025 0.5136000 2.155719
13 {CookBks,RefBks} => {DoItYBks} 0.07450 0.5330948 2.092619
14 {YouthBks,GeogBks} => {DoItYBks} 0.05450 0.5215311 2.047227
20 {YouthBks,CookBks} => {DoItYBks} 0.08375 0.5201863 2.041948
10 {YouthBks,RefBks} => {CookBks} 0.06825 0.8400000 2.021661
15 {YouthBks,DoItYBks} => {GeogBks} 0.05450 0.5278450 1.978801
19 {YouthBks,DoItYBks} => {CookBks} 0.08375 0.8111380 1.952197
12 {DoItYBks,RefBks} => {CookBks} 0.07450 0.8054054 1.938400
11 {RefBks,GeogBks} => {CookBks} 0.06450 0.7889908 1.898895
17 {YouthBks,GeogBks} => {CookBks} 0.08025 0.7679426 1.848237
21 {DoItYBks,GeogBks} => {CookBks} 0.07750 0.7673267 1.846755
7 {YouthBks,ArtBks} => {CookBks} 0.05150 0.7410072 1.783411
9 {DoItYBks,ArtBks} => {CookBks} 0.05300 0.7114094 1.712177
3 {RefBks} => {CookBks} 0.13975 0.6825397 1.642695
8 {ArtBks,GeogBks} => {CookBks} 0.05525 0.6800000 1.636582
4 {YouthBks} => {CookBks} 0.16100 0.6757608 1.626380
6 {DoItYBks} => {CookBks} 0.16875 0.6624141 1.594258
1 {ItalCook} => {CookBks} 0.06875 0.6395349 1.539193
5 {GeogBks} => {CookBks} 0.15625 0.5857545 1.409758
2 {ArtBks} => {CookBks} 0.11300 0.5067265 1.219558
342 ASSOCIATION RULES AND COLLABORATIVE FILTERING
In reviewing these rules, we see that the information can be compressed.
First, rule #1, which appears from the confidence level to be a very promis-
ing rule, is probably meaningless. It says: “If Italian cooking books have been
purchased, then cookbooks are purchased.” It seems likely that Italian cooking
books are simply a subset of cookbooks. Rules 14,15, and 16 involve the same
trio of books, with different antecedents and consequents. The same is true of
rules 17 and 18 as well as 19 and 20. (Pairs and groups like this are easy to track
down by looking for rows that share the same support.) This does not mean that
the rules are not useful. On the contrary, it can reduce the number of itemsets
to be considered for possible action from a business perspective.
14.2 Collaborative Filtering3
Recommendation systems are a critically important part of websites that offer
a large variety of products or services. Examples include Amazon.com, which
offers millions of different products; Netflix has thousands of movies for rental;
Google searches over huge numbers of webpages; Internet radio websites such
as Spotify and Pandora include a large variety of music albums by various artists;
travel websites offer many destinations and hotels; social network websites have
many groups. The recommender engine provides personalized recommenda-
tions to a user based on the user’s information as well as on similar users’ infor-
mation. Information means behaviors indicative of preference, such as purchase,
ratings, and clicking.
The value that recommendation systems provide to users helps online com-
panies convert browsers into buyers, increase cross-selling, and build loyalty.
Collaborative filtering is a popular technique used by such recommendation
systems. The term collaborative filtering is based on the notions of identifying
relevant items for a specific user from the very large set of items (“filtering”) by
considering preferences of many users (“collaboration”).
The Fortune.com article “Amazon’s Recommendation Secret” (June 30,
2012) describes the company’s use of collaborative filtering not only for provid-
ing personalized product recommendations, but also for customizing the entire
website interface for each user:
At root, the retail giant’s recommendation system is based on a number of
simple elements: what a user has bought in the past, which items they have in
their virtual shopping cart, items they’ve rated and liked, and what other
customers have viewed and purchased. Amazon calls this homegrown math
“item-to-item collaborative filtering,” and it’s used this algorithm to heavily
customize the browsing experience for returning customers.
3This section copyright © 2017 Datastats, LLC, Galit Shmueli, and Peter Bruce.
COLLABORATIVE FILTERING 343
Data Type and Format
Collaborative filtering requires availability of all item–user information. Specif-
ically, for each item–user combination, we should have some measure of the
user’s preference for that item. Preference can be a numerical rating or a binary
behavior such as a purchase, a ‘like’, or a click.
For n users (u1, u2, . . . , un) and p items (i1, i2, . . . ip), we can think of the
data as an n×p matrix of n rows (users) by p columns (items). Each cell includes
the rating or the binary event corresponding to the user’s preference of the item
(see schematic in Table 14.9). Typically not every user purchases or rates every
item, and therefore a purchase matrix will have many zeros (it is sparse), and
a rating matrix will have many missing values. Such missing values sometimes
convey “uninterested” (as opposed to non-missing values that convey interest).
When both n and p are large, it is not practical to store the preferences data
(ru,i) in an n× p table. Instead, the data can be stored in many rows of triplets
of the form (Uu, Ii, ru,i), where each triplet contains the user ID, the item ID,
and the preference information.
Item ID
User ID I1 I2 · · · Ip
U1 r1,1 r1,2 · · · r1,p
U2 r2,1 r2,2 · · · r2,p
…
Un rn,1 rn,2 · · · rn,p
TABLE 14.9 SCHEMATIC OF MATRIX FORMAT WITH RATINGS DATA
Example 3: Netflix Prize Contest
We have been considering both association rules and collaborative filtering as
unsupervised techniques, but it is possible to judge how well they do by looking
at holdout data to see what users purchase and how they rate items. The famous
Netflix contest, mentioned in Chapter 13, did just this and provides a useful
example to illustrate collaborative filtering, though the extension into training
and validation is beyond the scope of this book.
In 2006, Netflix, the largest movie rental service in North America,
announced a one million USD contest (www.netflixprize.com) for the purpose
of improving its recommendation system called Cinematch. Participants were
provided with a number of datasets, one for each movie. Each dataset included
all the customer ratings for that movie (and the timestamp). We can think of one
large combined dataset of the form [customer ID, movie ID, rating, date] where
each record includes the rating given by a certain customer to a certain movie
http://www.netflixprize.com
344 ASSOCIATION RULES AND COLLABORATIVE FILTERING
TABLE 14.10 SAMPLE OF RECORDS FROM THE NETFLIX PRIZE
CONTEST, FOR A SUBSET OF 10 CUSTOMERS AND
9 MOVIES
Movie ID
Customer ID 1 5 8 17 18 28 30 44 48
30878 4 1 3 3 4 5
124105 4
822109 5
823519 3 1 4 4 5
885013 4 5
893988 3 4 4
1248029 3 2 4 3
1503895 4
1842128 4 3
2238063 3
on a certain date. Ratings were on a 1–5 star scale. Contestants were asked to
develop a recommendation algorithm that would improve over the existing Net-
flix system. Table 14.10 shows a small sample from the contest data, organized
in matrix format. Rows indicate customers and columns are different movies.
It is interesting to note that the winning team was able to improve their
system by considering not just the ratings for a movie, but whether a movie was
rated by a particular customer or not. In other words, the information on which
movies a customer decided to rate turned out to be critically informative of cus-
tomers’ preferences, more than simply considering the 1–5 rating information4:
Collaborative filtering methods address the sparse set of rating values.
However, much accuracy is obtained by also looking at other features of the
data. First is the information on which movies each user chose to rate,
regardless of specific rating value (“the binary view”). This played a decisive
role in our 2007 solution, and reflects the fact that the movies to be rated are
selected deliberately by the user, and are not a random sample.
This is an example where converting the rating information into a binary
matrix of rated/unrated proved to be useful.
User-Based Collaborative Filtering: “People Like You”
One approach to generating personalized recommendations for a user using col-
laborative filtering is based on finding users with similar preferences, and recom-
mending items that they liked but the user hasn’t purchased. The algorithm has
two steps:
4Bell, R. M., Koren, Y., and Volinsky, C., “The BellKor 2008 Solution to the Netflix Prize”, www.
netflixprize.com/assets/ProgressPrize2008_BellKor .
http://www.netflixprize.com/assets/ProgressPrize2008_BellKor
http://www.netflixprize.com/assets/ProgressPrize2008_BellKor
COLLABORATIVE FILTERING 345
1. Find users who are most similar to the user of interest (neighbors). This
is done by comparing the preference of our user to the preferences of
other users.
2. Considering only the items that the user has not yet purchased, recom-
mend the ones that are most preferred by the user’s neighbors.
This is the approach behind Amazon’s “Customers Who Bought This Item Also
Bought…” (see Figure 14.1). It is also used in a Google search for generating
the “Similar pages” link shown near each search result.
Step 1 requires choosing a distance (or proximity) metric to measure the
distance between our user and the other users. Once the distances are computed,
we can use a threshold on the distance or on the number of required neighbors
to determine the nearest neighbors to be used in Step 2. This approach is called
“user-based top-N recommendation.”
A nearest-neighbors approach measures the distance of our user to each of
the other users in the database, similar to the k-nearest-neighbors algorithm
(see Chapter 7). The Euclidean distance measure we discussed in that chap-
ter does not perform as well for collaborative filtering as some other mea-
sures. A popular proximity measure between two users is the Pearson correlation
between their ratings. We denote the ratings of items I1, . . . , Ip by user U1 as
r1,1, r1,2, . . . , r1,p and their average by r1. Similarly, the ratings by user U2 are
r2,1, r2,2, . . . , r2,p, with average r2. The correlation proximity between the two
users is defined by
Corr(U1, U2) =
∑
(r1,i − r1)(r2,i − r2)√∑
(r1,i − r1)2
√∑
(r2,i − r2)2
, (14.1)
where the summations are only over the items co-rated by both users.
To illustrate this, let us compute the correlation between customer 30878
and customer 823519 in the small Netflix sample in Table 14.10. We’ll assume
that the data shown in the table is the entire information. First, we compute the
average rating by each of these users:
r30878 = (4 + 1 + 3 + 3 + 4 + 5)/6 = 3.333
r823519 = (3 + 1 + 4 + 4 + 5)/5 = 3.4
Note that the average is computed over a different number of movies for each of
these customers, because they each rated a different set of movies. The average
for a customer is computed over all the movies that a customer rated. The
calculations for the correlation involve the departures from the average, but
346 ASSOCIATION RULES AND COLLABORATIVE FILTERING
only for the items that they co-rated. In this case the co-rated movie IDs are 1,
28, and 30:
Corr(U30878, U823519) =
(4− 3.333)(3− 3.4) + (3− 3.333)(4− 3.4) + (4− 3.333)(5− 3.4)√
(4− 3.333)2 + (3− 3.333)2 + (4− 3.333)2
√
(3− 3.4)2 + (4− 3.4)2 + (5− 3.4)2
= 0.6/1.75 = 0.34
The same approach can be used when the data are in the form of a binary
matrix (e.g., purchased or didn’t purchase.)
Another popular measure is a variant of the Pearson correlation called cosine
similarity. It differs from the correlation formula by not subtracting the means.
Subtracting the mean in the correlation formula adjusts for users’ different overall
approaches to rating—for example, a customer who always rates highly vs. one
who tends to give low ratings.5
For example, the cosine similarity between the two Netflix customers is:
Cos Sim(U30878, U823519) =
4× 3 + 3× 4 + 4× 5√
42 + 32 + 42
√
32 + 42 + 52
= 44/45.277 = 0.972
Note that when the data are in the form of a binary matrix, say, for purchase
or no-purchase, the cosine similarity must be calculated over all items that either
user has purchased; it cannot be limited to just the items that were co-purchased.
Collaborative filtering suffers from what is called a cold start: it cannot be used as is
to create recommendations for new users or new items. For a user who rated a single
item, the correlation coefficient between this and other users (in user-generated
collaborative filtering) will have a denominator of zero and the cosine proximity
will be 1 regardless of the rating. In a similar vein, users with just one item, and
items with just one user, do not qualify as candidates for nearby neighbors.
For a user of interest, we compute his/her similarity to each of the users in
our database using a correlation, cosine similarity, or another measure. Then, in
Step 2, we look only at the k nearest users, and among all the other items that
they rated/purchased, we choose the best one and recommend it to our user.
What is the best one? For binary purchase data, it is the item most purchased.
5Correlation and cosine similarity are popular in collaborative filtering because they are computation-
ally fast for high-dimensional sparse data, and they account both for the rating values and the number
of rated items.
COLLABORATIVE FILTERING 347
For rating data, it could be the highest rated, most rated, or a weighting of the
two.
The nearest-neighbors approach can be computationally expensive when
we have a large database of users. One solution is to use clustering methods
(see Chapter 15) to group users into homogeneous clusters in terms of their
preferences, and then to measure the distance of our user to each of the clusters.
This approach places the computational load on the clustering step that can
take place earlier and offline; it is then cheaper (and faster) to compare our
user to each of the clusters in real time. The price of clustering is less accurate
recommendations, because not all the members of the closest cluster are the most
similar to our user.
Item-Based Collaborative Filtering
When the number of users is much larger than the number of items, it is com-
putationally cheaper (and faster) to find similar items rather than similar users.
Specifically, when a user expresses interest in a particular item, the item-based
collaborative filtering algorithm has two steps:
1. Find the items that were co-rated, or co-purchased, (by any user) with
the item of interest.
2. Recommend the most popular or correlated item(s) among the similar
items.
Similarity is now computed between items, instead of users. For example, in
our small Netflix sample (Table 14.10), the correlation between movie 1 (with
average r1 = 3.7) and movie 5 (with average r5 = 3) is:
Corr(I1, I5) =
(4− 3.7)(1− 3) + (4− 3.7)(5− 3)√
(4− 3.7)2 + (4− 3.7)2
√
(1− 3)2 + (5− 3)2
= 0
The zero correlation is due to the two opposite ratings of movie 5 by the users
who also rated 1. One user rated it 5 stars and the other gave it a 1 star.
In like fashion, we can compute similarity between all the movies. This can
be done offline. In real time, for a user who rates a certain movie highly, we can
look up the movie correlation table and recommend the movie with the highest
positive correlation to the user’s newly rated movie.
According to an industry report6 by researchers who developed the Amazon
item-to-item recommendation system,
“[The item-based] algorithm produces recommendations in real time, scales to
massive data sets, and generates high-quality recommendations.”
6“Linden, G., Smith, B., and York J., Amazon.com Recommendations: Item-to-Item Collaborative
Filtering”, IEEE Internet Computing, vol. 7, no. 1, p. 76–80, 2003.
348 ASSOCIATION RULES AND COLLABORATIVE FILTERING
The disadvantage of item-based recommendations is that there is less diversity
between items (compared to users’ taste), and therefore, the recommendations
are often obvious.
Table 14.11 shows R code for collaborative filtering on a simulated data
similar to the Netflix data using the recommenderlab package.
TABLE 14.11 COLLABORATIVE FILTERING IN R
code for collaborative filtering
# simulate matrix with 1000 users and 100 movies
m <- matrix(nrow = 1000, ncol = 100)
# simulated ratings (1% of the data)
m[sample.int(100*1000, 1000)] <- ceiling(runif(1000, 0, 5))
## convert into a realRatingMatrix
r <- as(m, "realRatingMatrix")
library(recommenderlab)
# user-based collaborative filtering
UB.Rec <- Recommender(r, "UBCF")
pred <- predict(UB.Rec, r, type="ratings")
as(pred, "matrix")
# item-based collaborative filtering
IB.Rec <- Recommender(r, "IBCF")
pred <- predict(IB.Rec, r, type="ratings")
as(pred, "matrix")
Advantages and Weaknesses of Collaborative Filtering
Collaborative filtering relies on the availability of subjective information regard-
ing users’ preferences. It provides useful recommendations, even for “long tail”
items, if our database contains sufficient similar users (not necessarily many, but
at least a few per user), so that each user can find other users with similar tastes.
Similarly, the data should include sufficient per-item ratings or purchases. One
limitation of collaborative filtering is therefore that it cannot generate recom-
mendations for new users, nor for new items. There are various approaches for
tackling this challenge.
User-based collaborative filtering looks for similarity in terms of highly-rated
or preferred items. However, it is blind to data on low-rated or unwanted items.
We can therefore not expect to use it as is for detecting unwanted items.
User-based collaborative filtering helps leverage similarities between peo-
ple’s tastes for providing personalized recommendations. However, when the
number of users becomes very large, collaborative filtering becomes computa-
tionally difficult. Solutions include item-based algorithms, clustering of users,
COLLABORATIVE FILTERING 349
and dimension reduction. The most popular dimension reduction method used
in such cases is singular value decomposition (SVD), a computationally superior
form of principal components analysis (see Chapter 4).
Although the term “prediction” is often used to describe the output of col-
laborative filtering, this method is unsupervised by nature. It can be used to
generate predicted ratings or purchase indication for a user, but usually we do
not have the true outcome value in practice. One important way to improve rec-
ommendations generated by collaborative filtering is by getting user feedback.
Once a recommendation is generated, the user can indicate whether the recom-
mendation was adequate or not. For this reason, many recommender systems
entice users to provide feedback on their recommendations.
Collaborative Filtering vs. Association Rules
While collaborative filtering and association rules are both unsupervised methods
used for generating recommendations, they differ in several ways:
Frequent itemsets vs. personalized recommendations Association
rules look for frequent item combinations and will provide recommendations
only for those items. In contrast, collaborative filtering provides personalized
recommendations for every item, thereby catering to users with unusual
taste. In this sense, collaborative filtering is useful for capturing the “long
tail” of user preferences, while association rules look for the “head.” This
difference has implications for the data needed: association rules require data
on a very large number of “baskets” (transactions) in order to find a sufficient
number of baskets that contain certain combinations of items. In contrast,
collaborative filtering does not require many “baskets,” but does require data
on as many items as possible for many users. Also, association rules operate
at the basket level (our database can include multiple transactions for each
user), while collaborative filtering operates at the user level.
Because association rules produce generic, impersonal rules (association-
based recommendations such as Amazon’s “Frequently Bought Together”
display the same recommendations to all users searching for a specific item),
they can be used for setting common strategies such as product placement
in a store or sequencing of diagnostic tests in hospitals. In contrast, col-
laborative filtering generates user-specific recommendations (e.g., Amazon’s
“Customers Who Bought This Item Also Bought…”) and is therefore a tool
designed for personalization.
Transactional data vs. user data Association rules provide recommen-
dations of items based on their co-purchase with other items in many trans-
actions/baskets. In contrast, collaborative filtering provides recommendations
of items based on their co-purchase or co-rating by even a small number of
350 ASSOCIATION RULES AND COLLABORATIVE FILTERING
other users. Considering distinct baskets is useful when the same items are
purchased over and over again (e.g., in grocery shopping). Considering dis-
tinct users is useful when each item is typically purchased/rated once (e.g.,
purchases of books, music, and movies).
Binary data and ratings data Association rules treat items as binary data
(1 = purchase, 0 = nonpurchase), whereas collaborative filtering can operate
on either binary data or on numerical ratings.
Two or more items In association rules, the antecedent and consequent
can each include one or more items (e.g., IF milk THEN cookies and corn-
flakes). Hence, a recommendation might be a bundle of the item of interest
with multiple items (“buy milk, cookies, and cornflakes and receive 10% dis-
count”). In contrast, in collaborative filtering, similarity is measured between
pairs of items or pairs of users. A recommendation will therefore be either
for a single item (the most popular item purchased by people like you, which
you haven’t purchased), or for multiple single items which do not necessarily
relate to each other (the top two most popular items purchased by people
like you, which you haven’t purchased).
These distinctions are sharper for purchases and recommendations of non-
popular items, especially when comparing association rules to user-based collab-
orative filtering. When considering what to recommend to a user who purchased
a popular item, then association rules and item-based collaborative filtering might
yield the same recommendation for a single item. But a user-based recommen-
dation will likely differ. Consider a customer who purchases milk every week
as well as gluten-free products (which are rarely purchased by other customers).
Suppose that using association rules on the transaction database we identify the
rule “IF milk THEN cookies.” Then, the next time our customer purchases
milk, s/he will receive a recommendation (e.g., a coupon) to purchase cookies,
whether or not s/he purchased cookies, and irrespective of his/her gluten-free
item purchases. In item-based collaborative filtering, we would look at all items
co-purchased with milk across all users and recommend the most popular item
among them (which was not purchased by our customer). This might also lead
to a recommendation of cookies, because this item was not purchased by our
customer.7 Now consider user-based collaborative filtering. User-based collab-
orative filtering searches for similar customers—those who purchased the same
set of items—and then recommend the item most commonly purchased by these
neighbors, which was not purchased by our customer. The user-based recommenda-
tion is therefore unlikely to recommend cookies and more likely to recommend
popular gluten-free items that the customer has not purchased.
7If the rule is “IF milk, THEN cookies and cornflakes” then the association rules would recommend
cookies and cornflakes to a milk purchaser, while item-based collaborative filtering would recommend
the most popular single item purchased with milk.
SUMMARY 351
14.3 Summary
Association rules (also called market basket analysis) and collaborative filtering
are unsupervised methods for deducing associations between purchased items
from databases of transactions. Association rules search for generic rules about
items that are purchased together. The main advantage of this method is that it
generates clear, simple rules of the form “IF X is purchased, THEN Y is also
likely to be purchased.” The method is very transparent and easy to understand.
The process of creating association rules is two-staged. First, a set of candi-
date rules based on frequent itemsets is generated (the Apriori algorithm being
the most popular rule-generating algorithm). Then from these candidate rules,
the rules that indicate the strongest association between items are selected. We
use the measures of support and confidence to evaluate the uncertainty in a rule.
The user also specifies minimal support and confidence values to be used in
the rule generation and selection process. A third measure, the lift ratio, com-
pares the efficiency of the rule to detect a real association compared to a random
combination.
One shortcoming of association rules is the profusion of rules that are gen-
erated. There is therefore a need for ways to reduce these to a small set of useful
and strong rules. An important nonautomated method to condense the infor-
mation involves examining the rules for uninformative and trivial rules as well
as for rules that share the same support. Another issue that needs to be kept in
mind is that rare combinations tend to be ignored, because they do not meet the
minimum support requirement. For this reason, it is better to have items that
are approximately equally frequent in the data. This can be achieved by using
higher-level hierarchies as the items. An example is to use types of books rather
than titles of individual books in deriving association rules from a database of
bookstore transactions.
Collaborative filtering is a popular technique used in online recommenda-
tion systems. It is based on the relationship between items formed by users who
acted similarly on an item, such as purchasing or rating an item highly. User-
based collaborative filtering operates on data on item–user combinations, calcu-
lates the similarities between users and provides personalized recommendations
to users. An important component for the success of collaborative filtering is that
users provide feedback about the recommendations provided and have sufficient
information on each item. One disadvantage of collaborative filtering methods
is that they cannot generate recommendations for new users or new items. Also,
with a huge number of users, user-based collaborative filtering becomes compu-
tationally challenging, and alternatives such as item-based methods or dimension
reduction are popularly used.
352 ASSOCIATION RULES AND COLLABORATIVE FILTERING
PROBLEMS
14.1 Satellite Radio Customers. An analyst at a subscription-based satellite radio com-
pany has been given a sample of data from their customer database, with the goal of
finding groups of customers who are associated with one another. The data consist
of company data, together with purchased demographic data that are mapped to the
company data (see Table 14.12). The analyst decides to apply association rules to learn
more about the associations between customers. Comment on this approach.
TABLE 14.12 SAMPLE OF DATA ON SATELLITE RADIO CUSTOMERS
Row ID zipconvert_2 zipconvert_3 zipconvert_4 zipconvert_5 homeowner NUMCHLD INCOME gender WEALTH
dummy dummy
17 0 1 0 0 1 1 5 1 9
25 1 0 0 0 1 1 1 0 7
29 0 0 0 1 0 2 5 1 8
38 0 0 0 1 1 1 3 0 4
40 0 1 0 0 1 1 4 0 8
53 0 1 0 0 1 1 4 1 8
58 0 0 0 1 1 1 4 1 8
61 1 0 0 0 1 1 1 0 7
71 0 0 1 0 1 1 4 0 5
87 1 0 0 0 1 1 4 1 8
100 0 0 0 1 1 1 4 1 8
104 1 0 0 0 1 1 1 1 5
121 0 0 1 0 1 1 4 1 5
142 1 0 0 0 0 1 5 0 8
14.2 Identifying Course Combinations. The Institute for Statistics Education at Statis-
tics.com offers online courses in statistics and analytics, and is seeking information that
will help in packaging and sequencing courses. Consider the data in the file Course-
Topics.csv, the first few rows of which are shown in Table 14.13. These data are for
purchases of online statistics courses at Statistics.com. Each row represents the courses
attended by a single customer. The firm wishes to assess alternative sequencings and
bundling of courses. Use association rules to analyze these data, and interpret several
of the resulting rules.
TABLE 14.13 DATA ON PURCHASES OF ONLINE STATISTICS COURSES
Intro DataMining Survey CatData Regression Forecast DOE SW
1 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 1 0 1 1 0 0 1
1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0
0 1 0 0 0 0 0 0
1 0 0 0 0 0 0 0
0 0 0 1 0 1 1 1
1 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0
PROBLEMS 353
14.3 Recommending Courses We again consider the data in CourseTopics.csv describing
course purchases at Statistics.com (see Problem 14.2 and data sample in Table 14.13).
We want to provide a course recommendation to a student who purchased the Regres-
sion and Forecast courses. Apply user-based collaborative filtering to the data. You
will get a Null matrix. Explain why this happens.
14.4 Cosmetics Purchases. The data shown in Table 14.14 and the output in Table 14.15
are based on a subset of a dataset on cosmetic purchases (Cosmetics.csv) at a large chain
TABLE 14.14 EXCERPT FROM DATA ON COSMETICS PURCHASES IN BINARY
MATRIX FORM
Trans. # Bag Blush Nail Polish Brushes Concealer Eyebrow Pencils Bronzer
1 0 1 1 1 1 0 1
2 0 0 1 0 1 0 1
3 0 1 0 0 1 1 1
4 0 0 1 1 1 0 1
5 0 1 0 0 1 0 1
6 0 0 0 0 1 0 0
7 0 1 1 1 1 0 1
8 0 0 1 1 0 0 1
9 0 0 0 0 1 0 0
10 1 1 1 1 0 0 0
11 0 0 1 0 0 0 1
12 0 0 1 1 1 0 1
TABLE 14.15 ASSOCIATION RULES FOR COSMETICS PURCHASES DATA
lhs rhs support confidence lift
1 {Blush,
Concealer,
Mascara,
Eye.shadow,
Lipstick} => {Eyebrow.Pencils} 0.013 0.3023255814 7.198228128
2 {Trans.,
Blush,
Concealer,
Mascara,
Eye.shadow,
Lipstick} => {Eyebrow.Pencils} 0.013 0.3023255814 7.198228128
3 {Blush,
Concealer,
Mascara,
Lipstick} => {Eyebrow.Pencils} 0.013 0.2888888889 6.878306878
4 {Trans.,
Blush,
Concealer,
Mascara,
Lipstick} => {Eyebrow.Pencils} 0.013 0.2888888889 6.878306878
5 {Blush,
Concealer,
Eye.shadow,
Lipstick} => {Eyebrow.Pencils} 0.013 0.2826086957 6.728778468
6 {Trans.,
Blush,
Concealer,
Eye.shadow,
Lipstick} => {Eyebrow.Pencils} 0.013 0.2826086957 6.728778468
354 ASSOCIATION RULES AND COLLABORATIVE FILTERING
drugstore. The store wants to analyze associations among purchases of these items
for purposes of point-of-sale display, guidance to sales personnel in promoting cross-
sales, and guidance for piloting an eventual time-of-purchase electronic recommender
system to boost cross-sales. Consider first only the data shown in Table 14.14, given
in binary matrix form.
a. Select several values in the matrix and explain their meaning.
b. Consider the results of the association rules analysis shown in Table 14.15.
i. For the first row, explain the “confidence” output and how it is calculated.
ii. For the first row, explain the “support” output and how it is calculated.
iii. For the first row, explain the “lift” and how it is calculated.
iv. For the first row, explain the rule that is represented there in words.
c. Now, use the complete dataset on the cosmetics purchases (in the file Cosmetics.csv).
Using R, apply association rules to these data (use the default parameters).
i. Interpret the first three rules in the output in words.
ii. Reviewing the first couple of dozen rules, comment on their redundancy and
how you would assess their utility.
14.5 Course ratings. The Institute for Statistics Education at Statistics.com asks students to
rate a variety of aspects of a course as soon as the student completes it. The Institute
is contemplating instituting a recommendation system that would provide students
with recommendations for additional courses as soon as they submit their rating for
a completed course. Consider the excerpt from student ratings of online statistics
courses shown in Table 14.16, and the problem of what to recommend to student
E.N.
a. First consider a user-based collaborative filter. This requires computing correlations
between all student pairs. For which students is it possible to compute correlations
with E.N.? Compute them.
b. Based on the single nearest student to E.N., which single course should we recom-
mend to E.N.? Explain why.
c. Use R (function similarity()) to compute the cosine similarity between users.
TABLE 14.16 RATINGS OF ONLINE STATISTICS COURSES: 4 = BEST, 1 = WORST, BLANK = NOT
TAKEN
SQL Spatial PA 1 DM in R Python Forecast R Prog Hadoop Regression
L N 4 3 2 4 2
M H 3 4 4
J H 2 2
E N 4 4 4 3
D U 4 4
F L 4
G L 4
A H 3
S A 4
R W 2 4
B A 4
M G 4 4
A F 4
K G 3
D S 4 2 4
PROBLEMS 355
d. Based on the cosine similarities of the nearest students to E.N., which course should
be recommended to E.N.?
e. What is the conceptual difference between using the correlation as opposed to
cosine similarities? [Hint: how are the missing values in the matrix handled in each
case?]
f. With large datasets, it is computationally difficult to compute user-based recom-
mendations in real time, and an item-based approach is used instead. Returning to
the rating data (not the binary matrix), let’s now take that approach.
i. If the goal is still to find a recommendation for E.N., for which course pairs is
it possible and useful to calculate correlations?
ii. Just looking at the data, and without yet calculating course pair correlations,
which course would you recommend to E.N., relying on item-based filter-
ing? Calculate two course pair correlations involving your guess and report the
results.
g. Apply item-based collaborative filtering to this dataset (using R) and based on the
results, recommend a course to E.N.
CHAPTER 15
Cluster Analysis
This chapter is about the popular unsupervised learning task of clustering, where
the goal is to segment the data into a set of homogeneous clusters of records for
the purpose of generating insight. Separating a dataset into clusters of homoge-
neous records is also useful for improving performance of supervised methods, by
modeling each cluster separately rather than the entire, heterogeneous dataset.
Clustering is used in a vast variety of business applications, from customized
marketing to industry analysis. We describe two popular clustering approaches:
hierarchical clustering and k-means clustering. In hierarchical clustering, records are
sequentially grouped to create clusters, based on distances between records and
distances between clusters. We describe how the algorithm works in terms of the
clustering process and mention several common distance metrics used. Hierar-
chical clustering also produces a useful graphical display of the clustering process
and results, called a dendrogram. We present dendrograms and illustrate their
usefulness. k-means clustering is widely used in large dataset applications. In
k-means clustering, records are allocated to one of a prespecified set of clusters,
according to their distance from each cluster. We describe the k-means cluster-
ing algorithm and its computational advantages. Finally, we present techniques
that assist in generating insight from clustering results.
15.1 Introduction
Cluster analysis is used to form groups or clusters of similar records based on
several measurements made on these records. The key idea is to characterize the
clusters in ways that would be useful for the aims of the analysis. This idea has
been applied in many areas, including astronomy, archaeology, medicine, chem-
istry, education, psychology, linguistics, and sociology. Biologists, for example,
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
357
358 CLUSTER ANALYSIS
have made extensive use of classes and subclasses to organize species. A spectac-
ular success of the clustering idea in chemistry was Mendeleev’s periodic table
of the elements.
One popular use of cluster analysis in marketing is for market segmentation:
customers are segmented based on demographic and transaction history infor-
mation, and a marketing strategy is tailored for each segment. In countries such
as India, where customer diversity is extremely location-sensitive, chain stores
often perform market segmentation at the store level, rather than chain-wide
(called “micro segmentation”). Another use is for market structure analysis: iden-
tifying groups of similar products according to competitive measures of similar-
ity. In marketing and political forecasting, clustering of neighborhoods using US
postal zip codes has been used successfully to group neighborhoods by lifestyles.
Claritas, a company that pioneered this approach, grouped neighborhoods into
40 clusters using various measures of consumer expenditure and demographics.
Examining the clusters enabled Claritas to come up with evocative names, such
as “Bohemian Mix,” “Furs and Station Wagons,” and “Money and Brains,” for
the groups that captured the dominant lifestyles. Knowledge of lifestyles can be
used to estimate the potential demand for products (such as sports utility vehicles)
and services (such as pleasure cruises). Similarly, sales organizations will derive
customer segments and give them names—“personas”—to focus sales efforts.
In finance, cluster analysis can be used for creating balanced portfolios: Given
data on a variety of investment opportunities (e.g., stocks), one may find clus-
ters based on financial performance variables such as return (daily, weekly, or
monthly), volatility, beta, and other characteristics, such as industry and market
capitalization. Selecting securities from different clusters can help create a bal-
anced portfolio. Another application of cluster analysis in finance is for industry
analysis: For a given industry, we are interested in finding groups of similar firms
based on measures such as growth rate, profitability, market size, product range,
and presence in various international markets. These groups can then be ana-
lyzed in order to understand industry structure and to determine, for instance,
who is a competitor.
An interesting and unusual application of cluster analysis, described in Berry
and Linoff (1997), is the design of a new set of sizes for army uniforms for
women in the US Army. The study came up with a new clothing size system
with only 20 sizes, where different sizes fit different body types. The 20 sizes are
combinations of five measurements: chest, neck, and shoulder circumference,
sleeve outseam, and neck-to-buttock length (for further details, see McCullugh
et al., 1998). This example is important because it shows how a completely new
insightful view can be gained by examining clusters of records.
Cluster analysis can be applied to huge amounts of data. For instance, Inter-
net search engines use clustering techniques to cluster queries that users submit.
These can then be used for improving search algorithms. The objective of this
INTRODUCTION 359
chapter is to describe the key ideas underlying the most commonly used tech-
niques for cluster analysis and to lay out their strengths and weaknesses.
Typically, the basic data used to form clusters are a table of measurements on
several variables, where each column represents a variable and a row represents
a record. Our goal is to form groups of records so that similar records are in the
same group. The number of clusters may be prespecified or determined from
the data.
Example: Public Utilities
Table 15.1 gives corporate data on 22 public utilities in the United States (the
variable definitions are given in the table footnote). We are interested in forming
groups of similar utilities. The records to be clustered are the utilities, and the
clustering will be based on the eight measurements on each utility. An example
TABLE 15.1 DATA ON 22 PUBLIC UTILITIES
Company Fixed RoR Cost Load Demand Sales Nuclear Fuel Cost
Arizona Public Service 1.06 9.2 151 54.4 1.6 9077 0.0 0.628
Boston Edison Co. 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
Central Louisiana Co. 1.43 15.4 113 53.0 3.4 9212 0.0 1.058
Commonwealth Edison Co. 1.02 11.2 168 56.0 0.3 6423 34.3 0.700
Consolidated Edison Co. (NY) 1.49 8.8 192 51.2 1.0 3300 15.6 2.044
Florida Power & Light Co. 1.32 13.5 111 60.0 −2.2 11127 22.5 1.241
Hawaiian Electric Co. 1.22 12.2 175 67.6 2.2 7642 0.0 1.652
daho Power Co. 1.10 9.2 245 57.0 3.3 13082 0.0 0.309
Kentucky Utilities Co. 1.34 13.0 168 60.4 7.2 8406 0.0 0.862
Madison Gas & Electric Co. 1.12 12.4 197 53.0 2.7 6455 39.2 0.623
Nevada Power Co. 0.75 7.5 173 51.5 6.5 17441 0.0 0.768
New England Electric Co. 1.13 10.9 178 62.0 3.7 6154 0.0 1.897
Northern States Power Co. 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
Oklahoma Gas & Electric Co. 1.09 12.0 96 49.8 1.4 9673 0.0 0.588
Pacific Gas & Electric Co. 0.96 7.6 164 62.2 −0.1 6468 0.9 1.400
Puget Sound Power & Light Co. 1.16 9.9 252 56.0 9.2 15991 0.0 0.620
San Diego Gas & Electric Co. 0.76 6.4 136 61.9 9.0 5714 8.3 1.920
The Southern Co. 1.05 12.6 150 56.7 2.7 10140 0.0 1.108
Texas Utilities Co. 1.16 11.7 104 54.0 −2.1 13507 0.0 0.636
Wisconsin Electric Power Co. 1.20 11.8 148 59.9 3.5 7287 41.1 0.702
United Illuminating Co. 1.04 8.6 204 61.0 3.5 6650 0.0 2.116
Virginia Electric & Power Co. 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
Fixed = fixed-charge covering ratio (income/debt); RoR = rate of return on capital
Cost = cost per kilowatt capacity in place; Load = annual load factor
Demand = peak kilowatthour demand growth from 1974 to 1975
Sales = sales (kilowatthour use per year)
Nuclear = percent nuclear
Fuel Cost = total fuel costs (cents per kilowatthour)
360 CLUSTER ANALYSIS
FIGURE 15.1 SCATTER PLOT OF FUEL COST VS. SALES FOR THE 22 UTILITIES
where clustering would be useful is a study to predict the cost impact of dereg-
ulation. To do the requisite analysis, economists would need to build a detailed
cost model of the various utilities. It would save a considerable amount of time
and effort if we could cluster similar types of utilities and build detailed cost
models for just one “typical” utility in each cluster and then scale up from these
models to estimate results for all utilities.
For simplicity, let us consider only two of the measurements: Sales and Fuel
Cost. Figure 15.1 shows a scatter plot of these two variables, with labels marking
each company. At first glance, there appear to be two or three clusters of utilities:
one with utilities that have high fuel costs, a second with utilities that have lower
fuel costs and relatively low sales, and a third with utilities with low fuel costs
but high sales.
We can therefore think of cluster analysis as a more formal algorithm that
measures the distance between records, and according to these distances (here,
two-dimensional distances), forms clusters.
Two general types of clustering algorithms for a dataset of n records are
hierarchical and non-hierarchical clustering:
Hierarchical methods can be either agglomerative or divisive. Agglomera-
tive methods begin with n clusters and sequentially merge similar clusters
until a single cluster is obtained. Divisive methods work in the opposite
direction, starting with one cluster that includes all records. Hierarchical
methods are especially useful when the goal is to arrange the clusters into a
natural hierarchy.
MEASURING DISTANCE BETWEEN TWO RECORDS 361
Non-hierarchical methods, such as k-means. Using a prespecified num-
ber of clusters, the method assigns records to each cluster. These methods
are generally less computationally intensive and are therefore preferred with
very large datasets.
We concentrate here on the two most popular methods: hierarchical
agglomerative clustering and k-means clustering. In both cases, we need to
define two types of distances: distance between two records and distance between
two clusters. In both cases, there is a variety of metrics that can be used.
15.2 Measuring Distance Between Two
Records
We denote by dij a distance metric, or dissimilarity measure, between records i
and j. For record i we have the vector of p measurements (xi1, xi2, . . . , xip),
while for record j we have the vector of measurements (xj1, xj2, . . . , xjp). For
example, we can write the measurement vector for Arizona Public Service as
[1.06, 9.2, 151, 54.4, 1.6, 9077, 0, 0.628].
Distances can be defined in multiple ways, but in general, the following
properties are required:
Non-negative: dij ≥ 0
Self-proximity: dii = 0 (the distance from a record to itself is zero)
Symmetry: dij = dji
Triangle inequality: dij ≤ dik + dkj (the distance between any pair can-
not exceed the sum of distances between the other two pairs)
Euclidean Distance
The most popular distance measure is the Euclidean distance, dij , which between
two records, i and j, is defined by
dij =
√
(xi1 − xj1)2 + (xi2 − xj2)2 + · · ·+ (xip − xjp)2.
For instance, the Euclidean distance between Arizona Public Service and
Boston Edison Co. can be computed from the raw data by
d12 =
√
(1.06− 0.89)2 + (9.2− 10.3)2 + (151− 202)2 + · · ·+ (0.628− 1.555)2
= 3989.408.
To compute Euclidean distance in R, see Table 15.2.
362 CLUSTER ANALYSIS
TABLE 15.2 DISTANCE MATRIX BETWEEN PAIRS OF THE FIRST FIVE UTILITIES, USING
EUCLIDEAN DISTANCE
code for computing distance between records
utilities.df <- read.csv("Utilities.csv")
# set row names to the utilities column
row.names(utilities.df) <- utilities.df[,1]
# remove the utility column
utilities.df <- utilities.df[,-1]
# compute Euclidean distance
# (to compute other distance measures, change the value in method = )
d <- dist(utilities.df, method = "euclidean")
Output
> d
Arizona Boston Central Commonwealth NY
Boston 3989.40808
Central 140.40286 4125.04413
Commonwealth 2654.27763 1335.46650 2789.75967
NY 5777.16767 1788.06803 5912.55291 3123.15322
Florida 2050.52944 6039.68908 1915.15515 4704.36310 7827.42921
Normalizing Numerical Measurements
The measure computed above is highly influenced by the scale of each vari-
able, so that variables with larger scales (e.g., Sales) have a much greater influ-
ence over the total distance. It is therefore customary to normalize continuous
measurements before computing the Euclidean distance. This converts all mea-
surements to the same scale. Normalizing a measurement means subtracting the
average and dividing by the standard deviation (normalized values are also called
z-scores). For instance, the average sales amount across the 22 utilities is 8914.045
and the standard deviation is 3549.984. The normalized sales for Arizona Public
Service is therefore (9077− 8914.045)/3549.984 = 0.046.
Returning to the simplified utilities data with only two measurements (Sales
and Fuel Cost), we first normalize the measurements (see Table 15.3), and then
compute the Euclidean distance between each pair. Table 15.4 gives these pair-
wise distances for the first five utilities. A similar table can be constructed for all
22 utilities.
Other Distance Measures for Numerical Data
It is important to note that the choice of the distance measure plays a major role in
cluster analysis. The main guideline is domain dependent: What exactly is being
MEASURING DISTANCE BETWEEN TWO RECORDS 363
TABLE 15.3 ORIGINAL AND NORMALIZED MEASUREMENTS FOR SALES AND FUEL COST
Company Sales Fuel Cost NormSales NormFuel
Arizona Public Service 9077 0.628 0.0459 −0.8537
Boston Edison Co. 5088 1.555 −1.0778 0.8133
Central Louisiana Co. 9212 1.058 0.0839 −0.0804
Commonwealth Edison Co. 6423 0.7 −0.7017 −0.7242
Consolidated Edison Co. (NY) 3300 2.044 −1.5814 1.6926
Florida Power & Light Co. 11127 1.241 0.6234 0.2486
Hawaiian Electric Co. 7642 1.652 −0.3583 0.9877
Idaho Power Co. 13082 0.309 1.1741 −1.4273
Kentucky Utilities Co. 8406 0.862 −0.1431 −0.4329
Madison Gas & Electric Co. 6455 0.623 −0.6927 −0.8627
Nevada Power Co. 17441 0.768 2.4020 −0.6019
New England Electric Co. 6154 1.897 −0.7775 1.4283
Northern States Power Co. 7179 0.527 −0.4887 −1.0353
Oklahoma Gas & Electric Co. 9673 0.588 0.2138 −0.9256
Pacific Gas & Electric Co. 6468 1.4 −0.6890 0.5346
Puget Sound Power & Light Co. 15991 0.62 1.9935 −0.8681
San Diego Gas & Electric Co. 5714 1.92 −0.9014 1.4697
The Southern Co. 10140 1.108 0.3453 0.0095
Texas Utilities Co. 13507 0.636 1.2938 −0.8393
Wisconsin Electric Power Co. 7287 0.702 −0.4583 −0.7206
United Illuminating Co. 6650 2.116 −0.6378 1.8221
Virginia Electric & Power Co. 10093 1.306 0.3321 0.3655
Mean 8914.05 1.10 0.00 0.00
Standard deviation 3549.98 0.56 1.00 1.00
TABLE 15.4 DISTANCE MATRIX BETWEEN PAIRS OF THE FIRST FIVE UTILITIES, USING
EUCLIDEAN DISTANCE AND NORMALIZED MEASUREMENTS
code for normalizing data and computing distance
# normalize input variables
utilities.df.norm <- sapply(utilities.df, scale)
# add row names: utilities
row.names(utilities.df.norm) <- row.names(utilities.df)
# compute normalized distance based on Sales (column 6) and Fuel Cost (column 8)
d.norm <- dist(utilities.df.norm[,c(6,8)], method = "euclidean")
Output
> d.norm
Arizona Boston Central Commonwealth NY
Boston 2.0103293
Central 0.7741795 1.4657027
Commonwealth 0.7587375 1.5828208 1.0157104
NY 3.0219066 1.0133700 2.4325285 2.5719693
Florida 1.2444219 1.7923968 0.6318918 1.6438566 2.6355728
364 CLUSTER ANALYSIS
measured? How are the different measurements related? What scale should each
measurement be treated as (numerical, ordinal, or nominal)? Are there outliers?
Finally, depending on the goal of the analysis, should the clusters be distinguished
mostly by a small set of measurements, or should they be separated by multiple
measurements that weight moderately?
Although Euclidean distance is the most widely used distance, it has three
main features that need to be kept in mind. First, as mentioned earlier, it is
highly scale dependent. Changing the units of one variable (e.g., from cents
to dollars) can have a huge influence on the results. Normalizing is therefore
a common solution. But unequal weighting should be considered if we want
the clusters to depend more on certain measurements and less on others. The
second feature of Euclidean distance is that it completely ignores the relation-
ship between the measurements. Thus, if the measurements are in fact strongly
correlated, a different distance (such as the statistical distance, described later) is
likely to be a better choice. Third, Euclidean distance is sensitive to outliers. If
the data are believed to contain outliers and careful removal is not a choice, the
use of more robust distances (such as the Manhattan distance, described later) is
preferred.
Additional popular distance metrics often used (for reasons such as the ones
above) are:
Correlation-based similarity. Sometimes it is more natural or conve-
nient to work with a similarity measure between records rather than distance,
which measures dissimilarity. A popular similarity measure is the square of
the Pearson correlation coefficient, r2ij , where the correlation coefficient is
defined by
rij =
p∑
m=1
(xim − xm)(xjm − xm)√
p∑
m=1
(xim − xm)2
p∑
m=1
(xjm − xm)2
. (15.1)
Such measures can always be converted to distance measures. In the example
above, we could define a distance measure dij = 1− r2ij .
Statistical distance (also called Mahalanobis distance). This metric has
an advantage over the other metrics mentioned in that it takes into account
the correlation between measurements. With this metric, measurements that
are highly correlated with other measurements do not contribute as much
as those that are uncorrelated or mildly correlated. The statistical distance
between records i and j is defined as
di,j =
√
(xi − xj)′S−1(xi − xj),
MEASURING DISTANCE BETWEEN TWO RECORDS 365
where xi and xj are p-dimensional vectors of the measurements values for
records i and j, respectively; and S is the covariance matrix for these vec-
tors. (′, a transpose operation, simply turns a column vector into a row
vector). S−1 is the inverse matrix of S, which is the p-dimension extension
to division. For further information on statistical distance, see Chapter 12.
Manhattan distance (“city block”). This distance looks at the absolute
differences rather than squared differences, and is defined by
dij =
p∑
m=1
| xim − xjm | .
Maximum coordinate distance. This distance looks only at the mea-
surement on which records i and j deviate most. It is defined by
dij = max
m=1,2,…,p
| xim − xjm | .
Distance Measures for Categorical Data
In the case of measurements with binary values, it is more intuitively appealing
to use similarity measures than distance measures. Suppose that we have binary
values for all the xij ’s, and for records i and j we have the following 2× 2 table:
Record j
0 1
Record i 0 a b a+ b
1 c d c+ d
a+ c b+ d n
where a denotes the number of variables for which records i and j do not have
that attribute (they each have value 0 on that attribute), d is the number of
variables for which the two records have the attribute present, and so on. The
most useful similarity measures in this situation are:
Matching coefficient: (a+ d)/n.
Jaquard’s coefficient: d/(b+c+d). This coefficient ignores zero matches.
This is desirable when we do not want to consider two people to be similar
simply because a large number of characteristics are absent in both. For
example, if owns a Corvette is one of the variables, a matching “yes” would
be evidence of similarity, but a matching “no” tells us little about whether
the two people are similar.
366 CLUSTER ANALYSIS
Distance Measures for Mixed Data
When the measurements are mixed (some continuous and some binary), a simi-
larity coefficient suggested by Gower is very useful. Gower’s similarity measure is a
weighted average of the distances computed for each variable, after scaling each
variable to a [0,1] scale. It is defined as
sij =
p∑
m=1
wijmsijm
p∑
m=1
wijm
,
where sijm is the similarity between records i and j on measurement m, and
wijm is a binary weight given to the corresponding distance.
The similarity measures sijm and weights wijm are computed as follows:
1. For continuous measurements, sijm = 1− |xim−xjm|max(xm)−min(xm) and wijm =
1 unless the value for measurement m is unknown for one or both of the
records, in which case wijm = 0.
2. For binary measurements, sijm = 1 if xim = xjm = 1 and 0 otherwise.
wijm = 1 unless xim = xjm = 0.
3. For nonbinary categorical measurements, sijm = 1 if both records are in
the same category, and otherwise sijm = 0. As in continuous measure-
ments, wijm = 1 unless the category for measurement m is unknown
for one or both of the records, in which case wijm = 0.
15.3 Measuring Distance Between Two
Clusters
We define a cluster as a set of one or more records. How do we measure dis-
tance between clusters? The idea is to extend measures of distance between records
into distances between clusters. Consider cluster A, which includes the m records
A1, A2, . . . , Am and cluster B, which includes n records B1, B2, . . . , Bn. The
most widely used measures of distance between clusters are:
Minimum Distance
The distance between the pair of records Ai and Bj that are closest:
min(distance(Ai, Bj)), i = 1, 2, . . . ,m; j = 1, 2, . . . , n.
Maximum Distance
The distance between the pair of records Ai and Bj that are farthest:
max(distance(Ai, Bj)), i = 1, 2, . . . ,m; j = 1, 2, . . . , n.
MEASURING DISTANCE BETWEEN TWO CLUSTERS 367
Average Distance
The average distance of all possible distances between records in one cluster and
records in the other cluster:
Average(distance(Ai, Bj)), i = 1, 2, . . . ,m; j = 1, 2, . . . , n.
Centroid Distance
The distance between the two cluster centroids. A cluster centroid is the vector of
measurement averages across all the records in that cluster. For cluster A, this is
the vector xA = [(1/m
∑m
i=1 x1i, . . . , 1/m
∑m
i=1 xpi)]. The centroid distance
between clusters A and B is
distance(xA, xB).
Minimum distance, maximum distance, and centroid distance are illustrated
visually for two dimensions with a map of Portugal and France in Figure 15.2.
For instance, consider the first two utilities (Arizona, Boston) as cluster A,
and the next three utilities (Central, Commonwealth, Consolidated) as cluster
B. Using the normalized scores in Table 15.3 and the distance matrix in Table
FIGURE 15.2 TWO-DIMENSIONAL REPRESENTATION OF SEVERAL DIFFERENT DISTANCE
MEASURES BETWEEN PORTUGAL AND FRANCE
368 CLUSTER ANALYSIS
15.4, we can compute each of the distances described above. Using Euclidean
distance for each distance calculation, we get:
• The closest pair is Arizona and Commonwealth, and therefore the mini-
mum distance between clusters A and B is 0.76.
• The farthest pair is Arizona and Consolidated, and therefore the maxi-
mum distance between clusters A and B is 3.02.
• The average distance is (0.77+0.76+3.02+1.47+1.58+1.01)/6 =
1.44.
• The centroid of cluster A is[
0.0459− 1.0778
2
,
−0.8537 + 0.8133
2
]
= [−0.516,−0.020],
and the centroid of cluster B is[
0.0839− 0.7017− 1.5814
3
,
−0.0804− 0.7242 + 1.6926
3
]
= [−0.733, 0.296].
The distance between the two centroids is then√
(−0.516 + 0.733)2 + (−0.020− 0.296)2 = 0.38.
In deciding among clustering methods, domain knowledge is key. If you
have good reason to believe that the clusters might be chain- or sausage-like,
minimum distance would be a good choice. This method does not require that
cluster members all be close to one another, only that the new members being
added be close to one of the existing members. An example of an application
where this might be the case would be characteristics of crops planted in long
rows, or disease outbreaks along navigable waterways that are the main areas
of settlement in a region. Another example is laying and finding mines (land
or marine). Minimum distance is also fairly robust to small deviations in the
distances. However, adding or removing data can influence it greatly.
Maximum and average distance are better choices if you know that the clus-
ters are more likely to be spherical (e.g., customers clustered on the basis of
numerous attributes). If you do not know the probable nature of the cluster,
these are good default choices, since most clusters tend to be spherical in nature.
We now move to a more detailed description of the two major types of
clustering algorithms: hierarchical (agglomerative) and non-hierarchical.
15.4 Hierarchical (Agglomerative)
Clustering
The idea behind hierarchical agglomerative clustering is to start with each cluster
comprising exactly one record and then progressively agglomerating (combining)
HIERARCHICAL (AGGLOMERATIVE) CLUSTERING 369
the two nearest clusters until there is just one cluster left at the end, which consists
of all the records.
Returning to the small example of five utilities and two measures (Sales and
Fuel Cost) and using the distance matrix (Table 15.4), the first step in the hier-
archical clustering would join Arizona and Commonwealth, which are the clos-
est (using normalized measurements and Euclidean distance). Next, we would
recalculate a 4× 4 distance matrix that would have the distances between these
four clusters: {Arizona, Commonwealth}, {Boston}, {Central}, and {Consol-
idated}. At this point, we use a measure of distance between clusters such as the
ones described in Section 15.3. Each of these distances (minimum, maximum,
average, and centroid distance) can be implemented in the hierarchical scheme
as described below.
H I E R A R C H I C A L A G G L O M E R A T I V E C L U S T E R I N G
A L G O R I T H M :
1. Start with n clusters (each record = cluster).
2. The two closest records are merged into one cluster.
3. At every step, the two clusters with the smallest distance are merged. This
means that either single records are added to existing clusters or two existing
clusters are combined.
Single Linkage
In single linkage clustering, the distance measure that we use is the minimum dis-
tance (the distance between the nearest pair of records in the two clusters, one
record in each cluster). In our utilities example, we would compute the dis-
tances between each of {Boston}, {Central}, and {Consolidated} with {Ari-
zona, Commonwealth} to create the 4×4 distance matrix shown in Table 15.5.
The next step would consolidate {Central} with {Arizona, Common-
wealth} because these two clusters are closest. The distance matrix will again be
recomputed (this time it will be 3× 3), and so on.
This method has a tendency to cluster together at an early stage records that
are distant from each other because of a chain of intermediate records in the
TABLE 15.5 DISTANCE MATRIX AFTER ARIZONA AND COMMONWEALTH CONSOLIDATION
CLUSTER TOGETHER, USING SINGLE LINKAGE
Arizona–Commonwealth Boston Central Consolidated
Arizona–Commonwealth 0
Boston min(2.01,1.58) 0
Central min(0.77,1.02) 1.47 0
Consolidated min(3.02,2.57) 1.01 2.43 0
370 CLUSTER ANALYSIS
same cluster. Such clusters have elongated sausage-like shapes when visualized
as objects in space.
Complete Linkage
In complete linkage clustering, the distance between two clusters is the maximum
distance (between the farthest pair of records). If we used complete linkage with
the five-utilities example, the recomputed distance matrix would be equivalent
to Table 15.5, except that the “min” function would be replaced with a “max.”
This method tends to produce clusters at the early stages with records that
are within a narrow range of distances from each other. If we visualize them as
objects in space, the records in such clusters would have roughly spherical shapes.
Average Linkage
Average linkage clustering is based on the average distance between clusters
(between all possible pairs of records). If we used average linkage with the five-
utilities example, the recomputed distance matrix would be equivalent to Table
15.5, except that the “min” function would be replaced with “average.” This
method is also called Unweighted Pair-Group Method using Averages (UPGMA).
Note that unlike average linkage, the results of the single and complete link-
age methods depend only on the ordering of the inter-record distances. Linear
transformations of the distances (and other transformations that do not change
the ordering) do not affect the results.
Centroid Linkage
Centroid linkage clustering is based on centroid distance, where clusters are rep-
resented by their mean values for each variable, which forms a vector of means.
The distance between two clusters is the distance between these two vectors.
In average linkage, each pairwise distance is calculated, and the average of all
such distances is calculated. In contrast, in centroid distance clustering, just one
distance is calculated: the distance between group means. This method is also
called Unweighted Pair-Group Method using Centroids (UPGMC).
Ward’s Method
Ward’s method is also agglomerative, in that it joins records and clusters together
progressively to produce larger and larger clusters, but operates slightly differently
from the general approach described above. Ward’s method considers the “loss
of information” that occurs when records are clustered together. When each
cluster has one record, there is no loss of information and all individual values
remain available. When records are joined together and represented in clusters,
information about an individual record is replaced by the information for the
HIERARCHICAL (AGGLOMERATIVE) CLUSTERING 371
cluster to which it belongs. To measure loss of information, Ward’s method
employs a measure “error sum of squares” (ESS) that measures the difference
between individual records and a group mean.
This is easiest to see in univariate data. For example, consider the values (2,
6, 5, 6, 2, 2, 2, 2, 0, 0, 0) with a mean of 2.5. Their ESS is equal to
(2− 2.5)2 + (6− 2.5)2 + (5− 2.5)2 + . . .+ (0− 2.5)2 = 50.5.
The loss of information associated with grouping the values into a single group
is therefore 50.5. Now group the records into four groups: (0, 0, 0), (2, 2, 2, 2),
(5), (6, 6). The loss of information is the sum of the ESS’s for each group, which
is 0 (each record in each group is equal to the mean for that group, so the ESS for
each group is 0). Thus, clustering the 10 records into 4 clusters results in no loss
of information, and this would be the first step in Ward’s method. In moving
to a smaller number of clusters, Ward’s method would choose the configuration
that results in the smallest incremental loss of information.
Ward’s method tends to result in convex clusters that are of roughly equal
size, which can be an important consideration in some applications (e.g., in
establishing meaningful customer segments).
Dendrograms: Displaying Clustering Process and Results
A dendrogram is a treelike diagram that summarizes the process of clustering.
On the x-axis are the records. Similar records are joined by lines whose ver-
tical length reflects the distance between the records. Figure 15.3 shows the
dendrograms that results from clustering the 22 utilities using the 8 normalized
measurements, Euclidean distance, once with single linkage (top) and once with
average linkage (bottom).
By choosing a cutoff distance on the y-axis, a set of clusters is created. Visu-
ally, this means drawing a horizontal line on a dendrogram. Records with con-
nections below the horizontal line (that is, their distance is smaller than the cutoff
distance) belong to the same cluster. For example, setting the cutoff distance to
2.7 on the single linkage dendrogram in Figure 15.3 (top) results in six clusters.
The six clusters are (from left to right on the dendrogram):
{NY}, {Nevada}, {San Diego}, {Idaho, Puget}, {Central}, {Others}.
If we want six clusters using average linkage, we can choose a cutoff distance
of 3.5. The resulting six clusters are slightly different.
The six (or other number of) clusters can be computed in R by applying the
function cutree() to the dendrogram object. Table 15.6 shows the results for the
single linkage and the average linkage clustering with six clusters. Each record
is assigned a cluster number. While some records remain in the same cluster
372 CLUSTER ANALYSIS
code for running hierarchical clustering and generating a dendrogram
# in hclust() set argument method =
# to “ward.D”, “single”, “complete”, “average”, “median”, or “centroid”
hc1 <- hclust(d.norm, method = "single")
plot(hc1, hang = -1, ann = FALSE)
hc2 <- hclust(d.norm, method = "average")
plot(hc2, hang = -1, ann = FALSE)
N
Y
N
ev
ad
a
S
an
D
ie
go
Id
ah
o
P
ug
et
C
en
tr
al
F
lo
rid
a
V
irg
in
ia
P
ac
ifi
c
H
aw
ai
ia
n
N
ew
E
ng
la
nd
U
ni
te
d
B
os
to
n
W
is
co
ns
in
C
om
m
on
w
ea
lth
M
ad
is
on
N
or
th
er
n
K
en
tu
ck
y
O
kl
ah
om
a
Te
xa
s
A
riz
on
a
S
ou
th
er
n
1.
0
1.
5
2.
0
2.
5
3.
0
3.
5
N
ev
ad
a
Id
ah
o
P
ug
et N
Y
S
an
D
ie
go
P
ac
ifi
c
H
aw
ai
ia
n
N
ew
E
ng
la
nd
U
ni
te
d
M
ad
is
on
N
or
th
er
n
C
om
m
on
w
ea
lth
W
is
co
ns
in
B
os
to
n
V
irg
in
ia
C
en
tr
al
K
en
tu
ck
y
F
lo
rid
a
O
kl
ah
om
a
Te
xa
s
A
riz
on
a
S
ou
th
er
n
1.
0
1.
5
2.
0
2.
5
3.
0
3.
5
4.
0
4.
5
FIGURE 15.3 DENDROGRAM: SINGLE LINKAGE (TOP) AND AVERAGE LINKAGE (BOTTOM) FOR
ALL 22 UTILITIES, USING ALL EIGHT MEASUREMENTS
HIERARCHICAL (AGGLOMERATIVE) CLUSTERING 373
TABLE 15.6 COMPUTING CLUSTER MEMBERSHIP BY “CUTTING” THE DENDROGRAM
Single Linkage
> memb <- cutree(hc1, k = 6)
> memb
Arizona Boston Central Commonwealth NY
1 1 2 1 3
Florida Hawaiian Idaho Kentucky Madison
1 1 4 1 1
Nevada New England Northern Oklahoma Pacific
5 1 1 1 1
Puget San Diego Southern Texas Wisconsin
4 6 1 1 1
United Virginia
1 1
Average Linkage
> memb <- cutree(hc2, k = 6)
> memb
Arizona Boston Central Commonwealth NY
1 2 1 2 3
Florida Hawaiian Idaho Kentucky Madison
1 4 5 1 2
Nevada New England Northern Oklahoma Pacific
5 4 2 1 4
Puget San Diego Southern Texas Wisconsin
5 6 1 1 2
United Virginia
4 2
in both methods (e.g., Arizona, Florida, Kentucky, Oklamona, Texas), others
change.
Note that if we wanted five clusters, they would be identical to the six above,
with the exception that two of the clusters would be merged into a single cluster.
For example, in the single linkage case, the two right-most clusters in the den-
drogram would be merged into one cluster. In general, all hierarchical methods
have clusters that are nested within each other as we decrease the number of
clusters. This is a valuable property for interpreting clusters and is essential in
certain applications, such as taxonomy of varieties of living organisms.
Validating Clusters
One important goal of cluster analysis is to come up withmeaningful clusters. Since
there are many variations that can be chosen, it is important to make sure that the
resulting clusters are valid, in the sense that they really generate some insight. To
see whether the cluster analysis is useful, consider each of the following aspects:
1. Cluster interpretability. Is the interpretation of the resulting clusters reason-
able? To interpret the clusters, explore the characteristics of each cluster
by
374 CLUSTER ANALYSIS
a. Obtaining summary statistics (e.g., average, min, max) from each clus-
ter on each measurement that was used in the cluster analysis
b. Examining the clusters for separation along some common feature
(variable) that was not used in the cluster analysis
c. Labeling the clusters: based on the interpretation, trying to assign a
name or label to each cluster
2. Cluster stability. Do cluster assignments change significantly if some of the
inputs are altered slightly? Another way to check stability is to partition
the data and see how well clusters formed based on one part apply to the
other part. To do this:
a. Cluster partition A.
b. Use the cluster centroids from A to assign each record in partition B
(each record is assigned to the cluster with the closest centroid).
c. Assess how consistent the cluster assignments are compared to the
assignments based on all the data.
3. Cluster separation. Examine the ratio of between-cluster variation to
within-cluster variation to see whether the separation is reasonable.
There exist statistical tests for this task (an F-ratio), but their usefulness is
somewhat controversial.
4. Number of clusters. The number of resulting clusters must be useful, given
the purpose of the analysis. For example, suppose the goal of the clus-
tering is to identify categories of customers and assign labels to them for
market segmentation purposes. If the marketing department can only
manage to sustain three different marketing presentations, it would prob-
ably not make sense to identify more than three clusters.
Returning to the utilities example, we notice that both methods (single and
average linkage) identify {NY} and {San Diego} as singleton clusters. Also, both
dendrograms imply that a reasonable number of clusters in this dataset is four.
One insight that can be derived from the average linkage clustering is that clusters
tend to group geographically. The four non-singleton clusters form (approxi-
mately) a southern group, a northern group, an east/west seaboard group, and
a west group.
We can further characterize each of the clusters by examining the summary
statistics of their measurements, or visually looking at a heatmap of their indi-
vidual measurements. Figure 15.4 shows a heatmap of the four clusters and two
singletons, highlighting the different profile that each cluster has in terms of the
eight measurements. We see, for instance, that cluster 2 is characterized by utili-
ties with a high percent of nuclear power; cluster 1 is characterized by high fixed
charge and RoR; cluster 4 has high fuel costs.
HIERARCHICAL (AGGLOMERATIVE) CLUSTERING 375
code for creating heatmap
# set labels as cluster membership and utility name
row.names(utilities.df.norm) <- paste(memb, ": ", row.names(utilities.df), sep = "")
# plot heatmap
# rev() reverses the color mapping to large = dark
heatmap(as.matrix(utilities.df.norm), Colv = NA, hclustfun = hclust,
col=rev(paste("gray",1:99,sep="")))
F
ix
ed
_c
ha
rg
e
R
oR
C
os
t
Lo
ad
_f
ac
to
r
D
em
an
d_
gr
ow
th
S
al
es
N
uc
le
ar
F
ue
l_
C
os
t
1: Oklahoma
5: Texas
1: Arizona
3: Southern
3: Florida
3: Central
1: Kentucky
4: NY
2: Boston
3: Virginia
1: Commonwealth
1: Wisconsin
1: Madison
1: Northern
4: San Diego
2: Pacific
4: United
4: New England
2: Hawaiian
6: Nevada
5: Idaho
6: Puget
FIGURE 15.4 HEATMAP FOR THE 22 UTILITIES (IN ROWS). ROWS ARE SORTED BY THE SIX
CLUSTERS FROM AVERAGE LINKAGE CLUSTERING. DARKER CELLS DENOTE
HIGHER VALUES WITHIN A COLUMN
Limitations of Hierarchical Clustering
Hierarchical clustering is very appealing in that it does not require specification
of the number of clusters, and in that sense is purely data-driven. The ability
to represent the clustering process and results through dendrograms is also an
advantage of this method, as it is easier to understand and interpret. There are,
however, a few limitations to consider:
376 CLUSTER ANALYSIS
1. Hierarchical clustering requires the computation and storage of an n×n
distance matrix. For very large datasets, this can be expensive and slow.
2. The hierarchical algorithm makes only one pass through the data. This
means that records that are allocated incorrectly early in the process can-
not be reallocated subsequently.
3. Hierarchical clustering also tends to have low stability. Reordering data
or dropping a few records can lead to a different solution.
4. With respect to the choice of distance between clusters, single and com-
plete linkage are robust to changes in the distance metric (e.g., Euclidean,
statistical distance) as long as the relative ordering is kept. In contrast,
average linkage is more influenced by the choice of distance metric, and
might lead to completely different clusters when the metric is changed.
5. Hierarchical clustering is sensitive to outliers.
15.5 Non-hierarchical Clustering: The
k-Means Algorithm
A non-hierarchical approach to forming good clusters is to pre-specify a desired
number of clusters, k, and assign each case to one of the k clusters so as to
minimize a measure of dispersion within the clusters. In other words, the goal is
to divide the sample into a predetermined number k of non-overlapping clusters
so that clusters are as homogeneous as possible with respect to the measurements
used.
A common measure of within-cluster dispersion is the sum of distances (or
sum of squared Euclidean distances) of records from their cluster centroid. The
problem can be set up as an optimization problem involving integer program-
ming, but because solving integer programs with a large number of variables
is time-consuming, clusters are often computed using a fast, heuristic method
that produces good (although not necessarily optimal) solutions. The k-means
algorithm is one such method.
The k-means algorithm starts with an initial partition of the records into k
clusters. Subsequent steps modify the partition to reduce the sum of the distances
of each record from its cluster centroid. The modification consists of allocating
each record to the nearest of the k centroids of the previous partition. This leads
to a new partition for which the sum of distances is smaller than before. The
means of the new clusters are computed and the improvement step is repeated
until the improvement is very small.
NON-HIERARCHICAL CLUSTERING: THE k-MEANS ALGORITHM 377
k- M E A N S C L U S T E R I N G A L G O R I T H M :
1. Start with k initial clusters (user chooses k).
2. At every step, each record is reassigned to the cluster with the “closest” centroid.
3. Recompute the centroids of clusters that lost or gained a record, and repeat
Step 2.
4. Stop when moving any more records between clusters increases cluster disper-
sion.
Returning to the example with the five utilities and two measurements, let
us assume that k = 2 and that the initial clusters are A = {Arizona, Boston} and
B = {Central, Commonwealth, Consolidated}. The cluster centroids were
computed in Section 15.4:
xA = [−0.516,−0.020] and xB = [−0.733, 0.296].
The distance of each record from each of these two centroids is shown in
Table 15.7.
TABLE 15.7 DISTANCE OF EACH RECORD
FROM EACH CENTROID
Distance from Distance from
Centroid A Centroid B
Arizona 1.0052 1.3887
Boston 1.0052 0.6216
Central 0.6029 0.8995
Commonwealth 0.7281 1.0207
Consolidated 2.0172 1.6341
We see that Boston is closer to cluster B, and that Central and Common-
wealth are each closer to cluster A. We therefore move each of these records to
the other cluster and obtain A = {Arizona, Central, Commonwealth} and B =
{Consolidated, Boston}. Recalculating the centroids gives
xA = [−0.191,−0.553] and xB = [−1.33, 1.253].
The distance of each record from each of the newly calculated centroids is
given in Table 15.8. At this point we stop, because each record is allocated to its
closest cluster.
Choosing the Number of Clusters (k)
The choice of the number of clusters can either be driven by external consider-
ations (e.g., previous knowledge, practical constraints, etc.), or we can try a few
378 CLUSTER ANALYSIS
TABLE 15.8 DISTANCE OF EACH RECORD FROM
EACH NEWLY CALCULATED
CENTROID
Distance from Distance from
Centroid A Centroid B
Arizona 0.3827 2.5159
Boston 1.6289 0.5067
Central 0.5463 1.9432
Commonwealth 0.5391 2.0745
Consolidated 2.6412 0.5067
different values for k and compare the resulting clusters. After choosing k, the
n records are partitioned into these initial clusters. If there is external reasoning
that suggests a certain partitioning, this information should be used. Alterna-
tively, if there exists external information on the centroids of the k clusters, this
can be used to initially allocate the records.
In many cases, there is no information to be used for the initial partition.
In these cases, the algorithm can be rerun with different randomly generated
starting partitions to reduce chances of the heuristic producing a poor solution.
The number of clusters in the data is generally not known, so it is a good idea to
run the algorithm with different values for k that are near the number of clusters
that one expects from the data, to see how the sum of distances reduces with
increasing values of k. Note that the clusters obtained using different values of
k will not be nested (unlike those obtained by hierarchical methods).
The results of running the k-means algorithm using function kmeans() for all
22 utilities and eight measurements with k = 6 are shown in Table 15.9. As in
the results from the hierarchical clustering, we see once again that {San Diego} is
a singleton cluster. Two more clusters (cluster #3 = {Arizona, Florida, Central,
Kentucky, Oklahoma, Texas} and cluster #4 = {Virginia, Northern, Common-
wealth, Madison, Wisconsin}) are nearly identical to those that emerged in the
hierarchical clustering with average linkage.
To characterize the resulting clusters, we examine the cluster centroids
(numerically in Table 15.10 and in the line chart (“profile plot”) in Figure 15.5).
We see, for instance, that cluster #5 is characterized by especially low Fixed-
charge and RoR, and high Demand-growth. We can also see which variables
do the best job of separating the clusters. For example, the spread of clusters for
Sales and Fixed-charge is quite high, and not so high for the other variables.
We can also inspect the information on the within-cluster dispersion. From
Table 15.10, we see that cluster #3, with seven records, has the largest within-
cluster sum of squared distances. In comparison, cluster #6, with three records,
NON-HIERARCHICAL CLUSTERING: THE k-MEANS ALGORITHM 379
TABLE 15.9 k-MEANS CLUSTERING OF 22 UTILITIES INTO k = 6 CLUSTERS (SORTED BY
CLUSTER ID
code for k-means
# load and preprocess data
utilities.df <- read.csv("Utilities.csv")
row.names(utilities.df) <- utilities.df[,1]
utilities.df <- utilities.df[,-1]
# normalized distance:
utilities.df.norm <- sapply(utilities.df, scale)
row.names(utilities.df.norm) <- row.names(utilities.df)
# run kmeans algorithm
km <- kmeans(utilities.df.norm, 6)
# show cluster membership
km$cluster
Output
> km$cluster
Arizona Boston Central Commonwealth NY
3 2 3 4 1
Florida Hawaiian Idaho Kentucky Madison
3 2 6 3 4
Nevada New England Northern Oklahoma Pacific
6 2 4 3 2
Puget San Diego Southern Texas Wisconsin
6 5 3 3 4
United Virginia
2 4
has a smaller within-cluster sum of squared distances. This means that cluster #6
is more homogeneous than cluster #3 (although it is also has fewer records).
When the number of clusters is not predetermined by domain requirements,
we can use a graphical approach to evaluate different numbers of clusters. An
“elbow chart” is a line chart depicting the decline in cluster heterogeneity as we
add more clusters. Figure 15.6 shows the overall average within-cluster distance
(normalized) for different choices of k. Moving from 1 to 2 tightens clusters
considerably (reflected by the large reduction in within-cluster distance), and so
does moving from 2 to 3 and even to 4. Adding more clusters beyond 4 brings
less improvement to cluster homogeneity.
From the distances between clusters measured by Euclidean distance between
the centroids (see Table 15.11), we can learn about the separation of the different
clusters. For instance, we can see that clusters #3 and #4 are the closest to one
another, and clusters #1 and #5 (each with a single record) are the most distant
from one another. Cluster #5 is most distant from the other clusters, overall, but
there is no cluster that is a striking outlier.
380 CLUSTER ANALYSIS
TABLE 15.10 CLUSTER CENTROIDS AND SQUARED DISTANCES FOR k-MEANS WITH k = 6
> # centroids
> km$centers
Fixed_charge RoR Cost Load_factor Demand_growth
1 2.03732429 -0.8628882 0.5782326 -1.2950193 -0.7186431
2 -0.35819462 -0.3637904 0.3985832 1.1572643 -0.3017426
3 0.50431607 0.7795509 -0.9858961 -0.3375463 -0.4895769
4 -0.01133215 0.3313815 0.2189339 -0.3580408 0.1664686
5 -1.91907572 -1.9323833 -0.7812761 1.1034665 1.8468982
6 -0.60027572 -0.8331800 1.3389101 -0.4805802 0.9917178
Sales Nuclear Fuel_Cost
1 -1.5814284 0.2143888 1.6926380
2 -0.7080723 -0.4025746 1.1171999
3 0.3518600 -0.5232108 -0.4105368
4 -0.4018738 1.5650384 -0.5954476
5 -0.9014253 -0.2203441 1.4696557
6 1.8565214 -0.7146294 -0.9657660
> # within-cluster sum of squares
> km$withinss
[1] 0.000000 11.935249 26.507769 10.177094 0.000000 9.533522
> # cluster size
> km$size
[1] 1 5 7 5 1 3
code for plotting profile plot of centroids
# plot an empty scatter plot
plot(c(0), xaxt = ‘n’, ylab = “”, type = “l”,
ylim = c(min(km$centers), max(km$centers)), xlim = c(0, 8))
# label x-axes
axis(1, at = c(1:8), labels = names(utilities.df))
# plot centroids
for (i in c(1:6))
lines(km$centers[i,], lty = i, lwd = 2, col = ifelse(i %in% c(1, 3, 5),
“black”, “dark grey”))
# name clusters
text(x = 0.5, y = km$centers[, 1], labels = paste(“Cluster”, c(1:6)))
−
2
−
1
0
1
2
Index
Fixed_charge RoR Cost Load_factor Demand_growth Sales Nuclear Fuel_Cost
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
FIGURE 15.5 VISUAL PRESENTATION (PROFILE PLOT) OF CLUSTER CENTROIDS
NON-HIERARCHICAL CLUSTERING: THE k-MEANS ALGORITHM 381
TABLE 15.11 EUCLIDEAN DISTANCE BETWEEN CLUSTER CENTROIDS
> dist(km$centers)
1 2 3 4 5
2 3.698906348
3 4.143482338 3.116033813
4 3.983046535 3.170117175 2.704121216
5 5.628591280 3.332157258 5.099264137 4.735628592
6 5.556507647 4.065954404 3.748248485 3.753375354 4.946218349
1 2 3 4 5 6
50
10
0
15
0
Number of Clusters (k)
A
ve
ra
ge
W
ith
in
−
C
lu
st
er
S
qu
ar
ed
D
is
ta
nc
e
FIGURE 15.6 COMPARING DIFFERENT CHOICES OF k IN TERMS OF OVERALL AVERAGE
WITHIN-CLUSTER DISTANCE
Finally, we can use the information on the distance between the final clusters
to evaluate the cluster validity. The ratio of the sum of squared distances for a
given k to the sum of squared distances to the mean of all the records (k = 1)
is a useful measure for the usefulness of the clustering. If the ratio is near 1.0,
the clustering has not been very effective, whereas if it is small, we have well-
separated groups.
382 CLUSTER ANALYSIS
PROBLEMS
15.1 University Rankings. The dataset on American College and University Rankings
(available from www.dataminingbook.com) contains information on 1302 American
colleges and universities offering an undergraduate program. For each university, there
are 17 measurements, including continuous measurements (such as tuition and grad-
uation rate) and categorical measurements (such as location by state and whether it is
a private or public school).
Note that many records are missing some measurements. Our first goal is to
estimate these missing values from “similar” records. This will be done by clustering
the complete records and then finding the closest cluster for each of the partial records.
The missing values will be imputed from the information in that cluster.
a. Remove all records with missing measurements from the dataset.
b. For all the continuous measurements, run hierarchical clustering using complete
linkage and Euclidean distance. Make sure to normalize the measurements. From
the dendrogram: How many clusters seem reasonable for describing these data?
c. Compare the summary statistics for each cluster and describe each cluster in this
context (e.g., “Universities with high tuition, low acceptance rate…”). Hint: To
obtain cluster statistics for hierarchical clustering, use the aggregate() function.
d. Use the categorical measurements that were not used in the analysis (State and Pri-
vate/Public) to characterize the different clusters. Is there any relationship between
the clusters and the categorical information?
e. What other external information can explain the contents of some or all of these
clusters?
f. Consider Tufts University, which is missing some information. Compute the
Euclidean distance of this record from each of the clusters that you found above
(using only the measurements that you have). Which cluster is it closest to? Impute
the missing values for Tufts by taking the average of the cluster on those measure-
ments.
15.2 Pharmaceutical Industry. An equities analyst is studying the pharmaceutical indus-
try and would like your help in exploring and understanding the financial data collected
by her firm. Her main objective is to understand the structure of the pharmaceutical
industry using some basic financial measures.
Financial data gathered on 21 firms in the pharmaceutical industry are available
in the file Pharmaceuticals.csv. For each firm, the following variables are recorded:
1. Market capitalization (in billions of dollars)
2. Beta
3. Price/earnings ratio
4. Return on equity
5. Return on assets
6. Asset turnover
7. Leverage
8. Estimated revenue growth
9. Net profit margin
10. Median recommendation (across major brokerages)
11. Location of firm’s headquarters
12. Stock exchange on which the firm is listed
http://www.dataminingbook.com
PROBLEMS 383
Use cluster analysis to explore and analyze the given dataset as follows:
a. Use only the numerical variables (1 to 9) to cluster the 21 firms. Justify the various
choices made in conducting the cluster analysis, such as weights for different vari-
ables, the specific clustering algorithm(s) used, the number of clusters formed, and
so on.
b. Interpret the clusters with respect to the numerical variables used in forming the
clusters.
c. Is there a pattern in the clusters with respect to the numerical variables (10 to 12)?
(those not used in forming the clusters)
d. Provide an appropriate name for each cluster using any or all of the variables in the
dataset.
15.3 Customer Rating of Breakfast Cereals. The datasetCereals.csv includes nutritional
information, store display, and consumer ratings for 77 breakfast cereals.
Data Preprocessing. Remove all cereals with missing values.
a. Apply hierarchical clustering to the data using Euclidean distance to the normal-
ized measurements. Compare the dendrograms from single linkage and complete
linkage, and look at cluster centroids. Comment on the structure of the clusters
and on their stability. Hint: To obtain cluster centroids for hierarchical clustering,
compute the average values of each cluster members, using the aggregate() function.
b. Which method leads to the most insightful or meaningful clusters?
c. Choose one of the methods. How many clusters would you use? What distance is
used for this cutoff? (Look at the dendrogram.)
d. The elementary public schools would like to choose a set of cereals to include in
their daily cafeterias. Every day a different cereal is offered, but all cereals should
support a healthy diet. For this goal, you are requested to find a cluster of “healthy
cereals.” Should the data be normalized? If not, how should they be used in the
cluster analysis?
15.4 Marketing to Frequent Fliers. The file EastWestAirlinesCluster.csv contains infor-
mation on 3999 passengers who belong to an airline’s frequent flier program. For
each passenger, the data include information on their mileage history and on different
ways they accrued or spent miles in the last year. The goal is to try to identify clus-
ters of passengers that have similar characteristics for the purpose of targeting different
segments for different types of mileage offers.
a. Apply hierarchical clustering with Euclidean distance and Ward’s method. Make
sure to normalize the data first. How many clusters appear?
b. What would happen if the data were not normalized?
c. Compare the cluster centroid to characterize the different clusters, and try to give
each cluster a label.
d. To check the stability of the clusters, remove a random 5% of the data (by taking
a random sample of 95% of the records), and repeat the analysis. Does the same
picture emerge?
e. Use k-means clustering with the number of clusters that you found above. Does
the same picture emerge?
f. Which clusters would you target for offers, and what types of offers would you
target to customers in that cluster?
Part VI
Forecasting Time Series
CHAPTER 16
Handling Time Series
In this chapter, we describe the context of business time series forecasting and
introduce the main approaches that are detailed in the next chapters, and in par-
ticular, regression-based forecasting and smoothing-based methods. Our focus
is on forecasting future values of a single time series. These three chapters are
meant as an introduction to the general forecasting approach and methods.
In this chapter, we discuss the difference between the predictive nature of
time series forecasting vs. the descriptive or explanatory task of time series anal-
ysis. A general discussion of combining forecasting methods or results for added
precision follows. Next, we present a time series in terms of four components
(level, trend, seasonality, and noise) and present methods for visualizing the dif-
ferent components and for exploring time series data. We close with a discussion
of data partitioning (creating training and validation sets), which is performed
differently from cross-sectional data partitioning.
16.1 Introduction1
Time series forecasting is performed in nearly every organization that works with
quantifiable data. Retail stores use it to forecast sales. Energy companies use it to
forecast reserves, production, demand, and prices. Educational institutions use
it to forecast enrollment. Governments use it to forecast tax receipts and spend-
ing. International financial organizations like the World Bank and International
1This and subsequent sections in this chapter, copyright © 2017 Datastats, LLC, and Galit Shmueli.
Used by permission.
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
387
388 HANDLING TIME SERIES
Monetary Fund use it to forecast inflation and economic activity. Transportation
companies use time series forecasting to forecast future travel. Banks and lend-
ing institutions use it (sometimes badly!) to forecast new home purchases. And
venture capital firms use it to forecast market potential and to evaluate business
plans.
Previous chapters in this book deal with classifying and predicting data where
time is not a factor, in the sense that it is not treated differently from other
variables, and where the sequence of measurements over time does not matter.
These are typically called cross-sectional data. In contrast, this chapter deals with
a different type of data: time series.
With today’s technology, many time series are recorded on very frequent
time scales. Stock data are available at ticker level. Purchases at online and
offline stores are recorded in real time. The Internet of Things (IoT) gener-
ates huge numbers of time series, produced by sensors and other measurements
devices. Although data might be available at a very frequent scale, for the pur-
pose of forecasting, it is not always preferable to use this time scale. In con-
sidering the choice of time scale, one must consider the scale of the required
forecasts and the level of noise in the data. For example, if the goal is to fore-
cast next-day sales at a grocery store, using minute-by-minute sales data is likely
to be less useful for forecasting than using daily aggregates. The minute-by-
minute series will contain many sources of noise (e.g., variation by peak and
non-peak shopping hours) that degrade its forecasting power, and these noise
errors, when the data are aggregated to a cruder level, are likely to average
out.
The focus in this part of the book is on forecasting a single time series. In
some cases, multiple time series are to be forecasted (e.g., the monthly sales of
multiple products). Even when multiple series are being forecasted, the most
popular forecasting practice is to forecast each series individually. The advan-
tage of single-series forecasting is its simplicity. The disadvantage is that it does
not take into account possible relationships between series. The statistics litera-
ture contains models for multivariate time series which directly model the cross-
correlations between series. Such methods tend to make restrictive assumptions
about the data and the cross-series structure. They also require statistical expertise
for estimation and maintenance. Econometric models often include information
from one or more series as inputs into another series. However, such models
are based on assumptions of causality that are based on theoretical models. An
alternative approach is to capture the associations between the series of interest
and external information more heuristically. An example is using the sales of
lipstick to forecast some measure of the economy, based on the observation by
Ronald Lauder, chairman of Estee Lauder, that lipstick sales tend to increase
before tough economic times (a phenomenon called the “leading lipstick
indicator”).
DESCRIPTIVE VS. PREDICTIVE MODELING 389
16.2 Descriptive vs. Predictive Modeling
As with cross-sectional data, modeling time series data is done for either descrip-
tive or predictive purposes. In descriptive modeling, or time series analysis, a time
series is modeled to determine its components in terms of seasonal patterns,
trends, relation to external factors, etc. These can then be used for decision-
making and policy formulation. In contrast, time series forecasting uses the infor-
mation in a time series (and perhaps other information) to forecast future values
of that series. The difference between the goals of time series analysis and time
series forecasting leads to differences in the type of methods used and in the
modeling process itself. For example, in selecting a method for describing a time
series, priority is given to methods that produce understandable results (rather
than “blackbox” methods) and sometimes to models based on causal arguments
(explanatory models). Furthermore, describing can be done in retrospect, while
forecasting is prospective in nature. This means that descriptive models might use
“future” information (e.g., averaging the values of yesterday, today, and tomor-
row to obtain a smooth representation of today’s value) whereas forecasting mod-
els cannot.
The focus in this chapter is on time series forecasting, where the goal is to
predict future values of a time series. For information on time series analysis,
see Chatfield (2003).
16.3 Popular Forecasting Methods
in Business
In this part of the book, we focus on two main types of forecasting methods that
are popular in business applications. Both are versatile and powerful, yet rela-
tively simple to understand and deploy. One type of forecasting tool is multiple
linear regression, where the user specifies a certain model and then estimates
it from the time series. The other is the more data-driven tool of smoothing,
where the method learns patterns from the data. Each of the two types of tools
has advantages and disadvantages, as detailed in Chapters 17 and 18. We also note
that data mining methods such as neural networks and others that are intended
for cross-sectional data are also sometimes used for time series forecasting, espe-
cially for incorporating external information into the forecasts (see Shmueli &
Lichtendahl, 2016).
Combining Methods
Before a discussion of specific forecasting methods in the following two chap-
ters, it should be noted that a popular approach for improving predictive per-
formance is to combine forecasting methods. This is similar to the ensembles
390 HANDLING TIME SERIES
approach described in Chapter 13. Combining forecasting methods can be done
via two-level (or multi level) forecasters, where the first method uses the origi-
nal time series to generate forecasts of future values, and the second method uses
the residuals from the first model to generate forecasts of future forecast errors,
thereby “correcting” the first level forecasts. We describe two-level forecasting
in Chapter 17 (Section 17.4). Another combination approach is via “ensem-
bles,” where multiple methods are applied to the time series, and their resulting
forecasts are averaged in some way to produce the final forecast. Combining
methods can take advantage of the strengths of different forecasting methods to
capture different aspects of the time series (also true in cross-sectional data). The
averaging across multiple methods can lead to forecasts that are more robust and
of higher precision.
16.4 Time Series Components
In both types of forecasting methods, regression models and smoothing, and in
general, it is customary to dissect a time series into four components: level, trend,
seasonality, and noise. The first three components are assumed to be invisible,
as they characterize the underlying series, which we only observe with added
noise. Level describes the average value of the series, trend is the change in
the series from one period to the next, and seasonality describes a short-term
cyclical behavior of the series which can be observed several times within the
given series. Finally, noise is the random variation that results from measurement
error or other causes not accounted for. It is always present in a time series to
some degree.
Don’t get confused by the standard conversational meaning of ”season” (winter,
spring, etc.). The statistical meaning of ”season” refers to any time period that
repeats itself in cycles within the larger time series. Also, note that the term
”period” in forecasting means simply a span of time, and not the specific meaning
that it has in physics, of the distance between two equivalent points in a cycle.
In order to identify the components of a time series, the first step is to
examine a time plot. In its simplest form, a time plot is a line chart of the series
values over time, with temporal labels (e.g., calendar date) on the horizontal axis.
To illustrate this, consider the following example.
Example: Ridership on Amtrak Trains
Amtrak, a US railway company, routinely collects data on ridership. Here
we focus on forecasting future ridership using the series of monthly ridership
TIME SERIES COMPONENTS 391
between January 1991 and March 2004. These data are publicly available at
www.forecastingprinciples.com.2
A time plot for the monthly Amtrak ridership series is shown in Figure 16.1.
Note that the values are in thousands of riders.
code for creating a time series plot
library(forecast)
Amtrak.data <- read.csv("Amtrak.csv")
# create time series object using ts()
# ts() takes three arguments: start, end, and freq.
# with monthly data, the frequency of periods per season is 12 (per year).
# arguments start and end are (season number, period number) pairs.
# here start is Jan 1991: start = c(1991, 1); end is Mar 2004: end = c(2004, 3).
ridership.ts <- ts(Amtrak.data$Ridership,
start = c(1991, 1), end = c(2004, 3), freq = 12)
# plot the series
plot(ridership.ts, xlab = "Time", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
Time
R
id
er
sh
ip
(
in
0
00
s)
1992 1994 1996 1998 2000 2002 2004
14
00
16
00
18
00
20
00
22
00
FIGURE 16.1 MONTHLY RIDERSHIP ON AMTRAK TRAINS (IN THOUSANDS) FROM JANUARY
1991 TO MARCH 2004
2To get this series: click on Data. Under T-Competition Data, click “time-series data” (the direct
URL to the data file is www.forecastingprinciples.com/files/MHcomp1.xls as of July 2015). This file
contains many time series. In the Monthly worksheet, column AI contains series M034.
http://www.forecastingprinciples.com
http://www.forecastingprinciples.com/files/MHcomp1.xls
392 HANDLING TIME SERIES
Looking at the time plot reveals the nature of the series components: the
overall level is around 1,800,000 passengers per month. A slight U-shaped trend
is discernible during this period, with pronounced annual seasonality, with peak
travel during summer (July and August).
A second step in visualizing a time series is to examine it more carefully. A
few tools are useful:
Zoom in: Zooming in to a shorter period within the series can reveal patterns
that are hidden when viewing the entire series. This is especially important
when the time series is long. Consider a series of the daily number of vehicles
passing through the Baregg tunnel in Switzerland (data are available in the
same location as the Amtrak Ridership data; series D028). The series from
November 1, 2003 to November 16, 2005 is shown in the top panel of
Figure 16.2. Zooming in to a 4-month period (bottom panel) reveals a
strong day-of-week pattern that is not visible in the time plot of the complete
time series.
60
00
0
10
00
00
14
00
00
Day
N
um
be
r
of
V
eh
ic
le
s
01−Nov−03 01−Jul−04 01−Mar−05 01−Nov−05
60
00
0
10
00
00
14
00
00
Day
N
um
be
r
of
V
eh
ic
le
s
01−Feb−04 01−Mar−04 01−Apr−04 01−May−04 31−May−04
FIGURE 16.2 TIME PLOTS OF THE DAILY NUMBER OF VEHICLES PASSING THROUGH THE
BAREGG TUNNEL, SWITZERLAND. THE BOTTOM PANEL ZOOMS IN TO A 4-MONTH
PERIOD, REVEALING A DAY-OF-WEEK PATTERN
TIME SERIES COMPONENTS 393
Change scale of series: In order to better identify the shape of a trend, it is
useful to change the scale of the series. One simple option is to change the
vertical scale to an exponential scale (in Excel Ribbon, select Chart Tools >
Axes > Primary Vertical Axis > More Primary Vertical Axis Options, and
click “logarithmic scale” in the Format Axis menu). If the trend on the new
scale appears more linear, then the trend in the original series is closer to an
exponential trend.
Add trend lines: Another possibility for better discerning the shape of the
trend is to add a trend line (Excel Ribbon: Chart Tools > Trendline). By
trying different trendlines, one can see what type of trend (e.g., linear, expo-
nential, quadratic) best approximates the data.
Suppress seasonality: It is often easier to see trends in the data when seasonality
is suppressed. Suppressing seasonality patterns can be done by plotting the
series at a cruder time scale (e.g., aggregating monthly data into years) or
creating separate lines or time plots for each season (e.g., separate lines for
each day of week). Another popular option is to use moving average charts.
We will discuss these in Chapter 18 (Section 18.2).
Continuing our example of Amtrak ridership, the charts in Figure 16.3 help
make the series’ components more visible.
Some forecasting methods directly model these components by making
assumptions about their structure. For example, a popular assumption about
trend is that it is linear or exponential over the given time period or part of
it. Another common assumption is about the noise structure: many statisti-
cal methods assume that the noise follows a normal distribution. The advan-
tage of methods that rely on such assumptions is that when the assumptions
are reasonably met, the resulting forecasts will be more robust and the mod-
els more understandable. Other forecasting methods, which are data-adaptive,
make fewer assumptions about the structure of these components, and instead
try to estimate them only from the data. Data-adaptive methods are advanta-
geous when such assumptions are likely to be violated, or when the structure
of the time series changes over time. Another advantage of many data-adaptive
methods is their simplicity and computational efficiency.
A key criterion for deciding between model-driven and data-driven fore-
casting methods is the nature of the series in terms of global vs. local patterns. A
global pattern is one that is relatively constant throughout the series. An example
is a linear trend throughout the entire series. In contrast, a local pattern is one
that occurs only in a short period of the data, and then changes, for example, a
trend that is approximately linear within four neighboring time points, but the
trend size (slope) changes slowly over time.
Model-driven methods are generally preferable for forecasting series with
global patterns as they use all the data to estimate the global pattern. For a local
394 HANDLING TIME SERIES
code for creating Figure 16.3
library(forecast)
# create short time series
# use window() to create a new, shorter time series of ridership.ts
# for the new three-year series, start time is Jan 1997 and end time is Dec 1999
ridership.ts.3yrs <- window(ridership.ts, start = c(1997, 1), end = c(1999, 12))
# fit a linear regression model to the time series
ridership.lm <- tslm(ridership.ts ~ trend + I(trend^2))
# shorter and longer time series
par(mfrow = c(2, 1))
plot(ridership.ts.3yrs, xlab = "Time", ylab = "Ridership (in 000s)",
ylim = c(1300, 2300))
plot(ridership.ts, xlab = "Time", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
# overlay the fitted values of the linear model
lines(ridership.lm$fitted, lwd = 2)
Time
R
id
er
sh
ip
(
in
0
00
s)
1997.0 1997.5 1998.0 1998.5 1999.0 1999.5 2000.0
14
00
18
00
22
00
Time
R
id
er
sh
ip
(
in
0
00
s)
1992 1994 1996 1998 2000 2002 2004
14
00
18
00
22
00
FIGURE 16.3 PLOTS THAT ENHANCE THE DIFFERENT COMPONENTS OF THE TIME SERIES. TOP:
ZOOM-IN TO 3 YEARS OF DATA. BOTTOM: ORIGINAL SERIES WITH OVERLAID
QUADRATIC TRENDLINE
DATA-PARTITIONING AND PERFORMANCE EVALUATION 395
pattern, a model-driven model would require specifying how and when the
patterns change, which is usually impractical and often unknown. Therefore,
data-driven methods are preferable for local patterns. Such methods “learn”
patterns from the data and their memory length can be set to best adapt to the rate
of change in the series. Patterns that change quickly warrant a “short memory,”
whereas patterns that change slowly warrant a “long memory.” In conclusion,
the time plot should be used not only to identify the time series component, but
also the global/local nature of the trend and seasonality.
16.5 Data-Partitioning and Performance
Evaluation
As in the case of cross-sectional data, in order to avoid overfitting and to be able
to assess the predictive performance of the model on new data, we first partition
the data into a training set and a validation set (and perhaps an additional test
set). However, there is one important difference between data partitioning in
cross-sectional and time series data. In cross-sectional data, the partitioning is
usually done randomly, with a random set of records designated as training data
and the remainder as validation data. However, in time series, a random parti-
tion would create two time series with “holes.” Nearly all standard forecasting
methods cannot handle time series with missing values. Therefore, we partition
a time series into training and validation sets differently. The series is trimmed
into two periods; the earlier period is set as the training data and the later period
as the validation data. Methods are then trained on the earlier training period,
and their predictive performance assessed on the later validation period. Evalu-
ation measures typically use the same metrics used in cross-sectional evaluation
(see Chapter 5) with MAE, MAPE, and RMSE being the most popular metrics
in practice. In evaluating and comparing forecasting methods, another impor-
tant tool is visualization: examining time plots of the actual and predicted series
can shed light on performance and hint toward possible improvements.
Benchmark Performance: Naive Forecasts
While it is tempting to apply “sophisticated” forecasting methods, we must eval-
uate their value added compared to a very simple approach: the naive forecast. A
naive forecast is simply the most recent value of the series. In other words, at
time t, our forecast for any future period t+k is simply the value of the series at
time t. While simple, naive forecasts are sometimes surprisingly difficult to out-
perform with more sophisticated models. It is therefore important to benchmark
against results from a naive-forecasting approach.
When a time series has seasonality, a seasonal naive forecast can be generated.
It is simply the last similar value in the season. For example, to forecast April
396 HANDLING TIME SERIES
2001 for Amtrak ridership, we use the ridership from the most recent April,
April 2000. Similarly, to forecast April 2002, we also use April 2000 ridership.
In Figure 16.4, we show naive (horizontal line) and seasonal naive forecasts, as
well as actual values (dotted line), in a 3-year validation set from April 2001 to
March 2004.
Table 16.1 compares the accuracies of these two naive forecasts. Because
Amtrak ridership has monthly seasonality, the seasonal naive forecast is the clear
winner on both training and validation sets and on all popular measures. In
choosing between the two models, the accuracy on the validation set is more
relevant than the accuracy on the training set. Performance on the validation set
is more indicative of how the models will perform in the future.
TABLE 16.1 PREDICTIVE ACCURACY OF NAIVE AND SEASONAL NAIVE FORECASTS IN THE
VALIDATION SET FOR AMTRAK RIDERSHIP
> accuracy(naive.pred, valid.ts)
ME RMSE MAE MPE MAPE
Training set 2.45091 168.1470 125.2975 -0.3460027 7.271393
Test set -14.71772 142.7551 115.9234 -1.2749992 6.021396
> accuracy(snaive.pred, valid.ts)
ME RMSE MAE MPE MAPE
Training set 13.93991 99.26557 82.49196 0.5850656 4.715251
Test set 54.72961 95.62433 84.09406 2.6527928 4.247656
Generating Future Forecasts
Another important difference between cross-sectional and time-series partition-
ing occurs when creating the actual forecasts. Before attempting to forecast
future values of the series, the training and validation sets are recombined into
one long series, and the chosen method/model is rerun on the complete data.
This final model is then used to forecast future values. The three advantages in
recombining are:
1. The validation set, which is the most recent period, usually contains the
most valuable information in terms of being the closest in time to the
forecast period;
2. With more data (the complete time series compared to only the training
set), some models can be estimated more accurately;
3. If only the training set is used to generate forecasts, then it will require
forecasting farther into the future (e.g., if the validation set contains four
time points, forecasting the next period will require a 5-step-ahead fore-
cast from the training set).
DATA-PARTITIONING AND PERFORMANCE EVALUATION 397
code for creating Figure 16.4
nValid <- 36
nTrain <- length(ridership.ts) - nValid
# partition the data
train.ts <- window(ridership.ts, start = c(1991, 1), end = c(1991, nTrain))
valid.ts <- window(ridership.ts, start = c(1991, nTrain + 1),
end = c(1991, nTrain + nValid))
# generate the naive and seasonal naive forecasts
naive.pred <- naive(train.ts, h = nValid)
snaive.pred <- snaive(train.ts, h = nValid)
# plot forecasts and actuals in the training and validation sets
plot(train.ts, ylim = c(1300, 2600), ylab = "Ridership", xlab = "Time", bty = "l",
xaxt = "n", xlim = c(1991,2006.25), main = "")
axis(1, at = seq(1991, 2006, 1), labels = format(seq(1991, 2006, 1)))
lines(naive.pred$mean, lwd = 2, col = "blue", lty = 1)
lines(snaive.pred$mean, lwd = 2, col = "blue", lty = 1)
lines(valid.ts, col = "grey20", lty = 3)
lines(c(2004.25 - 3, 2004.25 - 3), c(0, 3500))
lines(c(2004.25, 2004.25), c(0, 3500))
text(1996.25, 2500, "Training")
text(2002.75, 2500, "Validation")
text(2005.25, 2500, "Future")
arrows(2004 - 3, 2450, 1991.25, 2450, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2004.5 - 3, 2450, 2004, 2450, code = 3, length = 0.1, lwd = 1,angle = 30)
arrows(2004.5, 2450, 2006, 2450, code = 3, length = 0.1, lwd = 1, angle = 30)
Time
R
id
er
sh
ip
14
00
16
00
18
00
20
00
22
00
24
00
26
00
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
FIGURE 16.4 NAIVE AND SEASONAL NAIVE FORECASTS IN A 3-YEAR VALIDATION SET FOR
AMTRAK RIDERSHIP
398 HANDLING TIME SERIES
PROBLEMS
16.1 Impact of September 11 on Air Travel in the United States. The Research and
Innovative Technology Administration’s Bureau of Transportation Statistics conducted
a study to evaluate the impact of the September 11, 2001 terrorist attack on US trans-
portation. The 2006 study report and the data can be found at http://goo.gl/w2lJPV.
The goal of the study was stated as follows:
The purpose of this study is to provide a greater understanding of the
passenger travel behavior patterns of persons making long distance trips before
and after 9/11.
The report analyzes monthly passenger movement data between January 1990 and
May 2004. Data on three monthly time series are given in file Sept11Travel.csv for this
period:
(1) Actual airline revenue passenger miles (Air),
(2) Rail passenger miles (Rail), and
(3) Vehicle miles traveled (Car).
In order to assess the impact of September 11, BTS took the following approach:
using data before September 11, they forecasted future data (under the assumption of
no terrorist attack). Then, they compared the forecasted series with the actual data
to assess the impact of the event. Our first step, therefore, is to split each of the time
series into two parts: pre- and post September 11. We now concentrate only on the
earlier time series.
a. Is the goal of this study descriptive or predictive?
b. Plot each of the three pre-event time series (Air, Rail, Car).
i. What time series components appear from the plot?
ii. What type of trend appears? Change the scale of the series, add trendlines and
suppress seasonality to better visualize the trend pattern.
16.2 Performance on Training and Validation Data. Two different models were fit to
the same time series. The first 100 time periods were used for the training set and the
last 12 periods were treated as a hold-out set. Assume that both models make sense
practically and fit the data pretty well. Below are the RMSE values for each of the
models:
Training Set Validation Set
Model A 543 690
Model B 669 675
a. Which model appears more useful for explaining the different components of this
time series? Why?
b. Which model appears to be more useful for forecasting purposes? Why?
16.3 Forecasting Department Store Sales. The file DepartmentStoreSales.csv contains
data on the quarterly sales for a department store over a 6-year period (data courtesy
of Chris Albright).
http://goo.gl/w2lJPV
PROBLEMS 399
a. Create a well-formatted time plot of the data.
b. Which of the four components (level, trend, seasonality, noise) seem to be present
in this series?
16.4 Shipments of Household Appliances. The file ApplianceShipments.csv contains the
series of quarterly shipments (in million $) of US household appliances between 1985
and 1989 (data courtesy of Ken Black).
a. Create a well-formatted time plot of the data.
b. Which of the four components (level, trend, seasonality, noise) seem to be present
in this series?
16.5 Canadian Manufacturing Workers Workhours. The time plot in Figure 16.5
describes the average annual number of weekly hours spent by Canadian manufactur-
ing workers (data are available in CanadianWorkHours.csv—thanks to Ken Black for the
data).
Year
H
ou
rs
P
er
W
ee
k
1970 1975 1980 1985 1990 1995 2000
34
.5
35
.0
35
.5
36
.0
36
.5
37
.0
37
.5
38
. 0
FIGURE 16.5 AVERAGE ANNUAL WEEKLY HOURS SPENT BY CANADIAN MANUFACTURING
WORKERS
a. Reproduce the time plot.
b. Which of the four components (level, trend, seasonality, noise) seem to be present
in this series?
16.6 Souvenir Sales: The file SouvenirSales.csv contains monthly sales for a souvenir
shop at a beach resort town in Queensland, Australia, between 1995–2001. (Source:
Hyndman, R.J., Time Series Data Library, http://data.is/TSDLdemo. Accessed on
07/25/15.)
Back in 2001, the store wanted to use the data to forecast sales for the next 12
months (year 2002). They hired an analyst to generate forecasts. The analyst first
partitioned the data into training and validation sets, with the validation set containing
http://data.is/TSDLdemo
400 HANDLING TIME SERIES
the last 12 months of data (year 2001). She then fit a regression model to sales, using
the training set.
a. Create a well-formatted time plot of the data.
b. Change the scale on the x-axis, or on the y-axis, or on both to log-scale in order
to achieve a linear relationship. Select the time plot that seems most linear.
c. Comparing the two time plots, what can be said about the type of trend in the
data?
d. Why were the data partitioned? Partition the data into the training and validation
set as explained above.
16.7 Shampoo Sales. The file ShampooSales.csv contains data on the monthly sales of a
certain shampoo over a 3-year period. Source: Hyndman, R.J., Time Series Data
Library, http://data.is/TSDLdemo. Accessed on 07/25/15).
a. Create a well-formatted time plot of the data.
b. Which of the four components (level, trend, seasonality, noise) seem to be present
in this series?
c. Do you expect to see seasonality in sales of shampoo? Why?
d. If the goal is forecasting sales in future months, which of the following steps should
be taken?
• Partition the data into training and validation sets
• Tweak the model parameters to obtain good fit to the validation data
• Look at MAPE and RMSE values for the training set
• Look at MAPE and RMSE values for the validation set
http://data.is/TSDLdemo
CHAPTER 17
Regression-Based Forecasting
A popular forecasting tool is based on multiple linear regression models, using
suitable predictors to capture trend and/or seasonality. In this chapter, we show
how a linear regression model can be set up to capture a time series with a trend
and/or seasonality. The model, which is estimated from the data, can then pro-
duce future forecasts by inserting the relevant predictor information into the
estimated regression equation. We describe different types of common trends
(linear, exponential, polynomial), as well as two types of seasonality (additive and
multiplicative). Next, we show how a regression model can be used to quantify
the correlation between neighboring values in a time series (called autocorrela-
tion). This type of model, called an autoregressive model, is useful for improving
forecast precision by making use of the information contained in the autocor-
relation (beyond trend and seasonality). It is also useful for evaluating the pre-
dictability of a series, by evaluating whether the series is a “random walk.” The
various steps of fitting linear regression and autoregressive models, using them
to generate forecasts, and assessing their predictive accuracy, are illustrated using
the Amtrak ridership series.
17.1 A Model with Trend1
Linear Trend
To create a linear regression model that captures a time series with a global linear
trend, the outcome variable (Y ) is set as the time series values or some function
of it, and the predictor (X) is set as a time index. Let us consider a simple
1This and subsequent sections in this chapter copyright © 2017 Datastats, LLC, and Galit Shmueli.
Used by permission.
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
401
402 REGRESSION-BASED FORECASTING
code for creating Figure 17.1
library(forecast)
Amtrak.data <- read.csv("Amtrak.csv")
# create time series
ridership.ts <- ts(Amtrak.data$Ridership, start = c(1991,1),
end = c(2004,3), freq = 12)
# produce linear trend model
ridership.lm <- tslm(ridership.ts ~ trend)
# plot the series
plot(ridership.ts, xlab = "Time", ylab = "Ridership", ylim = c(1300,2300),
bty = "l")
lines(ridership.lm$fitted, lwd = 2)
Time
R
id
er
sh
ip
1992 1994 1996 1998 2000 2002 2004
14
00
16
00
18
00
20
00
22
00
FIGURE 17.1 A LINEAR TREND FIT TO AMTRAK RIDERSHIP
example: fitting a linear trend to the Amtrak ridership data. This type of trend
is shown in Figure 17.1.
From the time plot, it is obvious that the global trend is not linear. However,
we use this example to illustrate how a linear trend is fitted, and later we consider
more appropriate models for this series.
A MODEL WITH TREND 403
To obtain a linear relationship between Ridership and Time, we set the
output variable Y as the Amtrak Ridership and create a new variable that is a
time index t = 1, 2, 3, . . . This time index is then used as a single predictor in
the regression model:
Yt = β0 + β1t+ ϵ,
where Yt is the Ridership at period t and ϵ is the standard noise term in a linear
regression. Thus, we are modeling three of the four time series components:
level (β0), trend (β1), and noise (ϵ). Seasonality is not modeled. A snapshot of
the two corresponding columns (Y and t) are shown in Table 17.1.
TABLE 17.1 OUTCOME VARIABLE (MIDDLE)
AND PREDICTOR VARIABLE (RIGHT)
USED TO FIT A LINEAR TREND
Month Ridership (Yt) t
Jan 91 1709 1
Feb 91 1621 2
Mar 91 1973 3
Apr 91 1812 4
May 91 1975 5
Jun 91 1862 6
Jul 91 1940 7
Aug 91 2013 8
Sep 91 1596 9
Oct 91 1725 10
Nov 91 1676 11
Dec 91 1814 12
Jan 92 1615 13
Feb 92 1557 14
After partitioning the data into training and validation sets, the next step is
to fit a linear regression model to the training set, with t as the single predictor
(function tslm() relies on ts() which automatically creates t and calls it trend).
Applying this to the Amtrak ridership data (with a validation set consisting of
the last 12 months) results in the estimated model shown in Figure 17.2. The
actual and fitted values and the residuals (or forecast errors) are shown in the two
time plots.
Table 17.2 contains a report of the estimated coefficients. Note that exam-
ining only the estimated coefficients and their statistical significance can be
misleading! In this example, they would indicate that the linear fit is reason-
able, although it is obvious from the time plots that the trend is not linear. An
inadequate trend shape is easiest to detect by examining the time plot of the
residuals.
404 REGRESSION-BASED FORECASTING
code for creating Figure 17.2
# fit linear trend model to training set and create forecasts
train.lm <- tslm(train.ts ~ trend)
train.lm.pred <- forecast(train.lm, h = nValid, level = 0)
par(mfrow = c(2, 1))
plot(train.lm.pred, ylim = c(1300, 2600), ylab = "Ridership", xlab = "Time",
bty = "l", xaxt = "n", xlim = c(1991,2006.25), main = "", flty = 2)
axis(1, at = seq(1991, 2006, 1), labels = format(seq(1991, 2006, 1)))
lines(train.lm.pred$fitted, lwd = 2, col = "blue")
lines(valid.ts)
plot(train.lm.pred$residuals, ylim = c(-420, 500), ylab = "Forecast Errors",
xlab = "Time", bty = "l", xaxt = "n", xlim = c(1991,2006.25), main = "")
axis(1, at = seq(1991, 2006, 1), labels = format(seq(1991, 2006, 1)))
lines(valid.ts - train.lm.pred$mean, lwd = 1)
Code for data partition is given in Figure 16.4
Time
R
id
er
sh
ip
14
00
20
00
26
00
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
Time
F
or
ec
as
t E
rr
or
s
−
40
0
0
40
0
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
FIGURE 17.2 A LINEAR TREND FIT TO AMTRAK RIDERSHIP IN THE TRAINING PERIOD AND
FORECASTED IN THE VALIDATION PERIOD
A MODEL WITH TREND 405
TABLE 17.2 SUMMARY OF OUTPUT FROM A LINEAR REGRESSION MODEL APPLIED TO THE
AMTRAK RIDERSHIP DATA IN THE TRAINING PERIOD
> summary(train.lm)
Call: lm(formula = formula, data = “train.ts”, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-411.29 -114.02 16.06 129.28 306.35
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1750.3595 29.0729 60.206 <2e-16 ***
trend 0.3514 0.4069 0.864 0.39
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 160.2 on 121 degrees of freedom
Multiple R-squared: 0.006125, Adjusted R-squared: -0.002089
F-statistic: 0.7456 on 1 and 121 DF, p-value: 0.3896
L I N E A R R E G R E S S I O N F O R T I M E S E R I E S I N R
Fitting a regression model to time series (and using it to generate forecasts) can be
done in R using the function tslm() in the forecast package.
Exponential Trend
Several alternative trend shapes are useful and easy to fit via a linear regression
model. One such shape is an exponential trend. An exponential trend implies a
multiplicative increase/decrease of the series over time (Yt = ceβ1t+ϵ). To fit an
exponential trend, simply replace the outcome variable Y with logY (where log
is the natural logarithm), and fit a linear regression (logYt = β0+β1t+ϵ). In the
Amtrak example, for instance, we would fit a linear regression of log(Ridership)
on the index variable t. Exponential trends are popular in sales data, where they
reflect percentage growth. In R, fitting exponential trend is done by setting
argument lambda = 0 in function tslm().2
Note: As in the general case of linear regression, when comparing the pre-
dictive accuracy of models that have a different output variable, such as a linear
2Argument lambda is used to apply the Box-Cox transformation to the values of the time series: (yλ−
1)/λ if λ ≠ 0. When λ = 0, the transformation is defined as log(y). When λ = 1, the series is not
transformed (except for the subtraction of 1 from each value), so the model has a linear trend.
406 REGRESSION-BASED FORECASTING
trend model (with Y ) and an exponential trend model (with logY ), it is essential
to compare forecasts or forecast errors on the same scale. In R, when using an
exponential trend model, the forecasts of logY are made and then automatically
converted back to the original scale. An example is shown in Figure 17.3, where
an exponential trend is fit to the Amtrak ridership data.
code for creating Figure 17.3
# fit exponential trend using tslm() with argument lambda = 0
train.lm.expo.trend <- tslm(train.ts ~ trend, lambda = 0)
train.lm.expo.trend.pred <- forecast(train.lm.expo.trend, h = nValid, level = 0)
# fit linear trend using tslm() with argument lambda = 1 (no transform of y)
train.lm.linear.trend <- tslm(train.ts ~ trend, lambda = 1)
train.lm.linear.trend.pred <- forecast(train.lm.linear.trend, h = nValid, level = 0)
plot(train.lm.expo.trend.pred, ylim = c(1300, 2600), ylab = "Ridership",
xlab = "Time", bty = "l", xaxt = "n", xlim = c(1991,2006.25), main = "", flty = 2)
axis(1, at = seq(1991, 2006, 1), labels = format(seq(1991, 2006, 1)))
lines(train.lm.expo.trend.pred$fitted, lwd = 2, col = "blue") # Added in 6-5
lines(train.lm.linear.trend.pred$fitted, lwd = 2, col = "black", lty = 3)
lines(train.lm.linear.trend.pred$mean, lwd = 2, col = "black", lty = 3)
lines(valid.ts)
Time
R
id
er
sh
ip
14
00
16
00
18
00
20
00
22
00
24
00
26
00
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
FIGURE 17.3 EXPONENTIAL (AND LINEAR) TREND USED TO FORECAST AMTRAK RIDERSHIP
A MODEL WITH SEASONALITY 407
Polynomial Trend
Another non-linear trend shape that is easy to fit via linear regression is a poly-
nomial trend, and in particular, a quadratic relationship of the form Yt =
β0 + β1t + β2t
2 + ϵ. This is done by creating an additional predictor t2 (the
square of t), and fitting a multiple linear regression with the two predictors t and
t2. In R, we fit a quadratic trend using function I(), which treats an object “as is”
(Figure 17.4). For the Amtrak ridership data, we have already seen a U-shaped
trend in the data. We therefore fit a quadratic model. The fitted and residual
charts are shown in Figure 17.4. We conclude from these plots that this shape
adequately captures the trend. The forecast errors are now devoid of trend and
exhibit only seasonality.
In general, any type of trend shape can be fit as long as it has a mathematical
representation. However, the underlying assumption is that this shape is appli-
cable throughout the period of data that we have and also during the period that
we are going to forecast. Do not choose an overly complex shape. Although it
will fit the training data well, it will in fact be overfitting them. To avoid over-
fitting, always examine the validation performance and refrain from choosing
overly complex trend patterns.
17.2 A Model with Seasonality
A seasonal pattern in a time series means that observations that fall in some
seasons have consistently higher or lower values than those that fall in other sea-
sons. Examples are day-of-week patterns, monthly patterns, and quarterly pat-
terns. The Amtrak ridership monthly time series, as can be seen in the time plot,
exhibits strong monthly seasonality (with highest traffic during summer months).
Seasonality is captured in a regression model by creating a new categorical
variable that denotes the season for each value. This categorical variable is then
turned into dummies, which in turn are included as predictors in the regression
model. To illustrate this, we created a new “Season” column for the Amtrak
ridership data, as shown in Table 17.3. Then, to include the Season categorical
variable as a predictor in a regression model for Y (Ridership), we turn it into
dummies (for m = 12 seasons we create 11 dummies, which are binary vari-
ables that take on the value 1 if the record falls in that particular season, and 0
otherwise3).
In R, function tslm() uses ts() which automatically creates the categorical
Season column (called season) and converts it into dummy variables.
3We use only m-1 dummies because information about the m − 1 seasons is sufficient. If all m − 1
variables are zero, then the season must be themth season. Including themth dummy causes redundant
information and multicollinearity.
408 REGRESSION-BASED FORECASTING
code for creating Figure 17.4
# fit quadratic trend using function I(), which treats an object "as is".
train.lm.poly.trend <- tslm(train.ts ~ trend + I(trend^2))
summary(train.lm.poly.trend)
train.lm.poly.trend.pred <- forecast(train.lm.poly.trend, h = nValid, level = 0)
par(mfrow = c(2,1))
plot(train.lm.poly.trend.pred, ylim = c(1300, 2600), ylab = "Ridership",
xlab = "Time", bty = "l", xaxt = "n", xlim = c(1991,2006.25), main = "", flty = 2)
axis(1, at = seq(1991, 2006, 1), labels = format(seq(1991, 2006, 1)))
lines(train.lm.poly.trend$fitted, lwd = 2)
lines(valid.ts)
plot(train.lm.poly.trend$residuals, ylim = c(-400, 550), ylab = "Forecast Errors",
xlab = "Time", bty = "l", xaxt = "n", xlim = c(1991,2006.25), main = "")
axis(1, at = seq(1991, 2006, 1), labels = format(seq(1991, 2006, 1)))
lines(valid.ts - train.lm.poly.trend.pred$mean, lwd = 1)
Time
R
id
er
sh
ip
14
00
20
00
26
00
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
Time
F
or
ec
as
t E
rr
or
s
−
40
0
0
40
0
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
FIGURE 17.4 QUADRATIC TREND MODEL USED TO FORECAST AMTRAK RIDERSHIP. PLOTS OF
FITTED, FORECASTED, AND ACTUAL VALUES (TOP) AND FORECAST ERRORS
(BOTTOM)
A MODEL WITH SEASONALITY 409
TABLE 17.3 NEW CATEGORICAL VARIABLE (RIGHT) TO BE USED
(VIA DUMMIES) AS PREDICTOR(S) IN A LINEAR
REGRESSION MODEL
Month Ridership Season
Jan 91 1709 Jan
Feb 91 1621 Feb
Mar 91 1973 Mar
Apr 91 1812 Apr
May 91 1975 May
Jun 91 1862 Jun
Jul 91 1940 Jul
Aug 91 2013 Aug
Sep 91 1596 Sep
Oct 91 1725 Oct
Nov 91 1676 Nov
Dec 91 1814 Dec
Jan 92 1615 Jan
Feb 92 1557 Feb
Mar 92 1891 Mar
Apr 92 1956 Apr
May 92 1885 May
Time
R
id
er
sh
ip
14
00
20
00
26
00
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
Time
F
or
ec
as
t E
rr
or
s
−
40
0
0
40
0
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
FIGURE 17.5 REGRESSION MODEL WITH SEASONALITY APPLIED TO THE AMTRAK RIDERSHIP
(TOP) AND ITS FORECAST ERRORS (BOTTOM)
410 REGRESSION-BASED FORECASTING
After partitioning the data into training and validation sets (see Section 16.5),
we fit the regression model to the training data. The fitted series and the residuals
from this model are shown in Figure 17.5. The model appears to capture the
seasonality in the data. However, since we have not included a trend component
in the model (as shown in Section 17.1), the fitted values do not capture the
existing trend. Therefore, the residuals, which are the difference between the
actual and the fitted values, clearly display the remaining U-shaped trend.
When seasonality is added as described above (create categorical seasonal
variable, then create dummies from it, then regress on Y ), it captures additive
seasonality. This means that the average value of Y in a certain season is a fixed
amount more or less than that in another season. Table 17.4 shows the output
of a linear regression fit to Ridership (Y ) with seasonality. For example, in the
TABLE 17.4 SUMMARY OF OUTPUT FROM FITTING ADDITIVE SEASONALITY TO THE AMTRAK
RIDERSHIP DATA IN THE TRAINING PERIOD
> # include season as a predictor in tslm(). Here it creates 11 dummies,
> # one for each month except for the first season, January.
> train.lm.season <- tslm(train.ts ~ season)
> summary(train.lm.season)
Call:
lm(formula = formula, data = “train.ts”, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-276.165 -52.934 5.868 54.544 215.081
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1573.97 30.58 51.475 < 2e-16 ***
season2 -42.93 43.24 -0.993 0.3230
season3 260.77 43.24 6.030 2.19e-08 ***
season4 245.09 44.31 5.531 2.14e-07 ***
season5 278.22 44.31 6.279 6.81e-09 ***
season6 233.46 44.31 5.269 6.82e-07 ***
season7 345.33 44.31 7.793 3.79e-12 ***
season8 396.66 44.31 8.952 9.19e-15 ***
season9 75.76 44.31 1.710 0.0901 .
season10 200.61 44.31 4.527 1.51e-05 ***
season11 192.36 44.31 4.341 3.14e-05 ***
season12 230.42 44.31 5.200 9.18e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 101.4 on 111 degrees of freedom
Multiple R-squared: 0.6348, Adjusted R-squared: 0.5986
F-statistic: 17.54 on 11 and 111 DF, p-value: < 2.2e-16
A MODEL WITH TREND AND SEASONALITY 411
Amtrak ridership, the coefficient for season8 (396.66) indicates that the average
number of passengers in August is higher by 396.66 thousand passengers than
the average in January (the reference category). Using regression models, we
can also capture multiplicative seasonality, where values in a certain season are on
average, higher or lower by a percentage amount compared to another season.
To fit multiplicative seasonality, we use the same model as above, except that we
use log(Y ) as the outcome variable. In R, this is achieved by setting lambda=0
in the tslm() function.
17.3 A Model with Trend and Seasonality
Finally, we can create models that capture both trend and seasonality by including
predictors of both types. For example, from our exploration of the Amtrak
Ridership data, it appears that a quadratic trend and monthly seasonality are
both warranted. We therefore fit a model to the training data with 13 predictors:
11 dummies for month, and t and t2 for trend. The fit and output from this
final model are shown in Figure 17.6 and Table 17.5. If we are satisfied with
Time
R
id
er
sh
ip
14
00
20
00
26
00
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
Time
F
or
ec
as
t E
rr
or
s
−
40
0
0
40
0
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
FIGURE 17.6 REGRESSION MODEL WITH TREND AND SEASONALITY APPLIED TO AMTRAK
RIDERSHIP (TOP) AND ITS FORECAST ERRORS (BOTTOM)
412 REGRESSION-BASED FORECASTING
TABLE 17.5 SUMMARY OF OUTPUT FROM FITTING TREND AND SEASONALITY TO AMTRAK
RIDERSHIP IN THE TRAINING PERIOD
> train.lm.trend.season <- tslm(train.ts ~ trend + I(trend^2) + season)
> summary(train.lm.trend.season)
Call:
tslm(formula = train.ts ~ trend + I(trend^2) + season)
Residuals:
Min 1Q Median 3Q Max
-213.77 -39.36 9.71 42.42 152.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1696.9794 27.6752 61.32 < 0.0000000000000002 ***
trend -7.1559 0.7293 -9.81 < 0.0000000000000002 ***
I(trend^2) 0.0607 0.0057 10.66 < 0.0000000000000002 ***
season2 -43.2458 30.2407 -1.43 0.1556
season3 260.0149 30.2423 8.60 0.00000000000006604 ***
season4 260.6175 31.0210 8.40 0.00000000000018264 ***
season5 293.7966 31.0202 9.47 0.00000000000000069 ***
season6 248.9615 31.0199 8.03 0.00000000000126033 ***
season7 360.6340 31.0202 11.63 < 0.0000000000000002 ***
season8 411.6513 31.0209 13.27 < 0.0000000000000002 ***
season9 90.3162 31.0223 2.91 0.0044 **
season10 214.6037 31.0241 6.92 0.00000000032920793 ***
season11 205.6711 31.0265 6.63 0.00000000133918009 ***
season12 242.9294 31.0295 7.83 0.00000000000344281 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 70.9 on 109 degrees of freedom
Multiple R-squared: 0.825, Adjusted R-squared: 0.804
F-statistic: 39.4 on 13 and 109 DF, p-value: <0.0000000000000002
this model after evaluating its predictive performance on the validation data and
comparing it against alternatives, we would re-fit it to the entire un-partitioned
series. This re-fitted model can then be used to generate k-step-ahead forecasts
(denoted by Ft+k) by plugging in the appropriate month and index terms.
17.4 Autocorrelation and ARIMA Models
When we use linear regression for time series forecasting, we are able to account
for patterns such as trend and seasonality. However, ordinary regression models
do not account for dependence between values in different periods, which in
cross-sectional data is assumed to be absent. Yet, in the time series context,
AUTOCORRELATION AND ARIMA MODELS 413
values in neighboring periods tend to be correlated. Such correlation, called
autocorrelation, is informative and can help in improving forecasts. If we know
that a high value tends to be followed by high values (positive autocorrelation),
then we can use that to adjust forecasts. We will now discuss how to compute the
autocorrelation of a series, and how best to utilize the information for improving
forecasts.
Computing Autocorrelation
Correlation between values of a time series in neighboring periods is called auto-
correlation, because it describes a relationship between the series and itself. To
compute autocorrelation, we compute the correlation between the series and
a lagged version of the series. A lagged series is a “copy” of the original series
which is moved forward one or more time periods. A lagged series with lag-1
is the original series moved forward one time period; a lagged series with lag-2
is the original series moved forward two time periods, etc. Table 17.6 shows
the first 24 months of the Amtrak ridership series, the lag-1 series and the lag-2
series.
Next, to compute the lag-1 autocorrelation, which measures the linear rela-
tionship between values in consecutive time periods, we compute the correlation
TABLE 17.6 FIRST 24 MONTHS OF AMTRAK RIDERSHIP SERIES WITH LAG-1 AND LAG-2
SERIES
Month Ridership Lag-1 Series Lag-2 Series
Jan 91 1709
Feb 91 1621 1709
Mar 91 1973 1621 1709
Apr 91 1812 1973 1621
May 91 1975 1812 1973
Jun 91 1862 1975 1812
Jul 91 1940 1862 1975
Aug 91 2013 1940 1862
Sep 91 1596 2013 1940
Oct 91 1725 1596 2013
Nov 91 1676 1725 1596
Dec 91 1814 1676 1725
Jan 92 1615 1814 1676
Feb 92 1557 1615 1814
Mar 92 1891 1557 1615
Apr 92 1956 1891 1557
May 92 1885 1956 1891
Jun 92 1623 1885 1956
Jul 92 1903 1623 1885
Aug 92 1997 1903 1623
Sep 92 1704 1997 1903
Oct 92 1810 1704 1997
Nov 92 1862 1810 1704
Dec 92 1875 1862 1810
414 REGRESSION-BASED FORECASTING
between the original series and the lag-1 series (e.g., via the function cor()) to be
0.08. Note that although the original series in Table 17.6 has 24 time periods,
the lag-1 autocorrelation will only be based on 23 pairs (because the lag-1 series
does not have a value for January 1991). Similarly, the lag-2 autocorrelation,
measuring the relationship between values that are two time periods apart, is the
correlation between the original series and the lag-2 series (yielding −0.15).
We can use R’s Acf() function in the forecast package to directly compute
and plot the autocorrelation of a series at different lags. For example, the output
for the 24-month ridership is shown in Figure 17.7.
code for creating Figure 17.7
ridership.24.ts <- window(train.ts, start = c(1991, 1), end = c(1991, 24))
Acf(ridership.24.ts, lag.max = 12, main = "")
−
0.
6
−
0.
4
−
0.
2
0.
0
0.
2
0.
4
0.
6
Lag
A
C
F
2 4 6 8 10 12
FIGURE 17.7 AUTOCORRELATION PLOT FOR LAGS 1–12 (FOR FIRST 24 MONTHS OF AMTRAK
RIDERSHIP)
A few typical autocorrelation behaviors that are useful to explore are:
Strong autocorrelation (positive or negative) at a lag k larger than 1
and its multiples (2k, 3k, . . .) typically reflects a cyclical pattern. For exam-
ple, strong positive lag-12 autocorrelation in monthly data will reflect an
annual seasonality (where values during a given month each year are posi-
tively correlated).
AUTOCORRELATION AND ARIMA MODELS 415
Positive lag-1 autocorrelation (called “stickiness”) describes a series
where consecutive values move generally in the same direction. In the pres-
ence of a strong linear trend, we would expect to see a strong and positive
lag-1 autocorrelation.
Negative lag-1 autocorrelation reflects swings in the series, where high
values are immediately followed by low values and vice versa.
Examining the autocorrelation of a series can therefore help to detect sea-
sonality patterns. In Figure 17.7, for example, we see that the strongest autocor-
relation is at lag 6 and is negative. This indicates a bi-annual pattern in ridership,
with 6-month switches from high to low ridership. A look at the time plot
confirms the high-summer low-winter pattern.
In addition to looking at the autocorrelation of the raw series, it is very useful
to look at the autocorrelation of the residual series. For example, after fitting
a regression model (or using any other forecasting method), we can examine
the autocorrelation of the series of residuals. If we have adequately modeled
the seasonal pattern, then the residual series should show no autocorrelation
at the season’s lag. Figure 17.8 displays the autocorrelations for the residuals
from the regression model with seasonality and quadratic trend shown in Figure
17.6. It is clear that the 6-month (and 12-month) cyclical behavior no longer
dominates the series of residuals, indicating that the regression model captured
them adequately. However, we can also see a strong positive autocorrelation
from lag 1 on, indicating a positive relationship between neighboring residuals.
This is valuable information, which can be used to improve forecasts.
−
0.
2
0.
0
0.
2
0.
4
0.
6
Lag
A
C
F
2 4 6 8 10 12
FIGURE 17.8 AUTOCORRELATION PLOT OF FORECAST ERRORS SERIES FROM FIGURE 17.6
416 REGRESSION-BASED FORECASTING
Improving Forecasts by Integrating Autocorrelation Information
In general, there are two approaches to taking advantage of autocorrelation. One
is by directly building the autocorrelation into the regression model, and the
other is by constructing a second-level forecasting model on the residual series.
Among regression-type models that directly account for autocorrelation are
autoregressive (AR) models, or the more general class of models called ARIMA
(Autoregressive Integrated Moving Average) models. AR models are similar to
linear regression models, except that the predictors are the past values of the
series. For example, an autoregressive model of order 2, denoted AR(2), can be
written as
Yt = β0 + β1Yt−1 + β2Yt−2 + ϵ (17.1)
Estimating such models is roughly equivalent to fitting a linear regression
model with the series as the outcome variable, and the two lagged series (at lag 1
and 2 in this example) as the predictors. However, it is better to use designated
ARIMA estimation methods (e.g., those available in R’s forecast package)
over ordinary linear regression estimation, to produce more accurate results.4
Moving from AR to ARIMA models creates a larger set of more flexible fore-
casting models, but also requires much more statistical expertise. Even with the
simpler AR models, fitting them to raw time series that contain patterns such as
trends and seasonality requires the user to perform several initial data transforma-
tions and to choose the order of the model. These are not straightforward tasks.
Because ARIMA modeling is less robust and requires more experience and sta-
tistical expertise than other methods, the use of such models for forecasting raw
series is generally less popular in practical forecasting. We therefore direct the
interested reader to classic time series textbooks [e.g., see Chapter 4 in Chatfield
(2003)].
However, we do discuss one particular use of AR models that is straight-
forward to apply in the context of forecasting, which can provide a significant
improvement to short-term forecasts. This relates to the second approach for
utilizing autocorrelation, which requires constructing a second-level forecasting
model for the residuals, as follows:
1. Generate a k-step-ahead forecast of the series (Ft+k), using any forecast-
ing method
2. Generate a k-step-ahead forecast of the forecast error (residual) (Et+k),
using an AR (or other) model
4ARIMA model estimation differs from ordinary regression estimation by accounting for the depen-
dence between records.
AUTOCORRELATION AND ARIMA MODELS 417
3. Improve the initial k-step-ahead forecast of the series by adjusting it
according to its forecasted error: Improved F ∗t+k = Ft+k + Et+k.
In particular, we can fit low-order AR models to series of residuals (or forecast
errors) which can then be used to forecast future forecast errors. By fitting the
series of residuals, rather than the raw series, we avoid the need for initial data
transformations (because the residual series is not expected to contain any trends
or cyclical behavior besides autocorrelation).
To fit an AR model to the series of residuals, we first examine the auto-
correlations of the residual series. We then choose the order of the AR model
according to the lags in which autocorrelation appears. Often, when autocor-
relation exists at lag 1 and higher, it is sufficient to fit an AR(1) model of the
form
Et = β0 + β1Et−1 + ϵ (17.2)
where Et denotes the residual (or forecast error) at time t. For example, although
the autocorrelations in Figure 17.8 appear large from lags 1 to 10 or so, it is
likely that an AR(1) would capture all of these relationships. The reason is that
if immediate neighboring values are correlated, then the relationship propagates
to values that are two periods away, then three periods away, etc.5
The result of fitting an AR(1) model to the Amtrak ridership residual series
is shown in Table 17.7. The AR(1) coefficient (0.5998) is close to the lag-
1 autocorrelation (0.6041) that we found earlier (Figure 17.8). The forecasted
residual for April 2001 is computed by plugging in the most recent residual from
March 2001 (equal to 12.108) into the AR(1) model6:
0.3728491 + (0.5997814)(12.108− 0.3728491) = 7.411.
You can obtain this number directly by using the forecast() function (see output
in Table 17.7). The positive value tells us that the regression model will produce
a ridership forecast for April 2001 that is too low and that we should adjust
it up by adding 7411 riders. In this particular example, the regression model
(with quadratic trend and seasonality) produced a forecast of 2,004,271 riders,
and the improved two-stage model [regression + AR(1) correction] corrected it
by increasing it to 2,011,906 riders. The actual value for April 2001 turned out
to be 2,023,792 riders—much closer to the improved forecast.
5Partial autocorrelations (use function Pacf()) measure the contribution of each lag series over and above
smaller lags. For example, the lag-2 partial autocorrelation is the contribution of lag-2 beyond that of
lag-1.
6The intercept in the Coefficients table resulting from function Arima() is not exactly an intercept—it
is the estimated mean of the series. Hence, to get a forecast, we must subtract this coefficient from our
value. In this case, we have Ft+1 = intercept + slope (yt− intercept).
418 REGRESSION-BASED FORECASTING
TABLE 17.7 OUTPUT FOR AR(1) MODEL ON RIDERSHIP RESIDUALS
code for running AR(1) model on residuals
# fit linear regression with quadratic trend and seasonality to Ridership
train.lm.trend.season <- tslm(train.ts ~ trend + I(trend^2) + season)
# fit AR(1) model to training residuals
# use Arima() in the forecast package to fit an ARIMA model
# (that includes AR models); order = c(1,0,0) gives an AR(1).
train.res.arima <- Arima(train.lm.trend.season$residuals, order = c(1,0,0))
valid.res.arima.pred <- forecast(train.res.arima, h = 1)
Output
> summary(train.res.arima)
Series: train.lm.trend.season$residuals
ARIMA(1,0,0) with non-zero mean
Coefficients:
ar1 intercept
0.5997814 0.3728491
s.e. 0.0712246 11.8408218
sigma^2 estimated as 2829: log likelihood=-663.54
AIC=1333.08 AICc=1333.29 BIC=1341.52
> valid.res.arima.pred
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Apr 2001 7.4111097650 -61.31747817 76.13969770 -97.70019491 112.5224144
From the plot of the actual vs. forecasted residual series (Figure 17.9), we can
see that the AR(1) model fits the residual series quite well. Note, however, that
the plot is based on the training data (until March 2001). To evaluate predictive
performance of the two-level model [regression + AR(1)], we would have to
examine performance (e.g., via MAPE or RMSE metrics) on the validation
data, in a fashion similar to the calculation that we performed for April 2001.
Finally, to examine whether we have indeed accounted for the autocor-
relation in the series and that no more information remains in the series, we
examine the autocorrelations of the series of residuals-of-residuals (the residuals
obtained after the AR(1), which was applied to the regression residuals). This is
seen in Figure 17.10. It is clear that no more autocorrelation remains, and that
the addition of the AR(1) model has captured the autocorrelation information
adequately.
We mentioned earlier that improving forecasts via an additional AR layer is
useful for short-term forecasting. The reason is that an AR model of order k
will usually only provide useful forecasts for the next k periods, and after that
forecasts will rely on earlier forecasts rather than on actual data. For example, to
AUTOCORRELATION AND ARIMA MODELS 419
code for creating Figure 17.9
plot(train.lm.trend.season$residuals, ylim = c(-250, 250), ylab = “Residuals”,
xlab = “Time”, bty = “l”, xaxt = “n”, xlim = c(1991,2006.25), main = “”)
axis(1, at = seq(1991, 2006, 1), labels = format(seq(1991, 2006, 1)))
lines(train.res.arima.pred$fitted, lwd = 2, col = “blue”)
Time
R
es
id
ua
ls
−
20
0
−
10
0
0
10
0
20
0
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
FIGURE 17.9 FITTING AN AR(1) MODEL TO THE RESIDUAL SERIES FROM FIGURE 17.6
−
0.
2
−
0.
1
0.
0
0.
1
0.
2
Lag
A
C
F
2 4 6 8 10 12
FIGURE 17.10 AUTOCORRELATIONS OF RESIDUALS-OF-RESIDUALS SERIES
420 REGRESSION-BASED FORECASTING
forecast the residual of May 2001 when the time of prediction is March 2001,
we would need the residual for April 2001. However, because that value is not
available, it would be replaced by its forecast. Hence, the forecast for May 2001
would be based on the forecast for April 2001.
Evaluating Predictability
Before attempting to forecast a time series, it is important to determine whether
it is predictable, in the sense that its past can be used to predict its future beyond
the naive forecast. One useful way to assess predictability is to test whether the
series is a random walk. A random walk is a series in which changes from one time
period to the next are random. According to the efficient market hypothesis in
economics, asset prices are random walks and therefore predicting stock prices
is a game of chance.7
A random walk is a special case of an AR(1) model, where the slope coeffi-
cient is equal to 1:
Yt = β0 + Yt−1 + ϵt. (17.3)
We can also write this as
Yt − Yt−1 = β0 + ϵt. (17.4)
We see from the last equation that the difference between the values at peri-
ods t−1 and t is random, hence the term “random walk.” Forecasts from such a
model are basically equal to the most recent observed value (the naive forecast),
reflecting the lack of any other information.
To test whether a series is a random walk, we fit an AR(1) model and test the
hypothesis that the slope coefficient is equal to 1 (H0 : β1 = 1 vs. H1 : β1 ̸= 1).
If the null hypothesis is rejected (reflected by a small p-value), then the series is
not a random walk and we can attempt to predict it.
As an example, consider the AR(1) model shown in Figure 17.6. The slope
coefficient (0.5998) is more than 5 standard errors away from 1, indicating that
this is not a random walk. In contrast, consider the AR(1) model fitted to the
series of S&P500 monthly closing prices between May 1995 and August 2003
(in SP500.csv, shown in Table 17.8). Here the slope coefficient is 0.9833, with
a standard error of 0.0145. The coefficient is sufficiently close to 1 (around
one standard error away), indicating that this is a random walk. Forecasting this
series using any of the methods described earlier (aside from the naive forecast)
is therefore futile.
7There is some controversy surrounding the efficient market hypothesis, with claims that there is slight
autocorrelation in asset prices, which does make them predictable to some extent. However, transac-
tion costs and bid-ask spreads tend to offset any prediction benefits.
AUTOCORRELATION AND ARIMA MODELS 421
TABLE 17.8 OUTPUT FOR AR(1) MODEL ON S&P500 MONTHLY CLOSING PRICES
> sp500.df <- read.csv("SP500.csv")
> sp500.ts <- ts(sp500.df$Close, start = c(1995, 5), end = c(2003, 8), freq = 12)
> sp500.arima <- Arima(sp500.ts, order = c(1,0,0))
> sp500.arima
Series: sp500.ts
ARIMA(1,0,0) with non-zero mean
Coefficients:
ar1 intercept
0.9833410 890.0706502
s.e. 0.0145083 221.0280170
sigma^2 estimated as 2833.446: log likelihood=-540.05
AIC=1086.1 AICc=1086.35 BIC=1093.92
422 REGRESSION-BASED FORECASTING
PROBLEMS
17.1 Impact of September 11 on Air Travel in the United States. The Research and
Innovative Technology Administration’s Bureau of Transportation Statistics conducted
a study to evaluate the impact of the September 11, 2001 terrorist attack on US trans-
portation. The 2006 study report and the data can be found at http://goo.gl/w2lJPV.
The goal of the study was stated as follows:
The purpose of this study is to provide a greater understanding of the
passenger travel behavior patterns of persons making long distance trips before
and after 9/11.
The report analyzes monthly passenger movement data between January 1990 and
May 2004. Data on three monthly time series are given in file Sept11Travel.csv for this
period: (1) Actual airline revenue passenger miles (Air), (2) rail passenger miles (Rail),
and (3) vehicle miles traveled (Car).
In order to assess the impact of September 11, BTS took the following approach:
using data before September 11, they forecasted future data (under the assumption of
no terrorist attack). Then, they compared the forecasted series with the actual data
to assess the impact of the event. Our first step, therefore, is to split each of the time
series into two parts: pre- and post September 11. We now concentrate only on the
earlier time series.
a. Plot the pre-event AIR time series. What time series components appear?
b. Figure 17.11 shows a time plot of the seasonally adjusted pre-September-11 AIR
series. Which of the following methods would be adequate for forecasting the series
shown in the figure?
• Linear regression model seasonality
• Linear regression model with trend
• Linear regression model with trend and seasonality
FIGURE 17.11 SEASONALLY ADJUSTED PRE-SEPTEMBER-11 AIR SERIES
http://goo.gl/w2lJPV
PROBLEMS 423
c. Specify a linear regression model for the AIR series that would produce a season-
ally adjusted series similar to the one shown in Figure 17.11, with multiplicative
seasonality. What is the outcome variable? What are the predictors?
d. Run the regression model from (c). Remember to use only pre-event data.
i. What can we learn from the statistical insignificance of the coefficients for
October and September?
ii. The actual value of AIR (air revenue passenger miles) in January 1990 was
35.153577 billion. What is the residual for this month, using the regression
model? Report the residual in terms of air revenue passenger miles.
e. Create an ACF (autocorrelation) plot of the regression residuals.
i. What does the ACF plot tell us about the regression model’s forecasts?
ii. How can this information be used to improve the model?
f. Fit linear regression models to Air, Rail, and to Auto with additive seasonality and
an appropriate trend. For Air and Rail, fit a linear trend. For Rail, use a quadratic
trend. Remember to use only pre-event data. Once the models are estimated, use
them to forecast each of the three post-event series.
i. For each series (Air, Rail, Auto), plot the complete pre-event and post-event
actual series overlayed with the predicted series.
ii. What can be said about the effect of the September 11 terrorist attack on the
three modes of transportation? Discuss the magnitude of the effect, its time
span, and any other relevant aspects.
17.2 Analysis of Canadian Manufacturing Workers Workhours. The time plot in
Figure 17.12 describes the average annual number of weekly hours spent by Canadian
manufacturing workers (data are available in CanadianWorkHours.csv, data courtesy of
Ken Black).
a. Which of the following regression models would fit the series best? (Choose one.)
Year
H
ou
rs
P
er
W
ee
k
1970 1975 1980 1985 1990 1995 2000
34
.5
35
.0
35
.5
36
.0
36
.5
37
.0
37
.5
38
. 0
FIGURE 17.12 AVERAGE ANNUAL WEEKLY HOURS SPENT BY CANADIAN MANUFACTURING
WORKERS
424 REGRESSION-BASED FORECASTING
• Linear trend model
• Linear trend model with seasonality
• Quadratic trend model
• Quadratic trend model with seasonality
b. If we computed the autocorrelation of this series, would the lag-1 autocorrelation
exhibit negative, positive, or no autocorrelation? How can you see this from the
plot?
c. Compute the autocorrelation of the series and produce an ACF plot. Verify your
answer to the previous question.
17.3 Toys “R” Us Revenues. Figure 17.13 is a time plot of the quarterly revenues of
Toys “R” Us between 1992 and 1995 (thanks to Chris Albright for suggesting the use
of these data, which are available in ToysRUsRevenues.csv).
a. Fit a regression model with a linear trend and additive seasonality. Use the entire
series (excluding the last two quarters) as the training set.
b. A partial output of the regression model is shown in Table 17.9 (where season2 is
the Quarter 2 dummy). Use this output to answer the following questions:
i. Which two statistics (and their values) measure how well this model fits the
training data?
ii. Which two statistics (and their values) measure the predictive accuracy of this
model?
iii. After adjusting for trend, what is the average difference between sales in Q3
and sales in Q1?
iv. After adjusting for seasonality, which quarter (Q1, Q2, Q3, or Q4) has the high-
est average sales?
Time
R
ev
en
ue
(
$
m
ill
io
ns
)
1992 1993 1994 1995 1996
10
00
15
00
20
00
25
00
30
00
35
00
40
00
FIGURE 17.13 QUARTERLY REVENUES OF TOYS “R” US, 1992–1995
PROBLEMS 425
TABLE 17.9 REGRESSION MODEL FITTED TO TOYS ”R” US TIME SERIES AND ITS PREDICTIVE
PERFORMANCE IN TRAINING AND VALIDATION PERIODS
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 906.75 115.35 7.861 2.55e-05 ***
trend 47.11 11.26 4.185 0.00236 **
season2 -15.11 119.66 -0.126 0.90231
season3 89.17 128.67 0.693 0.50582
season4 2101.73 129.17 16.272 5.55e-08 ***
—
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 .
ME RMSE MAE MPE MAPE MASE
Training set 0.0000 135.0795 92.53061 0.1614994 5.006914 0.4342122
Test set 183.1429 313.6820 254.66667 3.0193814 7.404655 1.1950571
17.4 Walmart Stock. Figure 17.14 shows the series of Walmart daily closing prices
between February 2001 and February 2002 (Thanks to Chris Albright for sug-
gesting the use of these data, which are publicly available, for example, at http:
//finance.yahoo.com and are in the file WalMartStock.csv).
a. Fit an AR(1) model to the close price series. Report the coefficient table.
b. Which of the following is/are relevant for testing whether this stock is a random
walk?
45
50
55
60
Time
C
lo
se
P
ric
e
($
)
Mar−01 May−01 Jul−01 Sep−01 Nov−01 Jan−02
FIGURE 17.14 DAILY CLOSE PRICE OF WALMART STOCK, FEBRUARY 2001–2002
http://finance.yahoo.com
426 REGRESSION-BASED FORECASTING
• The autocorrelations of the close prices series
• The AR(1) slope coefficient
• The AR(1) constant coefficient
c. Does the AR model indicate that this is a random walk? Explain how you reached
your conclusion.
d. What are the implications of finding that a time-series is a random walk? Choose
the correct statement(s) below.
• It is impossible to obtain forecasts that are more accurate than naive forecasts for
the series
• The series is random
• The changes in the series from one period to the next are random
17.5 Department Store Sales. The time plot in Figure 17.15 describes actual quarterly
sales for a department store over a 6-year period (data are available in DepartmentStore-
Sales.csv, data courtesy of Chris Albright).
a. The forecaster decided that there is an exponential trend in the series. In order to
fit a regression-based model that accounts for this trend, which of the following
operations must be performed?
• Take log of quarter index
• Take log of sales
• Take an exponent of sales
• Take an exponent of quarter index
Quarter
S
al
es
4
00
00
6
00
00
8
00
00
10
00
00
2 4 6 8 10 12 14 16 18 20 22 24
FIGURE 17.15 DEPARTMENT STORE QUARTERLY SALES SERIES
PROBLEMS 427
b. Fit a regression model with an exponential trend and seasonality, using the first 20
quarters as the training data (remember to first partition the series into training and
validation series).
c. A partial output is shown in Table 17.10. From the output, after adjusting for trend,
are Q2 average sales higher, lower, or approximately equal to the average Q1 sales?
TABLE 17.10 OUTPUT FROM REGRESSION MODEL FIT TO DEPARTMENT STORE SALES IN THE
TRAINING PERIOD
> summary(tslm(sales.ts ~ trend + season, lambda = 0))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.748945 0.018725 574.057 < 2e-16 ***
trend 0.011088 0.001295 8.561 3.70e-07 ***
season2 0.024956 0.020764 1.202 0.248
season3 0.165343 0.020884 7.917 9.79e-07 ***
season4 0.433746 0.021084 20.572 2.10e-12 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 .
d. Use this model to forecast sales in quarters 21 and 22.
e. The plots in Figure 17.16 describe the fit (top) and forecast errors (bottom) from
this regression model.
Quarter
S
al
es
4
00
00
7
00
00
10
00
0 0
2 4 6 8 10 12 14 16 18 20
Quarter
R
es
id
ua
ls
−
40
00
0
20
00
2 4 6 8 10 12 14 16 18 20
FIGURE 17.16 FIT OF REGRESSION MODEL FOR DEPARTMENT STORE SALES
428 REGRESSION-BASED FORECASTING
i. Recreate these plots.
ii. Based on these plots, what can you say about your forecasts for quarters 21 and
22? Are they likely to over-forecast, under-forecast, or be reasonably close to
the real sales values?
f. From the forecast errors plot, which of the following statements appear true?
• Seasonality is not captured well
• The regression model fits the data well
• The trend in the data is not captured well by the model
g. Which of the following solutions is adequate and a parsimonious solution for
improving model fit?
• Fit a quadratic trend model to the residuals (with Quarter and Quarter2)
• Fit an AR model to the residuals
• Fit a quadratic trend model to Sales (with Quarter and Quarter2)
17.6 Souvenir Sales. Figure 17.17 shows a time plot of monthly sales for a souvenir
shop at a beach resort town in Queensland, Australia, between 1995 and 2001 (Data
are available in SouvenirSales.csv, source: Hyndman, R.J., Time Series Data Library,
http://data.is/TSDLdemo. Accessed on 07/25/15.). The series is presented twice, in
Australian dollars and in log-scale. Back in 2001, the store wanted to use the data to
forecast sales for the next 12 months (year 2002). They hired an analyst to generate
forecasts. The analyst first partitioned the data into training and validation sets, with
the validation set containing the last 12 months of data (year 2001). She then fit a
regression model to sales, using the training set.
a. Based on the two time plots, which predictors should be included in the regression
model? What is the total number of predictors in the model?
b. Run a regression model with Sales (in Australian dollars) as the outcome variable,
and with a linear trend and monthly seasonality. Remember to fit only the training
data. Call this model A.
i. Examine the estimated coefficients: which month tends to have the highest
average sales during the year? Why is this reasonable?
ii. The estimated trend coefficient in model A is 245.36. What does this mean?
c. Run a regression model with an exponential trend and multiplicative seasonality.
Remember to fit only the training data. Call this model B.
i. Fitting a model to log(Sales) with a linear trend is equivalent to fitting a model
to Sales (in dollars) with what type of trend?
ii. The estimated trend coefficient in model B is 0.02. What does this mean?
iii. Use this model to forecast the sales in February 2002.
d. Compare the two regression models (A and B) in terms of forecast performance.
Which model is preferable for forecasting? Mention at least two reasons based on
the information in the outputs.
e. Continuing with model B, create an ACF plot until lag 15 for the forecast errors.
Now fit an AR model with lag 2 [ARIMA(2,0,0)] to the forecast errors.
http://data.is/TSDLdemo
PROBLEMS 429
Time
1995 1997 1999 2001
0
20
00
0
60
00
0
10
00
00
Time
lo
g(
S
al
es
)
1995 1997 1999 2001
8
9
10
11
S
al
es
(
A
us
tr
al
ia
n
$)
FIGURE 17.17 MONTHLY SALES AT AUSTRALIAN SOUVENIR SHOP IN DOLLARS (TOP) AND IN
LOG-SCALE (BOTTOM)
i. Examining the ACF plot and the estimated coefficients of the AR(2) model
(and their statistical significance), what can we learn about the forecasts that
result from model B?
ii. Use the autocorrelation information to compute an improved forecast for Jan-
uary 2002, using model B and the AR(2) model above.
f. How would you model these data differently if the goal was to understand the
different components of sales in the souvenir shop between 1995–2001? Mention
two differences.
17.7 Shipments of Household Appliances. The time plot in Figure 17.18 shows the
series of quarterly shipments (in million dollars) of US household appliances between
1985-1989 (data are available in ApplianceShipments.csv, data courtesy of Ken Black). If
we compute the autocorrelation of the series, which lag (> 0) is most likely to have
the largest coefficient (in absolute value)? Create an ACF plot and compare with your
answer.
430 REGRESSION-BASED FORECASTING
Time
S
hi
pm
en
ts
(
in
0
00
s)
40
00
42
00
44
00
46
00
48
00
Q1−1985 Q3−1985 Q1−1986 Q3−1986 Q1−1987 Q3−1987 Q1−1988 Q3−1988 Q1−1989 Q3−1989
FIGURE 17.18 QUARTERLY SHIPMENTS OF US HOUSEHOLD APPLIANCES OVER 5 YEARS
17.8 Australian Wine Sales. Figure 17.19 shows time plots of monthly sales of six types
of Australian wines (red, rose, sweet white, dry white, sparkling, and fortified) for
1980–1994 (Data are available in AustralianWines.csv, source: Hyndman, R.J., Time
Series Data Library, http://data.is/TSDLdemo. Accessed on 07/25/15.). The units
are thousands of litres. You are hired to obtain short term forecasts (2–3 months ahead)
for each of the six series, and this task will be repeated every month.
a. Which forecasting method would you choose if you had to choose the same method
for all series? Why?
b. Fortified wine has the largest market share of the above six types of wine. You
are asked to focus on fortified wine sales alone, and produce as accurate as possible
forecasts for the next 2 months.
• Start by partitioning the data: use the period until December 1993 as the training
set.
• Fit a regression model to sales with a linear trend and additive seasonality.
i. Create the “actual vs. forecast” plot. What can you say about the model fit?
ii. Use the regression model to forecast sales in January and February 1994.
c. Create an ACF plot for the residuals from the above model until lag 12. Examining
this plot (only), which of the following statements are reasonable conclusions?
• Decembers (month 12) are not captured well by the model.
• There is a strong correlation between sales on the same calendar month.
• The model does not capture the seasonality well.
• We should try to fit an autoregressive model with lag 12 to the residuals.
http://data.is/TSDLdemo
PROBLEMS 431
FIGURE 17.19 MONTHLY SALES OF SIX TYPES OF AUSTRALIAN WINES BETWEEN 1980–1994
CHAPTER 18
Smoothing Methods
In this chapter, we describe a set of popular and flexible methods for forecasting
time series that rely on smoothing. Smoothing is based on averaging over multiple
periods in order to reduce the noise. We start with two simple smoothers, the
moving average and simple exponential smoother, which are suitable for forecasting
series that contain no trend or seasonality. In both cases, forecasts are averages of
previous values of the series (the length of the series history considered and the
weights used in the averaging differ between the methods). We also show how
a moving average can be used, with a slight adaptation, for data visualization.
We then proceed to describe smoothing methods suitable for forecasting series
with a trend and/or seasonality. Smoothing methods are data-driven, and are
able to adapt to changes in the series over time. Although highly automated, the
user must specify smoothing constants that determine how fast the method adapts
to new data. We discuss the choice of such constants, and their meaning. The
different methods are illustrated using the Amtrak ridership series.
18.1 Introduction1
A second class of methods for time series forecasting is smoothing methods.
Unlike regression models, which rely on an underlying theoretical model for the
components of a time series (e.g., linear trend or multiplicative seasonality),
smoothing methods are data-driven, in the sense that they estimate time series
components directly from the data without assuming a predetermined structure.
Data-driven methods are especially useful in series where patterns change over
1This and subsequent sections in this chapter copyright © 2017 Datastats, LLC, and Galit Shmueli.
Used by permission.
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
433
434 SMOOTHING METHODS
time. Smoothing methods “smooth” out the noise in a series in an attempt to
uncover the patterns. Smoothing is done by averaging the series over multiple
periods, where different smoothers differ by the number of periods averaged,
how the average is computed, how many times averaging is performed, and so
on. We now describe two types of smoothing methods that are popular in busi-
ness applications due to their simplicity and adaptability. These are the moving
average method and exponential smoothing.
18.2 Moving Average
The moving average is a simple smoother: it consists of averaging values across
a window of consecutive periods, thereby generating a series of averages. A
moving average with window width w means averaging across each set of w
consecutive values, where w is determined by the user.
In general, there are two types of moving averages: a centered moving average
and a trailing moving average. Centered moving averages are powerful for visual-
izing trends, because the averaging operation can suppress seasonality and noise,
thereby making the trend more visible. In contrast, trailing moving averages
are useful for forecasting. The difference between the two is in terms of the
window’s location on the time series.
Centered Moving Average for Visualization
In a centered moving average, the value of the moving average at time t (MAt)
is computed by centering the window around time t and averaging across the w
values within the window:
MAt =
(
Yt−(w−1)/2 + · · ·+ Yt−1 + Yt + Yt+1 + · · ·+ Yt+(w−1)/2
)
/w.
(18.1)
For example, with a window of width w = 5, the moving average at time point
t = 3 means averaging the values of the series at time points 1, 2, 3, 4, 5; at
time point t = 4, the moving average is the average of the series at time points
2, 3, 4, 5, 6, and so on.2 This is illustrated in the top panel of Figure 18.1.
Choosing the window width in a seasonal series is straightforward: because
the goal is to suppress seasonality for better visualizing the trend, the default
choice should be the length of a seasonal cycle. Returning to the Amtrak rid-
ership data, the annual seasonality indicates a choice of w = 12. Figure 18.2
(smooth black line) shows a centered moving average line overlaid on the origi-
nal series. We can see a global U-shape, but unlike the regression model that fits
2For an even window width, for example, w = 4, obtaining the moving average at time point t = 3
requires averaging across two windows: across time points 1, 2, 3, 4; across time points 2, 3, 4, 5; and
finally the average of the two averages is the final moving average.
MOVING AVERAGE 435
Centered window (w = 5)
t – 2 t + 1 t + 2 t – 1 t
t – 1 tt – 3 t – 2t – 4
Trailing window (w = 5)
FIGURE 18.1 SCHEMATIC OF CENTERED MOVING AVERAGE (TOP) AND TRAILING MOVING
AVERAGE (BOTTOM), BOTH WITH WINDOW WIDTH W = 5
a strict U-shape, the moving average shows some deviation, such as the slight
dip during the last year.
Trailing Moving Average for Forecasting
Centered moving averages are computed by averaging across data in the past and
the future of a given time point. In that sense, they cannot be used for forecasting
because at the time of forecasting, the future is typically unknown. Hence, for
purposes of forecasting, we use trailing moving averages, where the window of
width w is set on the most recent available w values of the series. The k-step
ahead forecast Ft+k (k = 1, 2, 3, . . .) is then the average of these w values (see
also bottom plot in Figure 18.1):
Ft+k = (Yt + Yt−1 + · · ·+ Yt−w+1) /w
For example, in the Amtrak ridership series, to forecast ridership in February
1992 or later months, given information until January 1992 and using a moving
average with window widthw = 12, we would take the average ridership during
the most recent 12 months (February 1991 to January 1992). Figure 18.2 (broken
black line) shows a trailing moving average line overlaid on the original series.
Next, we illustrate a 12-month moving average forecaster for the Amtrak
ridership. We partition the Amtrak ridership time series, leaving the last 36
months as the validation period. Applying a moving average forecaster with
window w = 12, we obtained the output shown in Figure 18.3. Note that for
the first 12 records of the training period, there is no forecast (because there are
less than 12 past values to average). Also, note that the forecasts for all months in
the validation period are identical (1938.481) because the method assumes that
information is known only until March 2001.
In this example, it is clear that the moving average forecaster is inadequate
for generating monthly forecasts because it does not capture the seasonality in
the data. Seasons with high ridership are under-forecasted, and seasons with
low ridership are over-forecasted. A similar issue arises when forecasting a series
with a trend: the moving average “lags behind,” thereby under-forecasting in
436 SMOOTHING METHODS
code for creating Figure 18.2
library(zoo)
# centered moving average with window order = 12
ma.centered <- ma(ridership.ts, order = 12)
# trailing moving average with window k = 12
# in rollmean(), use argument align = right to calculate a trailing moving average.
ma.trailing <- rollmean(ridership.ts, k = 12, align = "right")
# generate a plot
plot(ridership.ts, ylim = c(1300, 2200), ylab = "Ridership",
xlab = "Time", bty = "l", xaxt = "n",
xlim = c(1991,2004.25), main = "")
axis(1, at = seq(1991, 2004.25, 1), labels = format(seq(1991, 2004.25, 1)))
lines(ma.centered, lwd = 2)
lines(ma.trailing, lwd = 2, lty = 2)
legend(1994,2200, c("Ridership","Centered Moving Average", "Trailing Moving Average"),
lty=c(1,1,2), lwd=c(1,2,2), bty = "n")
Time
R
id
er
sh
ip
14
00
16
00
18
00
20
00
22
00
1991 1993 1995 1997 1999 2001 2003
Ridership
Centered Moving Average
Trailing Moving Average
FIGURE 18.2 CENTERED MOVING AVERAGE (SMOOTH BLACK LINE) AND TRAILING MOVING
AVERAGE (BROKEN BLACK LINE) WITH WINDOW W = 12, OVERLAID ON
AMTRAK RIDERSHIP SERIES
MOVING AVERAGE 437
code for creating Figure 18.3
# partition the data
nValid <- 36
nTrain <- length(ridership.ts) - nValid
train.ts <- window(ridership.ts, start = c(1991, 1), end = c(1991, nTrain))
valid.ts <- window(ridership.ts, start = c(1991, nTrain + 1),
end = c(1991, nTrain + nValid))
# moving average on training
ma.trailing <- rollmean(train.ts, k = 12, align = "right")
# obtain the last moving average in the training period
last.ma <- tail(ma.trailing, 1)
# create forecast based on last MA
ma.trailing.pred <- ts(rep(last.ma, nValid), start = c(1991, nTrain + 1),
end = c(1991, nTrain + nValid), freq = 12)
# plot the series
plot(train.ts, ylim = c(1300, 2600), ylab = "Ridership", xlab = "Time", bty = "l",
xaxt = "n", xlim = c(1991,2006.25), main = "")
axis(1, at = seq(1991, 2006, 1), labels = format(seq(1991, 2006, 1)))
lines(ma.trailing, lwd = 2, col = "blue")
lines(ma.trailing.pred, lwd = 2, col = "blue", lty = 2)
lines(valid.ts)
Time
R
id
er
sh
ip
14
00
16
00
18
00
20
00
22
00
24
00
26
00
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
FIGURE 18.3 TRAILING MOVING AVERAGE FORECASTER WITH W = 12 APPLIED TO AMTRAK
RIDERSHIP SERIES
438 SMOOTHING METHODS
the presence of an increasing trend and over-forecasting in the presence of a
decreasing trend. This “lagging behind” of the trailing moving average can also
be seen in Figure 18.2.
In general, the moving average should be used for forecasting only in series that
lack seasonality and trend. Such a limitation might seem impractical. However,
there are a few popular methods for removing trends (de-trending) and remov-
ing seasonality (de-seasonalizing) from a series, such as regression models. The
moving average can then be used to forecast such de-trended and de-seasonalized
series, and then the trend and seasonality can be added back to the forecast. For
example, consider the regression model shown in Figure 17.6 in Chapter 17,
which yields residuals devoid of seasonality and trend (see bottom chart). We
can apply a moving average forecaster to that series of residuals (also called fore-
cast errors), thereby creating a forecast for the next forecast error. For example,
to forecast ridership in April 2001 (the first period in the validation set), assum-
ing that we have information until March 2001, we use the regression model in
Table 17.5 to generate a forecast for April 2001 (which yields 2004.271 thousand
riders). We then use a 12-month moving average (using the period April 2000
to March 2001) to forecast the forecast error for April 2001, which yields 30.78068
(manually, or using R, as shown in Table 18.1). The positive value implies that
the regression model’s forecast for April 2001 is too low, and therefore we should
adjust it by adding approximately 31 thousand riders to the regression model’s
forecast of 2004.271 thousand riders.
TABLE 18.1 APPLYING MA TO THE RESIDUALS FROM THE REGRESSION MODEL (WHICH LACK
TREND AND SEASONALITY), TO FORECAST THE APRIL 2001 RESIDUAL
code for applying moving average to residuals
# fit regression model with trend and seasonality
train.lm.trend.season <- tslm(train.ts ~ trend + I(trend^2) + season)
# create single-point forecast
train.lm.trend.season.pred <- forecast(train.lm.trend.season, h = 1, level = 0)
# apply MA to residuals
ma.trailing <- rollmean(train.lm.trend.season$residuals, k = 12, align = "right")
last.ma <- tail(ma.trailing, 1)
Output
> train.lm.trend.season.pred
Point Forecast Lo 0 Hi 0
Apr 2001 2004.271 2004.271 2004.271
> last.ma
[1] 30.78068
SIMPLE EXPONENTIAL SMOOTHING 439
Choosing Window Width (w)
With moving average forecasting or visualization, the only choice that the user
must make is the width of the window (w). As with other methods such as k-
nearest neighbors, the choice of the smoothing parameter is a balance between
under-smoothing and over-smoothing. For visualization (using a centered win-
dow), wider windows will expose more global trends, while narrow windows
will reveal local trends. Hence, examining several window widths is useful for
exploring trends of differing local/global nature. For forecasting (using a trail-
ing window), the choice should incorporate domain knowledge in terms of
relevance of past values and how fast the series changes. Empirical predictive
evaluation can also be done by experimenting with different values of w and
comparing performance. However, care should be taken not to overfit!
18.3 Simple Exponential Smoothing
A popular forecasting method in business is exponential smoothing. Its popu-
larity derives from its flexibility, ease of automation, cheap computation, and
good performance. Simple exponential smoothing is similar to forecasting with
a moving average, except that instead of taking a simple average over the w most
recent values, we take a weighted average of all past values, such that the weights
decrease exponentially into the past. The idea is to give more weight to recent
information, yet not to completely ignore older information.
Like the moving average, simple exponential smoothing should only be used
for forecasting series that have no trend or seasonality. As mentioned earlier, such
series can be obtained by removing trend and/or seasonality from raw series,
and then applying exponential smoothing to the series of residuals (which are
assumed to contain no trend or seasonality).
The exponential smoother generates a forecast at time t+1 (Ft+1) as follows:
Ft+1 = αYt + α(1− α)Yt−1 + α(1− α)2Yt−2 + . . . , (18.2)
where α is a constant between 0 and 1 called the smoothing parameter. The above
formulation displays the exponential smoother as a weighted average of all past
observations, with exponentially decaying weights.
It turns out that we can write the exponential forecaster in another way,
which is very useful in practice:
Ft+1 = Ft + αEt, (18.3)
where Et is the forecast error at time t. This formulation presents the exponen-
tial forecaster as an “active learner”: It looks at the previous forecast (Ft) and
how far it was from the actual value (Et), and then corrects the next forecast
440 SMOOTHING METHODS
based on that information. If in one period the forecast was too high, the next
period is adjusted down. The amount of correction depends on the value of the
smoothing parameter α. The formulation in (18.3) is also advantageous in terms
of data storage and computation time: it means that we need to store and use
only the forecast and forecast error from the most recent period, rather than the
entire series. In applications where real-time forecasting is done, or many series
are being forecasted in parallel and continuously, such savings are critical.
Note that forecasting further into the future yields the same forecast as a one-
step-ahead forecast. Because the series is assumed to lack trend and seasonality,
forecasts into the future rely only on information until the time of prediction.
Hence, the k-step ahead forecast is equal to
Ft+k = Ft+1.
Choosing Smoothing Parameter α
The smoothing parameter α, which is set by the user, determines the rate of
learning. A value close to 1 indicates fast learning (that is, only the most recent
values have influence on forecasts) whereas a value close to 0 indicates slow learn-
ing (past values have a large influence on forecasts). This can be seen by plugging
0 or 1 into equation (18.2) or (18.3). Hence, the choice of α depends on the
required amount of smoothing, and on how relevant the history is for gener-
ating forecasts. Default values that have been shown to work well are around
0.1–0.2. Some trial and error can also help in the choice of α: examine the
time plot of the actual and predicted series, as well as the predictive accuracy
(e.g., MAPE or RMSE of the validation set). Finding the α value that optimizes
predictive accuracy on the validation set can be used to determine the degree
of local vs. global nature of the trend. However, beware of choosing the “‘best
α” for forecasting purposes, as this will most likely lead to model overfitting and
low predictive accuracy on future data.
To illustrate forecasting with simple exponential smoothing, we return to
the residuals from the regression model, which are assumed to contain no trend
or seasonality. To forecast the residual on April 2001, we apply exponential
smoothing to the entire period until March 2001, and use the default α = 0.2
value. The forecasts of this model are shown in Figure 18.4. The forecast for the
residual (the horizontal broken line) is 14.143 (in thousands of riders), implying
that we should adjust the regression’s forecast by adding 14,143 riders from that
forecast.
Relation Between Moving Average and Simple Exponential Smoothing
In both smoothing methods, the user must specify a single parameter: In moving
averages, the window width (w) must be set; in exponential smoothing, the
SIMPLE EXPONENTIAL SMOOTHING 441
code for creating Figure 18.4
# get residuals
residuals.ts <- train.lm.trend.season$residuals
# run simple exponential smoothing
# use ets() with model = "ANN" (additive error (A), no trend (N), no seasonality (N))
# and alpha = 0.2 to fit simple exponential smoothing.
ses <- ets(residuals.ts, model = "ANN", alpha = 0.2)
ses.pred <- forecast(ses, h = nValid, level = 0)
plot(ses.pred, ylim = c(-250, 300), ylab = "Ridership", xlab = "Time",
bty = "l", xaxt = "n", xlim = c(1991,2006.25), main = "", flty = 2)
lines(train.lm.trend.season.pred$fitted, lwd = 2, col = "blue")
axis(1, at = seq(1991, 2006, 1), labels = format(seq(1991, 2006, 1)))
lines(ses.pred$fitted, lwd = 2, col = "blue")
lines(valid.ts)
Time
R
id
er
sh
ip
−
20
0
0
10
0
30
0
1991 1994 1997 2000 2003 2006
FIGURE 18.4 OUTPUT FOR SIMPLE EXPONENTIAL SMOOTHING FORECASTER WITH α = 0.2,
APPLIED TO THE SERIES OF RESIDUALS FROM THE REGRESSION MODEL (WHICH
LACK TREND AND SEASONALITY). THE FORECAST VALUE IS 14.143.
smoothing parameter (α) must be set. In both cases, the parameter determines
the importance of fresh information over older information. In fact, the two
smoothers are approximately equal if the window width of the moving average
is equal to w = 2/α− 1.
442 SMOOTHING METHODS
18.4 Advanced Exponential Smoothing
As mentioned earlier, both the moving average and simple exponential smooth-
ing should only be used for forecasting series with no trend or seasonality; series
that have only a level and noise. One solution for forecasting series with trend
and/or seasonality is first to remove those components (e.g., via regression mod-
els). Another solution is to use a more sophisticated version of exponential
smoothing, which can capture trend and/or seasonality.
Series with a Trend
For series that contain a trend, we can use “double exponential smoothing.”
Unlike in regression models, the trend shape is not assumed to be global, but
rather, it can change over time. In double exponential smoothing, the local
trend is estimated from the data and is updated as more data arrive. Simi-
lar to simple exponential smoothing, the level of the series is also estimated
from the data, and is updated as more data arrive. The k-step-ahead forecast is
given by combining the level estimate at time t (Lt) and the trend estimate at
time t (Tt):
Ft+k = Lt + kTt (18.4)
Note that in the presence of a trend, one-, two-, three-step-ahead (etc.),
forecasts are no longer identical. The level and trend are updated through a pair
of updating equations:
Lt = αYt + (1− α)(Lt−1 + Tt−1) (18.5)
Tt = β (Lt − Lt−1) + (1− β)Tt−1. (18.6)
The first equation means that the level at time t is a weighted average of the
actual value at time t and the level in the previous period, adjusted for trend (in
the presence of a trend, moving from one period to the next requires factoring
in the trend). The second equation means that the trend at time t is a weighted
average of the trend in the previous period and the more recent information on
the change in level.3 Here there are two smoothing parameters, α and β, which
determine the rate of learning. As in simple exponential smoothing, they are
both constants in the range [0,1], set by the user, with higher values leading to
faster learning (more weight to most recent information).
3There are various ways to estimate the initial values L1 and T1, but the differences among these ways
usually disappear after a few periods.
ADVANCED EXPONENTIAL SMOOTHING 443
Series with a Trend and Seasonality
For series that contain both trend and seasonality, the “Holt–Winter’s Expo-
nential Smoothing” method can be used. This is a further extension of double
exponential smoothing, where the k-step-ahead forecast also takes into account
the seasonality at period t + k. Assuming seasonality with M seasons (e.g., for
weekly seasonality M = 7), the forecast is given by
Ft+k = (Lt + kTt)St+k−M (18.7)
(Note that by the time of forecasting t, the series must have included at least
one full cycle of seasons in order to produce forecasts using this formula, that is,
t > M .)
Being an adaptive method, Holt–Winter’s exponential smoothing allows the
level, trend, and seasonality patterns to change over time. These three compo-
nents are estimated and updated as more information arrives. The three updating
equations are given by
Lt = αYt/St−M + (1− α)(Lt−1 + Tt−1) (18.8)
Tt = β (Lt − Lt−1) + (1− β)Tt−1 (18.9)
St = γYt/Lt + (1− γ)St−M . (18.10)
The first equation is similar to that in double exponential smoothing, except
that it uses the seasonally-adjusted value at time t rather than the raw value.
This is done by dividing Yt by its seasonal index, as estimated in the last cycle.
The second equation is identical to double exponential smoothing. The third
equation means that the seasonal index is updated by taking a weighted average of
the seasonal index from the previous cycle and the current trend-adjusted value.
Note that this formulation describes a multiplicative seasonal relationship, where
values on different seasons differ by percentage amounts. There is also an additive
seasonality version of Holt–Winter’s exponential smoothing, where seasons differ
by a constant amount (for more detail, see Shmueli and Lichtendahl, 2016).
To illustrate forecasting a series with the Holt–Winter’s method, consider
the raw Amtrak ridership data. As we observed earlier, the data contain both
a trend and monthly seasonality. Figure 18.5 depicts the fitted and forecasted
values. Table 18.2 presents a summary of the model.
Series with Seasonality (No Trend)
Finally, for series that contain seasonality but no trend, we can use a Holt–
Winter’s exponential smoothing formulation that lacks a trend term, by deleting
the trend term in the forecasting equation and updating equations.
444 SMOOTHING METHODS
code for creating Figure 18.5
# run Holt-Winters exponential smoothing
# use ets() with option model = “MAA” to fit Holt-Winter’s exponential smoothing
# with multiplicative error, additive trend, and additive seasonality.
hwin <- ets(train.ts, model = "MAA")
# create predictions
hwin.pred <- forecast(hwin, h = nValid, level = 0)
# plot the series
plot(hwin.pred, ylim = c(1300, 2600), ylab = "Ridership", xlab = "Time",
bty = "l", xaxt = "n", xlim = c(1991,2006.25), main = "", flty = 2)
axis(1, at = seq(1991, 2006, 1), labels = format(seq(1991, 2006, 1)))
lines(hwin.pred$fitted, lwd = 2, col = "blue")
lines(valid.ts)
Time
R
id
er
sh
ip
14
00
16
00
18
00
20
00
22
00
24
00
26
00
1991 1993 1995 1997 1999 2001 2003 2005
Training Validation Future
FIGURE 18.5 OUTPUT FOR HOLT–WINTERS, EXPONENTIAL SMOOTHING APPLIED TO AMTRAK
RIDERSHIP SERIES
ADVANCED EXPONENTIAL SMOOTHING 445
TABLE 18.2 SUMMARY OF A HOLT–WINTER’S EXPONENTIAL SMOOTHING MODEL APPLIED TO
THE AMTRAK RIDERSHIP DATA. INCLUDED ARE THE INITIAL AND FINAL STATES
> hwin
ETS(M,A,A)
Call:
ets(y = train.ts, model = “MAA”)
Smoothing parameters:
alpha = 0.5483
beta = 1e-04
gamma = 1e-04
Initial states:
l = 1881.6423
b = 0.4164
s=27.1143 -10.6847 -2.9465 -121.1763 201.1625 147.3359
37.6688 75.8711 60.4021 44.4779 -252.047 -207.1783
sigma: 0.0317
AIC AICc BIC
1614.219 1619.351 1659.214
E X P O N E N T I A L S M O O T H I N G U S I N G e t s ( ) I N R
In R, forecasting using exponential smoothing can be done via the ets() function
in the forecast package. The three letters in ets stand for error, trend, and sea-
sonality. Applying this function to a time series will yield forecasts and residuals
for both the training and validation periods. You can use the default values for
the smoothing parameters, set them to other values, or choose to find the optimal
values (which optimize AIC—see Chapter 5). We also choose the type of trend,
seasonality, and error. The three choices are made in the method = argument in
the form of a three-letter combination (e.g., “MAA”). The first letter denotes the
error type (A, M, or Z); the second letter denotes the trend type (N, A, M, or Z); and
the third letter denotes the season type (N, A, M, or Z). In all cases, N = none, A =
additive, M = multiplicative, and Z = automatically selected. For example, method
= “MAA” indicates a multiplicative error, additive trend, and additive seasonality.
446 SMOOTHING METHODS
PROBLEMS
18.1 Impact of September 11 on Air Travel in the United States. The Research and
Innovative Technology Administration’s Bureau of Transportation Statistics conducted
a study to evaluate the impact of the September 11, 2001 terrorist attack on US trans-
portation. The 2006 study report and the data can be found at http://goo.gl/w2lJPV.
The goal of the study was stated as follows:
The purpose of this study is to provide a greater understanding of the
passenger travel behavior patterns of persons making long distance trips before
and after 9/11.
The report analyzes monthly passenger movement data between January 1990
and May 2004. Data on three monthly time series are given in file Sept11Travel.csv for
this period: (1) Actual airline revenue passenger miles (Air), (2) Rail passenger miles
(Rail), and (3) Vehicle miles traveled (Car).
In order to assess the impact of September 11, BTS took the following approach:
using data before September 11, they forecasted future data (under the assumption of
no terrorist attack). Then, they compared the forecasted series with the actual data
to assess the impact of the event. Our first step, therefore, is to split each of the time
series into two parts: pre- and post-September 11. We now concentrate only on the
earlier time series.
a. Create a time plot for the pre-event AIR time series. What time series components
appear from the plot?
b. Figure 18.6 shows a time plot of the seasonally adjusted pre-September-11 AIR
series. Which of the following smoothing methods would be adequate for fore-
casting this series?
• Moving average (with what window width?)
• Simple exponential smoothing
• Holt exponential smoothing
• Holt–Winter’s exponential smoothing
FIGURE 18.6 SEASONALLY ADJUSTED PRE-SEPTEMBER-11 AIR SERIES
http://goo.gl/w2lJPV
PROBLEMS 447
18.2 Relation Between Moving Average and Exponential Smoothing. Assume that
we apply a moving average to a series, using a very short window span. If we wanted
to achieve an equivalent result using simple exponential smoothing, what value should
the smoothing coefficient take?
18.3 Forecasting with a Moving Average. For a given time series of sales, the training
set consists of 50 months. The first 5 months’ data are shown below:
Month Sales
Sept 98 27
Oct 98 31
Nov 98 58
Dec 98 63
Jan 99 59
a. Compute the sales forecast for January 1999 based on a moving average withw = 4.
b. Compute the forecast error for the above forecast.
18.4 Optimizing Holt–Winter’s Exponential Smoothing. The table below shows
the optimal smoothing constants from applying exponential smoothing to data, using
automated model selection:
Level 1.000
Trend 0.000
Seasonality 0.246
a. The value of zero that is obtained for the trend smoothing constant means that
(choose one of the following):
• There is no trend.
• The trend is estimated only from the first two periods.
• The trend is updated throughout the data.
• The trend is statistically insignificant.
b. What is the danger of using the optimal smoothing constant values?
18.5 Department Store Sales. The time plot in Figure 18.7 describes actual quarterly
sales for a department store over a 6-year period (data are available in DepartmentStore-
Sales.csv, data courtesy of Chris Albright).
a. Which of the following methods would not be suitable for forecasting this series?
• Moving average of raw series
• Moving average of deseasonalized series
• Simple exponential smoothing of the raw series
• Double exponential smoothing of the raw series
448 SMOOTHING METHODS
Quarter
S
al
es
4
00
00
6
00
00
8
00
00
10
00
00
2 4 6 8 10 12 14 16 18 20 22 24
FIGURE 18.7 DEPARTMENT STORE QUARTERLY SALES SERIES
• Holt–Winter’s exponential smoothing of the raw series
• Regression model fit to the raw series
• Random walk model fit to the raw series
b. The forecaster was tasked to generate forecasts for 4 quarters ahead. He therefore
partitioned the data such that the last 4 quarters were designated as the valida-
tion period. The forecaster approached the forecasting task by using multiplica-
tive Holt–Winter’s exponential smoothing. The smoothing parameters used were
α = 0.2, β = 0.15, γ = 0.05.
i. Run this method on the data.
ii. The forecasts for the validation set are given in Table 18.3. Compute the MAPE
values for the forecasts of quarters 21 and 22.
TABLE 18.3 FORECASTS FOR VALIDATION SERIES USING
EXPONENTIAL SMOOTHING
Quarter Actual Forecast Error
21 60,800 59,384.56586 1415.434145
22 64,900 61,656.49426 3243.505741
23 76,997 71,853.01442 5143.985579
24 103,337 95,074.69842 8262.301585
c. The fit and residuals from the exponential smoothing are shown in Figure 18.8.
Using all the information thus far, which model is more suitable for forecasting
quarters 21 and 22?
PROBLEMS 449
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
1 3 5 7 9 11 13 15 17 19
S
al
es
($
)
Quarter
Exp. Smoothing: Actual Vs. Forecast (Training Data)
Forecast Actual
–4000
–3000
–2000
–1000
0
1000
2000
3000
4000
5000
6000
1 3 5 7 9 11 13 15 17 19
F
o
re
ca
st
E
rr
o
r
Quarter
Exp. Smoothing Forecast Errors (Training Data)
FIGURE 18.8 FORECASTS AND ACTUALS (TOP) AND FORECAST ERRORS (BOTTOM) USING
EXPONENTIAL SMOOTHING
18.6 Shipments of Household Appliances. The time plot in Figure 18.9 shows the
series of quarterly shipments (in million dollars) of US household appliances between
1985–1989 (data are available in ApplianceShipments.csv, data courtesy of Ken Black).
Time
S
hi
pm
en
ts
(
in
0
00
s)
40
00
42
00
44
00
46
00
48
00
Q1−1985 Q3−1985 Q1−1986 Q3−1986 Q1−1987 Q3−1987 Q1−1988 Q3−1988 Q1−1989 Q3−1989
FIGURE 18.9 QUARTERLY SHIPMENTS OF US HOUSEHOLD APPLIANCES OVER 5 YEARS
a. Which of the following methods would be suitable for forecasting this series if
applied to the raw data?
• Moving average
• Simple exponential smoothing
• Double exponential smoothing
• Holt–Winter’s exponential smoothing
450 SMOOTHING METHODS
b. Apply a moving average with window span w = 4 to the data. Use all but the last
year as the training set. Create a time plot of the moving average series.
i. What does the MA(4) chart reveal?
ii. Use the MA(4) model to forecast appliance sales in Q1-1990.
iii. Use the MA(4) model to forecast appliance sales in Q1-1991.
iv. Is the forecast for Q1-1990 most likely to under-estimate, over-estimate or
accurately estimate the actual sales on Q1-1990? Explain.
v. Management feels most comfortable with moving averages. The analyst there-
fore plans to use this method for forecasting future quarters. What else should
be considered before using the MA(4) to forecast future quarterly shipments of
household appliances?
c. We now focus on forecasting beyond 1989. In the following, continue to use all but
the last year as the training set, and the last four quarters as the validation set. First,
fit a regression model to sales with a linear trend and quarterly seasonality to the
training data. Next, apply Holt–Winter’s exponential smoothing (with the default
smoothing values) to the training data. Choose an adequate “season length.”
i. Compute the MAPE for the validation data using the regression model.
ii. Compute the MAPE for the validation data using Holt–Winter’s exponential
smoothing.
iii. Which model would you prefer to use for forecasting Q1-1990? Give three
reasons.
iv. If we optimize the smoothing parameters in the Holt–Winter’s method, is it
likely to get values that are close to zero? Why or why not?
18.7 Shampoo Sales. The time plot in Figure 18.10 describes monthly sales of a cer-
tain shampoo over a 3-year period (Data are available in ShampooSales.csv, source:
Hyndman, R.J., Time Series Data Library, http://data.is/TSDLdemo. Accessed on
07/25/15.).
100
200
300
400
500
600
700
U
n
it
s
(i
n
0
00
s)
FIGURE 18.10 MONTHLY SALES OF A CERTAIN SHAMPOO
Which of the following methods would be suitable for forecasting this series if applied
to the raw data?
• Moving average
• Simple exponential smoothing
• Double exponential smoothing
• Holt–Winter’s exponential smoothing
http://data.is/TSDLdemo
PROBLEMS 451
18.8 Natural Gas Sales. Figure 18.11 is a time plot of quarterly natural gas sales (in
billions of BTU) of a certain company, over a period of 4 years (data courtesy of
George McCabe). The company’s analyst is asked to use a moving average to forecast
sales in Winter 2005.
120
130
140
150
160
170
180
W
int
er
2
00
1
Sp
rin
g
20
01
Su
m
m
er
2
00
1
Fa
ll 2
00
1
W
int
er
2
00
2
Sp
rin
g
20
02
Su
m
m
er
2
00
2
Fa
ll 2
00
2
W
int
er
2
00
3
Sp
rin
g
20
03
Su
m
m
er
2
00
1
Fa
ll 2
00
3
W
int
er
2
00
4
Sp
rin
g
20
04
Su
m
m
er
2
00
4
Fa
ll 2
00
4
Season
B
ill
io
n
B
T
U
Gas Sales 4 per. Mov. Avg. (Gas Sales)
FIGURE 18.11 QUARTERLY SALES OF NATURAL GAS OVER 4 YEARS
a. Reproduce the time plot with the overlaying MA(4) line.
b. What can we learn about the series from the MA line?
c. Run a moving average forecaster with adequate season length. Are forecasts gener-
ated by this method expected to over-forecast, under-forecast, or accurately forecast
actual sales? Why?
18.9 Australian Wine Sales. Figure 18.12 shows time plots of monthly sales of six types
of Australian wines (red, rose, sweet white, dry white, sparkling, and fortified) for
1980–1994 (Data are available in AustralianWines.csv, source: Hyndman, R.J., Time
Series Data Library, http://data.is/TSDLdemo. Accessed on 07/25/15.). The units
are thousands of litres. You are hired to obtain short term forecasts (2–3 months ahead)
for each of the six series, and this task will be repeated every month.
a. Which forecasting method would you choose if you had to choose the same method
for all series? Why?
b. Fortified wine has the largest market share of the above six types of wine. You
are asked to focus on fortified wine sales alone, and produce as accurate as possible
forecasts for the next 2 months.
• Start by partitioning the data using the period until December 1993 as the
training set.
• Apply Holt–Winter’s exponential smoothing to sales with an appropriate sea-
son length (use the default values for the smoothing constants).
c. Create an ACF plot for the residuals from the Holt–Winter’s exponential smoothing
until lag 12.
i. Examining this plot, which of the following statements are reasonable conclu-
sions?
• Decembers (month 12) are not captured well by the model.
• There is a strong correlation between sales on the same calendar month.
http://data.is/TSDLdemo
452 SMOOTHING METHODS
FIGURE 18.12 MONTHLY SALES OF SIX TYPES OF AUSTRALIAN WINES BETWEEN 1980 AND 1994
• The model does not capture the seasonality well.
• We should try to fit an autoregressive model with lag 12 to the residuals.
• We should first deseasonalize the data and then apply Holt–Winter’s expo-
nential smoothing.
ii. How can you handle the above effect without adding another layer to your
model?
Part VII
Data Analytics
CHAPTER 19
Social Network Analytics1
In this chapter, we examine the basic ways to visualize and describe social net-
works, measure linkages, and analyze the network with both supervised and
unsupervised techniques. The methods we use long predate the Internet, but
gained widespread use with the explosion in social media data. Twitter, for
example, makes its feed available for public analysis, and some other social media
firms make some of their data available to programmers and developers via an
application programming interface (API).
19.1 Introduction2
The use of social media began its rapid growth in the early 2000s with the advent
of Friendster and MySpace, and, in 2004, Facebook. LinkedIn, catering to pro-
fessionals, soon followed, as did Twitter, Tumblr, Instagram, Yelp, TripAdvisor,
and others. These information-based companies quickly began generating a del-
uge of data—especially data concerning links among people (friends, followers,
connections, etc.).
For some companies, like Facebook, Twitter, and LinkedIn, nearly the entire
value of the company lies in the analytic and predictive value of this data from
their social networks. As of this writing (March 2017), Facebook was worth
more than double General Motors and Ford combined. Other companies, like
1The organization of ideas in this chapter owes much to Jennifer Golbeck and her
Analyzing the Social Web. The contribution of Marc Smith, developer and shepherd of NodeXL, is also
acknowledged.
2This and subsequent sections in this chapter copyright © 2017 Datastats, LLC, and Galit Shmueli.
Used by permission.
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
455
456 SOCIAL NETWORK ANALYTICS
Amazon and Pandora, use social network data as important components of pre-
dictive engines aimed at selling products and services.
Social networks are basically entities (e.g., people) and the connections
among them. Let’s look at the building blocks for describing, depicting, and
analyzing networks. The basic elements of a network are:
• Nodes (also called vertices or vertexes)
• Edges (connections or links between nodes)
A very simple LinkedIn network might be depicted as shown in Figure 19.1.
This network has six nodes depicting members, with edges connecting some, but
not all, of the pairs of nodes.
code for plotting hypothetical LinkedIn network
library(igraph)
# define links in data
edges <- rbind(
c("Dave", "Jenny"), c("Peter", "Jenny"), c("John", "Jenny"),
c("Dave", "Peter"), c("Dave", "John"), c("Peter", "Sam"),
c("Sam", "Albert"), c("Peter", "John")
)
# generate and plot graph
# set argument directed = FALSE in graph.edgelist() to plot an undirected graph.
g <- graph.edgelist(edges, directed = FALSE)
plot(g, vertex.size = 1, vertex.label.dist = 0.5)
Dave
Jenny
Peter
John
Sam
Albert
FIGURE 19.1 TINY HYPOTHETICAL LINKEDIN NETWORK; THE EDGES REPRESENT CONNECTIONS
AMONG THE MEMBERS
DIRECTED VS. UNDIRECTED NETWORKS 457
19.2 Directed vs. Undirected Networks
In the graph shown in Figure 19.1, edges are bidirectional or undirected, mean-
ing that if John is connected to Peter, then Peter must also be connected to John,
and there is no difference in the nature of these connections. You can see from
this graph that there is a group of well-connected members (Peter, John, Dave
and Jenny), plus two less-connected members (Sam and Albert).
Connections might also be directional, or directed. For example, in Twitter,
Dave might follow Peter, but Peter might not follow Dave. A simple Twitter
network (using the same members and connections) might be depicted using
edges with arrows as shown in Figure 19.2.
Edges can also be weighted to reflect attributes of the connection. For exam-
ple, the thickness of the edge might represent the level of e-mail traffic between
two members in a network, or the bandwidth capacity between two nodes in
a digital network (as illustrated in Figure 19.3). The length of an edge can also
code for tiny Twitter network
library(igraph)
# generate and plot graph
# set argument directed = TRUE in graph.edgelist() to plot a directed graph.
g <- graph.edgelist(edges, directed = TRUE)
plot(g, vertex.size = 1, vertex.label.dist = 0.5)
Dave
Jenny
Peter
John
Sam
Albert
FIGURE 19.2 TINY HYPOTHETICAL TWITTER NETWORK WITH DIRECTED EDGES (ARROWS)
SHOWING WHO FOLLOWS WHOM
458 SOCIAL NETWORK ANALYTICS
A
B
C
FIGURE 19.3 EDGE WEIGHTS REPRESENTED BY LINE THICKNESS, FOR EXAMPLE, BANDWIDTH
CAPACITY BETWEEN NODES IN A DIGITAL NETWORK
be used to represent attributes such as physical distance between two points on
a map.
19.3 Visualizing and Analyzing Networks
You have probably seen graphs used as a tool for visualizing and exploring net-
works; they are used widely in the news media. Jason Buch and Guillermo Con-
treras, reporters for the San Antonio Express News,3 pored over law enforcement
records and produced the network diagram shown in Figure 19.4 to understand
and illustrate the connections used to launder drug money.4 You can see that
there is a well-connected central node; it is the address of an expensive residence
in the gated Dominion community in San Antonio, owned by accused laun-
derers Mauricio and Alejandro Sanchez Garza. There are several entities in the
lower left connected only to themselves, and one singleton. Nodes are sized
according to how central they are to the network (specifically, in proportion to
their eigenvector centrality, which is discussed in Section 19.4).
Graph Layout
It is important to note that x, y coordinates usually carry no meaning in network
graphs; the meaning is conveyed in other elements such as node size, edge width,
labels, and directional arrows. Consequently, the same network may be depicted
by two very different looking graphs. For example, Figure 19.5 presents two
different layouts of the hypothetical LinkedIn network.
3San Antonio Express News, May 23, 2012, accessed June 16, 2014.
4The original visualization can be seen at www.google.com/fusiontables/DataSource?snapid=
S457047pVkn, accessed June 13, 2014.
http://www.google.com/fusiontables/DataSource?snapid=S457047pVkn
VISUALIZING AND ANALYZING NETWORKS 459
code for plotting the drug money laundries network
library(igraph)
drug.df <- read.csv("Drug.csv")
# convert edges to edge list matrix
edges <- as.matrix(drug.df[, c(1,2)])
g <- graph.edgelist(edges,directed=FALSE)
# plot graph
# nodes' size is proportional to their eigenvector centrality
plot(g, vertex.label = NA, vertex.size = eigen_centrality(g)$vector * 20)
FIGURE 19.4 DRUG LAUNDRY NETWORK IN SAN ANTONIO, TX
As a result, visualization tools face innumerable choices in graph layout. The
first step in making a choice is to establish what principles should govern the
layout. Dunne and Shneiderman (2009, cited in Golbeck, 2013) list these four
graph readability principles:
1. Every node should be visible.
2. For every node, you should be able to count its degree (explained below).
460 SOCIAL NETWORK ANALYTICS
code for plotting different layouts
# Building on the code presented in Figure 19.1
plot(g, layout = layout_in_circle, vertex.size = 1, vertex.label.dist = 0.5)
plot(g, layout = layout_on_grid, vertex.size = 1, vertex.label.dist = 0.5)
Dave
JennyPeter
John
Sam Albert Dave Jenny Peter
John Sam Albert
FIGURE 19.5 TWO DIFFERENT LAYOUTS OF THE TINY LINKEDIN NETWORK PRESENTED IN
FIGURE 19.1
3. For every link, you should be able to follow it from source to destination.
4. Clusters and outliers should be identifiable.
These general principles are then translated into readability metrics by which
graphs can be judged. Two simple layouts are circular (all nodes lie in a circle)
and grid (all nodes lie at the intersection of grid lines in a rectangular grid).
You can probably think of alternate layouts that more clearly reveal structures
such as clusters and singletons, and so can computers using a variety of algo-
rithms. These algorithms typically use a combination of fixed arbitrary starting
structures, random tweaking, analogs to physical properties (e.g., springs con-
necting nodes), and a sequence of iteration and measurement against the read-
ability principles. A detailed discussion of layout algorithms is beyond the scope
of this chapter; see Golbeck (2013, chapter 4) for an introduction to layout issues,
including algorithms, the use of size, shape and color, scaling issues, and labeling.
Edge List
A network graph such as the one in Figure 19.4 (drug laundry network) is always
tied to a data table called an edge list, or an adjacency list. Table 19.1 shows an
excerpt from the data table used to generate Figure 19.4. All the entities in both
columns are nodes, and each row represents a link between the two nodes. If
VISUALIZING AND ANALYZING NETWORKS 461
TABLE 19.1 EDGE LIST EXCERPT CORRESPONDING TO THE DRUG-LAUNDERING NETWORK IN
FIGURE 19.4
6451 Babcock Road Q & M LLC
Q & M LLC 10 Kings Heath
Maurico Sanchez Q & M LLC
Hilda Riebeling Q & M LLC
Ponte Vedra Apartments Q & M LLC
O S F STEAK HOUSE, LLC Mauricio Sanchez
Arturo Madrigal O S F STEAK HOUSE
HARBARD BAR, LLC Arturo Madrigal
10223 Sahara Street O S F STEAK HOUSE
HARBARD BAR, LLC Maurico Sanchez
9510 Tioga Drive, Suite 206 Mauricio Sanchez
FDA FIBER, INC Arturo Madrigal
10223 Sahara Street O S F STEAK HOUSE
A G Q FULL SERVICE, LLC Alvaro Garcia de Quevedo
19510 Gran Roble Arturo Madrigal
Lorenza Madrigal Cristan 19519 Gran Roble
Laredo National Bank 19519 Gran Roble
the network is directional, the link is usually structured from the left column to
the right column.
In a typical network visualization tool, you can select a row from the data
table and see its node and connections highlighted in the network graph. Like-
wise, in the graph, you can click on a node and see it highlighted in the data
table.
Adjacency Matrix
The same relationships can be presented in a matrix. The adjacency matrix for
the small directed graph for Twitter in Figure 19.2 is shown in Table 19.2.
Each cell in the matrix indicates an edge, with the originating node in the
left header column and the destination node in the top row of headers. Reading
the first row, we see that Dave is following three people—Peter, Jenny, and John.
Using Network Data in Classification and Prediction
In our discussion of classification and prediction, as well as clustering and data
reduction, we were dealing mostly with highly structured data in the form of a
TABLE 19.2 ADJACENCY MATRIX EXCERPT CORRESPONDING TO THE
TWITTER DATA IN FIGURE 19.2
Dave Peter Jenny Sam John Albert
Dave 0 1 1 0 1 0
Peter 0 0 1 1 1 0
Jenny 0 0 0 0 0 0
Sam 0 0 0 0 0 1
John 0 1 1 0 0 0
Albert 0 0 0 0 0 0
462 SOCIAL NETWORK ANALYTICS
data frame—columns were variables (features), and rows were records. We saw
how to use R to sample from relational databases to bring data into the form of
a data frame.
Highly structured data can be used for network analysis, but network data
often start out in a more unstructured or semi-structured format. Twitter pro-
vides a public feed of a portion of its voluminous stream of tweets, which has
captured researchers’ attention and accelerated interest in the application of net-
work analytics to social media data. Network analysis can take this unstructured
data and turn it into structured data with usable metrics.
We now turn our attention to those metrics. These metrics can be used not
only to describe the attributes of networks, but as inputs to more traditional data
mining methods.
19.4 Social Data Metrics and Taxonomy
Several popular network metrics are used in network analysis. Before introduc-
ing them, we introduce some basic network terminology used for constructing
the metrics.
Edge weight measures the strength of the relationship between the two
connected nodes. For example, in an e-mail network, there might be an
edge weight that reflects the number of e-mails exchanged between two
individuals linked by that edge.
Path and path length are important for measuring distance between nodes.
A path is the route of nodes needed to go from node A to node B; path
length is the number of edges in that route. Typically these terms refer to the
shortest route. In a weighted graph, the shortest path does not necessarily
reflect the path with the fewest edges, but rather the path with the least
weight. For example, if the weights reflect a cost factor, the shortest path
would reflect the minimal cost.
Connected network A network is connected if each node in the network
has a path, of any length, to all other nodes. A network may be unconnected
in its entirety, but consist of segments that are connected within themselves.
In the money laundering visualization (Figure 19.4), the network as a whole
is unconnected—the nodes are not all connected to one another. You can
see one large connected segment, and, in the lower left, a small connected
segment and a singleton.
A clique is a network in which each node is directly connected by an edge
to every other node. The connections must all be single edges—a connec-
tion via a multi-node path does not count.
SOCIAL DATA METRICS AND TAXONOMY 463
A singleton is an unconnected node. It might arise when an individual
signs up for a social network service (e.g., to read reviews) and does not
participate in any networking activities.
Node-Level Centrality Metrics
Often we may be interested in the importance or influence of a particular indi-
vidual or node, which is reflected in how central that node is in the network.
The most common way to calculate this is by degree—how many edges
are connected to the node. Nodes with many connections are more central. In
Figure 19.1, the Albert node is of degree 1, Sam of degree 2, and Jenny of degree
3. In a directed network, we are interested in both indegree and outdegree—the
number of incoming and outgoing connections of a node. In Figure 19.2, Peter
has indegree of 2, and outdegree of 1.
Another metric for a node’s centrality is closeness—how close the node is to
the other nodes in the network. This is measured by finding the shortest path
from that node to all the other nodes, then taking the average path length.
Still another metric is betweenness—the extent to which a given node lies on
the shortest path between pairs of nodes. The calculation starts with the given
node, say, node A, and two other nodes, say B and C, out of perhaps many nodes
in a network. The shortest paths between B and C are listed, and the proportion
of paths that include A is recorded. This proportion is recorded also for all other
nodal pairs, and betweenness is the average proportion.
An aphorism relevant for social media networks is “it’s not what you know,
but who you know.” A more accurate rendition would qualify it further—“it’s
who you know and who they know.” A link to a member that has many other
connections can be more valuable than a link to a member with few connections.
A metric that measures this connectivity aspect is eigenvector centrality, which fac-
tors in both the number of links from a node and the number of onward con-
nections from those links. The details of the calculation is not discussed here,
but the result always lies between 0 (not central) and 1 (maximum centrality).
Centrality can be depicted on a network graph by the size of a node—the
more central the node, the larger it is.
Code for computing centrality measures in R for the small directed LinkedIn
data is given in Table 19.3.
Egocentric Network
It is often important to gain information that comes only from the analysis of
individuals and their connections. For example, an executive recruiting firm may
be interested in individuals with certain job titles and the people those individuals
are connected to.
464 SOCIAL NETWORK ANALYTICS
TABLE 19.3 COMPUTING CENTRALITY IN R
> degree(g)
Dave Jenny Peter John Sam Albert
3 3 4 3 2 1
> betweenness(g)
Dave Jenny Peter John Sam Albert
0 0 2 0 2 0
> closeness(g)
Dave Jenny Peter John Sam Albert
0.12500000 0.03333333 0.09090909 0.04000000 0.04000000 0.03333333
> eigen_centrality(g)
$vector
Dave Jenny Peter John Sam Albert
0.9119867 0.9119867 1.0000000 0.9119867 0.3605471 0.1164367
An egocentric network is the network of connections centered around an
individual node. A degree 1 egocentric network consists of all the edges con-
nected to the individual node, plus their connections. A degree 2 egocentric
network is the network of all those nodes and edges, plus the edges and nodes
connected to them. The degree 1 and degree 2 egocentric network for Peter in
the LinkedIn graph are shown in Figure 19.6. Note that the degree 2 egocentric
network for Peter is the entire graph shown in Figure 19.1.
code for computing egocentric network
# get Peter’s 1-level ego network
# for a 2-level ego network set argument order = 2 in make_ego_graph().
peter.ego <- make_ego_graph(g, order = 1, nodes = "Peter")
plot(peter.ego[[1]], vertex.size = 1, vertex.label.dist = 0.5)
Dave
Jenny
Peter
John
Sam
Dave
Jenny
Peter
John
Sam
Albert
FIGURE 19.6 THE DEGREE 1 (LEFT) AND DEGREE 2 (RIGHT) EGOCENTRIC NETWORKS FOR
PETER, FROM THE LINKEDIN GRAPH IN FIGURE 19.1
SOCIAL DATA METRICS AND TAXONOMY 465
Network Metrics
To this point, we have discussed metrics and terms that apply to nodes and edges.
We can also measure attributes of the network as a whole. Two main network
metrics are degree distribution and density.
Degree distribution describes the range of connectedness of the nodes—
how many nodes have (for example) 5 connections, how many have 4 con-
nections, how many have 3, etc. In the tiny LinkedIn network (Figure 19.1),
we see that Peter, Jenny, and Dave have three connections, John and Sam
have 2 connections, and Albert has one. A table of this degree distribution
is shown in Table 19.4.
Density is another way to describe the overall connectedness of a graph
which focuses on the edges, not the nodes. The metric looks at the ratio
of the actual number of edges to the maximum number of potential edges
(i.e., if every node were connected to every other node) in a network with
a fixed number of nodes. For a directed network with n nodes, there can be
a maximum of n(n− 1) edges. For an undirected network, the number is
n(n−1)/2. More formally, density calculations for directed and undirected
networks are as follows:
density (directed) =
e
n(n− 1)
(19.1)
density (undirected) =
e
n(n− 1)/2
(19.2)
where e is the number of edges, and n is the number of nodes. This metric
ranges between just above 0 (not dense at all) and 1 (as dense as possible).
Figures 19.7 and 19.8 illustrate a sparse and dense network, respectively.
Code for computing network measures in R for the small LinkedIn data is
given in Table 19.5.
TABLE 19.4 DEGREE DISTRIBUTION
OF THE TINY LINKEDIN
NETWORK
Degree Frequency
Degree 0 0
Degree 1 1
Degree 2 1
Degree 3 3
Degree 4 1
466 SOCIAL NETWORK ANALYTICS
FIGURE 19.7 A RELATIVELY SPARSE NETWORK
FIGURE 19.8 A RELATIVELY DENSE NETWORK
TABLE 19.5 COMPUTING CENTRALITY IN R
> degree.distribution(g) # normalized
[1] 0.0000000 0.1666667 0.1666667 0.5000000 0.1666667
> edge_density(g)
[1] 0.2666667
USING NETWORK METRICS IN PREDICTION AND CLASSIFICATION 467
19.5 Using Network Metrics in Prediction
and Classification
Network attributes can be used along with other predictors in standard classifi-
cation and prediction procedures. The most common applications involve the
concept of matching. Online dating services, for example, will predict for their
members which other members might be potentially compatible. Their algo-
rithms typically involve calculation of a distance measure between a member
seeking a relationship and candidate matches. It might also go beyond the mem-
bers’ self-reported features and incorporate information about links between the
member and candidate matches. A link might represent the action “viewed
candidate profile.”
Link Prediction
Social networks such as Facebook and LinkedIn use network information to rec-
ommend new connections. The translation of this goal into an analytics problem
is:
“If presented with a network, can you predict the next link to form?”
Prediction algorithms list all possible node pairs, then assign a score to each
pair that reflects the similarity of the two nodes. The pair that scores as most
similar (closest) is the next link predicted to form, if it does not already exist.
See Chapter 15 for a discussion of such distance measures. Some variables used
in calculating similarity measures are the same as those based on non-network
information (e.g., years of education, age, sex, location). Other metrics used in
link prediction apply specifically to network data:
• shortest path
• number of common neighbors
• edge weight
Link prediction is also used in targeting intelligence surveillance. “Collecting
everything” may be technically, politically, or legally unfeasible, and an agency
must therefore identify apriori a smaller set of individuals requiring surveillance.
The agency will often start with known targets, then use link prediction to
identify additional targets and prioritize collection efforts.
Entity Resolution
Governments use network analysis to track terrorist networks, and a key part
of that effort is identification of individuals. The same individual may appear
multiple times from different data sources, and the agencies want to know, for
example, whether individual A identified by French surveillance in Tunisia is the
468 SOCIAL NETWORK ANALYTICS
same as individual AA identified by Israeli intelligence in Lebanon and individual
AAA identified by US intelligence in Pakistan.
One way to evaluate whether an individual appears in multiple databases is
to measure distances and use them in a similar fashion to nearest neighbors or
clustering. In Chapter 15, we looked in particular at Euclidean distance, and
discussed this metric not in terms of the network an individual belongs to, but
rather in terms of the profile (predictor values) of the individual. When basing
entity resolution on these variables, it is useful to bring domain knowledge
into the picture to weight the importance of each variable. For example, two
variables in an individual’s record might be street address and zip code. A match
on street address is more definitive than a match on zip code, so we would
probably want to give street address more weight in the scoring algorithm. For
a more detailed discussion of automated weight calculation and assignment, see
p. 137 in Golbeck (2013).
In addition to measuring distance based on individual profiles, we can bring
network attributes into the picture. Consider the simple networks for each indi-
vidual in Figure 19.9, showing connections to known individuals: Based on
network connections, you would conclude that A and AA are likely the same
person, while AAA is probably a different person. The metrics that can formal-
ize this search, and be used in automated fashion where human-intermediated
visualization is not practical, are the same ones that are used in link prediction.
Entity resolution is also used extensively in customer record management
and search. For example, a customer might contact a company inquiring about
a product or service, triggering the creation of a customer record. The same cus-
tomer might later inquire again, or purchase something. The customer database
system should flag the second interaction as belonging to the first customer. Ide-
ally, the customer will enter his or her information exactly the same way in each
interaction, facilitating the match, but this does not necessarily happen. Failing
an exact match, other customers may be proposed as matches based on proximity.
Another area where entity resolution is used is fraud detection. For exam-
ple, a large telecom used link resolution to detect customers who “disappeared”
after accumulating debt but then reappeared by opening a new account. The
network of phone calls to and from such people tends to remain stable, assisting
the company to identify them.
In traditional business operations, the match may be based on variables such
as name, address and postal code. In social media products, matches may also be
calculated on the basis of network similarity.
Collaborative Filtering
We saw in Chapter 14 that collaborative filtering uses similarity metrics to iden-
tify similar individuals, and thereby develop recommendations for a particular
USING NETWORK METRICS IN PREDICTION AND CLASSIFICATION 469
FIGURE 19.9 THE NETWORKS FOR SUSPECT A (TOP), SUSPECT AA (MIDDLE), AND SUSPECT
AAA (BOTTOM)
470 SOCIAL NETWORK ANALYTICS
individual. Companies that have a social media component to their business can
use information about network connections to augment other data in measuring
similarity.
For example, in a company where Internet advertising revenue is important,
a key question is what ads to show consumers.
Consider the following small illustration for a company whose business is
centered around online users: User A has just logged on, and is to be compared
to users B–D. Table 19.6 shows some demographic and user data for each of the
users.
TABLE 19.6 FOUR MEASUREMENTS FOR USERS A, B, C, AND D
User Months as Cust. Age Spending Education
A 7 23 0 3
B 3 45 0 2
C 5 29 100 3
D 11 59 0 3
Our initial step is to compute distances based on these values, to determine
which user is closest to user A. First, we convert the raw data to normalized
values to place all the measurements on the same scale (for Education, 1 = high
school, 2 = college, 3 = post college degree). Normalizing means subtracting
the mean and dividing by the standard deviation. The normalized data are shown
in Table 19.7.
Next, we calculate the Euclidean distance between A and each of the other
users (see Table 19.8). Based on these calculations, which only take into account
the demographic and user data, user C is the closest one to the new user A.
TABLE 19.7 NORMALIZED MEASUREMENTS FOR USERS A, B, C,
AND D
User Months as Cust. Age Spending Education
A 0.17 −1.14 −0.58 0.58
B −1.18 0.43 −0.58 −1.73
C −0.51 −0.71 1.73 0.58
D 1.52 1.42 −0.58 0.58
TABLE 19.8 EUCLIDEAN DISTANCE BETWEEN EACH PAIR OF USERS
Pair Months as Cust. Age Spending Education Euclidean Dist.
A-B 1.83 2.44 0 5.33 3.1
A-C 0.46 0.18 5.33 0 2.44
A-D 1.83 6.55 0 0 2.89
COLLECTING SOCIAL NETWORK DATA WITH R 471
TABLE 19.9 NETWORK METRICS
Pair Shortest Path
A-B 2
A-C 4
A-D 3
Let’s now bring in network metrics, and suppose that the inter-user distances
in terms of shortest path are A-B = 2, A-C = 4, and A-D = 3 (Table 19.9).
Finally, we can combine this network metric with the other user measure-
ment distances calculated earlier. There is no need to calculate and normalize
differences, as we did with the other user measurements, since the shortest-path
metric itself already measures distance between records. We therefore take a
weighted average of the network and non-network metrics, using weights that
reflect the relative importance we want to attach to each. Using equal weights
(Table 19.10), user B is scored as the closest to A, and could be recommended
as a link for user A (in a social environment), or could be used as a source for
product and service recommendations. With different weights, it is possible to
obtain different results. Choosing weights is not a scientific process; rather, it
depends on the business judgment about the relative importance of the network
metrics vs. the non-network user measurements.
TABLE 19.10 COMBINING THE NETWORK AND NON-
NETWORK METRICS
Pair Shortest Path Weight Non-network Weight Mean
A-B 2 0.5 3.1 0.5 2.55
A-C 4 0.5 2.44 0.5 3.22
A-D 3 0.5 2.89 0.5 2.95
In addition to recommending social media connections and providing data
for recommendations, network analysis has been used for identifying clusters
of similar individuals based on network data (e.g., for marketing purposes),
identifying influential individuals (e.g., to target with promotions or outreach),
and understanding—in some cases, controlling—the propagation of disease and
information.
19.6 Collecting Social Network Data with R
In this section, we briefly show how to collect data from two of the most used
social networks: Twitter and Facebook. Interfaces to these social networks
are provided in R via two packages: TwitteR and Rfacebook. Both require
472 SOCIAL NETWORK ANALYTICS
pre-registration as a developer, to obtain application authorization codes:
• Before crawling for Twitter data, create a new Twitter application and
obtain a developer key and secret code from https://apps.twitter.com/.
For more information on the Twitter interface in R, see https://cran.
r-project.org/web/packages/twitteR/twitteR
• Before crawling for Facebook data, create a new Facebook appli-
cation and obtain a application id and secret number from https://
developers.facebook.com/apps. For more information on the Facebook
interface in R, see https://cran.r-project.org/web/packages/Rfacebook/
Rfacebook .
Tables 19.11 and 19.12 provide simple examples for the TwitteR and
Rfacebook packages. In Table 19.11, the code collects data on 25 recent tweets
that contain the words “text mining.” The output shows the content of the
resulting first two tweets. In Table 19.12, the code collects information about
the Facebook page “dataminingbook” and on the 25 most recent posts on that
page. The partial output shows the page details, as well as the number of com-
ments, shares, and likes for one post.
TABLE 19.11 TWITTER INTERFACE IN R
code for Twitter interface
library(twitteR)
# replace key and secret number with those you obtained from Twitter
setup_twitter_oauth(consumer_key = “XXX”, consumer_secret = “XXX”)
# get recent tweets
recent.25.tweets <- searchTwitter("text mining", resultType="recent", n = 25)
Output
> recent.25.tweets
[[1]]
[1] “Solvonauts: Image from ‘A Text-Book of Coal-Mining … Second edition, etc’,
001758763 : Image from ‘A Tex… https://t.co/8DMuZuSYNY #randomoer”
[[2]]
[1] “Apple_News1: Call Center Text mining New- Complete Text mining – Resources
https://t.co/YLntkKGhro
https://apps.twitter.com/
https://cran.r-project.org/web/packages/twitteR/twitteR
https://developers.facebook.com/apps
https://developers.facebook.com/apps
https://cran.r-project.org/web/packages/Rfacebook/Rfacebook
https://t.co/8DMuZuSYNY
https://t.co/YLntkKGhro
COLLECTING SOCIAL NETWORK DATA WITH R 473
TABLE 19.12 FACEBOOK INTERFACE IN R
code for Facebook interface
library(Rfacebook)
# replace the app id and secret number with those you obtained from Facebook
fb_oauth <- fbOAuth(app_id = "XXX", app_secret = "XXX")
fb_oauth_credentials <- fromJSON(names(fb_oauth$credentials))
# get recent posts on page "dataminingbook"
fb_page <- getPage(page = "dataminingbook", token = fb_oauth_credentials$access_token)
# a facebook page contains the following information:
t(t(names(fb_page)))
fb_page[1,]
# get information about most recent post
post <- getPost(post=fb_page$id[1], n=20, token=fb_oauth_credentials$access_token)
post$likes
post$comments
Ouput
> t(t(names(fb_page)))
[,1]
[1,] “from_id”
[2,] “from_name”
[3,] “message”
[4,] “created_time”
[5,] “type”
[6,] “link”
[7,] “id”
[8,] “likes_count”
[9,] “comments_count”
[10,] “shares_count
> fb_page[1,]
from_id from_name
1 130888926966051 Data Mining for Business Analytics
message
1 Our book is being used in over 300 courses around the world.
Check out the Google map at\nhttp://www.dataminingbook.com/map
created_time type
1 2016-07-28T06:39:57+0000 photo
link
1 https://www.facebook.com/dataminingbook/photos/
a.307011686020440.82052.130888926966051/1056786057709662/?type=3
id likes_count
1 130888926966051_1062626980458903 2
comments_count shares_count
1 0 0
nhttp://www.dataminingbook.com/map
https://www.facebook.com/dataminingbook/photos/
474 SOCIAL NETWORK ANALYTICS
19.7 Advantages and Disadvantages
The key value of social network data to businesses is the information it provides
about individuals and their needs, desires, and tastes. This information can be
used to improve target advertising—and perfectly targeted advertising is the holy
grail of the advertising business. People often disdain or ignore traditional “broad
spectrum” advertising, whereas they pay close attention when presented with
information about something specific that they are interested in. The power of
network data is that they allow capturing information on individual needs and
tastes without measuring or collecting such data directly. Moreover, network
data are often user-provided rather than actively collected.
To see the power of social network data, one need look no further than the
data-based social media giants of the 21st century—Facebook, LinkedIn, Twit-
ter, Yelp, and others. They have built enormous value based almost exclusively
on the information contained in their social data—they manufacture no prod-
ucts and sell no services (in the traditional sense) to their users. The main value
they generate is the ability to generate real-time information at the level of the
individual to better target ads.
Other companies saw this value, and have tried to add a social component
to what they produce. Google added a variety of social and sharing facilities
with Google+. Amazon has experimented with adding a social component to its
recommendation system.
It is important to distinguish between social network engagement and the
use of social network analytics. Many organizations have social media policies
and use social media as part of their communications and advertising strategies;
however, this does not mean they have access to detailed social network data
that they can use for analysis. Abrams Research, an Internet marketing agency,
cited the edgy online retailer Zappos in 2009 for the best use of social media, but
Zappos’ orientation was engagement—use of social media for product support
and customer service—rather than analytics.
Reliance on social network analytics comes with hazards and challenges.
One major hazard is the dynamic, faddish, and somewhat ephemeral nature of
social media use. In part this is because social media involvement is relatively
new, and the landscape is changing with the arrival of new players and new
technologies. It also stems from the essential nature of social media. People use
social media not to provide essential needs like food and shelter, but as a voluntary
avocation to provide entertainment and interaction with others. Tastes change,
and what’s in and out of fashion shifts rapidly in these spheres.
Facebook was a pioneer, and its initial appeal was to the college crowd and
later to young adults. Eight years later, nearly half its users were of age 45 years
or older, significantly higher than in the rest of the social media industry. These
ADVANTAGES AND DISADVANTAGES 475
users spend more time per visit and are wealthier than college students, so Face-
book may be benefiting from this demographic trend. However, this rapid shift
shows how fast the essentials of their business model can shift.
Other challenges lie in the public and personal nature of the data involved.
Although individuals nearly always engage with social media voluntarily, this
does not mean they do so responsibly or with knowledge of the potential con-
sequences. Information posted on Facebook became evidence in 33 % of US
divorce cases, and numerous cases have been cited of individuals being fired
because of information they posted on Facebook.
One enterprising programmer created the website pleaserobme.com (since
removed) that listed the real-time location of people who had “checked in” via
FourSquare to a cafe or restaurant, thus letting the world know that they were
not at home. FourSquare was not making this information easily available to
hackers, but the check-ins were auto-posted to Twitter, where they could be
harvested via an API (application programming interface). Thus, information
collected for the purpose of enriching the “targetability” of advertising to users
ended up creating a liability for both FourSquare and Twitter. Twitter’s liability
came about indirectly, and it would have been easy for Twitter to overlook this
risk in setting up their connection with FourSquare.
In summary, despite the potential for abuse and exposure to risk, the value
provided by social network analytics seems large enough to ensure that the ana-
lytics methods outlined in this chapter will continue to be in demand.
476 SOCIAL NETWORK ANALYTICS
PROBLEMS
19.1 Describing a Network. Consider an undirected network for individuals A, B, C,
D, and E. A is connected to B and C. B is connected to A and C. C is connected to
A, B, and D. D is connected to C and E. E is connected to D.
a. Produce a network graph for this network.
b. What node(s) would need to be removed from the graph for the remaining nodes
to constitute a clique?
c. What is the degree for node A?
d. Which node(s) have the lowest degree?
e. Tabulate the degree distribution for this network.
f. Is this network connected?
g. Calculate the betweenness centrality for nodes A and C.
h. Calculate the density of the network.
19.2 Network Density and Size. Imagine that two new nodes are added to the undi-
rected network in the previous exercise.
a. By what percent has the number of nodes increased?
b. By what percent has the number of possible edges increased?
c. Suppose the new node has a typical (median) number of connections. What will
happen to network density?
d. Comment on comparing densities in networks of different sizes.
e. Tabulate the degree distribution for this network.
19.3 Link Prediction. Consider the network shown in Figure 19.10.
a. Using the number of common neighbors score, predict the next link to form (that
is, suggest which new link has the best chance of success).
b. Using the shortest path score, identify the link that is least likely to form.
19.4 Startup Market Exploration. Social media has given a boost to the artisan and
craft trades. Many people embrace pastimes such as beer brewing, bread making, and
pottery making. Social media provides both a mechanism for a diaspora community
to form online, and a medium for instructions and advice to spread quickly. It also
provides a vehicle for new ideas, products, and services to go viral. This has opened
up opportunities for entrepreneurs and established companies to provide equipment
and supplies to the artisanal community, and become small-medium scale producers
themselves. Before the advent of social media, the high cost of advertising and market-
ing confined most artisans to the ranks of dedicated hobbyists. Now the availability of
social media as a marketing tool brings commercialization and expansion within reach
of hundreds of thousands of would-be entrepreneurs, and their financial backers. And
because social media has become the communication vehicle of choice for many in
the artisan community, it is an important channel for established supply companies
whose marketing budgets are not similarly constrained.
In the following problems, use social network graph visualizations and metrics
along with inspection of the data, to learn something about the network. This is an
exploratory task, with no correct or incorrect answers.
PROBLEMS 477
FIGURE 19.10 NETWORK FOR LINK PREDICTION EXERCISE
a. Use R’s Rfacebook package to import data from a Facebook page on a topic of
your choice. Note: With Facebook (and Twitter), choosing a topic that is too
popular risks submerging the interesting information in a sea of irrelevant chatter.
• Get the 50 most recent posts by setting n = 50. Collect the list of commenters on
these 50 posts.
• Now collect 50 more posts from the same Facebook page, this time from exactly
1 year ago. Use arguments since and until. Collect the list of commenters on these
50 posts.
Note: Be careful not to send too many frequent requests; otherwise, Facebook will
require you to wait 1 hour until you can again download data.
i. Who are the influential users in each of the two samples? Report the metrics
you used.
ii. Compare the list of influential users in the two samples to detect loyal users.
iii. Select the three most influential commenters in each sample and collect infor-
mation on these users (using function getUsers).
iv. Can you discern and describe marketing campaigns based on this analysis?
b. Explore whether similar explorations are useful for other products and services that
you are familiar with, such as consulting, software, and mobile apps.
CHAPTER 20
Text Mining
In this chapter, we introduce text as a form of data. First we discuss a tabular rep-
resentation of text data in which each column is a word, each row is a document,
and each cell is a 0 or 1, indicating whether that column’s word is present in that
row’s document. Then we consider how to move from unstructured documents
to this structured matrix. Finally, we illustrate how to integrate this process into
the standard data mining procedures covered in earlier parts of the book.
20.1 Introduction1
Up to this point, and in data mining in general, we have been dealing with three
types of data:
• numerical
• binary (yes/no)
• multicategory
In some common predictive analytics applications, though, data come in
text form. An Internet service provider, for example, might want to use an
automated algorithm to classify support tickets as urgent or routine, so that the
urgent ones can receive immediate human review. A law firm facing a mas-
sive discovery process (review of large numbers of documents) would benefit
from a document review algorithm that could classify documents as relevant or
1This and subsequent sections in this chapter copyright © 2017 Datastats, LLC, and Galit Shmueli.
Used by permission.
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
479
480 TEXT MINING
irrelevant. In both of these cases, the predictor variables (features) are embedded
as text in documents.
Text mining methods have gotten a boost from the availability of social
media data—the Twitter feed, for example, as well as blogs, online forums,
review sites, and news articles. According to the Rexer Analytics 2013 Data
Mining Survey, between 25% and 40% of data miners responding to the sur-
vey use data mining with text from these sources. Their public availability and
high profile has provided a huge repository of data on which researchers can
hone text mining methods. One area of growth has been the application of
text mining methods to notes and transcripts from contact centers and service
centers.
20.2 The Tabular Representation of Text:
Term-Document Matrix and
“Bag-of-Words”
Consider the following three sentences:
S1. this is the first sentence.
S2. this is a second sentence.
S3. the third sentence is here.
We can represent the words (called terms) in these three sentences (called
documents) in a term-document matrix, where each row is a word and each column
is a sentence. In R, converting a set of documents into a term-document matrix
can be done by first collecting the documents into a collection (using function
corpus) and then using function tdm() in the tm package. Table 20.1 illustrates
this process for the three-sentence example.
Note that all the words in all three sentences (except “is,” which is removed
by R’s TermDocumentMatrix function) are represented in the matrix and each
word has exactly one row. Order is not important, and the number in the cell
indicates the frequency of that term in that sentence. This is the bag-of-words
approach, where the document is treated simply as a collection of words in which
order, grammar, and syntax do not matter.
The three sentences have now been transformed into a tabular data for-
mat just like those we have seen to this point. In a simple world, this binary
matrix could be used in clustering, or, with the appending of an outcome vari-
able, for classification or prediction. However, the text mining world is not a
simple one. Even confining our analysis to the bag-of-words approach, con-
siderable thinking and preprocessing may be required. Some human review of
BAG-OF-WORDS VS. MEANING EXTRACTION AT DOCUMENT LEVEL 481
TABLE 20.1 TERM-DOCUMENT MATRIX REPRESENTATION OF WORDS IN SENTENCES S1–S3
code for term frequency
library(tm)
# define vector of sentences (“docs”)
text <- c("this is the first sentence",
"this is a second sentence",
"the third sentence is here")
# convert sentences into a corpus
corp <- Corpus(VectorSource(text))
# compute term frequency
tdm <- TermDocumentMatrix(corp)
inspect(tdm)
Output
> inspect(tdm)
<
Non-/sparse entries: 12/9
Sparsity : 43%
Maximal term length: 8
Weighting : term frequency (tf)
Docs
Terms 1 2 3
first 1 0 0
here 0 0 1
second 0 1 0
sentence 1 1 1
the 1 1 1
third 0 0 1
this 1 1 0
the documents, beyond simply classifying them for training purposes, may be
indispensable.
20.3 Bag-of-Words vs. Meaning Extraction at
Document Level
We can distinguish between two undertakings in text mining:
• Labeling a document as belonging to a class, or clustering similar
documents
• Extracting more detailed meaning from a document
482 TEXT MINING
The first goal requires a sizable collection of documents, or a corpus,2 the ability
to extract predictor variables from documents, and for the classification task, lots
of pre-labeled documents to train a model. The models that are used, though,
are the standard statistical and machine learning predictive models that we have
already dealt with for numerical and categorical data.
The second goal might involve a single document, and is much more ambi-
tious. The computer must learn at least some version of the complex “algo-
rithms” that make up human language comprehension: grammar, syntax, punc-
tuation, etc. In other words, it must undertake the processing of a natural (that
is, noncomputer) language to understand documents in that language. Under-
standing the meaning of one document on its own is a far more formidable task
than probabilistically assigning a class to a document based on rules derived from
hundreds or thousands of similar documents.
For one thing, text comprehension requires maintenance and consideration
of word order. “San Francisco beat Boston in last night’s baseball game” is very
different from “Boston beat San Francisco in last night’s baseball game.”
Even identical words in the same order can carry different meanings, depend-
ing on the cultural and social context: “Hitchcock shot The Birds in Bodega
Bay,” to an avid outdoors person indifferent to capitalization and unfamiliar with
Alfred Hitchcock’s films, might be about bird hunting. Ambiguity resolution is
a major challenge in text comprehension—does “bot” mean “bought,” or does
it refer to robots?
Our focus will remain with the overall focus of the book and the easier
goal—probabilistically assigning a class to a document, or clustering similar doc-
uments. The second goal—deriving understanding from a single document—is
the subject of the field of natural language processing (NLP).
20.4 Preprocessing the Text
The simple example we presented had ordinary words separated by spaces, and a
period to denote the end of the each sentence. A fairly simple algorithm could
break the sentences up into the word matrix with a few rules about spaces and
periods. It should be evident that the rules required to parse data from real
world sources will need to be more complex. It should also be evident that
2The term “corpus” is often used to refer to a large, fixed standard set of documents that can be used
by text preprocessing algorithms, often to train algorithms for a specific type of text, or to compare
the results of different algorithms. A specific text mining setting may rely on algorithms trained on a
corpus specially suited to that task. One early general-purpose standard corpus was the Brown corpus
of 500 English language documents of varying types (named Brown because it was compiled at Brown
University in the early 1960s).
PREPROCESSING THE TEXT 483
the preparation of data for text mining is a more involved undertaking than the
preparation of numerical or categorical data for predictive models. For example,
consider the modified example in Table 20.2, based on the following sentences:
S1. this is the first sentence!!
S2. this is a second Sentence 🙂
S3. the third sentence, is here
S4. forth of all sentences
This set of sentences has extra spaces, non-alpha characters, incorrect capitaliza-
tion, and a misspelling of “fourth.”
TABLE 20.2 TERM-DOCUMENT MATRIX REPRESENTATION OF WORDS IN SENTENCES S1–S4
(EXAMPLE 2)
code for term frequency of second example
library(tm)
text <- c("this is the first sentence!!",
"this is a second Sentence :)",
"the third sentence, is here",
"forth of all sentences")
corp <- Corpus(VectorSource(text))
tdm <- TermDocumentMatrix(corp)
inspect(tdm)
Output
> inspect(tdm)
<
Non-/sparse entries: 14/34
Sparsity : 71%
Maximal term length: 10
Weighting : term frequency (tf)
Docs
Terms 1 2 3 4
all 0 0 0 1
first 1 0 0 0
forth 0 0 0 1
here 0 0 1 0
second 0 1 0 0
sentence 0 1 0 0
sentence!! 1 0 0 0
sentence, 0 0 1 0
sentences 0 0 0 1
the 1 0 1 0
third 0 0 1 0
this 1 1 0 0
484 TEXT MINING
Tokenization
Our first simple data set was composed entirely of words found in the dictio-
nary. A real set of documents will have more variety—it will contain numbers,
alphanumeric strings like date stamps or part numbers, web and e-mail addresses,
abbreviations, slang, proper nouns, misspellings, and more.
Tokenization is the process of taking a text and, in an automated fashion,
dividing it into separate “tokens” or terms. A token (term) is the basic unit of
analysis. A word separated by spaces is a token. 2+3 would need to be separated
into three tokens, while 23 would remain as one token. Punctuation might also
stand as its own token (for example, the @ symbol). These tokens become the
row headers in the data matrix. Each text mining software program will have its
own list of delimiters (spaces, commas, colons, etc.) that it uses to divide up the
text into tokens. In Table 20.3, we use R’s tm_map() to remove white space and
punctuation marks from the four sentences shown in Table 20.2.
For a sizeable corpus, tokenization will result in a huge number of variables—
the English language has over a million words, let alone the non-word terms that
TABLE 20.3 TOKENIZATION OF S1–S4 EXAMPLE
code for tokenization
# tokenization
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removePunctuation)
tdm <- TermDocumentMatrix(corp)
inspect(tdm)
Output
> inspect(tdm)
<
Non-/sparse entries: 14/26
Sparsity : 65%
Maximal term length: 9
Weighting : term frequency (tf)
Docs
Terms 1 2 3 4
all 0 0 0 1
first 1 0 0 0
forth 0 0 0 1
here 0 0 1 0
second 0 1 0 0
sentence 1 1 1 0
sentences 0 0 0 1
the 1 0 1 0
third 0 0 1 0
this 1 1 0 0
PREPROCESSING THE TEXT 485
will be encountered in typical documents. Anything that can be done in the
preprocessing stage to reduce the number of terms will aid in the analysis. The
initial focus is on eliminating terms that simply add bulk and noise.
Some of the terms that result from the initial parsing of the corpus might
not be useful in further analyses and can be eliminated in the preprocessing
stage. For example, in a legal discovery case, one corpus of documents might be
e-mails, all of which have company information and some boilerplate as part of
the signature. These terms might be added to a stopword list of terms that are to
be automatically eliminated in the preprocessing stage.
Text Reduction
Most text-processing software (the tm package in R included) come with a
generic stopword list of frequently occurring terms to be removed. If you review
R’s stopword list during preprocessing stages, you will see that it contains a large
number of terms to be removed (Table 20.4 shows the first 174 stopwords). You
can add additional terms or remove existing terms from the generic list.
TABLE 20.4 STEPWORDS IN R
> stopwords(“english”)
[1] “i” “me” “my” “myself” “we” “our”
[7] “ours” “ourselves” “you” “your” “yours” “yourself”
[13] “yourselves” “he” “him” “his” “himself” “she”
[19] “her” “hers” “herself” “it” “its” “itself”
[25] “they” “them” “their” “theirs” “themselves” “what”
[31] “which” “who” “whom” “this” “that” “these”
[37] “those” “am” “is” “are” “was” “were”
[43] “be” “been” “being” “have” “has” “had”
[49] “having” “do” “does” “did” “doing” “would”
[55] “should” “could” “ought” “i’m” “you’re” “he’s”
[61] “she’s” “it’s” “we’re” “they’re” “i’ve” “you’ve”
[67] “we’ve” “they’ve” “i’d” “you’d” “he’d” “she’d”
[73] “we’d” “they’d” “i’ll” “you’ll” “he’ll” “she’ll”
[79] “we’ll” “they’ll” “isn’t” “aren’t” “wasn’t” “weren’t”
[85] “hasn’t” “haven’t” “hadn’t” “doesn’t” “don’t” “didn’t”
[91] “won’t” “wouldn’t” “shan’t” “shouldn’t” “can’t” “cannot”
[97] “couldn’t” “mustn’t” “let’s” “that’s” “who’s” “what’s”
[103] “here’s” “there’s” “when’s” “where’s” “why’s” “how’s”
[109] “a” “an” “the” “and” “but” “if”
[115] “or” “because” “as” “until” “while” “of”
[121] “at” “by” “for” “with” “about” “against”
[127] “between” “into” “through” “during” “before” “after”
[133] “above” “below” “to” “from” “up” “down”
[139] “in” “out” “on” “off” “over” “under”
[145] “again” “further” “then” “once” “here” “there”
[151] “when” “where” “why” “how” “all” “any”
[157] “both” “each” “few” “more” “most” “other”
[163] “some” “such” “no” “nor” “not” “only”
[169] “own” “same” “so” “than” “too” “very”
486 TEXT MINING
Additional techniques to reduce the volume of text (“vocabulary reduction”)
and to focus on the most meaningful text include:
• Stemming, a linguistic method that reduces different variants of words to
a common core.
• Frequency filters can be used to eliminate either terms that occur in a
great majority of documents or very rare terms. Frequencty filters can
also be used to limit the vocabulary to the n most frequent terms.
• Synonyms or synonymous phrases may be consolidated.
• Letter case (uppercase/lowercase) can be ignored.
• A variety of specific terms in a category can be replaced with the cat-
egory name. This is called normalization. For example, different e-mail
addresses or different numbers might all be replaced with “emailtoken”
or “numbertoken.”
Table 20.5 presents the text reduction step applied to the four sentences
example, after tokenization. We can see the number of terms has been reduced
to five.
TABLE 20.5 TEXT REDUCTION OF S1–S4 (AFTER TOKENIZATION)
code for text reduction
# stopwords
library(SnowballC)
corp <- tm_map(corp, removeWords, stopwords("english"))
# stemming
corp <- tm_map(corp, stemDocument)
tdm <- TermDocumentMatrix(corp)
inspect(tdm)
Output
> inspect(tdm)
<
Non-/sparse entries: 8/12
Sparsity : 60%
Maximal term length: 7
Weighting : term frequency (tf)
Docs
Terms 1 2 3 4
first 1 0 0 0
forth 0 0 0 1
second 0 1 0 0
sentenc 1 1 1 1
third 0 0 1 0
PREPROCESSING THE TEXT 487
Presence/Absence vs. Frequency
The bag-of-words approach can be implemented either in terms of frequency of
terms, or presence/absence of terms. The latter might be appropriate in some
circumstances—in a forensic accounting classification model, for example, the
presence or absence of a particular vendor name might be a key predictor vari-
able, without regard to how often it appears in a given document. Frequency
can be important in other circumstances, however. For example, in processing
support tickets, a single mention of “IP address” might be non-meaningful—all
support tickets might involve a user’s IP address as part of the submission. Rep-
etition of the phrase multiple times, however, might provide useful information
that IP address is part of the problem (e.g., DNS resolution). Note: The tm
package in R only implements the frequency option. To implement the pres-
ence/absence option, you would need to convert the frequency data to a binary
matrix (i.e., convert all non-zero counts to 1’s).
Term Frequency–Inverse Document Frequency (TF-IDF)
There are additional popular options that factor in both the frequency of a term
in a document and the frequency of documents with that term. One such pop-
ular option, which measures the importance of a term to a document, is Term
Frequency–Inverse Document Frequency (TF-IDF). For a given document d and
term t, the term frequency is the number of times term t appears in document d:
TF(d, t) = # times term t appears in document d.
To account for terms that appear frequently in the domain of interest, we
compute the Inverse Document Frequency of term t, calculated over the entire
corpus and defined as3
IDF(t) = log
(
total number of documents
# documents containing term t
)
.
TF-IDF(t, d) for a specific term-document pair is the product of term fre-
quency TF(t, d) and inverse document frequency IDF(t):
TF-IDF(t, d) = TF(t, d)× IDF(t). (20.1)
The TF-IDF matrix contains the value for each term-document combination.
The above definition of TF-IDF is a common one, however there are multiple
ways to define and weight both TF and IDF, so there are a variety of possible
definitions of TF-IDF. For example, Table 20.6 shows the TF-IDF matrix for
the four-sentence example (after tokenization and text reduction). The function
3IDF(t) is actually just the fraction, without the logarithm, although using a logarithm is very common.
488 TEXT MINING
TABLE 20.6 TF-IDF MATRIX FOR S1–S4 EXAMPLE (AFTER TOKENIZATION AND TEXT
REDUCTION)
> tfidf <- weightTfIdf(tdm)
> inspect(tfidf)
<
Non-/sparse entries: 4/16
Sparsity : 80%
Maximal term length: 7
Weighting : term frequency – inverse document frequency (normalized) (tf-idf)
Docs
Terms 1 2 3 4
first 1 0 0 0
forth 0 0 0 1
second 0 1 0 0
sentenc 0 0 0 0
third 0 0 1 0
weightTfIdf() uses a base 2 logarithm for the computation of IDF, and normalizes
the numerator by dividing by the total number of terms in document d. For
example, the TF-IDF value for the term “first” in document 1 is computed by
TF-IDF(first, 1) =
1
2
× log2
(
4
1
)
= 1.
The general idea of TF-IDF is that it identifies documents with frequent
occurrences of rare terms. TF-IDF yields high values for documents with a
relatively high frequency for terms that are relatively rare overall, and near-zero
values for terms that are absent from a document, or present in most documents.
From Terms to Concepts: Latent Semantic Indexing
In Chapter 4, we showed how numerous numeric variables can be reduced to a
small number of “principal components” that explain most of the variation in a
set of variables. The principal components are linear combinations of the orig-
inal (typically correlated) variables, and a subset of them serve as new variables
to replace the numerous original variables.
An analogous dimension reduction method—latent semantic indexing [or latent
semantic analysis, (LSA)]—can be applied to text data. The mathematics of the
algorithm are beyond the scope of this chapter,4 but a good intuitive explanation
comes from the user guide for XLMiner software5:
4Generally, in PCA, the dimension is reduced by replacing the covariance matrix by a smaller one; in
LSA, dimension is reduced by replacing the term-document matrix by a smaller one.
5Analytic Solver Platform, XLMiner Platform, Data Mining User Guide, 2014, Frontline Systems,
p. 245
IMPLEMENTING DATA MINING METHODS 489
For example, if we inspected our document collection, we might find that
each time the term “alternator” appeared in an automobile document, the
document also included the terms “battery” and “headlights.” Or each time
the term “brake” appeared in an automobile document, the terms “pads” and
“squeaky” also appeared. However, there is no detectable pattern regarding the
use of the terms “alternator” and “brake” together. Documents including
“alternator” might or might not include “brake” and documents including
“brake” might or might not include “alternator.” Our four terms, battery,
headlights, pads, and squeaky describe two different automobile repair issues:
failing brakes and a bad alternator.
So, in this case, latent semantic indexing would reduce the four terms to two
concepts:
• brake failure
• alternator failure
We discuss illustrate latent semantic indexing using R in the example in
Section 20.6.
Extracting Meaning
In the simple latent semantic indexing example, the concepts to which the terms
map (failing brakes, bad alternator) are clear and understandable. In many cases,
unfortunately, this will not be true—the concepts to which the terms map will
not be obvious. In such cases, latent semantic indexing will greatly enhance the
manageability of the text for purposes of building predictive models, and sharpen
predictive power by reducing noise, but it will turn the model into a blackbox
device for prediction, not so useful for understanding the roles that terms and
concepts play. This is OK for our purposes—as noted earlier, we are focusing
on text mining to classify or cluster new documents, not to extract meaning.
20.5 Implementing Data Mining Methods
After the text has gone through the preprocessing stage, it is then in a numeric
matrix format and you can apply the various data mining methods discussed
earlier in this book. Clustering methods can be used to identify clusters of
documents—for example, large numbers of medical reports can be mined to
identify clusters of symptoms. Prediction methods can be used with tech support
tickets to predict how long it will take to resolve an issue. Perhaps the most
popular application of text mining is for classification—also termed labeling—of
documents.
490 TEXT MINING
20.6 Example: Online Discussions on Autos
and Electronics
This example6 illustrates a classification task—to classify Internet discussion posts
as either auto-related or electronics-related. One post looks like this:
From: smith@logos.asd.sgi.com (Tom Smith) Subject: Ford Explorer 4WD –
do I need performance axle?
We’re considering getting a Ford Explorer XLT with 4WD and we have the
following questions (All we would do is go skiing – no off-roading):
1. With 4WD, do we need the “performance axle” – (limited slip axle). Its
purpose is to allow the tires to act independently when the tires are on
different terrain.
2. Do we need the all-terrain tires (P235/75X15) or will the all-season
(P225/70X15) be good enough for us at Lake Tahoe?
Thanks,
Tom
–
================================================
Tom Smith Silicon Graphics smith@asd.sgi.com 2011 N. Shoreline Rd. MS
8U-815 415-962-0494 (fax) Mountain View, CA 94043
================================================
The posts are taken from Internet groups devoted to autos and electronics, so
are pre-labeled. This one, clearly, is auto-related. A related organizational sce-
nario might involve messages received by a medical office that must be classified
as medical or non-medical (the messages in such a real scenario would probably
have to be labeled by humans as part of the preprocessing).
The posts are in the form a zipped file that contains two folders: auto posts
and electronics posts, each contains a set of 1000 posts organized in small files.
In the following, we describe the main steps from preprocessing to building
a classification model on the data. Table 20.7 provides R code for the text
processing step for this example. We describe each step separately next.
Importing and Labeling the Records
Package tm in R contains multiple load functions for different document for-
mats. In our example, we use function ZipSource() to read the zipped file. Other
options include reading an entire directory, reading PDF or XML files, and
more. We additionally create a label array that corresponds to the order of the
documents—we will use “1” for autos and “0” for electronics. The first 1000
documents should be classified as “1” and the remaining 1000 documents “0”
(see Step 1 in Table 20.7).
6The dataset is taken from www.cs.cmu.edu/afs/cs/project/theo-20/www/data/news20.html, with
minor modifications.
mailto:smith@logos.asd.sgi.com
mailto:smith@asd.sgi.com
http://www.cs.cmu.edu/afs/cs/project/theo-20/www/data/news20.html
EXAMPLE: ONLINE DISCUSSIONS ON AUTOS AND ELECTRONICS 491
TABLE 20.7 IMPORTING AND LABELING THE RECORDS, PREPROCESSING TEXT, AND
PRODUCING CONCEPT MATRIX
code for importing and labeling records, preprocessing text, and producing concept matrix
library(tm)
# step 1: import and label records
# read zip file into a corpus
corp <- Corpus(ZipSource("AutoElectronics.zip", recursive = T))
# create an array of records labels
label <- c(rep(1, 1000), rep(0, 1000))
# step 2: text preprocessing
# tokenization
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeNumbers)
# stopwords
corp <- tm_map(corp, removeWords, stopwords("english"))
# stemming
corp <- tm_map(corp, stemDocument)
# step 3: TF-IDF and latent semantic analysis
# compute TF-IDF
tdm <- TermDocumentMatrix(corp)
tfidf <- weightTfIdf(tdm)
# extract (20) concepts
library(lsa)
lsa.tfidf <- lsa(tfidf, dim = 20)
# convert to data frame
words.df <- as.data.frame(as.matrix(lsa.tfidf$dk))
Text Preprocessing in R
The first preprocessing step is tokenization, which includes the removal of white
space, punctuations, and numbers. The next step is removing stop words. The
last preprocessing step is stemming, or the consolidation of multiple forms of a
word into a single core. For example, “road” and “Rd.” might stem to “road.”
These two operations are shown in Step 2 in Table 20.7.
Producing a Concept Matrix
The preprocessing step will produce a ‘clean’ corpus and term-document matrix
that can be used to compute the TF-IDF term-document matrix described
492 TEXT MINING
earlier. The TF-IDF matrix incorporates both the frequency of a term and
the frequency with which documents containing that term appear in the overall
corpus.
The resulting matrix is probably too large to efficiently analyze in a predictive
model (specifically, the number of predictors in the example is 21,789, so we
use latent semantic indexing to extract a reduced space, termed “concepts.” For
manageability, we will limit the number of concepts to 20. Table 20.7 shows the
R code for applying these two operations—creating a TF-IDF matrix and latent
semantic indexing—to the example.
Finally, we can use this reduced set of concepts to facilitate the building of a
predictive model. We will not attempt to interpret the concepts for meaning.
Fitting a Predictive Model
At this point, we have transformed the original text data into a familiar form
needed for predictive modeling—a single target variable (1 = autos, 0 = elec-
tronics), and 20 predictor variables (the concepts).
We can now partition the data (60% training, 40% validation), and try apply-
ing several classification models. Table 20.8 shows the performance of a logistic
regression, with “class” as the outcome variable and the 20 concept variables as
the predictors.
The confusion matrix (Table 20.8) shows reasonably high accuracy in sep-
arating the two classes of documents—an accuracy of 96.88%. The decile-wise
lift chart (Figure 20.1) confirms the high separability of the classes and the use-
fulness of this model for a ranking goal. For a two class dataset with a nearly
50/50 split between the classes, the maximum lift per decile is 2, and the lift
shown here is just under 2 for the first 40% of the cases, and close to 0 for
the last 40%. In a decision-making process, human review could be concen-
trated on the middle ranked 20% where the classification error is most likely to
occur.
Next, we would also try other models to see how they compare; this is left
as an exercise.
Prediction
The most prevalent application of text mining is classification (“labeling”), but it
can also be used for prediction of numerical values. For example, maintenance
or support tickets could be used to predict length or cost of repair. The only
step that would be different in the above process is the label that is applied after
the preprocessing, which would be a numeric value rather than a class.
EXAMPLE: ONLINE DISCUSSIONS ON AUTOS AND ELECTRONICS 493
TABLE 20.8 FITTING A PREDICTIVE MODEL TO THE AUTOS AND ELECTRONICS DISCUSSION
DATA
code for fitting and evaluating a logistic regression predictive model
# sample 60% training data
training <- sample(c(1:2000), 0.6*2000)
# run logistic model on training
trainData = cbind(label = label[training], words.df[training,])
reg <- glm(label ~ ., data = trainData, family = 'binomial')
# compute accuracy on validation set
validData = cbind(label = label[-training], words.df[-training,])
pred <- predict(reg, newdata = validData, type = "response")
# produce confusion matrix
library(caret)
confusionMatrix(ifelse(pred>0.5, 1, 0), label[-training])
Output
> confusionMatrix(ifelse(pred>0.5, 1, 0), label[-training])
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 385 16
1 9 390
Accuracy : 0.9688
10 20 30 40 50 60 70 80 90 100
Percentile
M
ea
n
R
es
po
ns
e
0.
0
0.
5
1.
0
1.
5
2.
0
FIGURE 20.1 DECILE-WISE LIFT CHART FOR AUTOS-ELECTRONICS DOCUMENT CLASSIFICATION
494 TEXT MINING
20.7 Summary
In this chapter, we drew a distinction between text processing for the purpose of
extracting meaning from a single document (natural language processing—NLP)
and classifying or labeling numerous documents in probabilistic fashion (text
mining). We concentrated on the latter, and examined the preprocessing steps
that need to occur before text can be mined. Those steps are more varied and
involved than those involved in preparing numerical data. The ultimate goal
is to produce a matrix in which rows are terms and columns are documents.
The nature of language is such that the number of terms is excessive for effec-
tive model-building, so the preprocessing steps include vocabulary reduction. A
final major reduction takes place if we use, instead of the terms, a limited set
of concepts that represents most of the variation in the documents, in the same
way that principal components capture most of the variation in numerical data.
Finally, we end up with a quantitative matrix in which the cells represent the
frequency or presence of terms and the columns represent documents. To this
we append document labels (classes), and then we are ready to use this matrix
for classifying documents using classification methods.
PROBLEMS 495
PROBLEMS
20.1 Tokenization. Consider the following text version of a post to an online learning
forum in a statistics course:
Thanks John!
"Illustrations and demos will be
provided for students to work through on
their own".
Do we need that to finish project? If yes,
where to find the illustration and demos?
Thanks for your help.\
a. Identify 10 non-word tokens in the passage.
b. Suppose this passage constitutes a document to be classified, but you are not certain
of the business goal of the classification task. Identify material (at least 20% of the
terms) that, in your judgment, could be discarded fairly safely without knowing
that goal.
c. Suppose the classification task is to predict whether this post requires the attention
of the instructor, or whether a teaching assistant might suffice. Identify the 20% of
the terms that you think might be most helpful in that task.
d. What aspect of the passage is most problematic from the standpoint of simply using
a bag-of-words approach, as opposed to an approach in which meaning is extracted?
20.2 Classifying Internet Discussion Posts. In this problem, you will use the data and
scenario described in this chapter’s example, in which the task is to develop a model
to classify documents as either auto-related or electronics-related.
a. Load the zipped file into R and create a label vector.
b. Following the example in this chapter, preprocess the documents. Explain what
would be different if you did not perform the “stemming” step.
c. Use the lsa package to create 10 concepts. Explain what is different about the
concept matrix, as opposed to the TF-IDF matrix.
d. Using this matrix, fit a predictive model (different from the model presented in
the chapter illustration) to classify documents as autos or electronics. Compare its
performance to that of the model presented in the chapter illustration.
20.3 Classifying Classified Ads Submitted Online. Consider the case of a website that
caters to the needs of a specific farming community, and carries classified ads intended
for that community. Anyone, including robots, can post an ad via a web interface,
and the site owners have problems with ads that are fraudulent, spam, or simply not
relevant to the community. They have provided a file with 4143 ads, each ad in a row,
and each ad labeled as either −1 (not relevant) or 1 (relevant). The goal is to develop
a predictive model that can classify ads automatically.
• Open the file farm-ads.csv, and briefly review some of the relevant and non-relevant
ads to get a flavor for their contents.
http://lms.statistics.com/pix/smartpix.php/statistics_com_1/s/smiley.gif
http://lms.statistics.com/pix/smartpix.php/statistics_com_1/s/smiley.gif
http://lms.statistics.com/pix/smartpix.php/statistics_com_1/s/smiley.gif
496 TEXT MINING
• Following the example in the chapter, preprocess the data in R, and create a term-
document matrix, and a concept matrix. Limit the number of concepts to 20.
a. Examine the term-document matrix.
i. Is it sparse or dense?
ii. Find two non-zero entries and briefly interpret their meaning, in words (you
do not need to derive their calculation)
b. Briefly explain the difference between the term-document matrix and the concept-
document matrix. Relate the latter to what you learned in the principal compo-
nents chapter (Chapter 4).
c. Using logistic regression, partition the data (60% training, 40% validation), and
develop a model to classify the documents as ‘relevant’ or ‘non-relevant.’ Comment
on its efficacy.
d. Why use the concept-document matrix, and not the term-document matrix, to
provide the predictor variables?
20.4 Clustering auto posts. In this problem, you will use the data and scenario described
in this chapter’s example. The task is to cluster the auto posts.
a. Following the example in this chapter, preprocess the documents, except do not
create a label vector.
b. Use the lsa package to create 10 concepts.
c. Before doing the clustering, state how many natural clusters you expect to find.
d. Perform hierarchical clustering and inspect the dendrogram.
i. From the dendrogram, how many natural clusters appear?
ii. Examining the dendrogram as it branches beyond the number of main clusters,
select a sub-cluster and assess its characteristics.
e. Perform k-means clustering for two clusters and report how distant and separated
they are (using between-cluster distance and within cluster dispersion).
Part VIII
Cases
CHAPTER 21
Cases
21.1 Charles Book Club1
CharlesBookClub.csv is the dataset for this case study.
The Book Industry
Approximately 50,000 new titles, including new editions, are published each
year in the United States, giving rise to a $25 billion industry in 2001. In terms
of percentage of sales, this industry may be segmented as follows:
16% Textbooks
16% Trade books sold in bookstores
21% Technical, scientific, and professional books
10% Book clubs and other mail-order books
17% Mass-market paperbound books
20% All other books
Book retailing in the United States in the 1970s was characterized by the
growth of bookstore chains located in shopping malls. The 1980s saw increased
purchases in bookstores stimulated through the widespread practice of discount-
ing. By the 1990s, the superstore concept of book retailing gained acceptance
and contributed to double-digit growth of the book industry. Conveniently
situated near large shopping centers, superstores maintain large inventories of
30,000–80,000 titles and employ well-informed sales personnel. Book retail-
ing changed fundamentally with the arrival of Amazon, which started out as an
1The Charles Book Club case was derived, with the assistance of Ms. Vinni Bhandari, from The
Bookbinders Club, a Case Study in Database Marketing, prepared by Nissan Levin and Jacob Zahavi, Tel
Aviv University; used with permission.
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
499
500 CASES
online bookseller and, as of 2015, was the world’s largest online retailer of any
kind. Amazon’s margins were small and the convenience factor high, putting
intense competitive pressure on all other book retailers. Borders, one of the two
major superstore chains, discontinued operations in 2011.
Subscription-based book clubs offer an alternative model that has persisted,
though it too has suffered from the dominance of Amazon.
Historically, book clubs offered their readers different types of membership
programs. Two common membership programs are the continuity and negative
option programs, which are both extended contractual relationships between the
club and its members. Under a continuity program, a reader signs up by accepting
an offer of several books for just a few dollars (plus shipping and handling) and
an agreement to receive a shipment of one or two books each month thereafter
at more-standard pricing. The continuity program is most common in the chil-
dren’s book market, where parents are willing to delegate the rights to the book
club to make a selection, and much of the club’s prestige depends on the quality
of its selections.
In a negative option program, readers get to select how many and which addi-
tional books they would like to receive. However, the club’s selection of the
month is delivered to them automatically unless they specifically mark “no” on
their order form by a deadline date. Negative option programs sometimes result
in customer dissatisfaction and always give rise to significant mailing and pro-
cessing costs.
In an attempt to combat these trends, some book clubs have begun to offer
books on a positive option basis, but only to specific segments of their customer
base that are likely to be receptive to specific offers. Rather than expanding
the volume and coverage of mailings, some book clubs are beginning to use
database-marketing techniques to target customers more accurately. Information
contained in their databases is used to identify who is most likely to be interested
in a specific offer. This information enables clubs to design special programs
carefully tailored to meet their customer segments’ varying needs.
Database Marketing at Charles
The Club The Charles Book Club (CBC) was established in December
1986 on the premise that a book club could differentiate itself through a deep
understanding of its customer base and by delivering uniquely tailored offerings.
CBC focused on selling specialty books by direct marketing through a variety of
channels, including media advertising (TV, magazines, newspapers) and mailing.
CBC is strictly a distributor and does not publish any of the books that it sells.
In line with its commitment to understanding its customer base, CBC built
and maintained a detailed database about its club members. Upon enrollment,
readers were required to fill out an insert and mail it to CBC. Through this
CHARLES BOOK CLUB 501
process, CBC created an active database of 500,000 readers; most were acquired
through advertising in specialty magazines.
The Problem CBC sent mailings to its club members each month con-
taining the latest offerings. On the surface, CBC appeared very successful:
mailing volume was increasing, book selection was diversifying and growing,
and their customer database was increasing. However, their bottom-line profits
were falling. The decreasing profits led CBC to revisit their original plan of
using database marketing to improve mailing yields and to stay profitable.
A Possible Solution CBC embraced the idea of deriving intelligence
from their data to allow them to know their customers better and enable mul-
tiple targeted campaigns where each target audience would receive appropriate
mailings. CBC’s management decided to focus its efforts on the most profitable
customers and prospects, and to design targeted marketing strategies to best reach
them. The two processes they had in place were:
1. Customer acquisition:
◦ New members would be acquired by advertising in specialty maga-
zines, newspapers, and on TV.
◦ Direct mailing and telemarketing would contact existing club
members.
◦ Every new book would be offered to club members before general
advertising.
2. Data collection:
◦ All customer responses would be recorded and maintained in the
database.
◦ Any information not being collected that is critical would be requested
from the customer.
For each new title, they decided to use a two-step approach:
1. Conduct a market test involving a random sample of 4000 customers from
the database to enable analysis of customer responses. The analysis would
create and calibrate response models for the current book offering.
2. Based on the response models, compute a score for each customer in the
database. Use this score and a cutoff value to extract a target customer
list for direct-mail promotion.
Targeting promotions was considered to be of prime importance. Other
opportunities to create successful marketing campaigns based on customer
behavior data (returns, inactivity, complaints, compliments, etc.) would be
addressed by CBC at a later stage.
502 CASES
Art History of Florence A new title, The Art History of Florence, is ready
for release. CBC sent a test mailing to a random sample of 4000 customers
from its customer base. The customer responses have been collated with past
purchase data. The dataset was randomly partitioned into three parts: Training
Data (1800 customers): initial data to be used to fit models, Validation Data (1400
customers): holdout data used to compare the performance of different models,
and Test Data (800 customers): data to be used only after a final model has been
selected to estimate the probable performance of the model when it is deployed.
Each row (or case) in the spreadsheet (other than the header) corresponds to one
market test customer. Each column is a variable, with the header row giving
the name of the variable. The variable names and descriptions are given in
Table 21.1.
Data Mining Techniques
Various data mining techniques can be used to mine the data collected from the
market test. No one technique is universally better than another. The particu-
lar context and the particular characteristics of the data are the major factors in
determining which techniques perform better in an application. For this assign-
ment, we focus on two fundamental techniques: k-nearest neighbors and logistic
TABLE 21.1 LIST OF VARIABLES IN CHARLES BOOK CLUB DATASET
Variable Name Description
Seq# Sequence number in the partition
ID# Identification number in the full
(unpartitioned) market test dataset
Gender 0 = Male, 1 = Female
M Monetary—Total money spent on books
R Recency—Months since last purchase
F Frequency—Total number of purchases
FirstPurch Months since first purchase
ChildBks Number of purchases from the category child books
YouthBks Number of purchases from the category youth books
CookBks Number of purchases from the category cookbooks
DoItYBks Number of purchases from the category do-it-yourself books
RefBks Number of purchases from the category reference books
(atlases, encyclopedias, dictionaries)
ArtBks Number of purchases from the category art books
GeoBks Number of purchases from the category geography books
ItalCook Number of purchases of book title Secrets of Italian Cooking
ItalAtlas Number of purchases of book title Historical Atlas of Italy
ItalArt Number of purchases of book title Italian Art
Florence = 1 if The Art History of Florence was bought; = 0 if not
Related Purchase Number of related books purchased
CHARLES BOOK CLUB 503
regression. We compare them with each other as well as with a standard industry
practice known as RFM (recency, frequency, monetary) segmentation.
RFM Segmentation The segmentation process in database marketing
aims to partition customers in a list of prospects into homogeneous groups (seg-
ments) that are similar with respect to buying behavior. The homogeneity crite-
rion we need for segmentation is the propensity to purchase the offering. How-
ever, since we cannot measure this attribute, we use variables that are plausible
indicators of this propensity.
In the direct marketing business, the most commonly used variables are the
RFM variables:
R = recency, time since last purchase
F = frequency, number of previous purchases from the company over a period
M = monetary, amount of money spent on the company’s products over a
period
The assumption is that the more recent the last purchase, the more products
bought from the company in the past, and the more money spent in the past
buying the company’s products, the more likely the customer is to purchase the
product offered.
The 1800 observations in the dataset were divided into recency, frequency,
and monetary categories as follows:
Recency:
0–2 months (Rcode = 1)
3–6 months (Rcode = 2)
7–12 months (Rcode = 3)
13 months and up (Rcode = 4)
Frequency:
1 book (Fcode = l)
2 books (Fcode = 2)
3 books and up (Fcode = 3)
Monetary:
$0–$25 (Mcode = 1)
$26–$50 (Mcode = 2)
$51–$100 (Mcode = 3)
$101–$200 (Mcode = 4)
$201 and up (Mcode = 5)
504 CASES
Assignment
Partition the data into training (60%) and validation (40%). Use seed = 1.
1. What is the response rate for the training data customers taken as a whole?
What is the response rate for each of the 4×5×3 = 60 combinations of
RFM categories? Which combinations have response rates in the training
data that are above the overall response in the training data?
2. Suppose that we decide to send promotional mail only to the “above-
average” RFM combinations identified in part 1. Compute the response
rate in the validation data using these combinations.
3. Rework parts 1 and 2 with three segments:
Segment 1: RFM combinations that have response rates that exceed
twice the overall response rate
Segment 2: RFM combinations that exceed the overall response rate
but do not exceed twice that rate
Segment 3: the remaining RFM combinations
Draw the lift curve (consisting of three points for these three segments)
showing the number of customers in the validation dataset on the x-axis
and cumulative number of buyers in the validation dataset on the y-axis.
k-Nearest Neighbors The k-nearest-neighbors technique can be used
to create segments based on product proximity to similar products of the products
offered as well as the propensity to purchase (as measured by the RFM variables).
For The Art History of Florence, a possible segmentation by product proximity
could be created using the following variables:
R: recency—months since last purchase
F: frequency—total number of past purchases
M: monetary—total money (in dollars) spent on books
FirstPurch: months since first purchase
RelatedPurch: total number of past purchases of related books (i.e., sum of
purchases from the art and geography categories and of titles Secrets of Italian
Cooking, Historical Atlas of Italy, and Italian Art)
4. Use the k-nearest-neighbor approach to classify cases with k =
1, 2, …, 11, using Florence as the outcome variable. Based on the val-
idation set, find the best k. Remember to normalize all five variables.
Create a lift curve for the best k model, and report the expected lift for
an equal number of customers from the validation dataset.
GERMAN CREDIT 505
5. The k-NN prediction algorithm gives a numerical value, which is a
weighted average of the values of the Florence variable for the k-nearest
neighbors with weights that are inversely proportional to distance. Using
the best k that you calculated above with k-NN classification, now run a
model with k-NN prediction and compute a lift curve for the validation
data. Use all 5 predictors and normalized data. What is the range within
which a prediction will fall? How does this result compare to the output
you get with the k-nearest-neighbor classification?
Logistic Regression The logistic regression model offers a powerful
method for modeling response because it yields well-defined purchase proba-
bilities. The model is especially attractive in consumer-choice settings because
it can be derived from the random utility theory of consumer behavior.
Use the training set data of 1800 records to construct three logistic regression
models with Florence as the outcome variable and each of the following sets of
predictors:
• The full set of 15 predictors in the dataset
• A subset of predictors that you judge to be the best
• Only the R, F , and M variables
6. Create a lift chart summarizing the results from the three logistic regres-
sion models created above, along with the expected lift for a random
selection of an equal number of customers from the validation dataset.
7. If the cutoff criterion for a campaign is a 30% likelihood of a purchase,
find the customers in the validation data that would be targeted and count
the number of buyers in this set.
21.2 German Credit
GermanCredit.csv is the dataset for this case study.
Background
Money-lending has been around since the advent of money; it is perhaps
the world’s second-oldest profession. The systematic evaluation of credit risk,
though, is a relatively recent arrival, and lending was largely based on reputation
and very incomplete data. Thomas Jefferson, the third President of the United
States, was in debt throughout his life and unreliable in his debt payments, yet
people continued to lend him money. It wasn’t until the beginning of the 20th
century that the Retail Credit Company was founded to share information about
credit. That company is now Equifax, one of the big three credit scoring agen-
cies (the other two are Transunion and Experion).
506 CASES
Individual and local human judgment are now largely irrelevant to the credit
reporting process. Credit agencies and other big financial institutions extending
credit at the retail level collect huge amounts of data to predict whether defaults
or other adverse events will occur, based on numerous customer and transaction
information.
Data
This case deals with an early stage of the historical transition to predictive mod-
eling, in which humans were employed to label records as either good or poor
credit. The German Credit dataset2 has 30 variables and 1000 records, each
record being a prior applicant for credit. Each applicant was rated as “good
credit” (700 cases) or “bad credit” (300 cases). Table 21.2 shows the values of
these variables for the first four records. All the variables are explained in Table
21.5. New applicants for credit can also be evaluated on these 30 predictor vari-
ables and classified as a good or a bad credit risk based on the predictor values.
The consequences of misclassification have been assessed as follows: The
costs of a false positive (incorrectly saying that an applicant is a good credit risk)
outweigh the benefits of a true positive (correctly saying that an applicant is
a good credit risk) by a factor of 5. This is summarized in Table 21.3. The
opportunity cost table was derived from the average net profit per loan as shown
TABLE 21.2 FIRST FOUR RECORDS FROM GERMAN CREDIT DATASET
OB
S#
CH
K_
AC
CT
DU
RA
TI
ON
H
IS
TO
RY
NE
W
_C
AR
US
ED
_C
AR
FU
RN
IT
U
RE
RA
DI
O/
TV
ED
U
CA
TI
ON
RE
TR
AI
NI
NG
AM
OU
NT
SA
V_
AC
CT
EM
PL
OY
M
EN
T
IN
ST
AL
L_
RA
TE
M
AL
E_
DI
V
M
AL
E_
SI
NG
LE
M
AL
E_
M
AR
_W
ID
CO
-A
PP
LI
CA
NT
GU
AR
AN
TO
R
1 0 6 4 0 0 0 1 0 0 1169 4 4 4 0 1 0 0 0
2 1 48 2 0 0 0 1 0 0 5951 0 2 2 0 0 0 0 0
3 3 12 4 0 0 0 0 1 0 2096 0 3 2 0 1 0 0 0
4 0 42 2 0 0 1 0 0 0 7882 0 3 2 0 1 0 0 1
PR
ES
EN
T_
RE
SI
DE
NT
RE
AL
_E
ST
AT
E
PR
OP
_U
NK
N_
NO
NE
AG
E
OT
H
ER
_I
NS
TA
LL
RE
NT
OW
N_
RE
S
NU
M
_C
RE
DI
TS
JO
B
NU
M
_D
EP
EN
DE
NT
S
TE
LE
PH
ON
E
FO
RE
IG
N
RE
SP
ON
SE
4 1 0 67 0 0 1 2 2 1 1 0 1
2 1 0 22 0 0 1 1 2 1 0 0 0
3 1 0 49 0 0 1 1 1 2 0 0 1
4 0 0 45 0 0 0 1 2 2 0 0 1
2This dataset is available from ftp.ics.uci.edu/pub/machine-learning-databases/statlog/.
ftp.ics.uci.edu/pub/machine-learning-databases/statlog/
GERMAN CREDIT 507
TABLE 21.3 OPPORTUNITY COST TABLE (DEUTSCHE MARKS)
Actual
Predicted (Decision) Good Bad
Good (Accept) 0 500
Bad (Reject) 100 0
TABLE 21.4 AVERAGE NET PROFIT (DEUTSCHE MARKS)
Actual
Predicted (Decision) Good Bad
Good (Accept) 100 −500
Bad (Reject) 0 0
in Table 21.4. Because decision makers are used to thinking of their decision
in terms of net profits, we use these tables in assessing the performance of the
various models.
Assignment
1. Review the predictor variables and guess what their role in a credit deci-
sion might be. Are there any surprises in the data?
2. Divide the data into training and validation partitions, and develop classi-
fication models using the following data mining techniques in R: logistic
regression, classification trees, and neural networks.
3. Choose one model from each technique and report the confusion matrix
and the cost/gain matrix for the validation data. Which technique has
the highest net profit?
4. Let us try and improve our performance. Rather than accept the default
classification of all applicants’ credit status, use the estimated probabilities
(propensities) from the logistic regression (where successmeans 1) as a basis
for selecting the best credit risks first, followed by poorer-risk applicants.
Create a vector containing the net profit for each record in the validation
set. Use this vector to create a decile-wise lift chart for the validation set
that incorporates the net profit.
a. How far into the validation data should you go to get maximum net
profit? (Often, this is specified as a percentile or rounded to deciles.)
b. If this logistic regression model is used to score to future applicants,
what “probability of success” cutoff should be used in extending credit?
508 CASES
TABLE 21.5 VARIABLES FOR THE GERMAN CREDIT DATASET
Var. Variable Name Description Variable Type Code Description
1 OBS# Observation numbers Categorical Sequence number in dataset
2 CHK−ACCT Checking account Categorical 0: <0 DM
status
1: 0−200 DM
2 : >200 DM
3: No checking account
3 DURATION Duration of Numerical
credit in months
4 HISTORY Credit history Categorical 0: No credits taken
1: All credits at this bank
paid back duly
2: Existing credits paid
back duly until now
3: Delay in paying
off in the past
4: Critical account
5 NEW−CAR Purpose of Binary Car (new), 0: No, 1: Yes
credit
6 USED−CAR Purpose of Binary Car (used), 0: No, 1: Yes
credit
7 FURNITURE Purpose of Binary Furniture/equipment,
credit 0: No, 1: Yes
8 RADIO/TV Purpose of Binary Radio/television,
credit 0: No, 1: Yes
9 EDUCATION Purpose of Binary Education, 0: No, 1: Yes
credit
10 RETRAINING Purpose of Binary Retraining, 0: No, 1: Yes
credit
11 AMOUNT Credit amount Numerical
12 SAV−ACCT Average balance Categorical 0: <100 DM
in savings 1 : 101−500 DM
account 2 : 501−1000 DM
3 : >1000 DM
4 : Unknown/ no
savings account
13 EMPLOYMENT Present Categorical 0 : Unemployed
employment 1: <1 year
since 2: 1−3 years
3: 4−6 years
4: ≥7 years
14 INSTALL−RATE Installment Numerical
rate as
% of disposable
income
15 MALE−DIV Applicant is male Binary 0: No, 1: Yes
and divorced
(continued)
GERMAN CREDIT 509
TABLE 21.5 (CONTINUED)
Var. Variable Name Description Variable Type Code Description
16 MALE−SINGLE Applicant is male Binary 0: No, 1: Yes
and single
17 MALE−MAR−WID Applicant is male Binary 0: No, 1: Yes
and married
or a widower
18 CO-APPLICANT Application has Binary 0: No, 1: Yes
a coapplicant
19 GUARANTOR Applicant has Binary 0: No, 1: Yes
a guarantor
20 PRESENT−RESIDENT Present resident Categorical 0: ≤1 year
since (years) 1: 1−2 years
2: 2−3 years
3: ≥3 years
21 REAL−ESTATE Applicant owns Binary 0: No, 1: Yes
real estate
22 PROP−UNKN−NONE Applicant owns no Binary 0: No, 1: Yes
property (or unknown)
23 AGE Age in years Numerical
24 OTHER−INSTALL Applicant has Binary 0: No, 1: Yes
other installment
plan credit
25 RENT Applicant rents Binary 0: No, 1: Yes
26 OWN−RES Applicant owns Binary 0: No, 1: Yes
residence
27 NUM−CREDITS Number of Numerical
existing credits
at this bank
28 JOB Nature of job Categorical 0 : Unemployed/
unskilled—
non-resident
1 : Unskilled—
resident
2 : Skilled employee/
official
3 : Management/
self-employed/
highly qualified
employee/officer
29 NUM−DEPENDENTS Number of people Numerical
for whom liable to
provide maintenance
30 TELEPHONE Applicant has Binary 0: No, 1: Yes
phone in his
or her name
31 FOREIGN Foreign worker Binary 0: No, 1: Yes
32 RESPONSE Credit rating Binary 0: No, 1: Yes
is good
(Note: The original dataset had a number of categorical variables, some of which were
transformed into a series of binary variables and some ordered categorical variables were left
as is, to be treated as numerical.)
510 CASES
21.3 Tayko Software Cataloger3
Tayko.csv is the dataset for this case study.
Background
Tayko is a software catalog firm that sells games and educational software. It
started out as a software manufacturer and later added third-party titles to its
offerings. It has recently put together a revised collection of items in a new
catalog, which it is preparing to roll out in a mailing.
In addition to its own software titles, Tayko’s customer list is a key asset. In
an attempt to expand its customer base, it has recently joined a consortium of
catalog firms that specialize in computer and software products. The consortium
affords members the opportunity to mail catalogs to names drawn from a pooled
list of customers. Members supply their own customer lists to the pool, and can
“withdraw” an equivalent number of names each quarter. Members are allowed
to do predictive modeling on the records in the pool so they can do a better job
of selecting names from the pool.
The Mailing Experiment
Tayko has supplied its customer list of 200,000 names to the pool, which totals
over 5,000,000 names, so it is now entitled to draw 200,000 names for a mailing.
Tayko would like to select the names that have the best chance of performing
well, so it conducts a test—it draws 20,000 names from the pool and does a test
mailing of the new catalog.
This mailing yielded 1065 purchasers, a response rate of 0.053. To optimize
the performance of the data mining techniques, it was decided to work with a
stratified sample that contained equal numbers of purchasers and nonpurchasers.
For ease of presentation, the dataset for this case includes just 1000 purchasers
and 1000 nonpurchasers, an apparent response rate of 0.5. Therefore, after using
the dataset to predict who will be a purchaser, we must adjust the purchase rate
back down by multiplying each case’s “probability of purchase” by 0.053/0.5,
or 0.107.
Data
There are two outcome variables in this case. Purchase indicates whether or not
a prospect responded to the test mailing and purchased something. Spending
indicates, for those who made a purchase, how much they spent. The overall
procedure in this case will be to develop two models. One will be used to classify
3Copyright © Resampling Stats, Inc. 2017 used with permission.
TAYKO SOFTWARE CATALOGER 511
TABLE 21.6 FIRST 10 RECORDS FROM TAYKO DATASET
se
qu
en
ce
_n
um
be
r
US so
ur
ce
_a
so
ur
ce
_c
so
ur
ce
_b
so
ur
ce
_d
so
ur
ce
_e
so
ur
ce
_m
so
ur
ce
_o
so
ur
ce
_h
so
ur
ce
_r
so
ur
ce
_s
so
ur
ce
_t
so
ur
ce
_u
so
ur
ce
_p
so
ur
ce
_x
so
ur
ce
_w
Fr
eq
la
st
_u
pd
at
e_
da
ys
_a
go
1s
t_
up
da
te
_d
ay
s_
ag
o
W
eb
or
de
r
Ge
nd
er
=m
al
e
Ad
dr
es
s_
is
_r
es
Pu
rc
ha
se
Sp
en
di
ng
1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 3662 3662 1 0 1 1 127.87
2 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2900 2900 1 1 0 0 0
3 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 3883 3914 0 0 0 1 127.48
4 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 829 829 0 1 0 0 0
5 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 869 869 0 0 0 0 0
6 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1995 2002 0 0 1 0 0.06
7 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1498 1529 0 0 1 0 0.06
8 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 3397 3397 0 1 0 0 0.08
9 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 525 2914 1 1 0 1 488.5
10 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3215 3215 0 0 0 1 173.5
records as purchase or no purchase. The second will be used for those cases that
are classified as purchase and will predict the amount they will spend.
Table 21.6 shows the first few rows of data. Table 21.7 provides a description
of the variables available in this case.
TABLE 21.7 DESCRIPTION OF VARIABLES FOR TAYKO DATASET
Code
Var. Variable Name Description Variable Type Description
1 US Is it a US address? Binary 1: Yes
0: No
2–16 Source−* Source catalog Binary 1: Yes
for the record 0: No
(15 possible sources)
17 Freq. Number of transactions Numerical
in last year at
source catalog
18 last−update−days−ago How many days ago Numerical
last update was made
to customer record
19 1st−update−days−ago How many days Numerical
ago first update
to customer record was made
20 RFM% Recency–frequency– Numerical
monetary percentile,
as reported by
source catalog
(see Section 21.1)
21 Web−order Customer placed at Binary 1: Yes
least one order 0: No
via web
22 Gender=mal Customer is male Binary 1: Yes
0: No
23 Address−is−res Address is Binary 1: Yes
a residence 0: No
24 Purchase Person made purchase Binary 1: Yes
in test mailing 0: No
25 Spending Amount (dollars) spent Numerical
by customer in
test mailing
512 CASES
Assignment
1. Each catalog costs approximately $2 to mail (including printing, postage,
and mailing costs). Estimate the gross profit that the firm could expect
from the remaining 180,000 names if it selects them randomly from the
pool.
2. Develop a model for classifying a customer as a purchaser or nonpur-
chaser.
a. Partition the data randomly into a training set (800 records), validation
set (700 records), and test set (500 records).
b. Run stepwise logistic regression using backward elimination to select
the best subset of variables, then use this model to classify the data into
purchasers and nonpurchasers. Use only the training set for running
the model. (Logistic regression is used because it yields an estimated
“probability of purchase,” which is required later in the analysis.)
3. Develop a model for predicting spending among the purchasers.
a. Create a vector of ID’s of only purchasers’ records (Purchase = 1).
b. Partition this dataset into the training and validation records. (Use the
same training/validation labels from the earlier partitioning; one way
is to use function intersect() to find IDs of purchasers in the original
partitions).
c. Develop models for predicting spending, using:
i. Multiple linear regression (use stepwise regression)
ii. Regression trees
d. Choose one model on the basis of its performance on the validation
data.
4. Return to the original test data partition. Note that this test data partition
includes both purchasers and nonpurchasers. Create a new data frame
called Score Analysis that contains the test data portion of this dataset.
a. Add a column to the data frame with the predicted scores from the
logistic regression.
b. Add another column with the predicted spending amount from he
prediction model chosen.
c. Add a column for “adjusted probability of purchase” by multiplying
“predicted probability of purchase” by 0.107. This is to adjust for over-
sampling the purchasers (see earlier description).
d. Add a column for expected spending: adjusted probability of purchase
× predicted spending.
e. Plot the lift chart of the expected spending.
POLITICAL PERSUASION 513
f. Using this lift curve, estimate the gross profit that would result from
mailing to the 180,000 names on the basis of your data mining models.
Note: Although Tayko is a hypothetical company, the data in this case
(modified slightly for illustrative purposes) were supplied by a real company that
sells software through direct sales. The concept of a catalog consortium is based
on the Abacus Catalog Alliance.
21.4 Political Persuasion4
Voter-Persuasion.csv is the dataset for this case study.
Note: Our thanks to Ken Strasma, President of HaystaqDNA and director of
targeting for the 2004 Kerry campaign and the 2008 Obama campaign, for the
data used in this case, and for sharing the information in the following writeup.
Background
When you think of political persuasion, you may think of the efforts that political
campaigns undertake to persuade you that their candidate is better than the other
candidate. In truth, campaigns are less about persuading people to change their
minds, and more about persuading those who agree with you to actually go out
and vote. Predictive analytics now plays a big role in this effort, but in 2004, it
was a new arrival in the political toolbox.
Predictive Analytics Arrives in US Politics
In January of 2004, candidates in the US presidential campaign were compet-
ing in the Iowa caucuses, part of a lengthy state-by-state primary campaign that
culminates in the selection of the Republican and Democratic candidates for
president. Among the Democrats, Howard Dean was leading in national polls.
The Iowa caucuses, however, are a complex and intensive process attracting only
the most committed and interested voters. Those participating are not a repre-
sentative sample of voters nationwide. Surveys of those planning to take part
showed a close race between Dean and three other candidates, including John
Kerry.
Kerry ended up winning by a surprisingly large margin, and the better than
expected performance was due to his campaign’s innovative and successful use
of predictive analytics to learn more about the likely actions of individual vot-
ers. This allowed the campaign to target voters in such a way as to optimize
4Copyright © Datastats, LLC 2017; used with permission.
514 CASES
performance in the caucuses. For example, once the model showed sufficient
support in a precinct to win that precinct’s delegate to the caucus, money and
time could be redirected to other precincts where the race was closer.
Political Targeting
Targeting of voters is not new in politics. It has traditionally taken three forms:
• Geographic
• Demographic
• Individual
In geographic targeting, resources are directed to a geographic unit—state,
city, county, etc.—on the basis of prior voting patterns or surveys that reveal the
political tendency in that geographic unit. It has significant limitations, though.
If a county is only, say, 52% in your favor, it may be in the greatest need of
attention, but if messaging is directed to everyone in the county, nearly half of
it is reaching the wrong people.
In demographic targeting, the messaging is intended for demographic
groups—for example, older voters, younger women voters, Hispanic voters,
etc. The limitation of this method is that it is often not easy to implement—
messaging is hard to deliver just to single demographic groups.
Traditional individual targeting, the most effective form of targeting, was
done on the basis of surveys asking voters how they plan to vote. The big
limitation of this method is, of course, the cost. The expense of reaching all
voters in a phone or door-to-door survey can be prohibitive.
The use of predictive analytics adds power to the individual targeting
method, and reduces cost. A model allows prediction to be rolled out to the
entire voter base, not just those surveyed, and brings to bear a wealth of infor-
mation. Geographic and demographic data remain part of the picture, but they
are used at an individual level.
Uplift
In a classical predictive modeling application for marketing, a sample of data
is selected and an offer is made (e.g., on the web) or a message is sent (e.g.,
by mail), and a predictive model is developed to classify individuals as respond-
ing or not-responding. The model is then applied to new data, propensities to
respond are calculated, individuals are ranked by their propensity to respond,
and the marketer can then select those most likely to respond to mailings or
offers.
POLITICAL PERSUASION 515
Some key information is missing from this classical approach: how would
the individual respond in the absence of the offer or mailing? Might a high-
propensity customer be inclined to purchase irrespective of the offer? Might a
person’s propensity to buy actually be diminished by the offer? Uplift model-
ing (see Chapter 13) allows us to estimate the effect of “offer vs. no offer” or
“mailing vs. no mailing” at the individual level.
In this case, we will apply uplift modeling to actual voter data that were aug-
mented with the results of a hypothetical experiment. The experiment consisted
of the following steps:
1. Conduct a pre-survey of the voters to determine their inclination to vote
Democratic.
2. Randomly split the voters into two samples—control and treatment.
3. Send a flyer promoting the Democratic candidate to the treatment group.
4. Conduct another survey of the voters to determine their inclination to
vote Democratic.
Data
The data in this case are in the file Voter-Persuasion.csv. The target variable is
MOVED_AD, where a 1 = “opinion moved in favor of the Democratic can-
didate” and 0 = “opinion did not move in favor of the Democratic candidate.”
This variable encapsulates the information from the pre- and post-surveys. The
important predictor variable is Flyer, a binary variable that indicates whether or
not a voter received the flyer. In addition, there are numerous other predictor
variables from these sources:
1. Government voter files
2. Political party data
3. Commercial consumer and demographic data
4. Census neighborhood data
Government voter files are maintained, and made public, to assure the integrity
of the voting process. They contain essential data for identification purposes such
as name, address and date of birth. The file used in this case also contains party
identification (needed if a state limits participation in party primaries to voters
in that party). Parties also staff elections with their own poll-watchers, who
record whether an individual votes in an election. These data (termed “derived”
in the case data) are maintained and curated by each party, and can be readily
matched to the voter data by name. Demographic data at the neighborhood
level are available from the census, and can be appended to the voter data by
address matching. Consumer and additional demographic data (buying habits,
516 CASES
education) can be purchased from marketing firms and appended to the voter
data (matching by name and address).
Assignment
The task in this case is to develop an uplift model that predicts the uplift for
each voter. Uplift is defined as the increase in propensity to move one’s opinion
in a Democratic direction. First, review the variables in Voter-Persuasion.csv and
understand which data source they are probably coming from. Then, answer the
following questions and perform the tasks indicated:
1. Overall, how well did the flyer do in moving voters in a Democratic
direction? (Look at the target variable among those who got the flyer,
compared to those who did not.)
2. Explore the data to learn more about the relationships between the pre-
dictor variables and MOVED_AD using data visualization. Which of
the predictors seem to have good predictive potential? Show supporting
charts and/or tables.
3. Partition the data using the partition variable that is in the dataset,
make decisions about predictor inclusion, and fit three predictive mod-
els accordingly. For each model, give sufficient detail about the method
used, its parameters, and the predictors used, so that your results can be
replicated.
4. Among your three models, choose the best one in terms of predictive
power. Which one is it? Why did you choose it?
5. Using your chosen model, report the propensities for the first three
records in the validation set.
6. Create a derived variable that is the opposite of Flyer. Call it Flyer-reversed.
Using your chosen model, re-score the validation data using the Flyer-
reversed variable as a predictor, instead of Flyer. Report the propensities
for the first three records in the validation set.
7. For each record, uplift is computed based on the following difference:
P(success | Flyer = 1) − P(success | Flyer = 0)
Compute the uplift for each of the voters in the validation set, and report
the uplift for the first three records.
8. If a campaign has the resources to mail the flyer only to 10% of the voters,
what uplift cutoff should be used?
TAXI CANCELLATIONS 517
21.5 Taxi Cancellations5
Taxi-cancellation-case.csv is the dataset for this case study.
Business Situation
In late 2013, the taxi company Yourcabs.com in Bangalore, India was facing a
problem with the drivers using their platform—not all drivers were showing up
for their scheduled calls. Drivers would cancel their acceptance of a call, and,
if the cancellation did not occur with adequate notice, the customer would be
delayed or even left high and dry.
Bangalore is a key tech center in India, and technology was transforming
the taxi industry. Yourcabs.com featured an online booking system (though
customers could phone in as well), and presented itself as a taxi booking portal.
The Uber ride sharing service started its Bangalore operations in mid-2014.
Yourcabs.com had collected data on its bookings from 2011 to 2013, and
posted a contest on Kaggle, in coordination with the Indian School of Business,
to see what it could learn about the problem of cab cancellations.
The data presented for this case are a randomly selected subset of the original
data, with 10,000 rows, one row for each booking. There are 17 input variables,
including user (customer) ID, vehicle model, whether the booking was made
online or via a mobile app, type of travel, type of booking package, geographic
information, and the date and time of the scheduled trip. The target variable
of interest is the binary indicator of whether a ride was canceled. The overall
cancellation rate is between 7% and 8%.
Assignment
1. How can a predictive model based on these data be used by Your-
cabs.com?
2. How can a profiling model (identifying predictors that distinguish can-
celed/uncanceled trips) be used by Yourcabs.com?
3. Explore, prepare, and transform the data to facilitate predictive modeling.
Here are some hints:
• In exploratory modeling, it is useful to move fairly soon to at least an
initial model without solving all data preparation issues. One example
is the GPS information—other geographic information is available so
you could defer the challenge of how to interpret/use the GPS infor-
mation.
5Copyright © Datastats, LLC and Galit Shmueli 2017; used with permission.
518 CASES
• How will you deal with missing data, such as cases where NULL is
indicated?
• Think about what useful information might be held within the date
and time fields (the booking timestamp and the trip timestamp). The
data file contains a worksheet with some hints on how to extract fea-
tures from the date/time field.
• Think also about the categorical variables, and how to deal with them.
Should we turn them all into dummies? Use only some?
4. Fit several predictive models of your choice. Do they provide information
on how the predictor variables relate to cancellations?
5. Report the predictive performance of your model in terms of error rates
(the confusion matrix). How well does the model perform? Can the
model be used in practice?
6. Examine the predictive performance of your model in terms of ranking
(lift). How well does the model perform? Can the model be used in
practice?
21.6 Segmenting Consumers of Bath Soap6
BathSoap.csv is the dataset for this case study.
Business Situation
CRISA is an Asian market research agency that specializes in tracking consumer
purchase behavior in consumer goods (both durable and nondurable). In one
major research project, CRISA tracks numerous consumer product categories
(e.g., “detergents”), and, within each category, perhaps dozens of brands. To
track purchase behavior, CRISA constituted household panels in over 100 cities
and towns in India, covering most of the Indian urban market. The households
were carefully selected using stratified sampling to ensure a representative sample;
a subset of 600 records is analyzed here. The strata were defined on the basis of
socioeconomic status and the market (a collection of cities).
CRISA has both transaction data (each row is a transaction) and household
data (each row is a household), and for the household data it maintains the fol-
lowing information:
• Demographics of the households (updated annually)
• Possession of durable goods (car, washing machine, etc., updated annually;
an “affluence index” is computed from this information)
• Purchase data of product categories and brands (updated monthly)
6Copyright © Cytel, Inc. and Resampling Stats, Inc. 2017; used with permission.
SEGMENTING CONSUMERS OF BATH SOAP 519
CRISA has two categories of clients: (1) advertising agencies that subscribe
to the database services, obtain updated data every month, and use the data to
advise their clients on advertising and promotion strategies; (2) consumer goods
manufacturers, which monitor their market share using the CRISA database.
Key Problems
CRISA has traditionally segmented markets on the basis of purchaser demo-
graphics. They would now like to segment the market based on two key sets of
variables more directly related to the purchase process and to brand loyalty:
1. Purchase behavior (volume, frequency, susceptibility to discounts, and
brand loyalty)
2. Basis of purchase (price, selling proposition)
Doing so would allow CRISA to gain information about what demographic
attributes are associated with different purchase behaviors and degrees of brand
loyalty, and thus deploy promotion budgets more effectively. More effective
market segmentation would enable CRISA’s clients (in this case, a firm called
IMRB) to design more cost-effective promotions targeted at appropriate seg-
ments. Thus, multiple promotions could be launched, each targeted at different
market segments at different times of the year. This would result in a more cost-
effective allocation of the promotion budget to different market segments. It
would also enable IMRB to design more effective customer reward systems and
thereby increase brand loyalty.
Data
The data in Table 21.8 profile each household, each row containing the data for
one household.
Measuring Brand Loyalty
Several variables in this case measure aspects of brand loyalty. The number of
different brands purchased by the customer is one measure of loyalty. However, a
consumer who purchases one or two brands in quick succession, then settles on
a third for a long streak, is different from a consumer who constantly switches
back and forth among three brands. Therefore, how often customers switch
from one brand to another is another measure of loyalty. Yet a third perspective
on the same issue is the proportion of purchases that go to different brands—a
consumer who spends 90% of his or her purchase money on one brand is more
loyal than a consumer who spends more equally among several brands.
All three of these components can be measured with the data in the purchase
summary worksheet.
520 CASES
TABLE 21.8 DESCRIPTION OF VARIABLES FOR EACH HOUSEHOLD
Variable Type Variable Name Description
Member ID Member id Unique identifier for each household
Demographics SEC Socioeconomic class (1 = high, 5 = low)
FEH Eating habits(1 = vegetarian, 2 =
vegetarian but eat eggs,
3 = nonvegetarian, 0 = not specified)
MT Native language (see table in worksheet)
SEX Gender of homemaker (1 = male,
2 = female)
AGE Age of homemaker
EDU Education of homemaker (1 = minimum,
9 = maximum)
HS Number of members in household
CHILD Presence of children in household
(4 categories)
CS Television availability (1 = available,
2 = unavailable)
Affluence
Index
Weighted value of durables possessed
Purchase summary
over the period
No. of Brands Number of brands purchased
Brand Runs Number of instances of consecutive
purchase of brands
Total Volume Sum of volume
No. of Trans Number of purchase transactions
(multiple brands purchased in a month
are counted as separate transactions)
Value Sum of value
Trans/ Brand
Runs
Average transactions per brand run
Vol/Trans Average volume per transaction
Avg. Price Average price of purchase
Purchase within
promotion
Pur Vol Percent of volume purchased
No Promo - % Percent of volume purchased under
no promotion
Pur Vol Promo 6% Percent of volume purchased under
promotion code 6
Pur Vol Other
Promo %
Percent of volume purchased under
other promotions
Brandwise
purchase
Br. Cd. (57, 144),
55, 272, 286, 24,
481, 352, 5,
and 999 (others)
Percent of volume purchased of the brand
Price categorywise
purchase
Price Cat 1 to 4 Percent of volume purchased under the
price category
Selling
propositionwise
purchase
Proposition Cat 5 to
15
Percent of volume purchased under the
product proposition category
DIRECT-MAIL FUNDRAISING 521
Assignment
1. Use k-means clustering to identify clusters of households based on:
a. The variables that describe purchase behavior (including brand loyalty)
b. The variables that describe the basis for purchase
c. The variables that describe both purchase behavior and basis of pur-
chase
Note 1: How should k be chosen? Think about how the clusters would
be used. It is likely that the marketing efforts would support two to five
different promotional approaches.
Note 2: How should the percentages of total purchases comprised by
various brands be treated? Isn’t a customer who buys all brand A just as
loyal as a customer who buys all brand B? What will be the effect on any
distance measure of using the brand share variables as is? Consider using
a single derived variable.
2. Select what you think is the best segmentation and comment on the char-
acteristics (demographic, brand loyalty, and basis for purchase) of these
clusters. (This information would be used to guide the development of
advertising and promotional campaigns.)
3. Develop a model that classifies the data into these segments. Since this
information would most likely be used in targeting direct-mail promo-
tions, it would be useful to select a market segment that would be defined
as a success in the classification model.
21.7 Direct-Mail Fundraising
Fundraising.csv and FutureFundraising.csv are the datasets used for this case study.
Background
Note: Be sure to read the information about oversampling and adjustment in
Chapter 5 before starting to work on this case.
A national veterans’ organization wishes to develop a predictive model to
improve the cost-effectiveness of their direct marketing campaign. The organi-
zation, with its in-house database of over 13 million donors, is one of the largest
direct-mail fundraisers in the United States. According to their recent mailing
records, the overall response rate is 5.1%. Out of those who responded (donated),
the average donation is $13.00. Each mailing, which includes a gift of personal-
ized address labels and assortments of cards and envelopes, costs $0.68 to produce
and send. Using these facts, we take a sample of this dataset to develop a classifi-
cation model that can effectively capture donors so that the expected net profit is
522 CASES
maximized. Weighted sampling is used, under-representing the non-responders
so that the sample has equal numbers of donors and non-donors.
Data
The file Fundraising.csv contains 3120 records with 50% donors (TAR-
GET_B=1) and 50% non-donors (TARGET_B = 0). The amount of donation
(TARGET_D) is also included but is not used in this case. The descriptions for
the 22 variables (including two target variables) are listed in Table 21.9.
TABLE 21.9 DESCRIPTION OF VARIABLES FOR THE FUNDRAISING DATASET
Variable Description
ZIP Zip code group (zip codes were grouped into five groups;
1 = the potential donor belongs to this zip group.)
00000–19999 ⇒ zipconvert_1
20000–39999 ⇒ zipconvert_2
40000–59999 ⇒ zipconvert_3
60000–79999 ⇒ zipconvert_4
80000–99999 ⇒ zipconvert_5
HOMEOWNER 1 = homeowner, 0 = not a homeowner
NUMCHLD Number of children
INCOME Household income
GENDER 0 = male, 1 = female
WEALTH Wealth rating uses median family income and population statistics
from each area to index relative wealth within each state
The segments are denoted 0 to 9, with 9 being the highest-wealth
group and zero the lowest. Each rating has a different
meaning within each state.
HV Average home value in potential donor’s neighborhood in
hundreds of dollars
ICmed Median family income in potential donor’s neighborhood in
hundreds of dollars
ICavg Average family income in potential donor’s neighborhood in hundreds
IC15 Percent earning less than $15K in potential donor’s neighborhood
NUMPROM Lifetime number of promotions received to date
RAMNTALL Dollar amount of lifetime gifts to date
MAXRAMNT Dollar amount of largest gift to date
LASTGIFT Dollar amount of most recent gift
TOTALMONTHS Number of months from last donation to July 1998 (the last time the case
was updated)
TIMELAG Number of months between first and second gift
AVGGIFT Average dollar amount of gifts to date
TARGET−B Outcome variable: binary indicator for response
1 = donor, 0 = non-donor
TARGET−D Outcome variable: donation amount (in dollars). We will NOT be using this
variable for this case.
DIRECT-MAIL FUNDRAISING 523
Assignment
Step 1: Partitioning. Partition the dataset into 60% training and 40% val-
idation (set the seed to 12345).
Step 2: Model Building. Follow the following steps to build, evaluate,
and choose a model.
1. Select classification tool and parameters. Run at least two classification models
of your choosing. Be sure NOT to use TARGET_D in your analysis.
Describe the two models that you chose, with sufficient detail (method,
parameters, variables, etc.) so that it can be replicated.
2. Classification under asymmetric response and cost. What is the reasoning
behind using weighted sampling to produce a training set with equal num-
bers of donors and non-donors? Why not use a simple random sample
from the original dataset?
3. Calculate net profit. For each method, calculate the lift of net profit for both
the training and validation sets based on the actual response rate (5.1%.)
Again, the expected donation, given that they are donors, is $13.00, and
the total cost of each mailing is $0.68. (Hint: To calculate estimated net
profit, we will need to undo the effects of the weighted sampling and
calculate the net profit that would reflect the actual response distribution
of 5.1% donors and 94.9% non-donors. To do this, divide each row’s net
profit by the oversampling weights applicable to the actual status of that
row. The oversampling weight for actual donors is 50%/5.1% = 9.8. The
oversampling weight for actual non-donors is 50%/94.9% = 0.53.)
4. Draw lift curves. Draw each model’s net profit lift curve for the validation
set onto a single graph (net profit on the y-axis, proportion of list or
number mailed on the x-axis). Is there a model that dominates?
5. Select best model. From your answer in (4), what do you think is the “best”
model?
Step 3: Testing. The file FutureFundraising.csv contains the attributes for
future mailing candidates.
6. Using your “best” model from Step 2 (number 5), which of these candi-
dates do you predict as donors and non-donors? List them in descending
order of the probability of being a donor. Starting at the top of this sorted
list, roughly how far down would you go in a mailing campaign?
524 CASES
21.8 Catalog Cross-Selling7
CatalogCrossSell.csv is the dataset for this case study.
Background
Exeter, Inc. is a catalog firm that sells products in a number of different cata-
logs that it owns. The catalogs number in the dozens, but fall into nine basic
categories:
1. Clothing
2. Housewares
3. Health
4. Automotive
5. Personal electronics
6. Computers
7. Garden
8. Novelty gift
9. Jewelry
The costs of printing and distributing catalogs are high. By far the biggest
cost of operation is the cost of promoting products to people who buy nothing.
Having invested so much in the production of artwork and printing of catalogs,
Exeter wants to take every opportunity to use them effectively. One such oppor-
tunity is in cross-selling—once a customer has “taken the bait” and purchases
one product, try to sell them another while you have their attention.
Such cross-promotion might take the form of enclosing a catalog in the
shipment of a purchased product, together with a discount coupon to induce a
purchase from that catalog. Or, it might take the form of a similar coupon sent
by e-mail, with a link to the web version of that catalog.
But which catalog should be enclosed in the box or included as a link in
the e-mail with the discount coupon? Exeter would like it to be an informed
choice—a catalog that has a higher probability of inducing a purchase than simply
choosing a catalog at random.
Assignment
Using the dataset CatalogCrossSell.csv, perform an association rules analysis, and
comment on the results. Your discussion should provide interpretations in
English of the meanings of the various output statistics (lift ratio, confidence,
7Copyright © Resampling Stats, Inc. 2017; used with permission.
PREDICTING BANKRUPTCY 525
support) and include a very rough estimate (precise calculations are not neces-
sary) of the extent to which this will help Exeter make an informed choice about
which catalog to cross-promote to a purchaser.
Acknowledgment The data for this case have been adapted from the data
in a set of cases provided for educational purposes by the Direct Marketing Edu-
cation Foundation (“DMEF Academic Data Set Two, Multi Division Catalog
Company, Code: 02DMEF”); used with permission.
21.9 Predicting Bankruptcy
Bankruptcy.csv is the dataset for this case study.
Predicting Corporate Bankruptcy8
Just as doctors check blood pressure and pulse rate as vital indicators of the health
of a patient, business analysts scour the financial statements of a corporation
to monitor its financial health. Whereas blood pressure, pulse rate, and most
medical vital signs, however, are measured through precisely defined procedures,
financial variables are recorded under much less specific general principles of
accounting. A primary issue in financial analysis, then, is how predictable is the
health of a company?
One difficulty in analyzing financial report information is the lack of disclo-
sure of actual cash receipts and disbursements. Users of financial statements have
had to rely on proxies for cash flow, perhaps the simplest of which is income
(INC) or earnings per share. Attempts to improve INC as a proxy for cash flow
include using income plus depreciation (INCDEP), working capital from oper-
ations (WCFO), and cash flow from operations (CFFO). CFFO is obtained by
adjusting income from operations for all noncash expenditures and revenues and
for changes in the current asset and current liabilities accounts.
8This case was prepared by Professor Mark E. Haskins and Professor Phillip E. Pfeifer. It was written as
a basis for class discussion rather than to illustrate effective or ineffective handling of an administrative
situation. Copyright © 1988 by the University of Virginia Darden School Foundation, Charlottesville,
VA. All rights reserved. To order copies, send an e-mail to sales@dardenpublishing.com. No part of
this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in
any form or by any means—electronic, mechanical, photocopying, recording, or otherwise—without
the permission of the Darden School Foundation.
mailto:sales@dardenpublishing.com
526 CASES
A further difficulty in interpreting historical financial disclosure information
is caused whenever major changes are made in accounting standards. For exam-
ple, the Financial Accounting Standards Board issued several promulgations in
the middle 1970s that changed the requirements for reporting accruals pertaining
to such things as equity earnings, foreign currency gains and losses, and deferred
taxes. One effect of changes of this sort was that earnings figures became less
reliable indicators of cash flow.
In the light of these difficulties in interpreting accounting information, just
what are the important vital signs of corporate health? Is cash flow an important
signal? If not, what is? If so, what is the best way to approximate cash flow?
How can we predict the impending demise of a company?
To begin to answer some of these important questions, we conducted a study
of the financial vital signs of bankrupt and healthy companies. We first identified
66 failed firms from a list provided by Dun and Bradstreet. These firms were
in manufacturing or retailing and had financial data available on the Compustat
Research tape. Bankruptcy occurred somewhere between 1970 and 1982.
For each of these 66 failed firms, we selected a healthy firm of approxi-
mately the same size (as measured by the book value of the firm’s assets) from
the same industry (3 digit SIC code) as a basis of comparison. This matched
sample technique was used to minimize the impact of any extraneous factors
(such as industry) on the conclusions of the study.
The study was designed to see how well bankruptcy can be predicted 2 years
in advance. A total of 24 financial ratios were computed for each of the 132 firms
using data from the Compustat tapes and from Moody’s Industrial Manual for
the year that was 2 years prior to the year of bankruptcy. Table 21.10 lists the 24
ratios together with an explanation of the abbreviations used for the fundamental
financial variables. All these variables are contained in a firm’s annual report with
the exception of CFFO. Ratios were used to facilitate comparisons across firms
of various sizes.
The first four ratios using CASH in the numerator might be thought of as
measures of a firm’s cash reservoir with which to pay debts. The three ratios with
CURASS in the numerator capture the firm’s generation of current assets with
which to pay debts. Two ratios, CURDEBT/DEBT and ASSETS/DEBTS,
measure the firm’s debt structure. Inventory and receivables turnover are mea-
sured by COGS/INV and SALES/REC, and SALES/ASSETS measures the
firm’s ability to generate sales. The final 12 ratios are asset flow measures.
Assignment
1. What data mining technique(s) would be appropriate in assessing whether
there are groups of variables that convey the same information and how
important that information is? Conduct such an analysis.
PREDICTING BANKRUPTCY 527
TABLE 21.10 PREDICTING CORPORATE BANKRUPTCY:
FINANCIAL VARIABLES AND RATIOS
Abbreviation Financial Variable Ratio Definition
ASSETS Total assets R1 CASH/CURDEBT
CASH Cash R2 CASH/SALES
CFFO Cash flow from operations R3 CASH/ASSETS
COGS Cost of goods sold R4 CASH/DEBTS
CURASS Current assets R5 CFFO/SALES
CURDEBT Current debt R6 CFFO/ASSETS
DEBTS Total debt R7 CFFO/DEBTS
INC Income R8 COGS/INV
INCDEP Income plus depreciation R9 CURASS/CURDEBT
INV Inventory R10 CURASS/SALES
REC Receivables R11 CURASS/ASSETS
SALES Sales R12 CURDEBT/DEBTS
WCFO Working capital from operations R13 INC/SALES
R14 INC/ASSETS
R15 INC/DEBTS
R16 INCDEP/SALES
R17 INCDEP/ASSETS
R18 INCDEP/DEBTS
R19 SALES/REC
R20 SALES/ASSETS
R21 ASSETS/DEBTS
R22 WCFO/SALES
R23 WCFO/ASSETS
R24 WCFO/DEBTS
2. Comment on the distinct goals of profiling the characteristics of bankrupt
firms versus simply predicting (black box style) whether a firm will go
bankrupt and whether both goals, or only one, might be useful. Also
comment on the classification methods that would be appropriate in each
circumstance.
3. Explore the data to gain a preliminary understanding of which vari-
ables might be important in distinguishing bankrupt from nonbankrupt
firms. (Hint: As part of this analysis, use side-by-side boxplots, with the
bankrupt/not bankrupt variable as the x variable.)
4. Using your choice of classifiers, use R to produce several models to pre-
dict whether or not a firm goes bankrupt, assessing model performance
on a validation partition.
5. Based on the above, comment on which variables are important in clas-
sification, and discuss their effect.
528 CASES
21.10 Time Series Case: Forecasting Public
Transportation Demand
bicup2006.csv is the dataset for this case study.
Background
Forecasting transportation demand is important for multiple purposes such as
staffing, planning, and inventory control. The public transportation system in
Santiago de Chile has gone through a major effort of reconstruction. In this
context, a business intelligence competition took place in October 2006, which
focused on forecasting demand for public transportation. This case is based on
the competition, with some modifications.
Problem Description
A public transportation company is expecting an increase in demand for its ser-
vices and is planning to acquire new buses and to extend its terminals. These
investments require a reliable forecast of future demand. To create such forecasts,
one can use data on historic demand. The company’s data warehouse has data on
each 15-minute interval between 6:30 and 22:00, on the number of passengers
arriving at the terminal. As a forecasting consultant, you have been asked to cre-
ate a forecasting method that can generate forecasts for the number of passengers
arriving at the terminal.
Available Data
Part of the historic information is available in the file bicup2006.csv. The file con-
tains the worksheet “Historic Information” with known demand for a 3-week
period, separated into 15-minute intervals. The second worksheet (“Future”)
contains dates and times for a future 3-day period, for which forecasts should be
generated (as part of the 2006 competition).
Assignment Goal
Your goal is to create a model/method that produces accurate forecasts. To
evaluate your accuracy, partition the given historic data into two periods: a
training period (the first two weeks), and a validation period (the last week).
Models should be fitted only to the training data and evaluated on the validation
data.
Although the competition winning criterion was the lowest Mean Absolute
Error (MAE) on the future 3-day data, this is not the goal for this assignment.
Instead, if we consider a more realistic business context, our goal is to create
a model that generates reasonably good forecasts on any time/day of the week.
TIME SERIES CASE: FORECASTING PUBLIC TRANSPORTATION DEMAND 529
Consider not only predictive metrics such as MAE, MAPE, and RMSE, but also
look at actual and forecasted values, overlaid on a time plot, as well as a time plot
of the forecast errors.
Assignment
For your final model, present the following summary:
1. Name of the method/combination of methods.
2. A brief description of the method/combination.
3. All estimated equations associated with constructing forecasts from this
method.
4. The MAPE and MAE for the training period and the validation period.
5. Forecasts for the future period (March 22–24), in 15-minute bins.
6. A single chart showing the fit of the final version of the model to the
entire period (including training, validation, and future). Note that this
model should be fitted using the combined training plus validation data.
Tips and Suggested Steps
1. Use exploratory analysis to identify the components of this time series. Is
there a trend? Is there seasonality? If so, how many “seasons” are there?
Are there any other visible patterns? Are the patterns global (the same
throughout the series) or local?
2. Consider the frequency of the data from a practical and technical point
of view. What are some options?
3. Compare the weekdays and weekends. How do they differ? Consider
how these differences can be captured by different methods.
4. Examine the series for missing values or unusual values. Think of solu-
tions.
5. Based on the patterns that you found in the data, which models or meth-
ods should be considered?
6. Consider how to handle actual counts of zero within the computation of
MAPE.
References
Agrawal, R., Imielinski, T., and Swami, A. (1993). Mining associations between sets of items
in massive databases. In Proceedings of the 1993 ACM-SIGMOD International Conference on
Management of Data, pp. 207-216, New York: ACM Press.
Berry, M. J. A., and Linoff, G. S. (1997). Data Mining Techniques. New York: Wiley.
Berry, M. J. A., and Linoff, G. S. (2000). Mastering Data Mining. New York: Wiley.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression
Trees. Boca Raton, FL: Chapman Hall/CRC (orig. published by Wadsworth).
Chatfield, C. (2003). The Analysis of Time Series: An Introduction, 6th ed. Boca Raton, FL:
Chapman Hall/CRC.
Delmaster, R., and Hancock, M. (2001). Data Mining Explained. Boston: Digital Press.
Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant
analysis. Journal of the American Statistical Association, vol. 70, number 352, pp. 892-898.
Few, S. (2009). Now You See It. Analytics Press, Oakland CA, USA.
Few, S. (2012). Show Me the Numbers, 2nd ed. Oakland, CA: Analytics Press.
Golbeck, J. (2013). Analyzing the Social Web. Waltham, MA: Morgan Kaufmann.
Han, J., and Kamber, M. (2001). Data Mining: Concepts and Techniques. San Diego, CA:
Academic Press.
Hand, D., Mannila, H., and Smyth, P. (2001). Principles of Data Mining. Cambridge, MA:
MIT Press.
Harris, H., Murphy, S., and Vaisman, M. (2013). Analyzing the Analyzers: An Introspective
Survey of Data Scientists and Their Work, Cambridge, MA: O’Reilly Media.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. New
York: Springer.
Hosmer, D. W., and Lemeshow, S. (2000). Applied Logistic Regression, 2nd ed. New York:
Wiley-Interscience.
Hothorn, T., Hornik K., and Zeileis, A. (2006). Unbiased Recursive Partitioning: A
Conditional Inference Framework. Journal of Computational and Graphical Statistics, vol.
15, number 3, pp. 651-674.
Jank, W., and Yahav, I. (2010). E-Loyalty networks in online auctions. Annal of Applied
Statistics, vol. 4, number 1, pp. 151-178.
Johnson, W., and Wichern, D. (2002). Applied Multivariate Statistics. Upper Saddle River,
NJ: Prentice Hall.
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
531
532 REFERENCES
Larsen, K. (2005). Generalized naive Bayes classifiers. SIGKDD Explorations, vol. 7, num-
ber 1, pp. 76-81.
Le, Q. V., Ranzato, M. A., Monga, R., Devin, M., Chen, K., Corrado, G. S., Dean, J., and
Ng, A.Y. (2012). Building high-level features using large scale unsupervised learning. In
Proceedings of the Twenty-Ninth International Conference on Machine Learning. Editors: John
Langford and Joelle Pineau. Edinburgh: Omnipress.
Loh, W.-Y., and Shih, Y.-S. (1997). Split selection methods for classification trees, Statistica
Sinica, vol. 7, pp. 815-840.
McCullugh, C. E., Paal, B., and Ashdown, S. P. (1998). An optimisation approach to apparel
sizing. Journal of the Operational Research Society, vol. 49, number 5, pp. 492-499.
Pregibon, D. (1999). 2001: A statistical odyssey. Invited talk at The Fifth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, New York: ACM Press,
p. 4.
Shmueli, G., and Lichtendahl, K. C. (2016). Practical Time Series Forecasting with R: A
Hands-On Guide, 2nd ed. Axelrod-Schnall Publishers.
Siegel, E. (2013). Predictive Analytics, New York: Wiley.
Trippi, R., and Turban, E. (eds.) (1996). Neural Networks in Finance and Investing. New
York: McGraw-Hill.
Veenhoven, R., World Database of Happiness, Erasmus University Rotterdam. Available at
http://worlddatabaseofhappiness.eur.nl.
http://worlddatabaseofhappiness.eur.nl
Data Files Used in the Book
1. Accidents.csv
2. AdSales.csv
3. Airfares.csv
4. Amtrak data.csv
5. ApplianceShipments.csv
6. AustralianWines.csv
7. AutosElectronics.zip
8. Bankruptcy.csv
9. Banks.csv
10. BathSoap.csv
11. bicup2006.csv
12. BostonHousing.csv
13. CanadianWorkHours.csv
14. CatalogCrossSell.csv
15. Cereals.csv
16. CharlesBookClub.csv
17. Cosmetics.csv
18. Cosmetics-small.csv
19. CourseTopics.csv
20. Drug.csv
21. DepartmentStoreSales.csv
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
533
534 DATA FILES USED IN THE BOOK
22. EastWestAirlinesCluster.csv
23. EastWestAirlinesNN.csv
24. eBayAuctions.csv
25. eBayNetwork.csv
26. eBayTreemap.csv
27. Faceplate.csv
28. Farm-ads.csv
29. FlightDelays.csv
30. Fundraising.csv
31. FutureFundraising.csv
32. gdp.csv
33. GermanCredit.csv
34. Hair-Care-Product.csv
35. LaptopSales.txt
36. LaptopSalesJanuary2008.csv
37. Pharmaceuticals.csv
38. Persuasion-A.csv
39. RidingMowers.csv
40. ShampooSales.csv
41. Sept11Travel.csv
42. SouvenirSales.csv
43. SP500.csv
44. Spambase.csv
45. SystemAdministrators.csv
46. Taxi-cancellation-case.csv
47. Tayko.csv
48. ToyotaCorolla.csv
49. ToysRUsRevenues.csv
50. UniversalBank.csv
51. Universities.csv
52. Utilities.csv
53. Voter-Persuasion.csv
54. WalMartStock.csv
55. WestRoxbury.csv
56. Wine.csv
Index
AIC, 163
A-B testing, 311
A-B tests, 318
accident data
discriminant analysis, 303
naive Bayes, 202
neural nets, 282
activation function, 275
additive seasonality, 410
adjusted-R2, 163, 166
affinity analysis, 16, 329
agglomerative, 360, 368
agglomerative algorithm, 368
aggregation, 70, 71, 74, 78
AIC, 445
airfare data
multiple linear regression, 170
Akaike Information Criterion, 163
algorithm, 9
ALVINN, 272
Amtrak data
time series, 390, 402
visualization, 59
Amtrak ridership example, 435
analytics, 3
antecedent, 331
appliance shipments data
time series, 399, 429, 449
visualization, 88
Apriori algorithm, 329, 333
AR models, 416
AR(1), 417
area under the curve, 132
ARIMA models, 416
artificial intelligence, 5, 9, 92
artificial neural networks, 271
association rules, 16, 18, 329, 330
confidence, 333, 337
cutoff, 336
data format, 335
itemset, 331
lift ratio, 333, 335
random selection, 337
statistical significance, 340
support, 332
assumptions, 261
asymmetric cost, 26, 140, 523
asymmetric response, 523
attribute, 9
AUC, 132
Australian wine sales data
time series, 430, 451
Auto posts, 496
autocorrelation, 401, 413, 417
average linkage, 371, 376
average squared errors, 156
back propagation, 278
backward elimination, 166, 250
bag-of-words, 480, 487
bagging, 229, 315
balanced portfolios, 358
bankruptcy data, 525
bar chart, 59
batch updating, 279
bath soap data, 518
benchmark, 118, 138
Data Mining for Business Analytics: Concepts, Techniques, and Applications in R, First Edition.
Galit Shmueli, Peter C. Bruce, Inbal Yahav, Nitin R. Patel, and Kenneth C. Lichtendahl, Jr.
© 2018 John Wiley & Sons, Inc. Published 2018 by John Wiley & Sons, Inc.
535
536 INDEX
benchmark confidence value, 334
best-pruned tree, 224
best subsets, 42
betweenness in a network, 463
bias, 35, 162, 164, 301
bias–variance trade-off, 162
big data, 6
binning, 77, 78
binomial distribution, 260
blackbox, 271, 287, 288, 317, 389, 489
boosted tree, 205
boosting, 315
bootstrap, 315
Boston housing, 93
Boston housing data, 93
multiple linear regression, 169
visualization, 57
Box-Cox transformation, 405
boxplot, 55, 61, 74, 119, 527
side-by-side, 63
bubble plot, 70
Bureau of Transportation Statistics,
250
business analytics, 3, 15
business intelligence, 3
Cp, 164
C4.5, 206, 222
cab cancellations, 517
Canadian manufacturing
workhours data
time series, 399, 423
CART, 206, 214, 222
case, 9
case updating, 279
catalog cross-selling case, 524
catalog cross-selling data, 524
categorical variable, 207
centrality in a network, 463
centroid, 296, 297, 376, 377
cereals data, 101
exploratory data analysis, 112
hierarchical clustering, 383
CHAID, 221
Charles Book Club case, 499
Charles Book Club data, 340, 499
chi-square distribution, 262
chi-square test, 221
city-block distance, 365
classification, 16, 59, 67, 76, 86, 118,
173, 187
discriminant analysis, 293
logistic regression, 238
classification and regression trees, 111,
283, 288
classification functions, 297, 304
classification matrix, 124
classification methods, 247
classification performance
accuracy measures, 126
confusion matrix, 124
training data, 126
validation data, 126
classification rules, 206, 226
classification scores, 299, 302
classification tree, 205
performance, 215
classification trees, 206, 507
classifier, 118, 301, 527
cleaning the data, 26
clique, 462
closeness in a network, 463
cluster
average linkage, 370
complete linkage, 370
single linkage, 369
cluster analysis, 59, 174, 357
allocate the records, 378
average distance, 367
centroid distance, 367
initial partition, 376
labeling, 374
maximum distance, 366
minimum distance, 366
nesting, 373
normalizing, 362
outliers, 376
partition, 374
partitioned, 378
randomly generated starting
partitions, 378
stability, 376
sum of distances, 378
summary statistics, 374
unequal weighting, 364
validating clusters, 373
cluster centroids, 378
cluster dispersion, 377
cluster validity, 381
INDEX 537
clustering, 17, 18, 109, 347, 348
clustering algorithms, 368
clustering techniques, 30
coefficients, 277
collaborative filtering, 17, 342
collaborative filtering using network
data, 468
collinearity, 243
combining categories, 111
conditional inference trees, 221
conditional probability, 9, 10, 187, 192,
334
confidence, 9, 333
confidence interval, 9, 333
confidence levels, 333
confusion matrix, 124, 145, 216, 254,
262, 507
connected network, 462
consequent, 331
consistent, 243
continuous response, 121
corpus, 482
correlated, 91, 364
correlation, 296, 345
correlation analysis, 97
correlation matrix, 97
correlation table, 64
correlation-based similarity, 364
correspondence analysis, 101
cosine similarity, 346
cosmetics data
affinity analysis, 353
association rules, 354
cost/gain matrix, 507
costs, 25
course topics data
association rules, 352
collaborative filtering, 353
covariance matrix, 296, 365
credit card data
neural nets, 290
credit risk score, 506
cross validation, 37, 222
curse of dimensionality, 92, 183
customer segmentation, 358
cutoff, 127, 501, 507
cutoff value, 247, 276
classification tree, 214
logistic regression, 239
dashboards, 3
data driven method, 271, 433
data exploration, 91
data mining, 5
data partitioning, 216, 387
data pipeline, 44
data projection, 104
data reduction, 17
data science, 7
data visualization, 18
data-driven method, 389, 433
database marketing, 500
de-seasonalizing, 438
de-trending, 438
decile chart, 138, 247
decile lift chart, 122, 138
decision making, 238, 289
deep learning, 287
deep learning networks, 287
degree distribution, 465
delayed flight data
classification tree, 234, 325
logistic regression, 250
delimiters for text, 484
dendrogram, 357, 371
density of a network, 465
department store sales data
time series, 398, 426, 447
dependent variable, 9
deviance, 262
deviation, 119
dimension reduction, 17, 18, 91, 92,
205, 237, 349, 488
dimensionality, 92
directed vs. undirected network, 457
discriminant analysis, 6, 174, 232, 293
assumptions, 300
classification performance, 300
classification score, 297
confusion matrix, 301
correlation matrix, 301
cutoff, 299
distance, 296
expected cost of misclassification,
302
lift chart, 301
lift curves, 299
more than two classes, 303
multivariate normal distribution,
300
538 INDEX
discriminant analysis (continued )
outliers, 301
prior probabilities, 302
prior/future probability of
membership, 302
probabilities of class membership,
302
propensities, 302
unequal misclassification costs, 302
validation set, 301
discriminant functions, 227
discriminators, 296
disjoint, 331
distance, 361, 376
distance between records, 174, 366
distance matrix, 367, 369, 376
distances between clusters, 366
distribution plots, 55, 61
divisive, 360
document
text mining, 480
domain knowledge, 29, 30, 92, 162,
368, 377
domain-dependent, 362
double exponential smoothing, 442
dummy variables, 27, 99, 101, 157
East-West Airlines data
cluster analysis, 383
neural nets, 290
eBay auctions data
classification tree, 234
logistic regression, 269
edge in a network, 456
edge list, 460
edge weight, 462
efficient, 243
egocentric network, 463
eigenvector centrality, 463
ensemble, 311
ensemble forecast, 390
ensembles, 229
entity resolution, 467
entropy impurity measure, 210
entropy measure, 210, 228
epoch, 279
equal costs, 26
error
average, 40
back propagation, 278
mean, 119
mean absolute, 119
mean absolute percentage, 119
mean percentage, 119
mean percentage error, 119
overall error rate, 127
prediction, 119
RMS, 41
root mean squared, 41, 119
error rate, 222
estimation, 9, 10
Euclidean distance, 174, 296, 345, 361,
371, 376
evaluating performance, 227, 229
exhaustive search, 163, 166
expert knowledge, 92
explained variability, 163
explanatory modeling, 155
exploratory analysis, 301
explore, 20, 21
exponential smoothing, 433, 439, 445
Holt–Winters, 447
exponential trend, 74, 405
extrapolation, 288
factor analysis, 183
factor selection, 92
factor variable, 243
false discovery rate, 131
false omission rate, 131
farm ads, 495
feature, 9, 10
feature extraction, 92
field, 10
filtering, 56, 70, 72
finance, 358
financial applications, 293
Financial Condition of Banks data
logistic regression, 268
first principal component, 104
Fisher’s linear classification functions,
297
fitting the best model to the data, 155
fold, 37
forecasting, 59, 78, 388
forward selection, 166, 250
fraudulent financial reporting data, 188
naive rule, 124
fraudulent transactions, 25
frequent itemset, 332
INDEX 539
function
nonlinear, 239
fundraising data, 521–523
generalization, 33
German credit case, 505
Gini index, 210, 228
global pattern, 72, 86, 87, 393
goodness of fit, 118, 124
Gower, 366
graphical exploration, 56
heatmap, 64
hidden, 281
hidden layer, 271
hierarchical, 360
hierarchical clustering, 357, 368
hierarchical methods, 373
hierarchies, 70, 71
histogram, 55, 61, 78, 119, 158, 260
holdout data, 9, 36
holdout set, 9, 155
Holt-Winter’s exponential smoothing,
443
home value data, 21
homoskedasticity, 156
impurity, 221, 227
impurity measures, 210, 227
imputation, 162
independent variable, 10
industry analysis, 358
input variable, 9
integer programming, 376
interaction term, 248
interactions, 272
interactive visualization, 55, 77
software
Spotfire, 79
Tableau, 79
iterative search, 164
Jaquard’s coefficient, 365
jittering, 74
k-means clustering, 357, 361, 376, 521
k-nearest neighbor, 504
k-nearest neighbors, 6, 173, 345, 502
k-nearest neighbors algorithm, 173
kitchen-sink approach, 161
labeling documents, 489
labels, 74
lag-1, 417
laptop sales data
visualization, 88
large dataset, 6, 74
latent semantic analysis, 488
latent semantic indexing, 488
lazy learning, 183
learning rate, 278, 286, 289
least squares, 277, 278, 306
level, 387, 390
lift, 256
lift (gains) chart, 247
lift chart, 117, 121, 136, 139, 146,
262, 523
lift curve, 256
lift ratio, 333
line graph, 59
linear, 276
linear classification rule, 294
linear combination, 101, 102
linear regression, 6, 18, 31, 38, 111, 174,
259, 389
linear regression models, 272
linear relationship, 98, 153, 156, 162
link prediction, 467
linked plots, 78
links in a network, 456
loading, 22
local optimum, 288
local pattern, 72, 86, 87, 393
log transform, 277
log-scale, 295
logistic regression, 6, 111, 237, 272, 276,
293, 301, 503, 505, 507, 512
p-value, 263
classification, 242
classification performance, 247
confidence intervals, 243
dummy variable, 246
iterations, 243
maximum likelihood, 243
model interpretation, 254
model performance, 254
negative coefficients, 244
nominal classes, 265
ordinal classes, 264
parameter estimates, 243
positive coefficients, 244
540 INDEX
logistic regression (continued )
prediction, 242
profiling, 261
training data, 262
variable selection, 248, 257
logistic response function, 239
logit, 239, 240, 247
MA, 434
machine learning, 5, 9, 26
MAE, 119
Mahalanobis distance, 296, 364
majority class, 177
majority decision rule, 174
majority vote, 180
Manhattan distance, 364, 365
MAPE, 117, 119
market basket analysis, 329
market segmentation, 358
market structure analysis, 358
marketing, 215, 221, 501
matching, 467
matching coefficient, 365
maximum coordinate distance, 365
maximum likelihood, 243, 277, 278
mean absolute error, 119
mean absolute percentage error, 119
mean error, 119
mean percentage error, 119
measuring impurity, 228
minimum validation error, 283
minority class, 288
misclassification
asymmetric costs, 133
average misclassification cost, 134
estimated misclassification rate, 127
misclassification costs, 129
misclassification error, 122, 174
misclassification rate, 25, 126
missing data, 20, 30
missing values, 30, 162, 163, 205
model, 9, 25, 28, 33
logistic regression, 239
model complexity, 248
model performance, 144
model validity, 155
momentum, 286, 289
moving average, 433–435, 438
centered moving average, 434
trailing moving average, 434, 435
MPE, 119
multi-level forecaster, 390
multicollinearity, 98, 106, 162
multilayer feedforward networks, 272
multiple linear regression, 153, 238, 262,
306, 512
multiple linear regression model, 157
multiple panels, 57
multiplicative factor, 246
multiplicative model, 246
multiplicative seasonality, 411
naive Bayes, 187
naive classification, 133
naive forecast, 395
naive model, 262
naive rule, 124, 177
natural gas sales data
time series, 451
natural hierarchy, 360
nearest neighbor, 174
almost, 183
Netflix Prize contest, 311, 343
network analytics, 455
neural nets, 35
θj , 273
wi,j , 273
activation function, 275
architecture, 272, 282, 285
bias, 273, 275
bias nodes, 279
classification, 271, 276, 283
cutoff, 285
engineering applications, 272
financial applications, 271
guidelines, 285
hidden layer, 272, 274
input layer, 272
iteratively, 278, 279
learning rate, 286
local optima, 286
logistic, 275
momentum, 286
neurons, 272
nodes, 272
output layer, 272, 276
overfitting, 283
oversampling, 288
predicted probabilities, 281
prediction, 271, 283
INDEX 541
preprocessing, 277
random initial set, 279
sigmoidal function, 275
transfer function, 275
updating the weights, 279
user input, 285
variable selection, 288
weighted sum, 275, 276
weights, 273
neural networks, 227, 507
neurons, 271
node, 211
node in a network, 456
noise, 387, 390, 393
noisy data, 288
non-hierarchical, 361
non-hierarchical algorithms, 368
non-hierarchical clustering, 376
nonparametric method, 174
normal distribution, 156, 260
normalization, 486
normalize, 32, 107, 108, 362
numerical outcome, 180, 227
observation, 9
odds, 239, 243
OLAP, 16
ordering of, 256
ordinary least squares, 156
orthogonally, 104
outcome variable, 10
outliers, 20, 30, 39, 119, 364
output layer, 281
output variable, 10
over-fitting, 163
overall accuracy, 127
overall fit, 261
overfitting, 6, 33, 35, 36, 92, 117,
121, 126, 177, 205, 216,
271, 439
oversampling, 25, 118, 135, 140, 141,
144, 146, 512
oversampling without replacement, 143
oversmoothing, 177
overweight, 25
pairwise correlations, 97
Pandora, 181
parallel coordinates plot, 75
parametric assumptions, 182
parsimony, 28, 29, 162, 306
partial autocorrelations, 417
partition, 35, 253
path in a network, 462
pattern, 9
Pearson correlation, 364
performance, 155
personal loan data, 184, 202
CART, 224
classification tree, 215
discriminant analysis, 294, 307
logistic regression, 240
persuasion models, 317
pharmaceuticals data
cluster analysis, 382
pivot table, 99
political persuasion, 513
polynomial trend, 74, 407
predicting bankruptcy case, 525
predicting new records, 154
prediction, 10, 16, 58, 67, 86, 206, 227,
228, 349
prediction error, 119
predictive accuracy, 118
predictive analytics, 5, 15, 17
predictive modeling, 5, 155
predictive performance, 117, 158
accuracy measures, 119
predictor, 10
preprocessing, 20, 26, 243, 251
principal components, 106, 166
principal components analysis, 6, 40,
101, 183, 288, 349
classification and prediction, 109
normalizing the data, 107
training data, 110
validation set, 110
weighted averages, 106
weights, 105–107
principal components scores, 106
principal components weights, 108
prior probability, 135
probabilities
logistic regression, 238
probability plot, 260
profile, 10
profile plot, 378
profiling, 238
discriminant analysis, 293
propensities, 127, 136
542 INDEX
propensity, 179, 187, 214
logistic regression, 238
pruning, 205, 207, 222
public transportation demand case, 528
public transportation demand data, 528
public utilities data
cluster analysis, 359
pure, 207
quadratic discriminant analysis, 301
quadratic model, 407
R2, 118, 163, 166
ets, 445
random forest, 205, 315
random forests, 229
random walk, 401, 420
ranking, 118, 121, 136, 192, 247, 518
ranking of records, 200
ratio of costs, 303
re-scaling, 70
recommendation system, 342
recommender systems, 330
record, 9, 10, 30
record deletion, 162
recursive partitioning, 205, 207
redundancy, 101
reference category, 254
reference line, 137
regression, 153
time series, 401
regression tree, 205
regression trees, 166, 206, 227, 512
rescaling, 277
residual series, 415
residuals
histogram, 261
response, 10
response rate, 25
reweight, 143
RFM segmentation, 503
riding-mower data
CART, 207
discriminant analysis, 294
k-nearest neighbor, 175
logistic regression, 269
visualization, 88
right-skewed, 277
RMSE, 117, 119
robust, 243, 368, 376
robust distances, 364
robust to outliers, 232
ROC curve, 117, 131
root mean squared error, 119
row, 9
rules
association rules, 330
S&P monthly closing prices, 420
sample, 6, 9, 10, 21
sampling, 19, 24, 74
satellite radio customer data
association rules, 352
scale, 108, 283, 362
scaling, 32
scatter plot, 59, 67, 74, 108
animated, 70
color-coded, 67
scatter plot matrix, 69
score, 10, 501
scoring, 21, 42
seasonal variable, 410
seasonality, 387, 390, 417, 435
second principal component, 104
segmentation, 358
segmenting consumers of bath
soap case, 518
self-proximity, 361
SEMMA, 21
sensitivity, 131
sensitivity analysis, 287
separating hyper-plane, 297
separating line, 297
September 11 travel data
time series, 398, 422, 446
shampoo sales data
time series, 400, 450
similarity measures, 364
simple linear regression, 241
simple random sampling, 141
single linkage, 371
singleton, 463
singular value decomposition,
183, 349
smoothing, 176, 389
time series, 433
smoothing constants, 433
smoothing parameter, 440, 441
souvenir sales data
time series, 399, 428
INDEX 543
spam e-mail data
discriminant analysis, 308
specialized visualization, 80
specificity, 131
split points, 209
SQL, 16
standard error of estimate, 118
standardization, 107
standardize, 32, 107, 174
statistical distance, 296, 364, 376
statistics, 5
stemming, 486
steps in data mining, 19
stepwise, 166
stepwise regression, 166, 250
stopping tree growth, 221
stopword, 485
stopword list, 485
stratified sampling, 140
subset selection, 111, 164
subset selection in linear regression,
161
subsets, 183
success class, 10
sum of squared deviations,
156, 228
sum of squared errors, 262
sum of squared perpendicular distances,
104
summary statistics, 163
supervised learning, 10, 18, 55, 58,
59, 69, 117
system administrators data
discriminant analysis, 308
logistic regression, 268
target variable, 10
taxi cancellations, 517
Tayko data, 510
multiple linear regression, 169
Tayko software catalog case, 510
term
text mining, 480
Term Frequency–Inverse Document
Frequency, 487
term-document matrix
text mining, 480
terminal node, 222
test data, 20
test partition, 36
test set, 9, 10, 248
text mining, 480
TF-IDF, 487
time series
dummies, 407
lagged series, 413
residuals, 403
window width, 439, 440
time series forecasting, 86, 387
time series partitioning, 395
time-plot
Spotfire, 73
tokenization, 484
total variability, 102
Toyota Corolla data, 51, 156
classification tree, 227, 235
multiple linear regression, 172
backward elimination, 166
best subsets, 164
neural nets, 290
principal components analysis,
112
Toys R Us revenues data
data reduction, 99
time series, 424
training, 278
training data, 18, 20
training partition, 36
training period, 435, 445
training set, 10, 117, 126, 155, 387
transfer function, 275
transform, 277
transformation, 99
transformation of variables, 232
transformations, 272
transpose, 297
tree depth, 221
trees, 35
conditional inference, 222
search, 183
trend, 72, 74, 387, 390, 405, 435
quadratic, 417
trend lines, 74, 87, 393
trial, 279
triangle inequality, 361
unbiased, 156, 164
under-fitting, 163
undersampling, 141
unequal importance of classes, 131
544 INDEX
Universal Bank data, 184
classification tree, 215, 224
discriminant analysis, 294, 307
logistic regression, 243
university rankings data
cluster analysis, 382
principal components analysis, 112
unsupervised learning, 10, 18, 55, 59,
69, 77, 87, 349
UPGMA, 370
UPGMC, 370
uplift modeling, 311
uplift models, 317
validation data, 20, 222
validation partition, 36
validation period, 435, 445
validation set, 9, 10, 117, 126, 143, 153,
247, 387
variability
between-class, 297
within-class variability, 297
variable, 10
binary outcome, 238
selection, 28, 161
variable importance, 230
variable selection, 153, 250
variables
categorical, 26
continuous, 26
nominal, 26
numerical, 26
ordinal, 26
text, 26
variation
between-cluster, 374
within-cluster, 374
vertex in a network, 456
visualization, 3
animation, 70
color, 67
hue, 67
map chart, 83
multiple panels, 67, 69, 70
networks, 80
shape, 67, 70
size, 67, 70
treemaps, 82
Walmart stock data
time series, 425
Ward’s method, 370
weight decay, 278
weighted average, 180
weighted sampling, 522
West Roxbury housing data, 21
within-cluster dispersion, 378
z-score, 32, 296, 297, 362
zooming, 56, 70, 72, 78
WILEY END USER LICENSE
AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook
EULA.
http://www.wiley.com/go/eula
DATA MINING FOR BUSINESS ANALYTICS
Contents
Foreword by Gareth James
Foreword by Ravi Bapna
Preface to the R Edition
Acknowledgments
PART I PRELIMINARIES
CHAPTER 1 Introduction
1.1 What Is Business Analytics?
1.2 What Is Data Mining?
1.3 Data Mining and Related Terms
1.4 Big Data
1.5 Data Science
1.6 Why Are There So Many Different Methods?
1.7 Terminology and Notation
1.8 Road Maps to This Book
Order of Topics
CHAPTER 2 Overview of the Data Mining Process
2.1 Introduction
2.2 Core Ideas in Data Mining
Classification
Prediction
Association Rules and Recommendation Systems
Predictive Analytics
Data Reduction and Dimension Reduction
Data Exploration and Visualization
Supervised and Unsupervised Learning
2.3 The Steps in Data Mining
2.4 Preliminary Steps
Organization of Datasets
Predicting Home Values in the West Roxbury Neighborhood
Loading and Looking at the Data in R
Sampling from a Database
Oversampling Rare Events in Classification Tasks
Preprocessing and Cleaning the Data
2.5 Predictive Power and Overfitting
Overfitting
Creation and Use of Data Partitions
2.6 Building a Predictive Model
Modeling Process
2.7 Using R for Data Mining on a Local Machine
2.8 Automating Data Mining Solutions
Data Mining Software: The State of the Market (by Herb Edelstein)
Problems
PART II DATA EXPLORATION AND DIMENSION REDUCTION
CHAPTER 3 Data Visualization
3.1 Uses of Data Visualization
Base R or ggplot?
3.2 Data Examples
Example 1: Boston Housing Data
Example 2: Ridership on Amtrak Trains
3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots
Distribution Plots: Boxplots and Histograms
Heatmaps: Visualizing Correlations and Missing Values
3.4 Multidimensional Visualization
Adding Variables: Color, Size, Shape, Multiple Panels, and Animation
Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering
Reference: Trend Lines and Labels
Scaling up to Large Datasets
Multivariate Plot: Parallel Coordinates Plot
Interactive Visualization
3.5 Specialized Visualizations
Visualizing Networked Data
Visualizing Hierarchical Data: Treemaps
Visualizing Geographical Data: Map Charts
3.6 Summary: Major Visualizations and Operations, by Data Mining Goal
Prediction
Classification
Time Series Forecasting
Unsupervised Learning
Problems
CHAPTER 4 Dimension Reduction
4.1 Introduction
4.2 Curse of Dimensionality
4.3 Practical Considerations
Example 1: House Prices in Boston
4.4 Data Summaries
Summary Statistics
Aggregation and Pivot Tables
4.5 Correlation Analysis
4.6 Reducing the Number of Categories in Categorical Variables
4.7 Converting a Categorical Variable to a Numerical Variable
4.8 Principal Components Analysis
Example 2: Breakfast Cereals
Principal Components
Normalizing the Data
Using Principal Components for Classification and Prediction
4.9 Dimension Reduction Using Regression Models
4.10 Dimension Reduction Using Classification and Regression Trees
Problems
PART III PERFORMANCE EVALUATION
CHAPTER 5 Evaluating Predictive Performance
5.1 Introduction
5.2 Evaluating Predictive Performance
Naive Benchmark: The Average
Prediction Accuracy Measures
Comparing Training and Validation Performance
Lift Chart
5.3 Judging Classifier Performance
Benchmark: The Naive Rule
Class Separation
The Confusion (Classification) Matrix
Using the Validation Data
Accuracy Measures
Propensities and Cutoff for Classification
Performance in Case of Unequal Importance of Classes
Asymmetric Misclassification Costs
Generalization to More Than Two Classes
5.4 Judging Ranking Performance
Lift Charts for Binary Data
Decile Lift Charts
Beyond Two Classes
Lift Charts Incorporating Costs and Benefits
Lift as a Function of Cutoff
5.5 Oversampling
Oversampling the Training Set
Evaluating Model Performance Using a Non-oversampled Validation Set
Evaluating Model Performance if Only Oversampled Validation Set Exists
Problems
PART IV PREDICTION AND CLASSIFICATION METHODS
CHAPTER 6 Multiple Linear Regression
6.1 Introduction
6.2 Explanatory vs. Predictive Modeling
6.3 Estimating the Regression Equation and Prediction
Example: Predicting the Price of Used Toyota Corolla Cars
6.4 Variable Selection in Linear Regression
Reducing the Number of Predictors
How to Reduce the Number of Predictors
Problems
CHAPTER 7 k-Nearest Neighbors (kNN)
7.1 The k-NN Classifier (Categorical Outcome)
Determining Neighbors
Classification Rule
Example: Riding Mowers
Choosing k
Setting the Cutoff Value
k-NN with More Than Two Classes
Converting Categorical Variables to Binary Dummies
7.2 k-NN for a Numerical Outcome
7.3 Advantages and Shortcomings of k-NN Algorithms
Problems
CHAPTER 8 The Naive Bayes Classifier
8.1 Introduction
Cutoff Probability Method
Conditional Probability
Example 1: Predicting Fraudulent Financial Reporting
8.2 Applying the Full (Exact) Bayesian Classifier
Using the “Assign to the Most Probable Class” Method
Using the Cutoff Probability Method
Practical Difficulty with the Complete (Exact) Bayes Procedure
Solution: Naive Bayes
The Naive Bayes Assumption of Conditional Independence
Using the Cutoff Probability Method
Example 2: Predicting Fraudulent Financial Reports, Two Predictors
Example 3: Predicting Delayed Flights
8.3 Advantages and Shortcomings of the Naive Bayes Classifier
Problems
CHAPTER 9 Classification and Regression Trees
9.1 Introduction
9.2 Classification Trees
Recursive Partitioning
Example 1: Riding Mowers
Measures of Impurity
Tree Structure
Classifying a New Record
9.3 Evaluating the Performance of a Classification Tree
Example 2: Acceptance of Personal Loan
9.4 Avoiding Overfitting
Stopping Tree Growth: Conditional Inference Trees
Pruning the Tree
Cross-Validation
Best-Pruned Tree
9.5 Classification Rules from Trees
9.6 Classification Trees for More Than Two Classes
9.7 Regression Trees
Prediction
Measuring Impurity
Evaluating Performance
9.8 Improving Prediction: Random Forests and Boosted Trees
Random Forests
Boosted Trees
9.9 Advantages and Weaknesses of a Tree
Problems
CHAPTER 10 Logistic Regression
10.1 Introduction
10.2 The Logistic Regression Model
10.3 Example: Acceptance of Personal Loan
Model with a Single Predictor
Estimating the Logistic Model from Data: Computing Parameter Estimates
Interpreting Results in Terms of Odds (for a Profiling Goal)
10.4 Evaluating Classification Performance
Variable Selection
10.5 Example of Complete Analysis: Predicting Delayed Flights
Data Preprocessing
Model-Fitting and Estimation
Model Interpretation
Model Performance
Variable Selection
10.6 Appendix: Logistic Regression for Profiling
Appendix A: Why Linear Regression Is Problematic for a Categorical Outcome
Appendix B: Evaluating Explanatory Power
Appendix C: Logistic Regression for More Than Two Classes
Problems
CHAPTER 11 Neural Nets
11.1 Introduction
11.2 Concept and Structure of a Neural Network
11.3 Fitting a Network to Data
Example 1: Tiny Dataset
Computing Output of Nodes
Preprocessing the Data
Training the Model
Example 2: Classifying Accident Severity
Avoiding Overfitting
Using the Output for Prediction and Classification
11.4 Required User Input
11.5 Exploring the Relationship Between Predictors and Outcome
11.6 Advantages and Weaknesses of Neural Networks
Problems
CHAPTER 12 Discriminant Analysis
12.1 Introduction
Example 1: Riding Mowers
Example 2: Personal Loan Acceptance
12.2 Distance of a Record from a Class
12.3 Fisher’s Linear Classification Functions
12.4 Classification Performance of Discriminant Analysis
12.5 Prior Probabilities
12.6 Unequal Misclassification Costs
12.7 Classifying More Than Two Classes
Example 3: Medical Dispatch to Accident Scenes
12.8 Advantages and Weaknesses
Problems
CHAPTER 13 Combining Methods: Ensembles and Uplift Modeling
13.1 Ensembles
Why Ensembles Can Improve Predictive Power
Simple Averaging
Bagging
Boosting
Bagging and Boosting in R
Advantages and Weaknesses of Ensembles
13.2 Uplift (Persuasion) Modeling
A-B Testing
Uplift
Gathering the Data
A Simple Model
Modeling Individual Uplift
Computing Uplift with R
Using the Results of an Uplift Model
13.3 Summary
Problems
PART V MINING RELATIONSHIPS AMONG RECORDS
CHAPTER 14 Association Rules and Collaborative Filtering
14.1 Association Rules
Discovering Association Rules in Transaction Databases
Example 1: Synthetic Data on Purchases of Phone Faceplates
Generating Candidate Rules
The Apriori Algorithm
Selecting Strong Rules
Data Format
The Process of Rule Selection
Interpreting the Results
Rules and Chance
Example 2: Rules for Similar Book Purchases
14.2 Collaborative Filtering
Data Type and Format
Example 3: Netflix Prize Contest
User-Based Collaborative Filtering: “People Like You”
Item-Based Collaborative Filtering
Advantages and Weaknesses of Collaborative Filtering
Collaborative Filtering vs. Association Rules
14.3 Summary
Problems
CHAPTER 15 Cluster Analysis
15.1 Introduction
Example: Public Utilities
15.2 Measuring Distance Between Two Records
Euclidean Distance
Normalizing Numerical Measurements
Other Distance Measures for Numerical Data
Distance Measures for Categorical Data
Distance Measures for Mixed Data
15.3 Measuring Distance Between Two Clusters
Minimum Distance
Maximum Distance
Average Distance
Centroid Distance
15.4 Hierarchical (Agglomerative) Clustering
Single Linkage
Complete Linkage
Average Linkage
Centroid Linkage
Ward’s Method
Dendrograms: Displaying Clustering Process and Results
Validating Clusters
Limitations of Hierarchical Clustering
15.5 Non-Hierarchical Clustering: The k-Means Algorithm
Choosing the Number of Clusters (k)
Problems
PART VI FORECASTING TIME SERIES
CHAPTER 16 Handling Time Series
16.1 Introduction
16.2 Descriptive vs. Predictive Modeling
16.3 Popular Forecasting Methods in Business
Combining Methods
16.4 Time Series Components
Example: Ridership on Amtrak Trains
16.5 Data-Partitioning and Performance Evaluation
Benchmark Performance: Naive Forecasts
Generating Future Forecasts
Problems
CHAPTER 17 Regression-Based Forecasting
17.1 A Model with Trend
Linear Trend
Exponential Trend
Polynomial Trend
17.2 A Model with Seasonality
17.3 A Model with Trend and Seasonality
17.4 Autocorrelation and ARIMA Models
Computing Autocorrelation
Improving Forecasts by Integrating Autocorrelation Information
Evaluating Predictability
Problems
CHAPTER 18 Smoothing Methods
18.1 Introduction
18.2 Moving Average
Centered Moving Average for Visualization
Trailing Moving Average for Forecasting
Choosing Window Width (w)
18.3 Simple Exponential Smoothing
Choosing Smoothing Parameter �
Relation Between Moving Average and Simple Exponential Smoothing
18.4 Advanced Exponential Smoothing
Series with a Trend
Series with a Trend and Seasonality
Series with Seasonality (No Trend)
Problems
PART VII DATA ANALYTICS
CHAPTER 19 Social Network Analytics
19.1 Introduction
19.2 Directed vs. Undirected Networks
19.3 Visualizing and Analyzing Networks
Graph Layout
Edge List
Adjacency Matrix
Using Network Data in Classification and Prediction
19.4 Social Data Metrics and Taxonomy
Node-Level Centrality Metrics
Egocentric Network
Network Metrics
19.5 Using Network Metrics in Prediction and Classification
Link Prediction
Entity Resolution
Collaborative Filtering
19.6 Collecting Social Network Data with R
19.7 Advantages and Disadvantages
Problems
CHAPTER 20 Text Mining
20.1 Introduction
20.2 The Tabular Representation of Text: Term-Document Matrix and “Bag-of-Words”
20.3 Bag-of-Words vs. Meaning Extraction at Document Level
20.4 Preprocessing the Text
Tokenization
Text Reduction
Presence/Absence vs. Frequency
Term Frequency–Inverse Document Frequency (TF-IDF)
From Terms to Concepts: Latent Semantic Indexing
Extracting Meaning
20.5 Implementing Data Mining Methods
20.6 Example: Online Discussions on Autos and Electronics
Importing and Labeling the Records
Text Preprocessing in R
Producing a Concept Matrix
Fitting a Predictive Model
Prediction
20.7 Summary
Problems
PART VIII CASES
CHAPTER 21 Cases
21.1 Charles Book Club
The Book Industry
Database Marketing at Charles
Data Mining Techniques
Assignment
21.2 German Credit
Background
Data
Assignment
21.3 Tayko Software Cataloger
Background
The Mailing Experiment
Data
Assignment
21.4 Political Persuasion
Background
Predictive Analytics Arrives in US Politics
Political Targeting
Uplift
Data
Assignment
21.5 Taxi Cancellations
Business Situation
Assignment
21.6 Segmenting Consumers of Bath Soap
Business Situation
Key Problems
Data
Measuring Brand Loyalty
Assignment
21.7 Direct-Mail Fundraising
Background
Data
Assignment
21.8 Catalog Cross-Selling
Background
Assignment
21.9 Predicting Bankruptcy
Predicting Corporate Bankruptcy
Assignment
21.10 Time Series Case: Forecasting Public Transportation Demand
Background
Problem Description
Available Data
Assignment Goal
Assignment
Tips and Suggested Steps
References
Data Files Used in the Book
Index
EULA