data science and big data analytics
1.What is a logit and how do we compute class probabilities from the logit?
2.Compare and contrast linear and logistic regression methods.
3.Let x3 be the following vector:
x3 <- c(0, 1, 1, 2, 2, 2, 3, 3, 4)
Imagine what a histogram of x3 would look like. Assume that the histogram has a bin width of 1. How many bars will the histogram have? Where will they appear? How high will each be?
When you are done, plot a histogram of x3 with bin width = 1, and see if you are right.
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Advanced Analytics – Theory and Methods
1Module 4: Analytics Theory/Methods
1Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Advanced Analytics – Theory and Methods
During this lesson the following topics are covered:
• Technical description of a logistic regression model
• Common use cases for the logistic regression model
• Interpretation and scoring with the logistic regression model
• Diagnostics for validating the logistic regression model
• Reasons to Choose (+) and Cautions (-) of the logistic
regression model
Lesson 4b:
Logistic Regression
Module 4: Analytics Theory/Methods 2
The topics covered in this lesson are listed.
Module 4: Analytics Theory/Methods 2
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Logistic Regression
• Used to estimate the probability that an event will occur as a
function of other variables
The probability that a borrower will default as a function of his
credit score, income, the size of the loan, and his existing debts
• Can be considered a classifier, as well
Assign the class label with the highest probability
• Input variables can be continuous or discrete
• Output:
A set of coefficients that indicate the relative impact of each driver
A linear expression for predicting the log-odds ratio of outcome as
a function of drivers. (Binary classification case)
Log-odds ratio easily converted to the probability of the outcome
3Module 4: Analytics Theory/Methods
We use logistic regression to estimate the probability that an event will occur as a function of
other variables. An example is that the probability that a borrower will default as a function of
his credit score , income, loan size, and his current debts.
We will be discussing classifiers in the next lesson. Logistic regression can also be considered a
classifier. Recall the discussions on classifiers in lesson 1 of this module(Clustering). Classifiers
are methods to assign class labels (default or no_default) based on the highest probability.
In logistic regression input variables can be continuous or discrete. The output is a set of
coefficients that indicate the relative impact of each of the input variables.
In a binary classification case (true/false) the output also provides a linear expression for
predicting the log odds ratio of the outcome as a function of drivers. The log odds ratios can be
converted to the probability of an outcome and many packages do this conversion in their
outputs automatically.
3Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Logistic Regression Use Cases
• The preferred method for many binary classification problems:
Especially if you are interested in the probability of an event, not
just predicting the “yes or no“
Try this first; if it fails, then try something more complicated
• Binary Classification examples:
The probability that a borrower will default
The probability that a customer will churn
• Multi-class example
The probability that a politician will vote yes/vote no/not show up
to vote on a given bill
4Module 4: Analytics Theory/Methods
Logistic regression is the preferred method for many binary classification problems
Two examples of a binary classification problem are shown in the slide above. Other examples
:
• true/false
• approve/deny
• respond to medical treatment/no response
• will purchase from a website/no purchase
• likelihood Spain will win the next World Cup
The third example on the slide “ The probability that a politician will vote yes/vote no/not show
up to vote on a given bill” is a multiclass problem. We will only discuss binary problems (such
as loan default) for simplicity in this lesson.
Logistic regression is especially useful if you are interested in the probability of an event, not
just predicting the class labels. In a binary class problem Logistic regression must be tried first
to fit a model. And only if it does not work models such as GAMS (generalized additive
methods), Support Vector Machines and Ensemble Methods are tried (these models are out of
scope for this course).
4Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Logistic Regression Model – Example
• Training data: default is 0/1
default=1 if loan defaulted
• The model will return the probability that a loan with given
characteristics will default
• If you only want a “yes/no” answer, you need a
threshold
The standard threshold is 0.5
5Module 4: Analytics Theory/Methods
The slide shows an example “Probability of Default”
Default (output for this model) is defined as a function of credit score, income, loan amount
and existing debt.
The training data represents the default as either 0 or 1 where default = 1 if the loan is
defaulted.
Fitting and scoring the logistic regression model will return the probability that a loan with a
given value for each of the input variables will default.
If only Yes/No type answer is desired a threshold must be set for the value of probability to
return the class label. The standard threshold is 0.5.
5Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Logistic Regression- Visualizing the Model
Overall fraction of default:
~20%
Logistic regression returns a
score that estimates the
probability that a borrower
will default
The graph compares the
distribution of defaulters
and non-defaulters as a
function of the model’s
predicted probability, for
borrowers scoring higher
than 0.1
Blue=defaulters
6Module 4: Analytics Theory/Methods
This is an example of how one might visualize the model. Logistic regression returns a score
that estimates the probability that a borrower will default. The graph compares the distribution
of defaulters and non defaulters as a function of model’s predicted probability for borrowers
scoring higher than 0.1 and less than 0.98
The graph is overlaid – think of the blue graph (defaulters) as being transparent and “in front
of” the red graph (non defaulters).
The takeaway from the graph is that the higher a borrower scores, the more likely empirically
that he will default.
The graph only considers borrowers who score > 0.1 and < 0.98 because this graph had large spikes near 0 and 1, so the graph becomes hard to read. We can see, however, that a fraction of low scoring borrowers do actually default. (the overlap)
6Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Technical Description (Binary Case)
• y=1 is the case of interest: ‘TRUE’
• LHS is called logit(P(y=1))
hence, “logistic regression”
• logit(P(y=1)) is inverted by the sigmoid function
standard packages can return probability for you
• Categorical variables are expanded as with linear regression
• Iterative solution to obtain coefficient estimates, denoted bj
“Iteratively re-weighted least squares”
7Module 4: Analytics Theory/Methods
??
?(? = 1)
1 − ?(? = 1)
= ?0 + ?1?1 + ?2?2 … + ????−1
The quantity on LHS (Left Hand Side) is the log odds ratio. We first compute the ratio of
probability of y equal to 1 vs. the probability of y not equal to 1 and take a log of this ratio. In
logistic regression the log odds ratio is equal to linear additive combination of the drivers. LHS
is called logit(P(y=1)) and hence this method came to be known as logistic regression. The
inverse of the logit is the sigmoid function. The output of the sigmoid is the actual
probabilities. Standard packages give the inverse as a standard output.
Categorical values are expanded exactly the way we did in the linear regression. Computing the
estimated coefficients, denoted bj, can also be accomplished as the least square method but
implemented as iteratively re-weighted least squares converging to the true probabilities with
every iteration.
Logistic regression has exactly the same problems that a OLS method has and the
computational complexity increases with more input variables and with categorical values with
multiple levels.
7Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Interpreting the Estimated Coefficients, bi
• Invert the logit expression:
• exp(bj) tells us how the odds-ratio of y=1 changes for every unit
change in xj
• Example: bcreditScore = -0.69
• exp(bcreditScore) = 0.5 = 1/2
• for the same income, loan, and existing debt, the odds-ratio of
default is halved for every point increase in credit score
• Standard packages return the significance of the coefficients in the
same way as in linear regression
8Module 4: Analytics Theory/Methods
If we invert the logit expression shown in the slide, we come up with the logit as a product of
the exponents of the coefficients times the drivers.
The exponent of the first coefficient, b0, represents the odds-ratio of the outcome in the
“reference situation” – the situation that is represented by all the continuous variables set to
zero, and the categorical variables at their reference
That means the exponent of the coefficients exp(bj) tells us how the odds-ratio of y=1 changes
for every unit change in xj
Suppose we have bcreditScore =- 0.69 implies exp(-0.69) = 0.5 = 1/2
This means for the same income, loan amount, existing debt, the odds ratio of default is cut in
half for every point of increase of credit score. The negative number on the coefficient
indicates that there is a negative relation between the credit score and the probability of
default. Higher credit score implies lower probability of default.
Significance of the credit score is returned in the same way as in linear regression. So you
should look for very low “p” values.
8Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
An Interesting Fact About Logistic Regression
“The probability mass equals
the counts”
• If 20% of our loan risk training
set defaults
The sum of all the training set
scores will be 20% of the
number of training examples
• If 40% of applicants with income
< $50,000 default
The sum of all the training set
scores of people in this income
category will be 40% of the
number of examples in this
income category
9Module 4: Analytics Theory/Methods
“Logistic regression preserves summary statistics of the training data” – in other words, logistic
regression is a very good way of concisely describing the probability of all the different possible
combination of features in the training data.
Two examples of this feature are shown in the slide. If you sum up everybody’s score after
putting them through the model the total computed will be equal to the sum of all the training
set scores.
What this means is that it is almost like a continuous look up probability table. Assume that we
have all categorical variables and you have the table of probability of every possible
combination of variables, Logistic regression is a concise version of the table. This is what can
be defined as a “well calibrated” model.
Reference: http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-
regression/
9Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Diagnostics
• Hold-out data:
Does the model predict well on data it hasn’t seen?
• N-fold cross-validation: Formal estimate of generalization error
• “Pseudo-R2” : 1 – (deviance/null deviance)
Deviance, null deviance both reported by most standard packages
The fraction of “variance” that is explained by the model
Used the way R2 is used
10Module 4: Analytics Theory/Methods
This is all very similar to linear regression. We use the hold-out data method, and N-fold cross
validation on the fitted model. This is exactly what we did with linear regression to determine if
the model predicts well.
The model should explain more than just this simple guess. Pseudo R2 is the term we use in
Logistic regression which we use the same way we use R2 in linear regression. It is basically
“the fraction” of the variance .
Deviance, for the purposes of this discussion, is analogous to “variance” in linear regression.
The null deviance is the deviance (or “error’) that you would make if you always assumed that
the probability of true were simply the global probability.
1 – (deviance/null deviance) is the “fraction” that defines Pseudo R2 which is a measure of how
well the model explains the data.
10Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Diagnostics (Cont.)
• Sanity check
the coefficients
Do the signs make sense? Are the coefficients excessively large?
Wrong sign is an indication of correlated inputs, but doesn’t
necessarily affect predictive power.
Excessively large coefficient magnitudes may indicate strongly
correlated inputs; you may want to consider eliminating some
variables, or using regularized regression techniques.
Infinite magnitude coefficients could indicate a variable that strongly
predicts a subset of the output (and doesn’t predict well on the rest).
▪ Try a Decision Tree on that variable, to see if you should segment the
data before regressing.
11Module 4: Analytics Theory/Methods
The sanity checks are exactly the same as what we discussed in linear regression.
Once we determine the fit is good we need to perform the sanity checks. Logistic regression is
an explanatory model and the coefficients provide the required details.
First check the sign of the coefficients. Do the signs make sense. For example, should the
income increase with age or years of education? The coefficients should be positive. If not
there might be something wrong. It is often an indicator that the variables are correlated to
each other. Regression works best if all the drivers are independent. This does not in fact affect
the predictive power but the explanatory capability is compromised here.
We also need to check if the magnitude of the coefficients make sense? They sometimes can
become excessively large and we prefer them not to be very large. This is also an indication of
strongly correlated inputs. In this case consider eliminating some variables. Note that unlike
linear regression, where we have regularized regression techniques, there are not any
standard methods with logistic regression. If there is a requirement one should implement
one’s own method.
Sometimes you may get infinite magnitude coefficients which could indicate that there is a
variable that strongly predicts a certain subset of the output and does not predict well on the
rest. For example there is a range of age for which the output income is perfectly predicted. In
such conditions plot the output vs. the input and determine the segment at which the
prediction goes wrong. We should then segment the data before fitting the model. Decision
Trees can be used on that variable, to see if you should segment the data before regressing.
11Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Diagnostics: ROC Curve
Area under the curve (AUC)
tells you how well the model
predicts. (Ideal AUC = 1)
For logistic regression, ROC
curve can help set classifier
threshold
12Module 4: Analytics Theory/Methods
Logistic models do very well at predicting class probabilities; but if you want to use them as a
classifier you have to set a threshold. For a given threshold, the classifier will give false
positives and false negatives. False positive rate (fpr) is the fraction of negative instances that
were misclassified.
False negative rate (fnr) is the fraction of positive instances that were misclassified. True
positive rate (tpr) = 1 – fnr
The ROC (Receiver Operating Characteristics) curve plots (fpr, tpr) as the threshold is varied
from 0 (the upper right hand corner) to 1 (the lower left hand corner).
As the threshold is raised, the false positive rate decreases, but the true positive rate
decreases, too.
The ideal classifier (only true instances have probability near 1) would trace the upper left
triangle of the unit square: as the threshold increases, fpr decreases without lowering tpr.
Usually, ROC curves are only used to evaluate prediction quality – how close the AUC is to 1.
But they can also be used to set thresholds; if you have upper bounds on your desired fpr and
fnr, you can use the ROC curve (or more accurately, the software that you use to plot the ROC
curve) to give you the range of thresholds that meet those constraints.
For logistic regression, the ROC curve can help set the classifier threshold.
An excellent primer on ROC is available in the following reference:
http://home.comcast.net/~tom.fawcett/public_html/papers/ROC101
12Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Diagnostics: Plot the Histograms of Scores
good separation
13Module 4: Analytics Theory/Methods
The next diagnostic method is plotting the histogram of the scores. The graph in the top half is
what we saw earlier in the lesson. The graph tells us how well the model discriminates true
instances from false instances. Ideally, true score high and false instances score low. If so, most
of the mass of the two histograms are separated. That is what you see in the graph at the top.
The graph shown at the bottom shows substantial overlap. The model did not predict well. This
means the input variables are not strong predictors of the output.
13Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Reasons to Choose (+) Cautions (-)
Explanatory value:
Relative impact of each variable on the outcome
in a more complicated way than linear regression
Does not handle missing values well
Robust with redundant variables, correlated variables
Lose some explanatory value
Assumes that each variable affects the log-odds of the
outcome linearly and additively
Variable transformations and modeling variable
interactions can alleviate this
A good idea to take the log of monetary amounts
or any variable with a wide dynamic range
Concise representation with the
the coefficients
Cannot handle variables that affect the outcome in a
discontinuous way.
Step functions
Easy to score data Doesn’t work well with discrete drivers that have a lot
of distinct values
For example, ZIP code
Returns good probability estimates of an event
Preserves the summary statistics of the training data
“The probabilities equal the counts”
Logistic Regression – Reasons to Choose (+) and
Cautions (-)
Module 4: Analytics Theory/Methods 14
Logistic regressions have the explanatory values and we can easily determine how the
variables affect the outcome. The explanatory values are a little more complicated than linear
regression. It works well with (robust) redundant variables and correlated variables. In this case
the prediction is not impacted but we lose some explanatory value with the fitted model.
Logistic regression provides the concise representation of the outcome with the coefficients
and it is easy to score the data with this model. Logistic regression returns probability
estimates of an event. It also returns calibrated model it preserves the summary statistics of
the training data.
Cautions (-) are that the Logistic regression does not handle missing values well. It assumes
that each variable affects the log odds of the outcome linearly and additively. So if we have
some variables that affect the outcome non-linearly and the relationships are not actually
additive the model does not fit well.
Variable transformations and modeling variable interactions can address this to some extent. It
is recommended to take the log of monetary amounts or any variable with a wide dynamic
range. It cannot handle variables that affect the outcome in a discontinuous way. We discussed
the issue of infinite magnitude coefficients earlier where the prediction is inconsistent in
ranges. Also when you have discrete drivers with a large number of distinct values the model
becomes complex and computationally inefficient.
Module 4: Analytics Theory/Methods 14
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Check Your Knowledge
1. What is a logit and how do we compute class probabilities
from the logit?
2. How is ROC curve used to diagnose the effectiveness of the
logistic regression model?
3. What is Pseudo R2 and what does it measure in a logistic
regression model?
4. How do you describe a binary class problem?
5. Compare and contrast linear and logistic regression methods.
Your Thoughts?
15Module 4: Analytics Theory/Methods
Record your answers here.
15Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Advanced Analytics – Theory and Methods
During this lesson the following topics were covered:
• Technical description of a logistic regression model
• Common use cases for the logistic regression model
• Interpretation and scoring with the logistic regression model
• Diagnostics for validating the logistic regression model
• Reasons to Choose (+) and Cautions (-) of the logistic
regression model
Lesson 4b: Logistic Regression – Summary
Module 4: Analytics Theory/Methods 16
This lesson covered these topics. Please take a moment to review them.
Module 4: Analytics Theory/Methods 16
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Advanced Analytics – Theory and Methods
1Module 4: Analytics Theory/Methods
1
Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Advanced Analytics – Theory and Methods
During this lesson the following topics are covered:
• General description of regression models
• Technical description of a linear regression model
• Common use cases for the linear regression model
• Interpretation and scoring with the linear regression model
• Diagnostics for validating the linear regression model
• The Reasons to Choose (+) and Cautions (-) of the linear
regression model
Lesson 4a: Linear
Regression
Module 4: Analytics Theory/Methods 2
The topics covered in this lesson are listed.
Module 4: Analytics Theory/Methods 2
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Regression
• Regression focuses on the relationship between an outcome and
its input variables.
Provides an estimate of the outcome based on the input values.
Models how changes in the input variables affect the outcome.
• The outcome can be continuous or discrete.
• Possible use cases:
Estimate the lifetime value (LTV) of a customer and understand
what influences LTV.
Estimate the probability that a loan will default and understand
what leads to default.
• Our approaches: linear regression and logistic regression
3Module 4: Analytics Theory/Methods
The term “regression” was coined by Francis Galton in the nineteenth century to describe a
biological phenomenon. The phenomenon was that the heights of descendants of tall
ancestors tend to regress down towards an average (a phenomenon also known as regression
toward the mean).
Specifically, regression analysis helps one understand how the value of the dependent variable
(also referred to as outcome) changes when any one of the independent (or input)
variables
changes, while the other independent variables are held fixed. Regression analysis estimates
the conditional expectation of the outcome variable given the input variables — that is, the
mean value of the outcome variable when the input variables are held fixed.
Regression focuses on the relationship between the outcome and the inputs. It also provides a
model that has some explanatory value, in addition to estimating outcomes. Although social
scientists use regression primarily for its explanatory value, data scientists apply regression
techniques as predictors or classifiers.
The outcome can be continuous or discrete. For continuous outcomes, such as income, this
lesson examines the use of linear regression. For discrete outcomes of a categorical attribute,
such as success/fail, gender, or political party affiliation, the next lesson presents the use of
logistic regression.
3Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Linear Regression
• Used to estimate a continuous value as a linear (additive)
function of other variables
Income as a function of years of education, age, and gender
House sales price as function of square footage, number of
bedrooms/bathrooms, and lot size
• Outcome variable is continuous.
• Input variables can be continuous or discrete.
• Model Output:
A set of estimated coefficients that indicate the relative impact of
each input variable on
the outcome
A linear expression for estimating the outcome as a function of
input variables
4Module 4: Analytics Theory/Methods
Linear regression is a commonly used technique for modeling a continuous outcome. It is
simple and works well in many instances. It is recommended that linear regression should be
tried and if it is determined that the results are not reliable, other more complicated models
should be considered. Alternative modeling approaches include ridge regression, local linear
regression, regression trees, and neural nets (these models are out of scope for this course).
Linear regression models a continuous outcome, such as income or housing sales prices, as a
linear or additive function of other input variables. The input variables can be continuous or
discrete.
4Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Linear Regression Model
Module 4: Analytics Theory/Methods 5
? = ?0 + ?0?1 + ?0?1 + ⋯+ ??−1??−1+ ?
where ? is the outcome variable
?? are the input variables, for j = 1,2,…,p−1
?0 is the value of ? when each ?? equals zero
?? is the change in ? based on a unit change in ??
? ~ N(0, ?2) and the ?’s are independent of each other
In linear regression, the outcome variable is expressed as a linear combination of the input
variables. For a given set of input variables, the linear regression model provides the expected
outcome value. Unless the situation being modeled is purely deterministic, there will be some
random variability in the outcome. This random error, denoted by ɛ, is assumed to be normally
distributed with a mean of zero and a constant variance (?2).
5Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Example: Linear Regression with One Input Variable
6
• x1 – the number of employees reporting to a manager
• y – the hours per week spent in meetings by the manager
Module 4: Analytics Theory/Methods
? = ?0 + ?1?1 + ?
In this example, the human resources department decides to examine the effect that the
number of employees directly reporting to a manager has on how many hours per week the
manager spends in meetings. The expected time spent in meetings is represented by the
equation of a line with unknown intercept and slope. Suppose the true value of the intercept
is 3.27 hours and the true value of the slope is 2.2 hours per employee. Then, a manager can
expect to spend an additional 2.2 hours per week in meetings for every additional employee.
The distribution of the error term is represented by the rotated normal distribution plots
provided at specific values of x1. For example, a typical manager with 8 employees may be
expected to spend 20.87 hours per week in meetings, but any amount of time from 15 to 27
hours per week is very probable.
This example illustrates a theoretical regression model. In practice, it is necessary to collect
and prepare the necessary data and use a software package such as R to estimate the values of
the coefficients. Coefficient estimation is covered later in this lesson.
Additional variables could be included to this model. For example, a categorical attribute can
be added to this linear regression model to account for the manager’s functional organization,
such as engineering, finance, manufacturing, or sales. It may be somewhat tempting to
included one variable, x2, to represent the organization and denote engineering by 1, finance
by 2, manufacturing by 3, and sales by 4. However, such an approach incorrectly suggests that
the interval between the assigned numeric values has meaning (for example, sales is three
more than engineering). The proper implementation of categorical attributes in a regression
model will be addressed next.
6Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Representing Categorical Attributes
• For a categorical attribute with m possible values
Add m-1 binary (0/1) variables to the regression model
The remaining category is represented by setting the m-1 binary
variables equal to zero
7Module 4: Analytics Theory/Methods
salesmfgfinanceemployeesy 432
10
Possible Situation Input Variables
Finance manager with 8 employees (8,1,0,0)
Manufacturing manager with 8 employees (8,0,1,0)
Sales manager with 8 employees (8,0,0,1)
Engineering manager with 8 employees (8,0,0,0)
In expanding the previous example to include the manager’s functional organization, the input
variables, denoted earlier by the x’s, have been replaced by more meaningful variable names.
In addition to the employees variable for the number of employees reporting to a manager,
three binary variables have been added to the model to identify finance, manufacturing (mfg),
and sales managers. If a manager belongs to either of these functional organizations, the
corresponding variable is set to 1. Otherwise, the variable is set to 0. Thus, for four functional
organizations, engineering is represented by the case where the three binary variables are each
set to 0. For this categorical variable, engineering is considered the reference level. For
example, the coefficient of finance denotes the relative difference from the reference level.
Choosing a different organization as the reference level changes the coefficient values, but not
their relative differences. Interpreting the coefficients for categorical variables relative to the
reference level is covered later in this lesson.
In general, a categorical attribute with m possible distinct values can be represented in the
linear regression model by adding m-1 binary variables. For a categorical attribute, such as
gender with only two possible values, female or male, then only one binary variable needs to
be added with one value assigned a 0 and the other value assigned 1.
Suppose it was decided to include the manager’s U.S. state of employment in the regression
model. Then 49 binary variables would have to be added to the regression model to account
for 50 states. However, that many categorical values can be quite cumbersome to interpret or
analyze. Alternatively, it may make more sense to group the states into geographic regions or
into other groupings such as type of location: headquarters, plant, field office, or remote. In
the latter case, only three binary variables would need to be added.
7Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
• Choose the line that minimizes:
• Provides the coefficient estimates, denoted bj
Fitting a Line with Ordinary Least Squares (OLS)
8Module 4: Analytics Theory/Methods
119.221.3ˆ xy
?=1
?
[?? − (?0 + ?1??1 + ⋯+ ??−1??,?−1)]
2
Once a dataset has been collected, the objective is fit the “best” line to the data points. A very
common approach to determine the best fitting line is to choose the line that minimizes the
sum of the squares of the differences between the observed outcomes in the dataset and the
estimated outcomes based on the equation of the fitted line. This method is known as
Ordinary Least Squares (OLS). In the case of one input variable, the differences or distances
between the observed outcome values and the estimated values along the fitted regression
line are presented in the provided graph as the vertical line segments.
Although this minimization problem can be solved by hand calculations, it becomes very
difficult for more than one input variable. Mathematically, the problem involves calculating the
inverse of a matrix. However, other methods such as QR decomposition are used to minimize
numerical round-off errors. Depending on the implementation, the required storage to
perform the OLS calculations may grow quadratically as the number of input variables grows.
For a large number of observations and many variables, the storage and RAM requirements
should be carefully considered.
Note the provided equation of the fitted line. The use of the carat over y, read y-hat, is used to
denote the estimated outcome for a given set of input. This notation helps to distinguish the
observed y values from the fitted y values. In this example, the estimated coefficients are b0 =
3.21 and b1 = 2.19.
8Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Interpreting the Estimated Coefficients, bj
• Coefficients for numeric input variables
Change in outcome due to a unit change in input variable*
Example: b1 = 2.2
Extra 2.2 hrs/wk in meetings for each additional employee managed*
• Coefficients for binary input variables
Represent the additive difference from the reference level *
Example: b2 = 0.5
Finance managers meet 0.5 hr/wk more than engineering managers do*
• Statistical significance of each coefficient
Are the coefficients significantly different from zero?
For small p-values (say < 0.05), the coefficient is statistically significant
* when all other input values remain the same
9Module 4: Analytics Theory/Methods
salesmfgfinanceemployeesy 6.09.15.02.20.4ˆ
For numeric variables, the estimated coefficients are interpreted in the same way as the
concept of slope was introduced in algebra. For a unit change in a numeric variable, the
outcome will change by the amount and in the direction of the corresponding coefficient. A
fitted linear regression model is provided for the example where the hours per week spent in
meeting by managers are modeled as a function of the number of employees and the
manager’s functional organization. In this case, the coefficient of 2.2, corresponding to the
employees variable, is interpreted as the expected amount of time spent in meetings will
increase by 2.2 hours per week for each additional employee reporting to a manager.
The interpretation of a binary variable coefficient is slightly different. When a binary variable
only assumes a value of 0 or 1, the coefficient is the additive difference or shift in the outcome
from the reference level. In this example, engineering is the reference level for the functional
organizations. So, a manufacturing manager would be expected to spend 1.9 hours per week
less in meetings than an engineering manager when the number of employees is the same.
When used to fit linear regression models, many statistical software packages will provide a p-
value with each coefficient estimate. This p-value can be used to determine if the coefficient is
significantly different that zero. In other words, the software performs a hypothesis test
where the null hypothesis is the coefficient equals zero and the alternate hypothesis is that the
coefficient does not equal zero. For small p-values (say <0.05), then the null hypothesis would
be rejected and the corresponding variable should remain in the linear regression model. If a
larger p-value is observed, then the null hypothesis would not be rejected and the
corresponding variable should be considered for removal from the model.
9Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
• Residuals
Differences between the observed and estimated outcomes
The observed values of the error term, ε, in the regression model
Expressed as:
• Errors are assumed to be normally distributed with
A mean of zero
Constant variance
Diagnostics – Examining Residuals
10
niforyye iii …,2,1
Module 4: Analytics Theory/Methods
Residuals are the differences between the observed and the estimated outcomes. The
residuals are the observed values of the error term in the linear regression model. In linear
regression modeling, these error terms are assumed to be normally distributed with a mean of
zero and a constant variance regardless of the input variable values. Although this normality
assumption is not required to fit a line using OLS, this assumption is the basis for many of the
hypothesis tests and confidence interval calculations performed by statistical software
packages such as R. The next few slides will address the use of residual plots to evaluate the
adherence to this assumption as well as to access the appropriateness of a linear model to a
given dataset.
10Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Diagnostics – Plotting Residuals
11Module 4: Analytics Theory/Methods
Ideal Residual Plot
Quadratic Trend
Non-centered
Non-constant Variance
When plotting the residuals against the estimated or fitted outcome values, the ideal residual
plot will show residuals symmetrically centered around zero with a constant variance and with
no apparent trends. If the ideal residual plot is not observed, it is often necessary to add
additional variables to the model or transform some of the existing input and outcome
variables. Common transformations include the square root and logarithmic functions.
Residual plots are also useful for identifying outliers that may require further investigation or
special handling.
11Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Diagnostics – Residual Normality Assumption
12Module 4: Analytics Theory/Methods
Ideal Histogram
Ideal Q-Q Plot
The provided histogram shows that the residuals are centered around zero and appear to be
symmetric about zero in a bell-shaped curved as one would expect for a normally distributed
random variable. Another option is to examine a Q-Q plot that compares the observed data
against the quantiles (Q) of the assumed distribution. In this example, the observed residuals
follow a theoretical normal distribution represented by the line. If any significant departures
of the plotted points from the line are observed, transformations, such as logarithms, may be
required to satisfy the normality assumption.
12Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Train
Train
Test
D3
D2
D1
Training
Set #1
Train
Test
D3
D2
D1 Train
Diagnostics – Using Hold-out Data
• Hold-out data
Training and testing datasets
Does the model predict well on data it
hasn’t seen?
• N-fold cross validation
Partition the data into N groups.
Holding out each group,
Fit the model
Calculate the residuals on the group
Estimated prediction error is the
average over all the residuals.
13Module 4: Analytics Theory/Methods
Test
Train
D3
D2
D1
Train
Training
Set #2
Training
Set #3
Creating a hold-out dataset (this was discussed in Apriori diagnostics earlier in lesson 2 of this
module) before you fit the model, and using that dataset to estimate prediction error is by far
the easiest thing to do.
N-fold cross validation – it tells you if your set of variables is reasonable. This method is used
when you don’t have enough data to create a hold-out dataset. N-fold cross validation is
performed by randomly splitting the dataset into N non-overlapping subsets or groups and
then fitting a model using N-1 groups and predicting its performance using the group that was
held out. This process is repeated a total of N times, by holding out each group. After
completing the N model fits, you estimate the mean performance of the model (maybe also
the variance/standard deviation of the performance).
“Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions”, by
Seni and Elder, provides a succinct description of N-fold cross-validation.
13Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Diagnostics – Other Considerations
• R2
The fraction of the variability in the outcome variable explained
by the fitted regression model.
Attains values from 0 (poorest fit) to 1 (perfect fit)
• Identify correlated input variables
Pair-wise scatterplots
Sanity check the coefficients
Are the magnitudes excessively large?
Do the signs make sense?
14Module 4: Analytics Theory/Methods
R2 (goodness of fit metric) is reported by all standard packages. It is the fraction of the
variability in the outcome variable that the fitted model explains. The definition of R2 is 1 –
SSerr/Sstot where SSerr = Sum[(y-ypred)
2] and SStot = Sum[(y-ymean)
2]. For a good fit, we want an
R2 value near 1.
Regression modeling works best if the input variables are independent of each other. A simple
way to look for the correlated variables is to examine pair-wise scatterplots such as the one
generated in Module 3 for the Iris dataset. If two input variables, x1 and x2, are linearly related
to the outcome variable y, but are also correlated to each other, it may be only necessary to
include one of these variables in the model. After fitting a regression model, it is useful to
examine the magnitude and signs of the coefficients. Coefficients with large magnitudes or
intuitively incorrect signs are also indications on correlated input variables. If the correlated
variables remain in the fitted model, the predictive power of the regression model may not
suffer, but its explanatory power will be diminished when the magnitude and signs of the
coefficients do not make sense.
If correlated input variables need to remain in the model, restrictions on the magnitudes of the
estimated coefficients can be accomplished with alternative regression techniques. Ridge
regression, which applies a penalty based on the size of the coefficients, is one technique that
can be applied. In fitting a linear regression model, the objective is to find the values of the
coefficients that minimize the sum of the residuals squared. In ridge regression, a penalty term
proportional to the sum of the squares of the coefficients is added to the sum of the residuals
squared. A related technique is lasso regression, in which the penalty is proportional to the
sum of the absolute values of the coefficients. Both of these techniques are outside of the
scope of this course.
14Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Reasons to Choose (+) Cautions (-)
Concise representation (the coefficients) Does not handle missing values well
Robust to redundant or correlated
variables
Lose some explanatory value
Assumes that each variable affects the
outcome linearly and additively
Variable transformations and
modeling variable interactions can
alleviate this
A good idea to take the log of
monetary amounts or any variable
with a wide dynamic range
Explanatory value
Relative impact of each variable on
the outcome
Does not easily handle variables that affect
the outcome in a discontinuous way
Step functions
Easy to score data Does not work well with categorical
attributes with a lot of distinct values
For example, ZIP code
Linear Regression – Reasons to Choose (+) and
Cautions (-)
Module 4: Analytics Theory/Methods 15
The estimated coefficients provide a concise representation of the outcome variable as a
function of the input variables. The estimated coefficients provide the explanatory value of the
model and are used to easily determine how the individual input variables affect the outcome.
Linear regression is robust to redundant or correlated variables. Although the predictive power
may not be impacted, the model does lose some explanatory value in the case of correlated
variables. With the fitted model, it is also easy to score a given set of input values.
A caution is that linear regression does not handle missing values well. Another caution is that
linear regression assumes that each variable affects the outcome linearly and additively. If
some variables affect the outcome non-linearly and the relationships are not actually additive,
the model will often not explain the data well. Variable transformations and modeling variable
interactions can address this issue to some extent.
Hypothesis testing and confidence intervals depend on the normality assumption of the error
term. To satisfy the normality assumption, a common practice is take the log of an outcome
variable with a skewed distribution for a given set of input values. Also, linear regression
models are not ideal for handling variables that affect the outcome in a discontinuous way. In
the case of a categorical attribute with a large number of distinct values, the model becomes
complex and computationally inefficient.
Module 4: Analytics Theory/Methods 15
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Check Your Knowledge
1. How is the measure of significance used in determining the
explanatory value of a driver (input variable) with linear
regression models?
2. Detail the challenges with categorical values in linear
regression model.
3. Describe N-Fold cross validation method used for diagnosing a
fitted model.
4. List two use cases of linear regression models.
5. List and discuss two standard checks that you will perform on
the coefficients derived from a linear regression model.
Your Thoughts?
16Module 4: Analytics Theory/Methods
Record your answers here.
16Module 4: Analytics Theory/Methods
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Advanced Analytics – Theory and Methods
During this lesson the following topics were covered:
• General description of regression models
• Technical description of a linear regression model
• Common use cases for the linear regression model
• Interpretation and scoring with the linear regression model
• Diagnostics for validating the linear regression model
• The Reasons to Choose (+) and Cautions (-) of the linear
regression model
Lesson 4a: Linear Regression – Summary
Module 4: Analytics Theory/Methods 17
This lesson covered these topics. Please take a moment to review them.
Module 4: Analytics Theory/Methods 17