Statistic 7
W7: Regression and Correlation
Look back to the raw data you collected in week 1. There are 7 variables listed:
Vehicle type/class
Year
Make
Model
Price
MPG (city)
MPG (highway)
Choose TWO variables that you feel are correlated and explain why you feel that they are correlated. Do you suspect the relation is positive or negative? Why? Which would be considered the independent variable, which the dependent variable? Why?
Run a regression analysis in Excel and provide the results in your post along with your raw data. Looking at the R2 value, explain what this indicates about the strength of the relation. Then write out your Regression Equation, state if your p-value and conclusion.
I encourage you to review the Week 7 Regression PDF at the bottom of the discussions. This will give you a step by step example on how to calculate a correlation and run a Regression using Excel. I DO NOT recommend doing this by hand. Let Excel do the heavy lifting for you. You can also use this PDF in Quizzes section.
There are additional PDFs that were created to help you with the Homework, Lessons and Tests in Quizzes section. I encourage you to review these ASAP! These PDFs are also located at the bottom of the discussion.
“Before you post your initial discussion, you must submit it in the assignment area in a Word file, so its originality can be checked by Turnitin.com. I will take points off if you do not do this. Your score will appear in the same place you submit your file. It can take up to 24 hours for a score to return, but usually, it is less than 30 minutes. Before you post your discussion in the activity, make sure your originality index (%) is less than 15. If it is greater than 15%, rewrite your discussion, submit it again in the assignment area and check the %. Keep doing this until your % is less than 15%. Only post your discussion when the % is less than 15. Here are two hints to get your score below 15%: 1) leave your list of supporting material out of the file you submit for checking (don’t forget to add these back when you post your discussion in the forum) and 2) use your own words, not quotes.
Once you have posted your initial discussion, you must reply to at least two other learner’s post. Each post must be a different topic. So, you will have your initial post from one topic, your first follow-up post from a different topic, and your second follow-up post from one of the other topics. Of course, you are more than welcome to respond to more than two learners.”
Instructions: Make sure you include your data set in your initial post as well. You must also respond to at least 2 other students. Responses may include direct questions.
Peer response #1 – Looking at your peer’s Excel output, and the Regression Equation they wrote out, interpret the slope of their Regression Equation. Use their Regression Equation to make a prediction and show the work for your predicted value based on your expression. For Example, if your peer used Year to predict Price, plug in a Year value into the regression equation and use it to predict the Price of a vehicle. Does this predicted Price value make sense with their data?
Peer response #2 – It is important to remember that typically a two-factor regression model cannot accurately describe the entire situation. Look at the dependent variable that your peer chose. Name at least 2 independent factors you would use to run a Multiple Linear Regression (MLR) and explain why you feel they are related. Then use those factors to run a Multiple Linear Regression (MLR) on your peer’s data and see if the variables you chose are related to the dependent variable they chose. What is your MLR equation? Is your MLR significant? Are any of the Independent factors significant? What is the R2 value? Explain and interpret this value and how it relates to the MLR. Make sure you include your MLR Excel output as an attachment in your response post.
Sheet1
Vehicle type/class | Year | Make | Model | Price | MPG (City) | MPG (hwy) | Max Seating | |||||||||||||
SUV | 20 | BMW | X3 | $ | 4 | 9 | 5 | 25 | 29 | |||||||||||
Sedan | 20 | 19 | Chevrolet | Cruze | $ | 18 | 7 | 28 | 38 | |||||||||||
20 | 21 | Tahoe | $55,095 | 16 | ||||||||||||||||
Coupe | 2020 | Camaro | $ | 26 | 30 | |||||||||||||||
Ford | Edge | $32,345 | ||||||||||||||||||
Explorer | $34,010 | |||||||||||||||||||
Luxury | Porsche | 911 | $111,550 | 24 | ||||||||||||||||
Van/Minivan | Toyota | Sienna | $32,760 | |||||||||||||||||
Dodge | Charger | $31,390 | ||||||||||||||||||
2019 | Grand Caravan | $27,595 | 17 | |||||||||||||||||
Qualitative | Quantitative |
The correlation is the direction and strength of association between 2 variables is
often expressed in a single number called the correlation coefficient. This is
denoted by the variable r.
• r can only be between -1 and 1, -1 ≤ r ≤ 1.
• If r = 0, then there is no linear relationship at all.
• If r = -1, then there is a perfect linear relationship that slopes down.
• If r = 1, then there is a perfect linear relationship that slopes up.
The Coefficient of Determination refers to how much percent Variation is around
the model. This is denoted by R². Note: If you have one you can find the other.
Simple Linear Regression is a data analysis technique that tries to find a linear
pattern in the data. In
linear regression, we use all the data to calculate a straight line which may be
used to predict the values. We will also discuss if the linear regression is
significant and if the independent variable (x) is a significant predictor of the
dependent variable (y).
The equation of line for a Simple Linear Regression (SLR) is:
�̂� = 𝛽1𝑥 + 𝛽0
Where 𝛽1 is the slope coefficient or the coefficient, 𝛽0 is the y-intercept and �̂� is
the predicted y value.
Let review our car price example. From the car price data, we also found out
what year these cars where manufactured in.
Car Price: Year
Observation 1 $ 20,000 2015
Observation 2 $ 25,000 2016
Observation 3 $ 30,000 2018
Observation 4 $ 31,000 2018
Observation 5 $ 22,500 2016
Observation 6 $ 25,000 2016
Observation 7 $ 29,500 2018
Observation 8 $ 24,000 2015
Observation 9 $ 24,500 2017
Observation 10 $ 25,000 2017
Having this information, we first want to see if there is a correlation between Year
and Car Price. Usually the older the car, cheaper the car is. As the age goes up,
the price will go down. The Price of the car depends on what Year it was
manufactured. This describes a negative correlation, but I want to see if my
assumption is correct and what the actual correlation value is.
Before we can do any calculations on the data, we will need to convert the Year to
a numeric value. Keeping the physical Year is going to skew the data and it
doesn’t make sense when we will get into analyzing and interpreting the data. If
the car was made in 2018, then this means the car will be 1 year old. 2019 – 2018
= 1. I am rounding all these to full years for ease of the example. Converting all
these Years will look like:
Car Price: Year Years
Old
Observation 1 $ 20,000 2015 4
Observation 2 $ 25,000 2016 3
Observation 3 $ 30,000 2018 1
Observation 4 $ 31,000 2018 1
Observation 5 $ 22,500 2016 3
Observation 6 $ 25,000 2016 3
Observation 7 $ 29,500 2018 1
Observation 8 $ 24,000 2015 4
Observation 9 $ 24,500 2017 2
Observation 10 $ 25,000 2017 2
Now that we have our data we can start analyzing it. To find the correlation we
will use the =CORREL( ) function in Excel.
In Excel type in the “=” and the CORREL(;put a left parentheses, then highlight the
first column; type in a comma, highlight the second column; close the
parentheses ) and hit Enter. Note: it does not matter which column you highlight
first.
Here we see that the Correlation = -.8846. This is in fact negative correlation and
agrees with our assumption. As the Age of the Car goes up, the Price of the Car
will go down. Now that we have the Correlation we can find R2.
-.8846 * – .8846 = 78.25%
Next, we will run a Regression using Excel. We will use the Data Analysis ToolPak
to run the Regression.
Go to Data – > Data Analysis
When the new window pops ups, scroll to where it says “Regression”, highlight it
and Click “OK”
Once you Click “OK”, a new window pops up
Where it will say “Input”
Input Y Range: Click in the box and highlight the y values.
Input x Range: Click in the box and highlight the x values.
Check the box that say “Labels” this will tell you that the first row has labels in it.
Output Options
Make sure the second bubble is highlighted. “New Worksheet Ply”
Residuals
Make sure you check the box for Residuals and Standardized Results
Then Click “OK”
(Remember, the x-value predicts the y-value. The Year of the Car will predict
what the Price of the Car is. This tells us that Years Old is the x-value and Price is
the y-value. This is very important to understand and remember)
It should look like this:
Once you click OK, here is the Regression Output:
Looking at the output we see the Multiple R is the correlation. We know the
Correlation is negative, but the regression will give us the positive value. Make
sure you look at the coefficients for validation. We also see that the R-squared is
78.25%. Which is what we calculated before. R-squared tells us that:
78.25% of variation in the data between Age and Price, can we have accounted
for by this model. The best R-squared value is 100%, our value is less than that,
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.884606501
R Square 0.782528661
Adjusted R Square 0.755344744
Standard Error 1725.490814
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 85706451.61 85706451.61 28.78646 0.000673381
Residual 8 23818548.39 2977318.548
Total 9 109525000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 31959.67742 1296.435244 24.65196589 7.83E-09 28970.09239 34949.26245 28970.09239 34949.26245
Years Old -2629.032258 490.0064638 -5.365301179 0.000673 -3758.98919 -1499.075326 -3758.98919 -1499.075326
but it is still high enough to give us a good indication of what the data will look
like and it tells us that we want to interpret the model further.
Next, we want to see if this model is significant and if Years is a significant
predictor for Price. We will look at the Significance F value. Recall: if the p-value
is < alpha, then Yes this is significant.
The p-value associated with this model is .00067338.
.00067338 < .05. The p-value is in fact less than alpha. We can state that Yes,
Years is in fact a significant predictor for Price. This can be valuable information
to know if we are going out to buy a new car. Now that we know the model is
significant let’s write out the Regression Equation and interpret the values.
In the Regression Output if we look like the under Coefficients, this is where we
will find the values to write out the Regression Equation. I highlighted them in
Yellow below.
Next to those value we see the word “Intercept”, this corresponds to the y-
intercept value. And we see the words “Years Old”, this corresponds to the slope
coefficient value. Using this equation
�̂� = 𝛽1𝑥 + 𝛽0, we will write out the regression equation and replace “x” and “y”
with the actual variable names.
𝑃𝑟𝑖𝑐�̂� = −2,629.03 (𝑌𝑒𝑎𝑟𝑠 𝑂𝑙𝑑) + 31,959.68
We see that the y-intercept is $31,959.68. This means when Years Old equals 0,
the Price of a Car should be $31,959.68.
𝑃𝑟𝑖𝑐�̂� = −2,629.03 (0) + 31,959.68
𝑃𝑟𝑖𝑐�̂� = 31,959.68
This makes sense because Year 0 is 2019. So, if you bought this type of car in the
Year 2019, you will expect to pay $31,959.68.
Please note: The y-intercept while in this case does make sense does not always
have a practical meaning. The y-intercept WILL NOT make sense in every
scenario. It is OK for the y-intercept not make sense with certain problems. For
example, if you wanted to use the Weight of a Car to predict the Price, the Weight
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 31959.67742 1296.435244 24.65196589 7.83E-09 28970.09239 34949.26245 28970.09239 34949.26245
Years Old -2629.032258 490.0064638 -5.365301179 0.000673 -3758.98919 -1499.075326 -3758.98919 -1499.075326
of a Car will NEVER be 0 pounds, so the y-intercept is not meaningful and would
not have a practical meaning in the problem.
Next, we want to interpret the Slope. As the Years Old increases by 1 year, then
the Price of the Car will go down by $2,629.03. Or as the car gets older and older
the price will keep decreasing by $2,629.03 every year.
Lastly, I want to use my Regression Equation to predict prices. What would I
expect to pay for a car that was manufactured in 2014? Remember 2019 – 2014 =
5. This means the car is 5 Years Old. This is the value you want to substitute into
the Regression Equation. DO NOT put 2019 into the equation.
𝑃𝑟𝑖𝑐�̂� = −2,629.03 (𝑌𝑒𝑎𝑟𝑠 𝑂𝑙𝑑) + 31,959.68
𝑃𝑟𝑖𝑐�̂� = −2,629.03 (5) + 31,959.68
𝑃𝑟𝑖𝑐�̂� = −13,145.16 + 31,959.68
𝑃𝑟𝑖𝑐�̂� = $18,814.52
In the Year 2014, when the car is 5 Years Old, we will expect to pay $18,814.52 for
a car.
This is a good analysis for a SLR. But if we wanted to analysis the data further?
We could run a Multiple Linear Regression (MLR). Multiple Linear Regression is
just like it sounds. Instead of having only 1 x-variable, we have multiple x-
variables. In our example the x-variable was Years Old, and it did a good job at
predicting Price. But what other values could you use to predict the Price of a
Car?
One this that comes to mind is Total Miles. When you are looking to buy a car,
you also want to look at Total Miles. Usually you want a car with fewer miles on
it. The fewer miles on the car, the higher the price. Or the more miles you have a
car, the lower the price. This appears to be another negative correlation, or
relationship.
Another variable that comes to mind is a 5-star safety rating. The safer the car
the more people are willing to pay for safety. If a car has 5 stars it will be more
expensive than if a car only had 2 or 3 stars. This appears to be a positive
correlation or relationship.
You would want to run a MLR to justify and verify your claims, but these are just a
few variables you could include to turn this SLR to a MLR.