Data Analysis Exercise 5 (DAX5)
Linear Regression Comparison between SAS and Excel
In this exercise, we will compare doing an analysis on the same data in Excel and SAS, and then report on our
analysis. The data that we will use for this analysis is called SASHELP.CARS, and the analysis we will do is called
a linear regression.
I have explored the dataset and have developed a question: Is there a relationship between horsepower and
the length of a car for sedans? To find the answer to this question, I am going to analyze the dataset and
perform a linear regression (LR) using the horsepower as the dependent variable (DV), and the length of the
vehicle as the independent variable (IV). First, a brief refresher on linear regression. This is only a simplified
review. For a better understanding, please refer to a text on statistical methods and linear regression.
A linear regression is a way to model the statistical relationship between a response (or dependent variable)
and one or more explanatory (or independent) variables. The linear relationship between the two variables
may be represented by a straight line, often called the regression line. Simply put, we want to see if the IV
value can be used to accurately predict the DV value. Often, we can visually see if there is a relationship
between the IV and DV by creating a scatterplot based on the data. See the following figures for examples of a
positive, neutral, and negative relationship.
In Figure 2 we can see that as the independent variable
increases on the horizontal axis from left to right, the
value of the dependent variable seems to vary randomly
above and below the regression line. The slope of the
regression line might be close to zero. This would
indicate that for any value of the independent variable,
the value of the dependent variable is equal to the
constant plus some random error value that we do not
know. There may be other explanatory variables (IV)
that have a stronger relationship with the response
variable (DV).
1
Dependent Variable
Postive Relationship
60
40
20
0
0
5
10
15
20
25
30
25
30
Independent Variable
Figure 1. Positive relationship between IV and DV
Neutral Relationship
Dependent Variable
In Figure 1 we can see that as the independent variable
increases on the horizontal axis from left to right, the
dependent variable tends to increase in value, although
the increase is not consistent due to error, or other
explanatory variables not used in our model. We would
need to perform a linear regression to analyze the
strength of the relationship and identify the parameters
(the constant and the slope) of the regression line. We
might suspect a strong positive relationship based on
our observation of this scatterplot between the IV and
DV.
40
30
20
10
0
0
5
10
15
20
Independent Variable
Figure 2. Neutral relationship between IV and DV
Data Analysis Exercise 5 (DAX5)
Linear Regression Comparison between SAS and Excel
Negative Relationship
Dependent Variable
In Figure 3 we can see that as the independent variable
increases on the horizontal axis from left to right, the
dependent variable tends to decrease in value, although
the decrease is not consistent due to error, or other
explanatory variables not used in our model. We would
need to perform a linear regression to analyze the
strength of the relationship and identify the parameters
(the constant and the slope) of the regression line. We
might suspect a strong negative relationship based on
our observation of this scatterplot between the IV and
DV.
60
40
20
0
0
5
10
15
20
25
30
Independent Variable
Figure 3. Negative relationship between IV and DV
Again, a linear regression is way to model a hypothesized statistical relationship between a predictor variable
(IV) and a response variable (DV). What is the difference between a deterministic relationship and a statistical
relationship?
In a deterministic relationship, an equation exactly describes the relationship between two or more variables.
Examples are the relationship between Fahrenheit and Celsius (oF = 9/5*oC + 32), and the relationship
between circumference and diameter (Circumference = π * diameter). In a deterministic relationship, there
are no error terms to consider, and we can create a simple linear equation to model the relationship as
depicted in Figure 4.
Y value = constant + (slope * X value)
Figure 4. Deterministic linear equation
In a statistical linear relationship, there is a trend (positive, neutral, or negative), plus a constant, plus some
error that we see as the “scatter” in a scatterplot. So, we must modify our linear equation to find the best
fitting line that best “describes” the relationship between the predictor variable and the response variable.
Data analysis software like SAS and Excel do this by adjusting the position of the line and the slope until the
sum of all the squared errors (difference between predicted and observed responses) has been minimized.
ŷi = the predicted response (or fitted value)
bo = the estimated Y axis intercept of the best fitting line
b1 = the estimated slope of the best fitting line
xi = the predictor variable value (IV value)
yi = the observed response value (DV value)
β0 = estimated population regression line constant
β1 = estimated population regression line slope
εi = error term (difference between ŷi and yi) aka residuals
ŷi = bo + b1 xi
yi = β0 + β1 xi + εi
Figure 5. Statistical linear equation
So, as we can see in the equations in Figure 5, the statistical linear relationship approximately describes the
relationship between the predictor value and the response value instead of the exact relationship described in
a deterministic linear equation. Thus, we need to determine if β1 is not equal to zero (β1 ≠ 0).
2
Data Analysis Exercise 5 (DAX5)
Linear Regression Comparison between SAS and Excel
In testing the null hypothesis for a simple linear regression, we should generally follow these steps:
1. State the plain language research question: e.g. Is the length of a car in this dataset, a good predictor
of the horsepower of that car? (i.e., Do longer cars usually have more horsepower?)
2. State the hypotheses:
• Null hypothesis – HO: β1 = 0
• Alternative hypothesis – HA: β1 ≠ 0
3. State the criteria for rejecting HO:
• α = 0.05
4. Consider the assumptions for linear regression:
• Assumption that there is a linear relationship between response variable and predictor variable
• Assumption that the errors, εi, are independent (research design)
i. Is your data a “snap shot” or a “video” of your observations? If your data is more of a
“video”, consider a time series analysis.
ii. Non-significant Chi Square
• Assumption that the errors, εi, at each value of the predictor, xi, are normally distributed (not
skewed with a mean of zero) (non-significant Shapiro-Wilks statistic indicates normal
distribution of error terms).
• Assumption that the errors, εi, at each value of the predictor, xi, have equal variances (σ2)
i. No triangular looking patterns between the response variable and the standardized
residuals, and
ii. Non-significant Chi Square
• Other items to consider:
i. Outliers can cause erroneous results (Cook’s D > ±2)
ii. The linear regression may not be the best fit (curvilinear, quadratic, etc.)
iii. Large data sets can result in significance (P value) but not really different from 0
iv. Averages of raw data (e.g. summing a region) can overstate the strength of the
correlation, so be mindful of what you are trying to prove with your analysis.
5. Compute the appropriate statistics:
• Pearson correlation coefficient (remember that correlation does imply causation!)
• F-Value
• Prob > F
• Did you observe any problematic outliers? What (if anything) can you do about them?
6. Decide whether to retain or reject your null hypothesis:
• If p > α, then retain the null hypothesis
• If p < α, then reject the null hypothesis, and accept the alternative hypothesis
• Remember, that statistical significance does not imply practical or meaningful significance!
7. Interpret the parameters (β0 and β1):
• What does a one unit increase in the predictor variable result in the expected response variable
(what is the slope of the regression line)? Is it positive or negative? Is it meaningful?
• What significance does the intercept (β0) have to your understanding of your research
question? Is zero within your predictor variable (IV) value range? What does that mean?
3
Data Analysis Exercise 5 (DAX5)
Linear Regression Comparison between SAS and Excel
Please watch the video linked to this module for detailed instructions. In general, you are going to:
1. Download “Business Memo Template.docx” from the DAX5 Assignment.
2. Save the downloaded file as a Word file and name it “LastName DAX5” without the quotation marks.
The file extension will be added automatically. For example, my file would be “Bohler DAX5.docx”.
Again, ignore the quotation marks in this part of the instructions.
3. Open SAS. In “Task and Utilities / Data” click on the “Filter Data” task.
4. In the “Filter Data” tab that opens, in the “Data” section, select the SASHELP.CARS dataset.
5. In the “Filter 1 / Variable 1” click on “Type”, click OK. (you may have to scroll down to see OK)
6. In “Comparison” chose “Equal”, Value type should be “Enter a value”, and then enter Sedan into the
“Value” text box. In this case, do not put quotes around the word “Sedan”. Capitalization is important,
and Sedan and Sedans are different words.
7. Scroll down to the “OUTPUT DATA SET” section, and change the name “filter” to Sedans, and chose a
library to put the filtered dataset in. I used a folder in my Library named “IS3310”. If you leave it in the
“WORK” Library, it will be deleted when you exit SAS. Click on “Show output data”. Run the task
(Running man or F3). Check the logs, check the results (it will only show 10 rows), and close the task.
8. Find the SEDANS dataset, double click it to open the dataset in a new tab.
9. Explore the SEDANS dataset (e.g. use the Characterize Data task). Choose the Horsepower and Length
variables to explore. In OPTIONS, make sure “Descriptive Statistics” and “Histogram” are checked for
NUMERIC VARIABLES.
10. Run the task (F3). Check the LOG for Errors, Warnings, and Notes, if there are any issues, fix the
problem and rerun the task. Look at the OUTPUT. What are the N, Minimum, Mean, Maximum, and
Standard Deviation of Horsepower and Length for the dataset? Are there any missing values? Add any
appropriate charts and tables to your Business Memo.
11. New task. In the “Tasks and Utilities / Graph” section, double click on “Scatter Plot”. Your SEDANS
dataset should already be in the DATA text box.
12. In the ROLES section, chose Length for “X axis” and Horsepower for “Y axis”. You may need to scroll
down to find OK.
13. In the APPEARANCE section, under “FIT CURVES”, click on the “Regression” option box.
14. Further down, in the TITLE AND FOOTNOTE section, enter “IV=Length DV=Horsepower” without the
quotes.
15. Run the task (F3). Check the LOG for Errors, Warnings, and Notes, if there are any issues, fix the
problem and rerun the task. Look at the OUTPUT. Does there appear to be a linear relationship
between Length and Horsepower in this dataset? If so, then let’s perform a linear regression on our
filtered dataset called SEDANS.
16. To perform a linear regression, we need to make a few assumptions:
a. Linear relationship between IV and DV: The observations seemed to be linearly related.
b. Independence: The observations are random and independent samples from the population.
c. Normality: Each group sample is drawn from a normally distributed population (residuals are
normally distributed).
d. Homogeneity of variance: The variances of the residuals in the populations are equal.
17. Test for linear relationship: Yes, the scatter plot indicates that there is a linear relationship
4
Data Analysis Exercise 5 (DAX5)
Linear Regression Comparison between SAS and Excel
18. Test for Independence: This is a methodological concern and is determined by the set up for the study.
For this assignment, please consider that the assumption of independence has been met in the
research study design. Once we run the Linear Regression task, we can check the Chi Square statistic
for non-significance.
19. Test for Normality: Once we run the Linear Regression task, we will examine the diagnostic plots.
20. Test for Homogeneity of variance: Once we run the Linear Regression task, we will examine the
diagnostic plots.
21. Select the “Linear Regression” task. The SEDANS dataset should already be in the DATA section.
22. In the ROLES section, select Horsepower as the “Dependent variable”, and Length in the “Continuous
variables” section (note: you may have to scroll down to see OK, and make sure you put Length in the
Continuous variable section, not the “Classification variables” section.
23. Next, in MODEL, you must click on the Edit icon to specify the model. When the “Model Effects
Builder” window opens, select Length, click “Add”, and scroll down to click on OK.
24. In the OPTIONS / STATISTICS section, select “Default and selected statistics, and click all of the options.
Select all boxes under Collinearity, and Heteroscedasticity.
25. Run the task (F3). Check the LOG for Errors, Warnings, and Notes, if there are any issues, fix the
problem and rerun the task. Look at the OUTPUT.
26. Review the Output and report your findings in the Business Memo. Is the Model significant? Pr>F is
Purchase answer to see full
attachment