Data Analytics Exercise

User Generated

MFZW

Mathematics

Description

In this exercise, students will conduct a simple Linear Regression using SAS and Excel. The student will create a summary report of their analysis and create an informative graphic that is included in a business memorandum that provides a recommendation based on their analysis of the dataset.

Our research question is "Is the length of a car a good predictor of its horsepower?" You need to turn this question into a statistical question that you can answer using this data. You may need to do additional research to inform your recommendation. Remember, this is just one data set. Is there other data that you could find that would support or not support your findings?

If you need help conducting a Linear Regression in SAS and Excel, you can view the videos listed after this assignment.

Unformatted Attachment Preview

Data Analysis Exercise 5 (DAX5) Linear Regression Comparison between SAS and Excel In this exercise, we will compare doing an analysis on the same data in Excel and SAS, and then report on our analysis. The data that we will use for this analysis is called SASHELP.CARS, and the analysis we will do is called a linear regression. I have explored the dataset and have developed a question: Is there a relationship between horsepower and the length of a car for sedans? To find the answer to this question, I am going to analyze the dataset and perform a linear regression (LR) using the horsepower as the dependent variable (DV), and the length of the vehicle as the independent variable (IV). First, a brief refresher on linear regression. This is only a simplified review. For a better understanding, please refer to a text on statistical methods and linear regression. A linear regression is a way to model the statistical relationship between a response (or dependent variable) and one or more explanatory (or independent) variables. The linear relationship between the two variables may be represented by a straight line, often called the regression line. Simply put, we want to see if the IV value can be used to accurately predict the DV value. Often, we can visually see if there is a relationship between the IV and DV by creating a scatterplot based on the data. See the following figures for examples of a positive, neutral, and negative relationship. In Figure 2 we can see that as the independent variable increases on the horizontal axis from left to right, the value of the dependent variable seems to vary randomly above and below the regression line. The slope of the regression line might be close to zero. This would indicate that for any value of the independent variable, the value of the dependent variable is equal to the constant plus some random error value that we do not know. There may be other explanatory variables (IV) that have a stronger relationship with the response variable (DV). 1 Dependent Variable Postive Relationship 60 40 20 0 0 5 10 15 20 25 30 25 30 Independent Variable Figure 1. Positive relationship between IV and DV Neutral Relationship Dependent Variable In Figure 1 we can see that as the independent variable increases on the horizontal axis from left to right, the dependent variable tends to increase in value, although the increase is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong positive relationship based on our observation of this scatterplot between the IV and DV. 40 30 20 10 0 0 5 10 15 20 Independent Variable Figure 2. Neutral relationship between IV and DV Data Analysis Exercise 5 (DAX5) Linear Regression Comparison between SAS and Excel Negative Relationship Dependent Variable In Figure 3 we can see that as the independent variable increases on the horizontal axis from left to right, the dependent variable tends to decrease in value, although the decrease is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong negative relationship based on our observation of this scatterplot between the IV and DV. 60 40 20 0 0 5 10 15 20 25 30 Independent Variable Figure 3. Negative relationship between IV and DV Again, a linear regression is way to model a hypothesized statistical relationship between a predictor variable (IV) and a response variable (DV). What is the difference between a deterministic relationship and a statistical relationship? In a deterministic relationship, an equation exactly describes the relationship between two or more variables. Examples are the relationship between Fahrenheit and Celsius (oF = 9/5*oC + 32), and the relationship between circumference and diameter (Circumference = π * diameter). In a deterministic relationship, there are no error terms to consider, and we can create a simple linear equation to model the relationship as depicted in Figure 4. Y value = constant + (slope * X value) Figure 4. Deterministic linear equation In a statistical linear relationship, there is a trend (positive, neutral, or negative), plus a constant, plus some error that we see as the “scatter” in a scatterplot. So, we must modify our linear equation to find the best fitting line that best “describes” the relationship between the predictor variable and the response variable. Data analysis software like SAS and Excel do this by adjusting the position of the line and the slope until the sum of all the squared errors (difference between predicted and observed responses) has been minimized. ŷi = the predicted response (or fitted value) bo = the estimated Y axis intercept of the best fitting line b1 = the estimated slope of the best fitting line xi = the predictor variable value (IV value) yi = the observed response value (DV value) β0 = estimated population regression line constant β1 = estimated population regression line slope εi = error term (difference between ŷi and yi) aka residuals ŷi = bo + b1 xi yi = β0 + β1 xi + εi Figure 5. Statistical linear equation So, as we can see in the equations in Figure 5, the statistical linear relationship approximately describes the relationship between the predictor value and the response value instead of the exact relationship described in a deterministic linear equation. Thus, we need to determine if β1 is not equal to zero (β1 ≠ 0). 2 Data Analysis Exercise 5 (DAX5) Linear Regression Comparison between SAS and Excel In testing the null hypothesis for a simple linear regression, we should generally follow these steps: 1. State the plain language research question: e.g. Is the length of a car in this dataset, a good predictor of the horsepower of that car? (i.e., Do longer cars usually have more horsepower?) 2. State the hypotheses: • Null hypothesis – HO: β1 = 0 • Alternative hypothesis – HA: β1 ≠ 0 3. State the criteria for rejecting HO: • α = 0.05 4. Consider the assumptions for linear regression: • Assumption that there is a linear relationship between response variable and predictor variable • Assumption that the errors, εi, are independent (research design) i. Is your data a “snap shot” or a “video” of your observations? If your data is more of a “video”, consider a time series analysis. ii. Non-significant Chi Square • Assumption that the errors, εi, at each value of the predictor, xi, are normally distributed (not skewed with a mean of zero) (non-significant Shapiro-Wilks statistic indicates normal distribution of error terms). • Assumption that the errors, εi, at each value of the predictor, xi, have equal variances (σ2) i. No triangular looking patterns between the response variable and the standardized residuals, and ii. Non-significant Chi Square • Other items to consider: i. Outliers can cause erroneous results (Cook’s D > ±2) ii. The linear regression may not be the best fit (curvilinear, quadratic, etc.) iii. Large data sets can result in significance (P value) but not really different from 0 iv. Averages of raw data (e.g. summing a region) can overstate the strength of the correlation, so be mindful of what you are trying to prove with your analysis. 5. Compute the appropriate statistics: • Pearson correlation coefficient (remember that correlation does imply causation!) • F-Value • Prob > F • Did you observe any problematic outliers? What (if anything) can you do about them? 6. Decide whether to retain or reject your null hypothesis: • If p > α, then retain the null hypothesis • If p < α, then reject the null hypothesis, and accept the alternative hypothesis • Remember, that statistical significance does not imply practical or meaningful significance! 7. Interpret the parameters (β0 and β1): • What does a one unit increase in the predictor variable result in the expected response variable (what is the slope of the regression line)? Is it positive or negative? Is it meaningful? • What significance does the intercept (β0) have to your understanding of your research question? Is zero within your predictor variable (IV) value range? What does that mean? 3 Data Analysis Exercise 5 (DAX5) Linear Regression Comparison between SAS and Excel Please watch the video linked to this module for detailed instructions. In general, you are going to: 1. Download “Business Memo Template.docx” from the DAX5 Assignment. 2. Save the downloaded file as a Word file and name it “LastName DAX5” without the quotation marks. The file extension will be added automatically. For example, my file would be “Bohler DAX5.docx”. Again, ignore the quotation marks in this part of the instructions. 3. Open SAS. In “Task and Utilities / Data” click on the “Filter Data” task. 4. In the “Filter Data” tab that opens, in the “Data” section, select the SASHELP.CARS dataset. 5. In the “Filter 1 / Variable 1” click on “Type”, click OK. (you may have to scroll down to see OK) 6. In “Comparison” chose “Equal”, Value type should be “Enter a value”, and then enter Sedan into the “Value” text box. In this case, do not put quotes around the word “Sedan”. Capitalization is important, and Sedan and Sedans are different words. 7. Scroll down to the “OUTPUT DATA SET” section, and change the name “filter” to Sedans, and chose a library to put the filtered dataset in. I used a folder in my Library named “IS3310”. If you leave it in the “WORK” Library, it will be deleted when you exit SAS. Click on “Show output data”. Run the task (Running man or F3). Check the logs, check the results (it will only show 10 rows), and close the task. 8. Find the SEDANS dataset, double click it to open the dataset in a new tab. 9. Explore the SEDANS dataset (e.g. use the Characterize Data task). Choose the Horsepower and Length variables to explore. In OPTIONS, make sure “Descriptive Statistics” and “Histogram” are checked for NUMERIC VARIABLES. 10. Run the task (F3). Check the LOG for Errors, Warnings, and Notes, if there are any issues, fix the problem and rerun the task. Look at the OUTPUT. What are the N, Minimum, Mean, Maximum, and Standard Deviation of Horsepower and Length for the dataset? Are there any missing values? Add any appropriate charts and tables to your Business Memo. 11. New task. In the “Tasks and Utilities / Graph” section, double click on “Scatter Plot”. Your SEDANS dataset should already be in the DATA text box. 12. In the ROLES section, chose Length for “X axis” and Horsepower for “Y axis”. You may need to scroll down to find OK. 13. In the APPEARANCE section, under “FIT CURVES”, click on the “Regression” option box. 14. Further down, in the TITLE AND FOOTNOTE section, enter “IV=Length DV=Horsepower” without the quotes. 15. Run the task (F3). Check the LOG for Errors, Warnings, and Notes, if there are any issues, fix the problem and rerun the task. Look at the OUTPUT. Does there appear to be a linear relationship between Length and Horsepower in this dataset? If so, then let’s perform a linear regression on our filtered dataset called SEDANS. 16. To perform a linear regression, we need to make a few assumptions: a. Linear relationship between IV and DV: The observations seemed to be linearly related. b. Independence: The observations are random and independent samples from the population. c. Normality: Each group sample is drawn from a normally distributed population (residuals are normally distributed). d. Homogeneity of variance: The variances of the residuals in the populations are equal. 17. Test for linear relationship: Yes, the scatter plot indicates that there is a linear relationship 4 Data Analysis Exercise 5 (DAX5) Linear Regression Comparison between SAS and Excel 18. Test for Independence: This is a methodological concern and is determined by the set up for the study. For this assignment, please consider that the assumption of independence has been met in the research study design. Once we run the Linear Regression task, we can check the Chi Square statistic for non-significance. 19. Test for Normality: Once we run the Linear Regression task, we will examine the diagnostic plots. 20. Test for Homogeneity of variance: Once we run the Linear Regression task, we will examine the diagnostic plots. 21. Select the “Linear Regression” task. The SEDANS dataset should already be in the DATA section. 22. In the ROLES section, select Horsepower as the “Dependent variable”, and Length in the “Continuous variables” section (note: you may have to scroll down to see OK, and make sure you put Length in the Continuous variable section, not the “Classification variables” section. 23. Next, in MODEL, you must click on the Edit icon to specify the model. When the “Model Effects Builder” window opens, select Length, click “Add”, and scroll down to click on OK. 24. In the OPTIONS / STATISTICS section, select “Default and selected statistics, and click all of the options. Select all boxes under Collinearity, and Heteroscedasticity. 25. Run the task (F3). Check the LOG for Errors, Warnings, and Notes, if there are any issues, fix the problem and rerun the task. Look at the OUTPUT. 26. Review the Output and report your findings in the Business Memo. Is the Model significant? Pr>F is
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Hi,Ple...


Anonymous
Great study resource, helped me a lot.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Related Tags