IS 3310 Troy University Data Analytics Linear Regression using SAS Report

User Generated

ebatsrv

Mathematics

IS 3310

Troy University

IS

Description

In this exercise, students will conduct a simple Linear Regression using SAS. I can provide a SAS account if you don't have it. (https://welcome.oda.sas.com/login)The student will create a summary report of their analysis and create an informative graphic that is included in a business memorandum that provides a recommendation based on their analysis of the dataset.

Our research question is "What factors influence $/hr for position 1 and 2 employees?" You need to turn this question into a statistical questions that you can answer using this data (The first is supplied to you in the memo).

Note: An Example of a Regression Analysis Results Table (An Example Only - Not DAX 4 results) that must be submitted to prove a Regression Analysis was conducted.

Unformatted Attachment Preview

05 October 2020 From: Joe Student To: Professor Subj: Enter a subject (Eg Labor Regression Model) 1. Our organization, Widget Inc, has recently acquired a new location that is like location 8 and we wish to evaluate the wages of two employees hired by their relatives. The first employee is position 1 and the second is position 2 both employees have 5 years of experience and a 7 for performance review score while making $17/hr. But first we must create an appropriate model that can estimate the salary for the 2 positions. Our research question is: “What factors influence the wages ($/H) for positions 1 and 2?” We will examine the data using a linear regression with an α of .05 to create a model. Our statistical questions evaluate Location, Position, Years of Employment, and Performance and are as follows: • • • • • • • • Null hypothesis – HO: βlocation = 0 Alternative hypothesis – HA: βlocation ≠ 0 Null hypothesis – HO: β1 = 0 Alternative hypothesis – HA: β1 ≠ 0 Null hypothesis – HO: β1 = 0 Alternative hypothesis – HA: β1 ≠ 0 Null hypothesis – HO: β1 = 0 Alternative hypothesis – HA: β1 ≠ 0 2. This paragraph(s) describes where the data came from (This is likely from an internal database), the steps that you took to explore the data and if the data is appropriate for the model (You may break this into multiple paragraphs if needed). Where is the data from? How did you filter the data? Examine the assumptions in the instructions and determine the tables and plots that are needed and state if each appears to be met (Boxplot for outliers, Scatter plots for continuous variables, observed by predicted for equal variance and fit diagnostics for normality of error terms). You should reference the appropriate figures and tables in the paragraph. You should always provide a summary statistics table and discuss it (missing data and relevant statistics). The last sentence should be whether the data is appropriate for a linear regression. 3. This paragraph discusses the results. Were the model results significant (Report the f value and the p value from SAS)? What was the R-Squared and Root MSE and what do they imply? Discuss the Parameter estimates and interpret them. You should also discuss if they are meaningful. 4. What is your recommendation for the use of the model? Is further analysis needed? Evaluate the 2 employees from paragraph 1 (Employee 1: Position 1, Location 8, 5 Years of Experience, performance review 7, and $17/hr vs. Employee 2: Position 2, Location 8, 5 Years of Experience, performance review 7 and $17/hr). Are either employee grossly over or under paid according to the model? What action if any should be taken to adjust their salary? After the body of the memo you should have several figures (Histogram and boxplot of $/hr, scatter plots for each continuous variable compared to the dependent, observed vs predicted, and fit diagnostics) and tables (Summary statistics and Parameter estimates). Format them, number them, and refer to them in the text. Data Analysis Exercise 5 (DAX5) The data that we will use for this analysis will be from our labor dataset and the analysis we will do is called a linear regression. But first you must filter it to include both Position 1 and Position 2 employees. We have explored the dataset and have developed a question: What factors influence $/hr of position 1 and 2 employees? To find the answer to this question, I am going to analyze the dataset and perform a linear regression (LR) using the $/hr as the dependent variable (DV), and explore other variables of the employee as independent variables (IV). First, a brief refresher on linear regression. This is only a simplified review. For a better understanding, please refer to DAL 5 and review other statistical methods and linear regression materials as needed. A linear regression is a way to model the statistical relationship between a response (or dependent variable) and one or more explanatory (or independent) variables. The linear relationship between the two variables may be represented by a straight line, often called the regression line. Simply put, we want to see if the IV value can be used to accurately predict the DV value. Often, we can visually see if there is a relationship between the IV and DV by creating a scatterplot based on the data. See the following figures for examples of a positive, neutral, and negative relationship/correlation. In Figure 2 we can see that as the independent variable value of the dependent variable seems to vary randomly above and below the regression line. The slope of the regression line might be close to zero. This would indicate that for any value of the independent variable, the value of the dependent variable is equal to the constant plus some random error value that we do not know. There may be other explanatory variables (IV) that have a stronger relationship with the response variable (DV). 1 Dependent Variable Positive Relationship 60 40 20 0 0 5 10 15 20 25 30 25 30 Independent Variable Figure 1. Positive relationship between IV and DV Neutral Relationship Dependent Variable In Figure 1 we can see that as the independent variable increases on the horizontal axis from left to right, the dependent variable tends to increase in value, although the increase is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong positive relationship based on our observation of this scatterplot between the IV and DV. It also tests for the linearity of the relationship. 40 30 20 10 0 0 5 10 15 20 Independent Variable Figure 2. Neutral relationship between IV and DV Data Analysis Exercise 5 (DAX5) Linear Regression Negative Relationship Dependent Variable In Figure 3 we can see that as the independent variable increases on the horizontal axis from left to right, the dependent variable tends to decrease in value, although the decrease is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong negative relationship based on our observation of this scatterplot between the IV and DV. 60 40 20 0 0 5 10 15 20 25 30 Independent Variable Figure 3. Negative relationship between IV and DV Again, a linear regression is way to model a hypothesized statistical relationship between a predictor variable (IV) and a response variable (DV). What is the difference between a deterministic relationship and a statistical relationship? In a deterministic relationship, an equation exactly describes the relationship between two or more variables. Examples are the relationship between Fahrenheit and Celsius (oF = 9/5*oC + 32), and the relationship between circumference and diameter (Circumference = π * diameter). In a deterministic relationship, there are no error terms to consider, and we can create a simple linear equation to model the relationship as depicted in Figure 4. This will have an R-squared of 1. Y value = constant + (slope * X value) Figure 4. Deterministic linear equation In a statistical linear relationship, there is a trend (positive, neutral, or negative), plus a constant, plus some error that we see as the “scatter” in a scatterplot. So, we must modify our linear equation to find the best fitting line that best “describes” the relationship between the predictor variable and the response variable. Data analysis software like SAS and Excel do this by adjusting the position of the line and the slope until the sum of all the squared errors (difference between predicted and observed responses) has been minimized. ŷi = the predicted response (or fitted value) bo = the estimated Y axis intercept of the best fitting line b1 = the estimated slope of the best fitting line xi = the predictor variable value (IV value) yi = the observed response value (DV value) β0 = estimated population regression line constant β1 = estimated population regression line slope εi = error term (difference between ŷi and yi) aka residuals ŷi = bo + b1 xi yi = β0 + β1 xi + εi Figure 5. Statistical linear equation So, as we can see in the equations in Figure 5, the statistical linear relationship approximately describes the relationship between the predictor value and the response value instead of the exact relationship described in a deterministic linear equation. Thus, we need to determine if β1 is not equal to zero (β1 ≠ 0). 2 Data Analysis Exercise 5 (DAX5) Linear Regression In testing the null hypothesis for a simple linear regression, we should generally follow these steps: 1. State the plain language research question: e.g. What factors influence $/hr for position 1 and 2 employees? 2. State the hypotheses: • Null hypothesis – HO: βPerformance = 0 • Alternative hypothesis – HA: βperformance ≠ 0 3. State the criteria for rejecting HO: • α = 0.05 4. Consider the assumptions for linear regression: • Assumption that there is a linear relationship between response variable and predictor variable (You should use scatter plots of the individual continuous independent variables compared to the dependent). • Assumption that the errors, εi, are independent (research design) i. Is your data a “snap shot” or a “video” of your observations? If your data is more of a “video”, consider a time series analysis. ii. Non-significant Chi Square • Assumption that the errors, εi, at each value of the predictor, xi, are normally distributed (not skewed with a mean of zero) (non-significant Shapiro-Wilks statistic indicates normal distribution of error terms). (Examine the residual plots) Assumption that the errors, εi, at each value of the predictor, xi, have equal variances (σ2) i. No triangular looking patterns between the response variable and the standardized residuals, and ii. Non-significant Chi Square • • Other items to consider: i. Outliers can cause erroneous results (Cook’s D > ±2) ii. The linear regression may not be the best fit (curvilinear, quadratic, etc.) iii. Large data sets can result in significance (P value) but not really different from 0 iv. Averages of raw data (e.g. summing a region) can overstate the strength of the correlation, so be mindful of what you are trying to prove with your analysis. 5. Compute the appropriate statistics: • Pearson correlation coefficient (remember that correlation does not imply causation!) • F-Value • Prob > F • Did you observe any problematic outliers? What (if anything) can you do about them? 6. Decide whether to retain or reject your null hypothesis: • If p > α, then retain the null hypothesis • If p < α, then reject the null hypothesis, and accept the alternative hypothesis 3 Data Analysis Exercise 5 (DAX5) Linear Regression • Remember, that statistical significance does not imply practical or meaningful significance! 7. Interpret the parameters (β0 and β1): • What does a one unit increase in the predictor variable result in the expected response variable (what is the slope of the regression line)? Is it positive or negative? Is it meaningful? • Is zero within your predictor variable (IV) value range? What does that mean? Please watch the videos for detailed instructions. Filtered Dataset Video Running Correlation Interpreting Correlation Running the Regression Interpreting the Regression In general, you are going to: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Download “Business Memo Template.docx” from the DAX4 Assignment. Save the downloaded file as a Word file and name it “LastName DAX4” without the quotation marks.The file extension will be added automatically. For example, my file would be “Larson DAX5.docx”. Again, ignore the quotation marks in this part of the instructions. Open SAS. In “Task and Utilities / Data” click on the “Filter Data” task. In the “Filter Data” tab that opens, in the “Data” section, select the labor dataset that we have been working with (Mine was named IS3310.labor f 2020). In the “Filter 1 / Variable 1” click on “Position”, click OK. (you may have to scroll down to see OK) In “Comparison” chose “Less than”, Value type should be “Enter a value” and then enter 3. Scroll down to the “OUTPUT DATA SET” section, and change the name “filter” to “POS 1 & 2”, and chose a library to put the filtered dataset in. I used a folder in my Library named “IS3310”. If you leave it in the “WORK” Library, it will be deleted when you exit SAS. Click on “Show output data”. Run the task (Running man or F3). Check the logs, check the results (it will only show 10 rows), and close the task. Find the POS 1 & 2 dataset, double click it to open the dataset in a new tab. Explore the POS 1 & 2 dataset (e.g. use the Characterize Data task and summary statistics). Choose the Type and all numeric variables to explore. In OPTIONS, make sure “Descriptive Statistics” and “Histogram” are checked for NUMERIC VARIABLES. Run the task (F3). Check the LOG for Errors, Warnings, and Notes, if there are any issues, fix the problem and rerun the task. Look at the OUTPUT. What are the N, Minimum, Mean, Maximum, and Standard Deviation of Performance and years of employment for the dataset? Are there any missing values? Add any appropriate charts and tables to your Business Memo. New task. In the “Tasks and Utilities / Graph” section, double click on “Scatter Plot”. Your POS 1 & 2 dataset should already be in the DATA text box. (You will run this for both performance and Years of Employment.) In the ROLES section, chose Years of Employment for “X axis” and $/hr for “Y axis”. You may 4 Data Analysis Exercise 5 (DAX5) Linear Regression 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. need to scroll down to find OK. In the APPEARANCE section, under “FIT CURVES”, click on the “Regression” option box. Further down, in the TITLE AND FOOTNOTE section, enter “DV=$/HR IV=Years of Employment (Years of employment once and then performance for the second graph)” without the quotes. Run the task (F3). Check the LOG for Errors, Warnings, and Notes, if there are any issues, fix the problem and rerun the task. Look at the OUTPUT. Does there appear to be a linear relationship between the IVs and DV in this dataset? If so, then let’s perform a linear regression on our filtered dataset called POS 1 & 2. (You will note in the videos I run the data with additional variables. To perform a linear regression, we need to make a few assumptions: a. Linear relationship between IV and DV: The observations seemed to be linearly related. (No angles or curves in the scatter plots) b. Independence: The observations are random and independent samples from the population. c. Normality: Each group sample is drawn from a normally distributed population (residuals are normally distributed). d. Homogeneity of variance: The variances of the residuals in the populations are equal. Test for linear relationship: Yes, the scatter plot indicates that there is a linear relationship in both variables. Test for Independence: This is a methodological concern and is determined by the set up for the study. For this assignment, please consider that the assumption of independence has been met in the research study design. Once we run the Linear Regression task, we can check the Chi Square statistic for nonsignificance. Test for Normality: Once we run the Linear Regression task, we will examine the diagnostic plots. Select the correlation analysis option under statistics. Verify that the POS 1 & 2 dataset is still selected and then add all the numeric variables into the analysis variables and run. What does the correlation matrix suggest as good predictors of $/hr? Are some of the variables so highly correlated that they measure roughly the same thing? Select the “Linear Regression” in the Tasks and Utilities / Linear Models group. The POS 1 & 2 dataset should already be in the DATA section. In the ROLES section, select $/hr as the “Dependent variable”, and add position and location in the classification section and performance and years of employment in the “Continuous variables” section (note: you may have to scroll down to see OK). Next, in MODEL, you must click on the Edit icon to specify the model. When the “Model EffectsBuilder” window opens, select all the variables and click “Add”, and scroll down to click on OK. In the OPTIONS / STATISTICS section, select “Default and selected statistics, and click all of the options. Select all boxes under Collinearity, and Heteroscedasticity. Run the task (F3). Check the LOG for Errors, Warnings, and Notes, if there are any issues, fix the problem and rerun the task. Look at the OUTPUT. Review the output and report your findings in the Business Memo. Is the Model significant? P-value is F shows a .0183 value which meets our rejection criteria of Pr being less than .05. Therefore, we reject can conclude that the alternative hypothesis is accepted and conclude that a statistically significant difference exists.
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

At...


Anonymous
Great! Studypool always delivers quality work.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Related Tags