Data Analytics Exercise

Content Type

User Generated

User

ffl11194

Subject

Computer Science

Description

In this exercise, you will conduct a simple Linear Regression using SAS and Excel. The student will create a summary report of their analysis and create an informative graphic that is included in a business memorandum that provides a recommendation based on their analysis of the data set.

Our research question is "Is there any relationship between dairy and meat consumption and bone density in this data set?" You need to turn this question into a statistical question that you can answer using this data. You may need to do additional research to inform your recommendation (is there any theory or medical evidence supporting what you found in the data?) Remember, this is just one data set. Is there other data that you could find that would support or not support your findings?

If you need help conducting a Linear Regression in SAS and Excel, you can view the videos listed after this assignment.

DAL5 Video Link Links to an external site. http://is3310videos.com/indirect2.html?video=13

Unformatted Attachment Preview

C1-T Country Hip fracture per 100 000a Dairy consumption (kg)b North America United States 595 253.8 Canada 310.9 206.83 Europe United Kingdom 523.5 241.47 Ireland 488 247.17 Sweden 802.8 355.86 Norway 563 261.52 Denmark 853 295.62 Finland 440 361.19 Iceland 385 223.68 Netherlands 368.3 320.15 Belgium 538.7 238.47 Switzerland 413 315.78 Germany 522 247.24 France 443 260.48 Spain 353 177.49 Portugal 408 222.94 Italy 498.4 256.1 Malta 502.5 188.64 Hungary 488 175.59 Russia 249 172.46 Kazakhstan 651.1 260 Oceania Australia 295 235.11 New Zealand 288 110.4 Latin America Mexico 169 115.18 Argentina 298 213.1 Brazil 138 124.61 Venezuela 150 87.29 Asia China 97 29.04 India 159 105.1 South Korea 262.8 71.5 Japan 266 93 Hong Kong 110 13.98 Thailand 7.05 22.48 Turkey 357 138.71 Jordan 198 88.1 Africa Morocco 85.9 50 Cameroon 3 14.4 Kenya 245 98.64 Nigeria 2 5.4 South Africa 20 57.92 Lactase persistence SNPsc Animal protein (kg)d 86.5 126.6 80 108.1 90 83.9 95 106.3 95 77.1 86 65.7 96 61.7 88 67.4 95 84.8 99 89.3 85 86.1 90 72.9 82.5 82.1 65 101.1 85 118.6 75 91.1 81 90.4 80 86.9 63 100.7 75 51 28.5 56.02 90 117.6 91 104 45 62.2 40 36.5 15 80.5 20 56.6 5 53.5 36.5 3.3 10 48.9 2.5 45.4 10 133.9 2 27.9 29 19.3 25 29.8 27 23.8 5 13.5 35 15.4 5 8.6 16 39 Per capita incomee 54 370 44 967 39 826 51 284 46 219 67 166 44 625 40 661 44 029 47 960 43 139 58 149 46 216 40 538 33 835 27 069 35 131 33 198 25 019 24 449 19 744 46 550 35 305 17 950 22 302 16 155 17 759 13 224 5808 35 379 37 519 55 097 15 579 19 698 11 971 7813 3007 3009 6054 13 094 Current Date From: Your Name To: InstructorSalutation InstructorLastName (e.g. Dr. Sherman) Subj: Name of your issue analysis (edit this title) 1. Identify the issue. In this paragraph, summarize the analysis project. In other words, identify the issue under investigation so that the reviewer knows what this analysis is about. This could be a problem or an opportunity. 2. Analyze the issue. In this paragraph, briefly describe: a. where the data came from b. the steps you took to prepare the data for analysis c. how you analyzed the data d. the results of your analysis i. you should have a table with summary statistics ii. you should have some type of graphic (column chart, pie chart, scatter graph, etc.) that illustrates your findings 3. Discussion. In this paragraph, in clear language, what did you find in your analysis, and what are the implications of your findings (for this exercise, you can be creative) 4. Conclusion. In this paragraph, briefly discuss what you learned in doing this exercise. Linear Regression Objectives • Introduce regression assumptions • Explain the value of Linear Regression • Explain the process for Dummy Coding • Explain the basic interpretation of the Linear Regression Model • Explain the value of examining Plots Regression • A linear regression model attempts to explain the relationship between two or more variables using a straight line. • Regression allows the creation of a predictive equation. • Regression allows you to evaluate the significance of a given variable in the presence of other variables. • Simple Regression examines the relationship between two variables. • Y is the dependent variable that changes as the value of X changes. • X is the independent variable • yi=β0+β1χ𝟏𝒊 • Β0 represents the intercept and β1 represents the slope. Regression Assumptions • linearity and additivity of the relationship between dependent and independent variables: • The expected value of dependent variable is a straight-line function of each independent variable. • This eliminates the use of categorical variables even though you may have coded them as numbers. • Categorical variables need to be dummy coded. • The slope of the line does not depend on the values of the other variables. • If two variables depend on each other you can include an interaction as a variable (Not covered in this lecture). • In the example that we use it may be more appropriate to model the interaction between Years of Service and Performance. • The effects of different independent variables on the expected value of the dependent variable are additive. • statistical independence of the errors (in particular, no correlation between consecutive errors in the case of time series data) • homoscedasticity (constant variance) of the errors • normality of the error distribution. Employee Salary Example • What is the relationship between Years of Service and Salary? • Data includes • Years of Service • Salary Simple Linear Regression Equation and Interpreting Results • Yi=β0+β1χ𝟏𝒊 + ϵi • 𝒀 𝒊𝒔 𝒕𝒉𝒆 𝑫𝒆𝒑𝒆𝒏𝒅𝒆𝒏𝒕 𝑽𝒂𝒓𝒊𝒂𝒃𝒍𝒆 𝒓𝒆𝒑𝒓𝒆𝒔𝒆𝒏𝒕𝒊𝒏𝒈 𝒉𝒐𝒖𝒓𝒍𝒚 𝒔𝒂𝒍𝒂𝒓𝒚 • The subscript i represents a variable that indicates a specific instance or in this case employee. • β0 is the intercept • This should be roughly the base starting salary. • Your stats software allows you to set this value to 0. Do not take this option in this case. • β1χ𝟏𝒊 represents the years of service. • β1 Represents the expected change in Y for every change of 1 in the years of years. • χ𝟏𝒊 Represents the number of years. • ϵi represents the error. • This is the difference between what is expected and what is observed. Employee Salary Example Regression Statistics Multiple R 0.71 R Square 0.50 Adjusted R Square 0.50 Standard Error Observations • The F test is significant indicating that the model is better than a model with no predictor. • R2 indicates that the model explains roughly 50% of the variance. • Ŷ=8.74+.𝟏. 𝟐𝟏χ𝟏 • Ŷ= Expected Salary 2.84 374 • Expected Salary = Beginning Salary + Annual Raise • $8.74/Hour is reasonable as a base salary • Each year of service adds nearly $1.21 per hour ANOVA Regression Residual Total Intercept Years of Service df 1.00 372.00 373.00 Coefficien ts 8.74 1.21 SS 3009.29 2999.14 6008.43 Standard Error MS 3009.29 8.06 F Significance F 373.26 0.28 t Stat 30.77 P-value 0.00 0.06 19.32 0.00 0.00 Lower 95% 8.19 Upper 95% 9.30 1.09 1.34 Employee Salary Example Cont. $/H 14 Years of Employment 5 • Ŷ=8.74+1. 𝟐𝟏 ∗ 𝟓 = 𝟏𝟒. 𝟕𝟗 • Given this model our estimate we can calculate the error term. • Yi=β0+β1χ𝟏𝒊 + ϵi • ϵi=$-.79 = 14-14.79 Valid and Meaningful? • The variance is not consistent as the error terms get wider as the years of employment increase. • Other variables may explain this. Employee Salary Example Expanded • What is an appropriate model to evaluate employee salary? • Data includes • • • • • Gender Position Description Height Years of Service Performance Review Score • Additional Variables will explain more variance and shrink the error if they are significant. Employee Salary Example Expanded Research Questions • Does the company discriminate based on height or gender? • Do different positions make different salaries? Dummy Coding • Dummy coding allows you to use categorical variables in simple regression. • You will assign a base model and the added variables will act to change the intercept for the model. • For n categories you will need n-1 variables • For the category gender assuming only male and female entries • You would need a single numeric variable assigning 1 to Male and 0 for female, or visa versa. • In this case female is the base model and male is adopted when the value of the dummy variable is 1. • For Job Description there are 3 categories. (Manager, Hourly Maintenance, and Hourly Administrative. • We will make hourly administrative the base model and create two variables that represent the manager and the hourly maintenance categories. Dummy Coding Gender M F M F Coded Values Job Category Hourly Maintenance Hourly Administrative Manager Manager Dummy Coded Values Gender Hourly Maintenance Manager 1 1 0 0 0 0 1 0 1 0 0 1 • Here we have coded the Female Hourly Administrative worker as the base model so all the dummy variables are 0. The intercept of the line will change as either the gender or the job description changes. Sample of Raw Data and Coding $/H 14 14 13 12 10 10 15 Performance Review 15 14 14 7 7 6 17 Years of gender Employment Height 0 5 72 0 6 62 1 2 66 0 3 69 1 1 68 0 2 76 1 5 63 Maintenance 1 1 1 1 1 1 1 Manager 0 0 0 0 0 0 0 Model and Interpreting Results • Yi=β0+β1χ𝟏𝒊 + β2χ𝟐𝒊 + β3χ𝟑𝒊 + β4χ𝟒𝒊 + β5χ𝟓𝒊 + β6χ𝟔𝒊 + ϵi • 𝒀 𝒊𝒔 𝒕𝒉𝒆 𝑫𝒆𝒑𝒆𝒏𝒅𝒆𝒏𝒕 𝑽𝒂𝒓𝒊𝒂𝒃𝒍𝒆 𝒓𝒆𝒑𝒓𝒆𝒔𝒆𝒏𝒕𝒊𝒏𝒈 𝒉𝒐𝒖𝒓𝒍𝒚 𝒔𝒂𝒍𝒂𝒓𝒚 • The subscript i represents a variable that indicates a specific instance or in this case employee. • β0 is the intercept • This should be roughly the base starting salary. • Your stats software allows you to set this value to 0. Do not take this option in this case. • β1χ𝟏𝒊 represents the performance review. • β1 Represents the expected change in Y for every change of 1 in the performance review. • χ𝟏𝒊 Represents the value of a performance review. Model and Interpreting Results • Yi=β0+β1χ𝟏𝒊 + β2χ𝟐𝒊 + β3χ𝟑𝒊 + β4χ𝟒𝒊 + β5χ𝟓𝒊 + β6χ𝟔𝒊 + ϵi • β2χ𝟐𝒊 represents the gender of the employee. • β2 Represents the expected change in Y if the employee is male. • χ𝟐𝒊 Represents the value assigned the gender (0 or 1). • β3χ𝟑𝒊 represents the years of employment. • β3 Represents the expected change in Y for every year of service • χ𝟑𝒊 Represents the number of the years of service for an employee • β4χ𝟒𝒊 represents the height. • β4 Represents the expected change in Y for every change of inch taller an employee is. • χ𝟒𝒊 Represents the height of an employee. Model and Interpreting Results • Yi=β0+β1χ𝟏𝒊 + β2χ𝟐𝒊 + β3χ𝟑𝒊 + β4χ𝟒𝒊 + β5χ𝟓𝒊 + β6χ𝟔𝒊 + ϵi • β5χ𝟓𝒊 represents the gender of the employee. • β2 Represents the expected change in Y if the employee is a maintenance worker. • χ𝟐𝒊 Represents if an employee is a maintenance worker. • β6χ𝟔𝒊 represents the years of employment. • β6 Represents the expected change if the employee is a manager. • χ𝟔𝒊 Represents if an employee is a manager or not • ϵi represents the error. • This is the difference between what is expected and what is observed. Model and Interpreting Results Intercept Performance Review Gender Years of Employment Height Maintenance Manager Coefficients 9.95 Standard Error 0.76 t Stat 13.15 P-value 0.00 Lower 95% Upper 95% 8.46 11.44 0.22 0.10 0.02 0.12 12.39 0.77 0.00 0.44 0.19 -0.15 0.26 0.34 0.27 -0.02 0.20 9.77 0.04 0.01 0.20 0.25 6.36 -1.53 0.98 39.06 0.00 0.13 0.33 0.00 0.18 -0.04 -0.20 9.28 0.35 0.00 0.60 10.26 • We see a number of the variables are not significant (gender, height, and Maintenance). P>.05 and the confidence interval contains 0. • In general we should eliminate them one at a time and rerun the results. Model and Interpreting Results • In this model (In the presence of the other variables.) • The gender of the employee does not influence employee salary. • It does not appear that the company discriminates. • Maintenance vs Admin non-supervisory employee does not make a difference in salary. • We can merge these two groups. • Manager did make a difference so it stays as a dummy variable. • The height of the employee does not make a difference in salary. • It does not appear that the company discriminates. Model and Interpreting Results Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept Performance Review Years of Employment 8.89 0.16 54.74 0.00 8.57 9.21 0.22 0.02 12.31 0.00 0.18 0.25 0.27 0.04 6.46 0.00 0.19 0.35 Manager 9.75 0.25 39.26 0.00 9.26 10.24 • Ŷ=8.89+.𝟐𝟏𝟖𝟗𝟐χ𝟏 + .26𝟗𝟎χ𝟑 + 9.748χ𝟔 • • • • • • Ŷ= Expected Salary Expected Salary = Beginning Salary + Performance + Years of Service + Management Promotion Pay Raise $8.89/Hour is reasonable as a base salary Each point of performance adds $.22 per hour Each year of service adds nearly $.27 per hour Management responsibility adds $9.75 per hour Interpreting Results $/H 14 Performance Review 15 Years of Gender Employment Height 0 5 72 Maintenance 1 • Ŷ=8.89+.𝟐𝟏𝟖𝟗𝟐 ∗ 𝟏𝟓 + .26𝟗𝟎 ∗ 𝟓+ 9.748 ∗ 𝟎 = 𝟏𝟑. 𝟓𝟐 • Given this model our estimate we can calculate the error • Yi=β0+β1χ𝟏𝒊 + β3χ𝟑𝒊 + β6χ𝟔𝒊 + ϵi • ϵi=$.52= 14-13.52 Manager 0 Valid and Meaningful? • A Significant F-test indicates that the provided model is statistically more significant than a model with no predictors. • R2 = .9119 indicates that the model explains roughly 91% of the variance. • A valid or important R2 may be high or low depending on the type of analysis. • Set processes will likely have a high R2 • You would expect this salary model to have a high R2 in a given company. Valid and Meaningful? • The manager errors seem larger. • Is including the data appropriate? • Are there more additional variables that can be collected to explain the errors? • If you supervise admin vs maintenance Simple vs Linear • The model seems to be a better fit, but it is not perfect. Valid and Meaningful? • As expected the manager plot has far more variance. The data should perhaps be separately analyzed. Regression • A linear regression model attempts to explain the relationship between two or more variables using a straight line. • Regression allows the creation of a predictive equation. • Regression allows you to evaluate the significance of a given variable in the presence of other variables. • Adding additional variables explains additional variance and reduces error. • Examining the plots enable us to evaluate if the data meets assumptions and also if additional variables are required to explain a phenomenon. DAX5 (1) Criteria Ratings Pts Full Marks Business Memo Submitted. No Marks 0.0 pts 5.0 pts 5.0 pts No Marks Correct Salutation Full Marks 5.0 pts 0.0 pts 5.0 pts Full Marks No Marks Relevant Subject Line 5.0 pts 5.0 pts 0.0 pts Full Marks Paragraph 1 summarizes problem/opportunity No Marks 0.0 pts 10.0 pts 10.0 pts Full Marks No Marks Paragraph 2 provides an Simple Linear Regression analysis of data with summary table. 30.0 pts 30.0 pts 0.0 pts Full Marks No Marks Linear regression graphic 5.0 pts 5.0 pts 0.0 pts No Marks Paragraph 3 provides a clear discussion on analysis findings. Full Marks 20.0 pts 20.0 pts 0.0 pts No Marks Paragraph 4 is a summary Paragraph with Recommendation Full Marks 20.0 pts 20.0 pts 0.0 pts Total Points: 100.0
Purchase answer to see full attachment

Tags: IS3310 Data Analytics Exercise

User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Attached.

Surname 1

Current Date
From: Student Name
To:

Instructor (e.g. Dr. Sherman)

Subj: Linear Regression Analysis
Introduction
Hip fracture incidences encounter various people globally. Ideally, the cause of the
problem instigated continuous debate, with numerous scholars claiming that degenerative
bone disease such as Osteoporosis, which weakness skeletal mass increases the risk of
fracture. Presently, more than 200 million people globally have experienced a hip fracture,
resulting to a huge burden on fiscal and health-care resources.
Problem statement
In order to confirm the possibility of rare outcomes of hip fracture incidence, this
research project examines whether dairy and meat consumption influences the rate of hip
fracture.
Research Objective
Research Question
Is there any relationship between dairy and meat consumption, and bone density?
Statistical Question
Does dairy and meat consumption influence bone density?
Research Methodology
Generally, using statistics from Africa, Asia, Europe, Asia, North America, and Latin
America, data from 40 countries on hip fracture incidence, dairy consumption, lactase
persistence SNPs, animal protein consumption and per capita income and references.
Utilizing Excel and SAS, An analysis of linear regression was applied to assess the
relationship between meat and dairy consumption towards bone density.

Surname 2

Presentation of Findings
The analysis findings of data from 40 countries containing information of dairy and
meat consumption, and bone density, were computed and presented in summary tables and
scatter plots to show the correlation between the variables. The output independent variable is
dairy and meat consumption, whereas, bone density is the dependent variable.
Table 1.1 Shows correlation between Hip fracture rates per 100,000 and dairy consumption
for the data set of 40 countries globally. The correlation value is 0.85
SUMMARY OUTPUT
Regression Statistics
Multiple R

0.852238764

R Square

0.726310911

Adjusted R Square

0.719108566

Standard Error

111.8497235

Observations

Intercept
X Variable 1

40

Coefficients

Standard
Error

t Stat

P-value

28.5661493

35.5862439

0.80273

0.427123

-43.4744

100.6067

-43.4744

1.783139656

0.17756649

10.0421

3.04E-12

1.423675

2.142604

1.423675

Lower 95%

Upper
95%

Lower
95.0%

The Table 1.1 a summary output computed using Excel, also visualized in Figure 1.1
Table...