Scatter Plot
12
y = 0.0275x + 0.9526
RΒ² = 0.814
10
Y
8
6
4
2
0
0
50
100
150
200
X
250
300
350
Calories Pct Alcohol
153
4.9
157
5.9
95
4.2
130
5
123
5.5
115
5
110
4.2
116
4.2
145
5
99
4.3
55
2.4
133
4.6
169
5.9
95
4.1
138
4.3
174
6.1
149
4.9
152
5
104
4.2
144
4.7
144
4.7
110
3.9
132
5
149
5.5
103
4.1
157
5.6
145
4.6
166
5.2
155
5
152
4.7
110
4.1
175
4.9
113
4.3
95
4.1
157
5.6
157
5.8
110
4.2
143
4.7
64
2.8
110
4.2
143
4.7
110
4.2
96
4.2
110
4.2
110
4.2
110
4.2
128
144
98
70
146
114
160
202
220
120
160
160
166
195
124
146
190
330
214
157
190
215
231
175
218
194
225
158
153
123
149
113
94
177
163
154
163
153
171
158
146
292
269
314
131
181
150
4.3
5.9
4.5
0.4
4.5
3.8
5.9
7.5
8
4.5
4.9
4.8
5.2
4.7
4.1
4.7
5.9
9.6
6.8
5
5.9
6.7
6.9
5.6
7
5.6
5.8
5
4.4
5
4.6
4.4
4.1
5.2
4.7
4.7
4.7
4.8
5.4
4.7
5.3
10.5
8.7
10.5
4.7
6.5
5.3
158
179
124
148
162
156
148
162
142
103
111
110
170
149
105
163
152
166
165
205
200
200
140
160
155
145
215
146
153
174
179
188
142
222
160
222
135
161
151
147
150
145
135
98
150
135
5.2
6.4
4.6
4.5
5.1
5.9
4.9
5
5.9
4.1
4.4
4.1
4.9
4.9
4.2
4.9
4.7
4.9
4.9
5.6
6.6
7
4.8
5.2
4.8
4.8
7.8
4.7
5
5.3
5.8
6.5
4.6
8.1
6
8.1
4.2
5.1
4.9
4.6
4.8
5
4.4
3.8
4.5
4.4
Calories Pct Alcohol
153
4.9
157
5.9
95
4.2
130
5
123
5.5
115
5
110
4.2
116
4.2
145
5
99
4.3
55
2.4
133
4.6
169
5.9
95
4.1
138
4.3
174
6.1
149
4.9
152
5
104
4.2
144
4.7
144
4.7
110
3.9
132
5
149
5.5
103
4.1
157
5.6
145
4.6
166
5.2
155
5
152
4.7
110
4.1
175
4.9
113
4.3
95
4.1
157
5.6
157
5.8
110
4.2
143
4.7
64
2.8
110
4.2
143
4.7
110
4.2
96
4.2
110
4.2
110
4.2
110
4.2
128
144
98
70
146
114
160
202
220
120
160
160
166
195
124
146
190
330
214
157
190
215
231
175
218
194
225
158
153
123
149
113
94
177
163
154
163
153
171
158
146
292
269
314
131
181
150
4.3
5.9
4.5
0.4
4.5
3.8
5.9
7.5
8
4.5
4.9
4.8
5.2
4.7
4.1
4.7
5.9
9.6
6.8
5
5.9
6.7
6.9
5.6
7
5.6
5.8
5
4.4
5
4.6
4.4
4.1
5.2
4.7
4.7
4.7
4.8
5.4
4.7
5.3
10.5
8.7
10.5
4.7
6.5
5.3
158
179
124
148
162
156
148
162
142
103
111
110
170
149
105
163
152
166
165
205
200
200
140
160
155
145
215
146
153
174
179
188
142
222
160
222
135
161
151
147
150
145
135
98
150
135
5.2
6.4
4.6
4.5
5.1
5.9
4.9
5
5.9
4.1
4.4
4.1
4.9
4.9
4.2
4.9
4.7
4.9
4.9
5.6
6.6
7
4.8
5.2
4.8
4.8
7.8
4.7
5
5.3
5.8
6.5
4.6
8.1
6
8.1
4.2
5.1
4.9
4.6
4.8
5
4.4
3.8
4.5
4.4
Simple Linear Regression Analysis
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations
0.9022
0.8140
0.8126
0.5642
139
ANOVA
df
Regression
Residual
Total
Intercept
Calories
1
137
138
SS
MS
F
Significance F
190.8133 190.8133 599.5098
0.0000
43.6047
0.3183
234.4180
Coefficients Standard Error
0.9526
0.1777
0.0275
0.0011
t Stat
P-value
5.3620
0.0000
24.4849
0.0000
Lower 95%
Upper 95%
0.6013
1.3039
0.0253
0.0297
Calculations
b1, b0 Coefficients
0.0275
0.9526
b1, b0 Standard Error
0.0011
0.1777
R Square, Standard Error
0.8140
0.5642
F , Residual df
599.5098 137.0000
Regression SS , Residual SS
190.8133 43.6047
Confidence level
t Critical Value
Half Width b0
Half Width b1
Lower 95%
0.6013
0.0253
Upper 95%
1.30390
0.02972
95%
1.9774
0.3513
0.0022
Calories
153
157
95
130
123
115
110
116
145
99
55
133
169
95
138
174
149
152
104
144
144
110
132
149
103
157
145
166
155
152
110
175
113
95
157
157
110
143
64
110
143
110
96
110
110
110
128
144
98
70
146
114
160
202
220
120
160
160
166
195
124
146
190
330
214
157
190
215
231
175
218
194
225
158
153
123
149
113
94
177
163
154
163
153
171
158
146
292
269
314
131
181
150
158
179
124
148
162
156
148
162
142
103
111
110
170
149
105
163
152
166
165
205
200
200
140
160
155
145
215
146
153
174
179
188
142
222
160
222
135
161
151
147
150
145
135
98
150
135
Calories
Mean
Median
Mode
Minimum
Maximum
Range
Variance
Standard Deviation
Coeff. of Variation
Skewness
Kurtosis
Count
Standard Error
Calories
152.3165468
150
110
55
330
275
1828.0005
42.7551
28.07%
1.1860
3.3511
139
3.6264
National Regional
4.9
4.1
5.9
5.2
4.2
4.7
5.0
4.7
5.5
4.7
5.0
4.8
4.2
5.4
4.2
4.7
5.0
5.3
4.3
10.5
2.4
8.7
4.6
10.5
5.9
4.7
4.1
6.5
4.3
5.3
6.1
5.2
4.9
6.4
5.0
4.6
4.2
4.5
4.7
5.1
4.7
5.9
3.9
4.9
5.0
5.0
5.5
5.9
4.1
4.1
5.6
4.4
4.6
4.1
5.2
4.9
5.0
4.9
4.7
4.2
4.1
4.9
4.9
4.7
4.3
4.9
4.1
4.9
5.6
5.6
5.8
6.6
4.2
7.0
4.7
4.8
2.8
5.2
4.2
4.8
4.7
4.8
4.2
7.8
4.2
4.7
4.2
5.0
4.2
5.3
4.2
5.8
4.3
5.9
4.5
0.4
4.5
3.8
5.9
7.5
8.0
4.5
4.9
4.8
5.2
4.7
4.1
4.7
5.9
9.6
6.8
5.0
5.9
6.7
6.9
5.6
7.0
5.6
5.8
5.0
4.4
5.0
4.6
4.4
6.5
4.6
8.1
6.0
8.1
4.2
5.1
4.9
4.6
4.8
5.0
4.4
3.8
4.5
4.4
Percent Alcohol and Dist. Type
Mean
Median
Mode
Minimum
Maximum
Range
Variance
Standard Deviation
Coeff. of Variation
Skewness
Kurtosis
Count
Standard Error
National
Regional
4.935897436 5.404918033
4.7
4.9
4.2
4.7
0.4
3.8
9.6
10.5
9.2
6.7
1.4327
1.9428
1.1970
1.3938
24.25%
25.79%
0.4212
2.1506
4.8372
4.8715
78
61
0.1355
0.1785
Pct Alcohol
4.9
5.9
4.2
5.0
5.5
5.0
4.2
4.2
5.0
4.3
2.4
4.6
5.9
4.1
4.3
6.1
4.9
5.0
4.2
4.7
4.7
3.9
5.0
5.5
4.1
5.6
4.6
5.2
5.0
4.7
4.1
4.9
4.3
4.1
5.6
5.8
4.2
4.7
2.8
4.2
4.7
4.2
4.2
4.2
4.2
4.2
4.3
5.9
4.5
0.4
4.5
3.8
5.9
7.5
8.0
4.5
4.9
4.8
5.2
4.7
4.1
4.7
5.9
9.6
6.8
5.0
5.9
6.7
6.9
5.6
7.0
5.6
5.8
5.0
4.4
5.0
4.6
4.4
4.1
5.2
4.7
4.7
4.7
4.8
5.4
4.7
5.3
10.5
8.7
10.5
4.7
6.5
5.3
5.2
6.4
4.6
4.5
5.1
5.9
4.9
5.0
5.9
4.1
4.4
4.1
4.9
4.9
4.2
4.9
4.7
4.9
4.9
5.6
6.6
7.0
4.8
5.2
4.8
4.8
7.8
4.7
5.0
5.3
5.8
6.5
4.6
8.1
6.0
8.1
4.2
5.1
4.9
4.6
4.8
5.0
4.4
3.8
4.5
4.4
Percent Alcohol
Mean
Median
Mode
Minimum
Maximum
Range
Variance
Standard Deviation
Coeff. of Variation
Skewness
Kurtosis
Count
Standard Error
Pct Alcohol
5.141726619
4.9
4.7
0.4
10.5
10.1
1.6987
1.3033
25.35%
1.3832
5.3178
139
0.1105
Percent Alcohol
Five-Number Summary
Minimum
0.4
First Quartile
4.4
Median
4.9
Third Quartile
5.6
Maximum
10.5
Percent Alcohol
Pct Alcohol
0
2
4
6
8
10
12
Percent Alcohol and Dist. Type
Five-Number Summary
National Regional
Minimum
0.4
3.8
First Quartile
4.2
4.7
Median
4.7
4.9
Third Quartile
5.6
5.7
Maximum
9.6
10.5
Percent Alcohol and Dist. Type
Regional
National
0
2
4
6
8
10
12
Calories
Five-Number Summary
Minimum
55
First Quartile
124
Median
150
Third Quartile
166
Maximum
330
Calories
Calories
50
100
150
200
250
300
350
DistType
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
National
DistType
Regional
Dist. Type
0
10
20
30
40
50
60
70
80
90
Dist. Type
Count of DistType
DistType
National
Regional
Grand Total
Total
78
61
139
Regional
Dist. Type and Light
Yes
National
No
0
10
20
30
40
50
60
Dist. Type and Light
Count of DistType
DistType
National
Regional
Grand Total
Light
No
Yes
53
55
108
25
6
31
Grand Total
78
61
139
Brand
Anchor Steam
Anheuser Busch Natural Ice
Anheuser Busch Natural Light
Bud Dry
Bud Ice
Bud Ice Light
Bud Light
Bud Light Lime
Budweiser
Budweiser Select
Budweiser Select 55
Busch Beer
Busch Ice
Busch Light
Carling Black Label
Colt 45 Malt Liquor
Coors
Coors Extra Gold Lager
Coors Light
Hamm's Beer
Hamm's Golden Draft
Hamm's Special Light
Icehouse
Icehouse 5
Icehouse Light
Magnum Malt Liquor
Michael Shea's
Michelob Amber Boch
Michelob Beer
Michelob Golden Draft
Michelob Golden Draft Light
Michelob Honey Lager
Michelob Light
Michelob Ultra
Mickey's Fine Malt Liquor
Mickey's Ice
Miller Chill
Miller Genuine Draft
Miller Genuine Draft 64
Miller Genuine Draft Light
Miller High Life
Miller High Life Light
Miller Lite
Miller Lite Brwer's Collection Amber
Miller Lite Brwer's Collection Blonde
Miller Lite Brwer's Collection Wheat
Pct Alcohol Calories Carbohydrates
DistTypeCODE
4.9
153
16.0
1
5.9
157
8.9
1
4.2
95
3.2
1
5.0
130
7.8
1
5.5
123
8.9
1
5.0
115
7.5
1
4.2
110
6.6
1
4.2
116
8.0
1
5.0
145
10.6
1
4.3
99
3.1
1
2.4
55
1.9
1
4.6
133
10.2
1
5.9
169
12.5
1
4.1
95
3.2
1
4.3
138
12.5
1
6.1
174
11.1
1
4.9
149
12.2
1
5.0
152
12.5
1
4.2
104
5.3
1
4.7
144
12.1
1
4.7
144
12.1
1
3.9
110
8.3
1
5.0
132
8.7
1
5.5
149
9.8
1
4.1
103
5.5
1
5.6
157
11.2
1
4.6
145
13.0
1
5.2
166
15.0
1
5.0
155
13.3
1
4.7
152
14.1
1
4.1
110
7.0
1
4.9
175
17.9
1
4.3
113
6.7
1
4.1
95
2.6
1
5.6
157
11.2
1
5.8
157
11.8
1
4.2
110
6.5
1
4.7
143
13.1
1
2.8
64
2.4
1
4.2
110
7.0
1
4.7
143
13.1
1
4.2
110
7.0
1
4.2
96
3.2
1
4.2
110
6.2
1
4.2
110
6.2
1
4.2
110
6.2
1
Milwaukee's Best
Milwaukee's Best Ice
Milwaukee's Best Light
O'Doul's
Old Milwaukee Beer
Old Milwaukee Light
Olde English 800
Olde English 800 7.5
Olde English High Gravity 800
Rolling Rock Premium Beer
Sam Adams Boston Ale
Sam Adams Boston Lager
Sam Adams Cherry Wheat
Sam Adams Cream Stout
Sam Adams Light
Schlitz
Sierra Nevada Anniversary Ale
Sierra Nevada Bigfoot
Sierra Nevada Celebration Ale
Sierra Nevada Draft Ale
Sierra Nevada Early Spring Beer
Sierra Nevada Harvest Ale
Sierra Nevada India Pale Ale
Sierra Nevada Pale Ale
Sierra Nevada Pale Bock
Sierra Nevada Porter
Sierra Nevada Stout
Sierra Nevada Summerfest Beer
Sierra Nevada Wheat Beer
Southpaw Light
Stroh's Beer
Stroh's Light
Aspen Edge
Big Sky Moose Drool Brown Ale
Big Sky Scape Goat Pale Ale
Big Sky Summer Honey Ale
Big Sky Trout Slayer Ale
Blatz Beer
Blue Moon
Flying Dog Doggie Style
Flying Dog Dogtober Fest
Flying Dog Double Dog Pale Ale
Flying Dog Gonzo
Flying Dog Horn Dog
Flying Dog In Heat Wheat
Flying Dog K-9 Cruiser
Flying Dog Old Scratch
4.3
5.9
4.5
0.4
4.5
3.8
5.9
7.5
8.0
4.5
4.9
4.8
5.2
4.7
4.1
4.7
5.9
9.6
6.8
5.0
5.9
6.7
6.9
5.6
7.0
5.6
5.8
5.0
4.4
5.0
4.6
4.4
4.1
5.2
4.7
4.7
4.7
4.8
5.4
4.7
5.3
10.5
8.7
10.5
4.7
6.5
5.3
128
144
98
70
146
114
160
202
220
120
160
160
166
195
124
146
190
330
214
157
190
215
231
175
218
194
225
158
153
123
149
113
94
177
163
154
163
153
171
158
146
292
269
314
131
181
150
11.4
7.3
3.5
13.3
12.9
8.3
10.5
13.4
14.6
10.0
19.9
18.0
16.9
23.9
9.7
12.1
17.3
32.1
19.4
13.4
16.7
19.3
20.0
14.1
19.7
18.4
22.3
13.7
13.1
6.6
12.0
7.0
2.6
15.6
13.9
11.6
13.9
11.6
13.7
11.4
11.4
15.0
18.6
18.9
8.3
10.6
9.6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Flying Dog Road Dog
Flying Dog Snake Dog
Flying Dog Tire Bite
Genesee Beer
Genesee Cream Ale
Genesee Ice
Genesee Red
George Killian's Irish Red
Keystone Ice
Keystone Light
Keystone Premium
Leinenkugel Amber Light
Leinenkugel Creamy Dark
Leinenkugel Honey Weiss
Leinenkugel Light
Leinenkugel Northwoods Lager
Leinenkugel Original
Leinenkugel Red
Leinenkugel Sunset Wheat
New Belgium 1554
New Belgium 2 Below
New Belgium Abbey
New Belgium Blue Paddle
New Belgium Fat Tire
New Belgium Mothership Wit
New Belgium Sunshine Wheat
New Belgium Trippel
Olympia Premium Lager
Pabst Blue Ribbon
Pete's Wicked Ale
Red Hook ESB
Red Hook IPA
Schaefer
Steel Reserve
Steel Reserve Six
Steel Reserve Triple Export
Weinhard's Amber Light
Weinhard's Blonde Lager
Weinhard's Hefweizen
Weinhard's Pale Ale
Weinhard's Private Reserve
Yuengling Ale
Yuengling Lager
Yuengling Light
Yuengling Porter
Yuengling Premium Beer
5.2
6.4
4.6
4.5
5.1
5.9
4.9
5.0
5.9
4.1
4.4
4.1
4.9
4.9
4.2
4.9
4.7
4.9
4.9
5.6
6.6
7.0
4.8
5.2
4.8
4.8
7.8
4.7
5.0
5.3
5.8
6.5
4.6
8.1
6.0
8.1
4.2
5.1
4.9
4.6
4.8
5.0
4.4
3.8
4.5
4.4
158
179
124
148
162
156
148
162
142
103
111
110
170
149
105
163
152
166
165
205
200
200
140
160
155
145
215
146
153
174
179
188
142
222
160
222
135
161
151
147
150
145
135
98
150
135
12.0
10.6
7.1
13.5
15.0
14.5
14.0
14.8
5.9
5.0
5.8
7.4
16.8
12.0
5.7
15.3
13.9
16.2
16.0
25.0
17.0
18.0
14.0
15.0
15.0
13.0
20.0
11.9
12.0
17.7
14.2
12.7
12.1
16.0
11.0
16.0
11.5
14.0
12.2
13.0
9.9
10.0
12.0
6.6
14.0
12.0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
DistType
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
Light
No
No
Yes
No
No
Yes
Yes
Yes
No
Yes
Yes
No
No
Yes
No
No
No
No
Yes
No
No
Yes
No
No
Yes
No
No
No
No
No
Yes
No
Yes
Yes
No
No
No
No
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
National
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
No
Yes
No
No
No
No
No
No
No
No
No
No
No
No
No
No
Yes
No
Yes
Yes
No
No
No
No
No
No
No
No
No
No
No
No
No
No
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
Regional
No
No
No
No
No
No
No
No
No
Yes
No
Yes
No
No
Yes
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No
Yes
No
No
No
No
No
No
Yes
No
No
Frequency Distribution for Pct Alcohol
bins
Midpts. Frequency Percentage
-0.01 -0
0.0%
0.99
0.5
1
0.7%
1.99
1.5
0
0.0%
2.99
2.5
2
1.4%
3.99
3.5
3
2.2%
4.99
4.5
72
51.8%
5.99
5.5
41
29.5%
6.99
6.5
9
6.5%
7.99
7.5
4
2.9%
8.99
8.5
4
2.9%
9.99
9.5
1
0.7%
10.99
10.5
2
1.4%
Total
139
100.0%
7-Beer Study
(A random sample of 139 beers.)
Scenario
You are employed as a research assistant at the Alcohol and Tobacco Tax and Trade Bureau (U. S.
Department of the Treasury) and your supervisor, R. Tyler Paterson, asks you to study the characteristics
of beers sold throughout the United States. For this purpose you take a sample of 139 beers.
For each beer you collect data on several variables (see the Variable INFO tab) but for the reports you will
be preparing you decide to focus on one key numerical variable, βPct Alcoholβ (i.e., the alcoholic content
in percentage), and one key categorical variable, βLightβ (i.e., whether or not the beer product is
considered to be light). You also decide to use a grouping categorical variable βDistribution Typeβ since
each beer is distributed nationally or regionally. This will enable you to make comparisons of both the
percent alcohol of the beer and whether or not the beer is considered to be light based on its distribution
β national or regional. In addition, you have also selected the numerical variable βCaloriesβ in the beer to
develop a simple linear regression model to predict the βPct Alcohol.β
Introduction to Simple Linear
Regression Modeling
(Prepared by Mark L. Berenson)
This chapter is an introduction to regression analysis modeling techniques that enable you
to use a numerical independent variable to predict the values of a numerical dependent
variable of interest. For example, the placement officer at your university can predict the
expected starting salary (in thousands of dollars) of a graduating business student by
developing a simple linear regression model that uses cumulative grade point average as
the numerical independent variable.
Regression analysis is fundamental to business decision-making because it involves
prediction/estimation/forecasting β three words used here synonymously. Below are
some examples of practical uses of regression analysis:
β’
β’
β’
β’
β’
An investment analyst can estimate your credit score rating based on current
salary.
A family doctor can forecast your relativeβs survivability from surgery based on
hours in surgery.
A financial analyst can predict your companyβs sustainability based on revenues
generated through the year.
A real estate agent can estimate the value of your house (in dollars) based on its
size in square feet.
The admissions director of an MBA program can forecast your chances of success
by estimating your graduate grade point average based on your GMAT score.
In a regression analysis, the dependent variable, given by the symbol Y, is the numerical
variable of interest that you want to predict. The dependent variable is often referred to as
the response variable. The independent variable, given by the symbol X, is the numerical
variable used to make the prediction. The independent variable is often referred to as the
predictor or explanatory variable.
1 Developing the Simple Linear Regression Model
A regression analysis begins by visually observing the relationship between the two
numerical variables in a scatter plot. Therefore, to develop a simple linear regression
model you need two numerical measurements on each item in your sample, such as the
expected starting salary of a graduating student and his/her cumulative grade point
average, or the selling price of a house and its size in square feet. Each pair of
measurements is plotted such that the dependent variable of interest is on the vertical or Y
axis, and the independent or predictor variable is on the horizontal or X axis.
Figure 1 shows scatter plots demonstrating both strong and weak linear relationships.
1
FIGURE 1 Scatter plots of strong and weak linear relationships
In the top two panels of Figure 1 you observe positive relationships between X and Y; in
the bottom two panels you see negative relationships between X and Y.
Sometimes the cloud of points will appear to follow a curved pattern instead of a straight
line pattern. In such circumstances be sure to consult with a professional statistician β
curvilinear regression analysis is outside the scope of this course, which focuses on
simple linear regression analysis.
Figure 2 depicts scatter plots with linear and curvilinear relationships.
FIGURE 2 Scatter plots of linear and curvilinear relationships
2
To demonstrate the development of a simple linear regression model, Figure 3 displays
part of an Excel worksheet of a data file constructed at Bergen University (βBUβ). Figure
4 is the scatter plot representing the test scores achieved by the sample of 93 business
students based on the number of hours the students claimed to have studied for their
comprehensive (i.e., all topics covered in the semester) final exam in their core-required
operations management course. You can see that the cloud of points plot upward and
toward the right. Any straight line drawn through these points would therefore indicate a
positive relationship between X and Y β a positive correlation and a positive slope.
FIGURE 3 Excel worksheet of the Bergen University data file containing
93 students and displaying student ID number,
test scores, and hours studied
ID Number
ID0001
ID0002
ID0003
ID0004
ID0005
ID0006
ID0007
ID0008
:
ID0086
ID0087
ID0088
ID0089
ID0090
ID0091
ID0092
ID0093
Test Scores
66
72
87
55
64
83
99
79
:
71
75
94
39
77
70
62
80
3
Hours Studied
8.0
7.5
9.5
2.0
5.0
9.5
11.0
9.0
:
7.5
8.5
14.0
0.0
8.5
6.5
4.0
8.5
FIGURE 4 Scatter plot of test scores with hours studied
The Simple Linear Regression Equation
The straight line developed from a sample of data is described by two statistics, the
sample slope b1 and the sample Y intercept b0 . The simple linear regression equation or
prediction line is given by:
YΛi = b0 + b1 X i
where
YΛi is the predicted value of Y for observation i
X i is the value of X for observation i
b0 is the sample Y intercept
b1 is the sample slope
The Y intercept b0 represents the mean or average value of Y when X equals 0. It is a
necessary component of the simple linear regression equation but it doesnβt always have
practical value. For instance, it would not be meaningful to predict the selling price of a
house that has 0 square feet -- the house doesnβt exist! Similarly, it would not be
meaningful to predict expected starting salary of a graduating senior based on a
4
cumulative grade point average of 0.0 β that student would have flunked out with a
straight F average and not be graduating!
The Y intercept b0 is given by:
b0 = Y β b1 X
It is the difference between the mean or average value of Y and the product of the slope
with the mean or average value of X. Therefore, if using a hand-held calculator or
programming in Excel the slope would have to be computed first.
The slope of the line b1 represents the mean or average amount that Y changes, either
positively or negatively, as a result of a one-unit change in X. Therefore, the slope will
be positive if the points on the scatter plot indicate that as X gets larger Y typically also
becomes larger. On the other hand, the slope will be negative if the points on the scatter
plot indicate that as X gets larger Y usually becomes smaller.
The slope b1 is given by:
n
b1 =
β(X
i
β X )(Yi β Y )
i =1
n
β(X
i
β X )2
i =1
The numerator term is the summation of the product of the difference between each
observation of the independent variable X and its mean with the difference between each
corresponding observation of the dependent variable Y and its mean. This result can be
positive or negative. The denominator term is the summation of the squared differences
between each observation of the independent variable X and its mean. This result can
only be positive. Therefore the slope b1 can be positive or negative.
If you are using Excel, it is easy to obtain X and Y , the means of the respective columns
of X and Y, and then compute columns of differences between each X observation and
X as well as each Y observation and Y . Once this is accomplished you can compute a
column of the product of the latter two columns and, in summing the results, you will
have obtained the numerator in the equation for the slope. You can then use your column
of differences between each X observation and X to form a column of squared
differences. In summing the results, you will have obtained the denominator in the
equation for the slope. To simplify and expedite matters, however, you could use PHStat
to develop your entire simple linear regression model.
To demonstrate the development of a simple linear regression equation, suppose you are
working as a research assistant to the chairperson of the Operations Management
Department at Bergen University (βBUβ). She wants to predict the test score that will be
attained on a comprehensive (i.e., all topics covered in the semester) final examination
based on the amount of time (in hours) that a student claims to have studied for that exam
5
in this core-required course. The scatter plot of Figure 4, representing the test scores
earned and the study hours claimed by a sample of 93 business students, indicates the
slope will be positive β more study time should result in a higher grade. This scatter plot
is created from PHStat. The PHStat printout shown in Figure 5 presents the developed
simple linear regression model.
FIGURE 5 PHStat simple linear regression model of test scores by hours
studied
Regression Analysis of Test Scores by Hours Studied
Regression Statistics
Multiple R
0.9220
R Square
0.8500
Adjusted R Square
0.8483
Standard Error
3.9958
Observations
93
ANOVA
df
Regression
Residual
Total
Intercept
Hours Studied
1
91
92
SS
8233.0292
1452.9278
9685.9570
Coefficients
48.7239
3.2988
Standard
Error
1.1474
0.1453
MS
8233.0292
15.9662
F
515.6524
Significance
F
0.0000
t Stat
42.4637
22.7080
P-value
0.0000
0.0000
Lower 95%
46.4447
3.0102
Upper 95%
51.0031
3.5874
Note: Only the highlighted portions of the PHStat printout are important in this course.
In the lower left corner of the PHStat worksheet displayed in Figure 5, you obtain the
sample Y intercept b0 and the sample slope b1 for the predictor variable Hours Studied
under the Coefficients column. Your sample linear regression equation, or prediction
line, is:
YΛi = 48.7239 + 3.2988 X i
Interpreting the Y Intercept and Slope
Note that the Y intercept b0 and the slope b1 are both measured in the same units as the
dependent variable Y.
The Y intercept b0 = 48.7239 indicates that if a student does not study for the final exam
(that is, X = 0), the test score is predicted on average to be 48.7239 β a failing grade. In
6
this situation, the Y intercept is a meaningful prediction because it is plausible that a
student, perhaps unwisely, will decide not to study. In fact, one of the 93 students in the
sample did not study and scored a 39, a failing grade almost 10 points worse than what on
average would be predicted!
The slope b1 = +3.2988 indicates that for each additional one hour of study time, the
average predicted change in the test score would be an increase of 3.2988 points.
Using the Sample Regression Equation for Prediction
When using the sample regression equation for prediction, you should only select values
of the independent variable X within its relevant range, i.e., choose only values of X
between the smallest and largest X that were used in developing the regression equation β
do not extrapolate beyond this relevant range.
In the sample of n = 93 Bergen University graduating business students, the predictor
variable hours studied ranged from 0 to 15. The prediction line you developed could be
used to predict the test score of any student claiming to have studied within that interval.
Therefore, if, owing to an examination conflict, a student takes this (or a comparable)
exam three days later and tells you he studied 10 hours, you would be able to predict that
on average his test score is expected to be:
YΛi = 48.7239 + 3.2988 X i
= 48.7239 + 3.2988(10)
= 48.7239 + 32.9880
= 81.7119
Therefore, a student who studies 10 hours for this exam could be predicted on average to
score approximately 82.
The scatter plot and the accompanying prediction line are depicted in Figure 6. Note that
as the values of X get larger (i.e., more study time hours) the prediction line passes
through the points in an upward direction (i.e., a prediction of a higher test score) so the
slope b1 is clearly positive. Also note that the points in the scatter plot distribute in a
linear (rather than curvilinear) manner so that fitting a straight line to the data (rather than
some curve) is appropriate. All the observed data points, however, do not lie on a
perfectly straight line β there is a scattering of points above and below the line β so the
simple linear regression model developed is not a perfect fit. The next section reviews
the various descriptive summary measures that enable you to assess how much variation
exists around the fitted prediction line and to express how strong (or weak) your simple
linear regression equation is.
7
FIGURE 6 Scatter plot and prediction line for test scores with hours
studied
2
Measures of Variation in Regression
From the Bergen University data file, a portion of which is extracted in the Excel
worksheet of Figure 3, there is evidence of lots of variability in your numerical variable
of interest, the test scores achieved by the 93 business students. Figures 7 through 10
depict the five-number-summary, the boxplot, the stem-and-leaf display, and set of
descriptive summary statistics of the test scores. You can gain a better understanding the
characteristics of your dependent variable Y by observing its properties of central
tendency, variation, and shape.
FIGURE 7 Five-Number-Summary of Test Scores Obtained from PHStat
Five-Number Summary of Test Scores
Minimum
39.00
First Quartile
67.00
Median
72.00
Third Quartile
79.50
Maximum
99.00
8
FIGURE 8 Boxplot of Test Scores Obtained from PHStat
FIGURE 9 Stem-and-Leaf Display of Test Scores Obtained from PHStat
Stem & Leaf Test Scores
Stem
unit:
Statistics
Sample Size
Mean
Median
Std. Deviation
Minimum
Maximum
93
73.02
72.00
10.26
39.00
99.00
3
4
5
6
7
8
9
10
9
7
058
11223334445566677777889999
000000111111111222222333345555567788999
00002233334456679
014689
The test score data are very slightly left-skewed in shape. There is one distinct outlier β
the test score of 39 earned by the student with ID0089 as seen in Figure 1. The Z score
criterion does not flag any other test score as an outlier. All other Z values are between β
3.00 and +3.00. On the other hand, the quartiles and interquartile range criterion signals
that the poor exam grade of 47 and the truly excellent test score of 99 are also possible
outliers.
The bottom line, however, from this assessment of your numerical random variable of
interest is that test scores vary from very poor (39) to truly excellent (99) with a
clustering of values in the low seventies (i.e., the mean is 73.02 and the median is 72).
9
FIGURE 10 Descriptive Summary Statistics for Test Scores Obtained from
PHStat
Descriptive Analysis
Test Scores
Mean
73.02
Median
72.00
Mode
71.00
Minimum
39.00
Maximum
99.00
Range
60.00
Variance
105.28
Standard Deviation
10.26
Coeff. of Variation
0.14
Skewness
-0.05
Kurtosis
1.26
Count
93
Standard Error
1.06
Note: Only the highlighted portions of the PHStat printout are important in this course.
The question now is how much of the observed variation in the dependent variable test
score can be explained or accounted for by building a simple linear regression model that
uses claimed hours of study as a numerical independent or predictor variable of test
score.
The total variation in the dependent variable Y can be divided into two parts, the βgoodβ
or explained variation that is due to the simple linear regression model you have
developed, and the βbadβ or unexplained variation that may be due either to an
unaccounted for curvilinear relationship with the independent variable, to other possible
predictor variables not considered in the model, or simply to naturally occurring random
variation.
With respect to the Bergen University sample, the total variation in test scores consists of
two parts β the explained variation attributable to using the numerical independent
variable hours studied in developing a simple linear regression model to predict test
score, and the unexplained or residual error variation that may be due to one or more of
several factors. Factors that may contribute to residual error include: an unaccounted for
curvilinear relationship with the numerical independent variable hours studied; other
unaccounted for numerical predictor variables such as hours of sleep the night before the
exam or ability in mathematics; unaccounted for categorical predictor variables such as
gender, major, or interest in the subject; or simply random variation β why a student
taking the same test at 9 a.m. might score a few points more or less than if the test was
taken at 8 a.m.
10
The total variation in Y, the dependent variable of interest, is obtained as the summation
of the squared difference between each of the Yi observations in the sample and the mean
of the sample Y . That is,
Total Variation = SST or Sum of Squares Total =
n
β (Y
i
β Y )2
i =1
The total variation is often referred to as SST, the sum of squares βtotal.β You should
recognize this formula as similar to the numerator in computing the variance S 2 and
standard deviation S for a numerical variable X that you learned in your basic statistics
course.
In this regression analysis, total variation represents the sum of squared differences
between each studentβs test score and the average of all 93 studentsβ test scores in the
Bergen University sample. From the Total row of the SS column in the ANOVA (i.e.,
βanalysis of varianceβ) table presented in the PHStat printout in Figure 3, note that the
total variation is 9685.9570 squared test score points.
The explained variation in Y, the part of the total variation in Y that is accounted for by
the developed simple linear regression model, is obtained as the summation of the
squared difference between each of the YΛi predicted observations in the sample and the
mean of the sample Y . That is,
Explained Variation = SSR or Sum of Squares Regression =
n
β (YΛ β Y )
2
i
i =1
The explained variation is often referred to as SSR, the sum of squares βregression.β
In this regression analysis, explained variation represents the sum of squared differences
between each studentβs predicted test score and the average of all 93 studentsβ test scores
in the Bergen University sample. From the Regression row of the SS column in the
ANOVA table displayed in the PHStat printout in Figure 5, note that the explained
variation is 8233.0292 squared test score points. Each of the 93 predicted test scores
(YΛi ) is obtained by substituting the 93 studentsβ individual values of X i (i.e., hours
studied) into the simple linear regression equation YΛ = 48.7239 + 3.2988 X .
i
i
The unexplained variation in Y, the part of the total variation in Y that is not accounted
for by the developed simple linear regression model, is obtained as the summation of the
squared difference between each of the actual Yi observations and their corresponding
predicted observations YΛ in the sample of size n. That is,
i
Unexplained Variation = SSE or Sum of Squares Residual Error =
n
β (YΛ β Y )
i
i =1
11
2
The unexplained variation is often referred to as SSE, the sum of squares βresidual
error.β
In this regression analysis, unexplained variation represents the sum of squared
differences between each studentβs actual test score and predicted test score. From the
Residual row of the SS column in the ANOVA table displayed in the PHStat printout in
Figure 5, note that the unexplained variation is 1452.9278 squared test score points.
The unexplained variation is a measure of how βoffβ the predicted values are from the
actual values. From Figure 6, you can see that most of the observations are close to the
prediction line and therefore the predicted test scores are, for most students, fairly close
to their actual test scores.
In a perfectly fitting model, all the observed Y values would lie on the fitted prediction
line and there would be no scatter or residual error above and below the line. In such a
situation, the explained variation SSR would equal total variation SST and the
unexplained variation SSE would be 0. On the other hand, if the chosen predictor
variable X is completely independent of the dependent variable Y then the unexplained
variation SSE would equal the total variation SST and explained variation SSR would be
0. Given that SSR, SSE, and SST are each obtained by summing a set of squared
observations, it is impossible for SSR, SSE, or SST to ever have a negative result.
To summarize,
SST
=
SSR
+
SSE
n
n
n
i =1
i =1
i =1
β (Yi β Y ) 2 = β (YΛi β Y ) 2 + β (Yi β YΛi ) 2
and, in this regression analysis,
9685.9570 = 8233.0292 + 1452.9278
Figure 11 displays the different measures of variation for a particular student in the
Bergen University sample. This student claims to have studied 11 hours for the exam and
achieved the truly excellent grade of 99. Also plotted here are the prediction line
YΛi = 48.7239 + 3.2988 X i and the horizontal line at a height of 73.02 representing Y , the
mean test score on this final examination.
From Figure 11 you observe that this student scored 25.98 points higher on the test than
the mean of all 93 student performances. The square of this difference represents this
studentβs contribution to SST, the total variation. Moreover, you observe that 85.01, the
predicted score for this student and others who also study 11 hours is, on average, 11.99
points higher on the test than the mean of all 93 student performances. The square of this
difference represents this studentβs contribution to SSR, the explained variation.
12
Furthermore, you also observe that this overachieving student scored 13.99 points higher
on the test than what would be predicted on average for all students who study 11 hours.
The square of this difference represents this studentβs contribution to SSE, the
unexplained variation.
When added to this studentβs three calculations, similar calculations for the test scores of
each of the other 92 students in the Bergen University sample would yield the SST, SSR,
and SSE results shown here.
FIGURE 11 Measures of variation
Coefficient of Determination
The coefficient of determination, given by the symbol r 2 , is the portion of the total
variation in the dependent variable Y that is explained by the variation in the independent
variable X in the simple linear regression model developed. That is,
n
r2 =
SSR
=
SST
β (YΛ β Y )
2
i
i =1
n
β (Y
i
i =1
β Y )2
In other words, r 2 is the ratio of the explained variation to the total variation. Note that
r 2 ranges from 0 (a horribly fitting and useless regression model) to 1 (a perfectly fitting
regression model).
13
In the results shown in the ANOVA table of our regression model in Figure 3, note that
r2 =
SSR 8233.0292
=
= 0.8500
SST 9685.9570
When r 2 is reported, it is often converted to a percentage. Therefore, 85.0% of the
variation in student test score is explained or accounted for by the variation in hours
studied in the simple linear regression model developed. This large r 2 indicates a strong
linear relationship between test score and hours studied because the fitted regression
model has explained 85.0% of the variability in the test scores. Only 15.0% of the
variability in the test scores remains unaccounted for.
From the PHStat printout shown in Figure 5, note in the upper left corner under
Regression Statistics that the result is displayed as R Square = 0.8500.
Coefficient of Correlation
The coefficient of correlation, given by the symbol r, measures the strength of the
relationship between the two numerical variables, X and Y.
You can obtain the coefficient of correlation r by simply taking the square root of the
coefficient of determination r 2 and then giving the result a β+β sign if the slope b1 of the
regression equation is positive or giving the result a βββ sign if the slope b1 of the
regression equation is negative. That is,
r = r2
where r is positive if b1 is positive or r is negative if b1 is negative.
Note that the coefficient of correlation r can range from β 1 (i.e., a perfect negative
correlation) to + 1 (i.e., a perfect positive correlation), depending on whether the slope
b1 of the simple linear regression equation is, respectively, negative or positive. The
closer the coefficient of correlation r is to either β 1 or to + 1, the stronger the
relationship between X and Y is; however, the closer r is to 0, the weaker the relationship
is. If r = 0, there is no association between X and Y.
For the results of our regression model,
r = 0.8500 = 0.9220
14
Since our slope, b1 = +3.2988, is positive, the coefficient of correlation r is +0.9220, an
indication of a very strong positive association between test score and hours studied for
the final examination.
From the PHStat printout shown in Figure 5, note in the upper left corner under
Regression Statistics that the result is displayed as Multiple R = 0.9220.
Caution
You must realize that correlation does not imply causation. Just because you find a
strong association between two variables does not mean that one caused the other.
Standard Error of the Estimate
The standard error of the estimate, given by the symbol S YX , measures the average
scatter or variability in a set of paired observations (i.e., the data points on a scatter plot)
around the fitted regression line. You may recall from your introductory statistics course
that the standard deviation S measures the average scatter or variability in a set of
individual observations around the mean of all the observations in a sample. In other
words, the standard error of the estimate S YX for a set of paired observations is just like
the standard deviation S for a set of individual observations. The former indicates the
average spread above and below the prediction line, the latter measures the average
spread above and below the sample mean.
The standard error of the estimate S YX is obtained from:
n
S YX =
SSE
=
nβ2
β (Y
i
β YΛi ) 2
i =1
nβ2
2
Therefore, the standard error of the estimate S YX is the square root of S YX
, the variance
around the fitted regression line. Note that the numerator is the unexplained variation,
i.e., the sum of squares of residual error. Just like the standard deviation S, the standard
error of the estimate S YX must yield a positive result. S YX would equal 0 only if all the
observed data points in the scatter plot were to lie on a straight line. In such a case where
the simple linear regression model perfectly fits the data, SSE would equal 0, SSR would
equal SST, and r 2 would equal 1. The closer the observed data points are to the
prediction line, the closer the value of S YX would be to 0, the closer r 2 would be to 1, and
the better the fitted regression model would be for its purpose of prediction.
Figure 12 depicts two scatter plots, one indicating a strong positive relationship with a
small standard error of the estimate, the other describing a weak positive relationship with
a large standard error of the estimate.
15
FIGURE 12 Comparing Standard Errors of the Estimate
For the results shown in the ANOVA table of our regression model in Figure 5, note that
n
β (Y
i
S YX =
β YΛi ) 2
i =1
nβ2
=
1452.9278
= 15.9662 = 3.9958
93 β 2
2
, the variance around the fitted regression line, is 15.9662 squared test score points
S YX
so S YX , the standard error of the estimate, is 3.9958 or approximately 4 test score points.
The standard error of the estimate S YX is measured in the same units as is the dependent
variable Y.
From the PHStat printout shown in Figure 5, note in the upper left corner under
Regression Statistics that the result is displayed as Standard Error = 3.9958.
Assessing Variability
The standard error of the estimate S YX is a measure of scatter around the regression line
and the standard deviation S is a measure of scatter around the mean. Here, the standard
error of the estimate S YX is approximately 4 test score points. From the PHStat printout
shown in Figures 9 and 10, the standard deviation S for the dependent variable is 10.26
test score points. Therefore, on average, an actual test score is expected to differ from its
predicted test score by Β± 4 points, whereas an actual test score is expected to differ from
the mean of all sampled test scores by Β± 10.26 points. Interestingly, the average spread
16
around the prediction line is 39 % of the average spread around the mean of all test
scores, attesting to the increase in the understanding of the distribution of test scores and
any test score estimates to be derived therefrom by using regression modeling for
prediction purposes rather than simply studying the test score data by itself.
17
Regression Memo
Assignment Instructions and Supporting Documents
OVERVIEW:
β’
β’
Prepare a 1 page memo with a second page attachment (this will have the scatterplot) to
the target of your assigned data set (so, it is actually a 2 page document).
Explain the regression analysis you have in your Excel file focusing on what the analysis
means to the company and without using any of the statistical terms associated with
regression analysis.
SCENARIO:
You are hired as a research assistant to analyze the data from a particular study and write a
memo regarding aspects of your analysis and, where appropriate, make recommendations.
Your memo should be written to the individual and organization designated in your project
theme described in the file BUGN 280 Excel Project Data File Information and look similar to the
sample memo below for Data Set 1 as far as content and formatting.
YOUR MEMO MUST INCLUDE:
1. Your scatterplot with a regression line. By checking the right box in your software, the
regression analysis will produce this chart for you. Format the chart to make it easy to read,
adding a title and axis labels.
2. Four basic regression statistics: correlation coefficient, coefficient of determination, slope
and y-intercept.
3. Explanation of your regression analysis, why you are using regression, what information is
provided by each statistic, and the scatterplot. Explain the difference between correlation
and regression, explain what each statistic means, and interpret each in practical terms.
4. Be resourceful. Leverage everything you can to help you. This is what itβs like on the job.
Deliver your task on time and with the information requested.
This is the Assessment of Learning for the School of Business Learning Goal 2a.
2. Be effective in (a) written and (b) oral communications and in the use of appropriate
supporting electronic technologies
The standard Graduation Writing Requirement Rubric will be used.
Purchase answer to see full
attachment