EFFECTS OF SMOKING ON NON-ACCIDENTAL DEATH RATES
EC 315 – Quantitative Research Methods
Russ Miller
Fall II 2006
i
TABLE OF CONTENTS
BACKGROUND ..........................................................................................................................1
REGRESSION ANALYSIS .........................................................................................................2
CONCLUSIONS ...........................................................................................................................4
BIBLIOGRAPHY .........................................................................................................................5
APPENDIX ....................................................................................................................................6
1
I. Background
It is widely accepted that the use of tobacco presents serious health risks. In fact, the Center
for disease control and prevention states that “Tobacco use, including cigarette smoking, cigar
smoking, and smokeless tobacco use, is the single leading preventable cause of death in the
United States” (Center for Disease Control and Prevention, 2006). The purpose of this analysis
is to determine the effects of tobacco use (SMOKE) on the non-accidental death rate (DEATH)
while holding the effects of alcohol consumption (ALCOHOL), drug use (DRUG), and health
insurance (INSUR) constant. This study will use cross-sectional data from the 50 states for the
2002-2003 combined time period. The model (less constant and coefficients) is:
DEATH = SMOKE + ALCOHOL +DRUG - INSUR
The dependent variable, DEATH, is defined as the death rate per 100,000 population for
major causes of death in the United States, excluding non-health related causes such as
automobile accidents, homicide, etc., and is extracted from the National Vital Statistics Reports
(2006).
Data for SMOKE, ALCOHOL, and DRUG were taken from the census bureau’s Statistical
Abstract (2006) and are based on results from the National Household Survey on Drug Use and
Health (NSDUH). SMOKE is defined as the number of people over 12 years of age (in
thousands) who had smoked a cigarette at least once in the month prior to the study. ALCOHOL
and DRUG are similarly defined with ALCOHOL representing binge drinking, and DRUG
representing the use of any illicit drug. These three variables were selected because the CDC has
stated in the Morbidity and Mortality Weekly Report that tobacco, alcohol, and other drug use is
associated with the leading causes of morbidity and mortality…” (Center for Disease Control
and Prevention, 1992). The relationships between DEATH and SMOKE, ALCOHOL, and
2
DRUG should all be positive, since the use of tobacco, drugs and alcohol are all bad for your
health.
The independent variable INSUR is defined as the number of people (in thousands) not
having insurance, and the data for this variable was also extracted from the census bureau’s
Statistical Abstract (2006). This variable was selected because an MIT Sloan study indicated
that automobile accident victims without health insurance were more likely to die than their
insured counterparts because of differences in the medical treatment received (MIT Sloan
Management [MIT], 2003). Although this study deals with non-accidental deaths as opposed to
automobile accident victims, it is assumed that the implied lower standard of care for uninsured
patients may occur for other causes of death as well. The relationship between DEATH and
INSUR should be positive since not having insurance has been linked to lower quality health
care.
II. Regression Analysis
The model was regressed and the results are shown in the Table 1.
Table 1. Original Regression Results
Dependent Variable: DEATH
Independent Variables
Coefficients
SMOKE
25.4186
ALCOHOL
-5.9164
DRUG
37.3578
INSUR
-2.1520
Adjusted R2 = 0.9802
t Statistic
7.1866
-1.0634
4.1175
-1.1743
n = 51
P-Value
0.0000
0.2932
0.0002
0.2463
You should now discuss these results…is your R2 good, bad, etc? What percentage of the
variation is explained by the regression? Are the coefficients (signs) as you expected?
Which variables are statistically significant?
3
Next, test for multicollinearity. Insert the correlation matrix and comment on the results,
and then compute the Variance Inflation Factors.
Table 2. Cross Correlation Matrix
SMOKE
X
Independent Variables
SMOKE
ALCOHOL
DRUG
INSUR
ALCOHOL
DRUG
INSUR
X
X
X
The rule of thumb for the cross correlation is for all coefficients be between -0.7 and
+0.7. Values in the above table outside that range are problematic. Comment on your
specific results.
SMOKE
Variance Inflation Factors
ALCOHOL
DRUG
INSUR
Rule of thumb for VIF is they should be less than 10. Comment on your specific results.
Based on your findings above regarding statistical significance of independent variables
and multicollinearity issues attempt to improve the regression by removing independent
variables if appropriate. Only remove one at a time and see if the regression is better or
worse. Try various combinations as appropriate. If all of your variables are statistically
significant and there are no multicollinearity problems, try lagging (time-series data
only) or logging the model and see if you can improve the regression. Even if you do end
up removing some independent variables you can still try lagging or logging for
additional improvement.
4
After all attempts to improve the regression, compare the “original” and “final”
regressions and discuss the results.
Independent
Variables
SMOKE
ALCOHOL
DRUG
Original Regression
Adjusted R2 = 0.9802
Final Regression
Adjusted R2 = 0.9901
Coefficient
Coefficient
P-Value
Comments
P-Value
INSUR
Explain why you selected the final regression that you did…why is it better? Were there
any trade-offs?
III. Conclusions
List and discuss your final model.
DEATH = -858.13 + 25.42*SMOKE – 5.92*ALCOHOL + 37.36*DRUG – 2.15*INSUR
What type of relationship did you establish between your primary independent variable
and the dependent variable (e.g., strong positive; strong negative, weak positive,
moderate positive, none, etc)? Based on your final regression, quantify the impact of a
change in your primary independent variable (e.g., “The SMOKE coefficient of 25.42
indicates that for every 1,000 additional smokers approximately 25 additional deaths
would occur.”) If your regression was not very good, what are some possible
5
explanations for the poor fit? If your regression is near perfect, why is that?If you were
to research this topic further, what would you change, etc.?
References
Center for Disease Control and Prevention, United States Department of Health and Human
Services. (Last reviewed August 3, 2006). Healthy Youth! Health Topics: Tobacco Use.
Retrieved December 2, 2006 from http://www.cdc.gov/HealthyYouth/tobacco/index.htm
Center for Disease Control and Prevention, United States Department of Health and Human
Services. (1992, September 18). Morbidity and Mortality Weekly Report: Tobacco,
Alcohol, and Other Drug Use Among High School Students – United States, 1991.
Retrieved December 2, 2006 from
http://www.cdc.gov/mmwr/preview/mmwrhtml/00017652.htm
Center for Disease Control and Prevention, United States Department of Health and Human
Services. (2006, April 19). National Vital Statistics Report, Volume 54, Number 13, Table
29 . Retrieved December 2, 2006 from
http://www.cdc.gov/nchs/data/nvsr/nvsr54/nvsr54_13.pdf
MIT Sloan Management News Room Press Releases. (2003, January 22). Uninsured auto crash
victims face 37% higher death rate, says MIT Sloan study. Retrieved December 2, 2006
from http://mitsloan.mit.edu/newsroom/2003-doyle.php
United States Census Bureau. (n.d.). The 2006 Statistical Abstract, Table 195: Estimated Use of
Selected Drugs by State: 2002-2003. Retrieved December 2, 2006 from
http://www.census.gov/compendia/statab/health_nutrition/health_risk_factors/
United States Census Bureau. (n.d.). The 2006 Statistical Abstract, Table 143: Persons With and
Without Health Insurance Coverage By State: 2003. Retrieved December 2, 2006 from
http://www.census.gov/compendia/statab/health_nutrition/health_insurance/
6
Appendix
1. Excel Summary output for original regression:
SUMMARY OUTPUT
NO_INS Res
Residuals
Regression Statistics
Multiple R
0.9908
R Square
0.9818
Adjusted R Square
0.9802
Standard Error
5247.6630
Observations
51
40000
20000
0
-20000 0
2,000
N
ANOVA
df
Regression
Residual
Total
Intercept
NO_INS
TOBACCO
DRUG
ALCOHOL
4
46
50
SS
MS
F
Significance F
68271117105 17067779276 619.790823
2.30853E-39
1266746485 27537967.07
69537863591
Coefficients Standard Error
-858.1337
1134.6000
-2.1520
1.8327
25.4186
3.5370
37.3578
9.0729
-5.9164
5.5638
t Stat
-0.7563
-1.1743
7.1866
4.1175
-1.0634
P-value
0.4533
0.2463
0.0000
0.0002
0.2932
Lower 95%
Upper 95% Lower 95.0% Upper 95.0%
-3141.9628 1425.6954
-3141.9628
1425.6954
-5.8410
1.5369
-5.8410
1.5369
18.2991
32.5382
18.2991
32.5382
19.0950
55.6206
19.0950
55.6206
-17.1156
5.2829
-17.1156
5.2829
2. Excel Summary output for final regression:
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.9919
R Square
0.9839
Adjusted R Square
0.9825
Standard Error
0.1417
Observations
51
ANOVA
df
Regression
Residual
Total
Intercept
LTOBACCO
LDRUG
LALCOHOL
LNO_INS
4
46
50
SS
MS
F
Significance F
56.57493941 14.14373485 704.3991372
1.28213E-40
0.923640829 0.020079148
57.49858024
Coefficients Standard Error
3.0154
0.1655
1.0707
0.1363
-0.1750
0.1248
0.1581
0.1398
-0.0298
0.0821
t Stat
18.2249
7.8548
-1.4018
1.1306
-0.3625
P-value
0.0000
0.0000
0.1677
0.2641
0.7186
Lower 95%
Upper 95% Lower 95.0% Upper 95.0%
2.6824
3.3485
2.6824
3.3485
0.7963
1.3451
0.7963
1.3451
-0.4262
0.0763
-0.4262
0.0763
-0.1234
0.4396
-0.1234
0.4396
-0.1951
0.1356
-0.1951
0.1356
...

Purchase answer to see full
attachment