STAT 6120 Fall 2018 Data Analysis Project
Due by the end of Dec 10. Please email your report to tianxili@virginia.edu.
The National Health and Examination Study (NHANES) sometimes includes a physical
activity component in which subjects wear a device that records physical activity for every
minute, recorded over seven days.
The dataset is quite large and can be found here: http://wwwn.cdc.gov/Nchs/Nhanes/
2005-2006/PAXRAW_D.ZIP
Some documentation of the data is here: http://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/
PAXRAW_D.htm
We will only focus on the good quality sample, which can be accessed by requiring
PAXSTAT==1 and PAXCAL==1. First, calculate the “activity intensity” number per hour per
subject.
You should merge the physical activity data summaries created above with basic NHANES
demographic data (https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/DEMO_D.htm) such as
age and gender, and body measurement data (https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/
BMX_D.htm), especially BMI. Then consider how physical activity patterns are related to
these factors. Such patterns might simply be the “activity intensity” number per hour or
be some specific metric of its variation over time, or so. Find a reasonable predictive model
and explain the model. You do not need to use the case weights in your analysis.
You can use external R packages (if you do use R) to help processing the data. Some
possibly helpful packages may include dplyr, data.table and foreign.
Something to keep in mind:
• This will be an individual project - you are on your own for it. Of course, you can use
internet. The honor code is applied.
• Only the methods covered in the course are allowed and you are not supposed to
develop a new method.
• Your report should be at most four pages long (and concise reports are encouraged),
including all graphs and tables. You do not need to submit any code, but your report
should include clear descriptions of all the methods that you used.
1
Downloaded by [University of Toronto] at 16:20 23 May 2014
CHAPMAN & HALL/CRC
Downloaded by [University of Toronto] at 16:20 23 May 2014
Texts in Statistical Science Series
Series Editors
Chris Chatfield, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada
Analysis of Failure and Survival Data
Peter J.Smith
The Analysis and Interpretation of Multivariate Data for Social Scientists
David J.Bartholomew, Fiona Steele, Irini Moustaki, and Jane Galbraith
The Analysis of Time Series—An Introduction, Sixth Edition
Chris Chatfield
Applied Bayesian Forecasting and Time Series Analysis
A.Pole, M.West and J.Harrison
Applied Nonparametric Statistical Methods, Third Edition
P.Sprent and N.C.Smeeton
Applied Statistics—Handbook of GENSTAT Analysis
E.J.Snell and H.Simpson
Applied Statistics—Principles and Examples
D.R.Cox and E.J.Snell
Bayes and Empirical Bayes Methods for Data Analysis, Second Edition
Bradley P.Carlin and Thomas A.Louis
Bayesian Data Analysis, Second Edition
Andrew Gelman, John B.Carlin, Hal S.Stern, and Donald B.Rubin
Beyond ANOVA—Basics of Applied Statistics
R.G.Miller, Jr.
Computer-Aided Multivariate Analysis, Third Edition
A.A.Afifi and V.A.Clark
A Course in Categorical Data Analysis
T.Leonard
A Course in Large Sample Theory
T.S.Ferguson
Downloaded by [University of Toronto] at 16:20 23 May 2014
Data Driven Statistical Methods
P.Sprent
Decision Analysis—A Bayesian Approach
J.Q.Smith
Elementary Applications of Probability Theory, Second Edition
H.C.Tuckwell
Elements of Simulation
B.J.T.Morgan
Epidemiology—Study Design and Data Analysis
M.Woodward
Essential Statistics, Fourth Edition
D.A.G.Rees
A First Course in Linear Model Theory
Nalini Ravishanker and Dipak K.Dey
Interpreting Data—A First Course in Statistics
A.J.B.Anderson
An Introduction to Generalized Linear Models, Second Edition
A.J.Dobson
Introduction to Multivariate Analysis
C.Chatfield and A.J.Collins
Introduction to Optimization Methods and their Applications in Statistics
B.S.Everitt
Large Sample Methods in Statistics
P.K.Sen and J.da Motta Singer
Markov Chain Monte Carlo—Stochastic Simulation for Bayesian Inference
D.Gamerman
Mathematical Statistics
K.Knight
Modeling and Analysis of Stochastic Systems
V.Kulkarni
Modelling Binary Data, Second Edition
D.Collett
Downloaded by [University of Toronto] at 16:20 23 May 2014
Modelling Survival Data in Medical Research, Second Edition
D.Collett
Multivariate Analysis of Variance andRepeated Measures—A Practical Approach
for Behavioural Scientists
D.J.Hand and C.C.Taylor
Multivariate Statistics—A Practical Approach
B.Flury and H.Riedwyl
Practical Data Analysis for Designed Experiments
B.S.Yandell
Practical Longitudinal Data Analysis
D.J.Hand and M.Crowder
Practical Statistics for Medical Research
D.G.Altman
Probability—Methods and Measurement
A.O’Hagan
Problem Solving—A Statistician’s Guide, Second Edition
C.Chatfield
Randomization, Bootstrap and Monte Carlo Methods in Biology, Second Edition
B.F.J.Manly
Readings in Decision Analysis
S.French
Sampling Methodologies with Applications
Poduri S.R.S.Rao
Statistical Analysis of Reliability Data
M.J.Crowder, A.C.Kimber, T.J.Sweeting, and R.L.Smith
Statistical Methods for SPC and TQM
D.Bissell
Statistical Methods in Agriculture and Experimental Biology, Second Edition
R.Mead, R.N.Curnow, and A.M.Hasted
Statistical Process Control—Theory and Practice, Third Edition
G.B.Wetherill and D.W.Brown
Downloaded by [University of Toronto] at 16:20 23 May 2014
Statistical Theory, Fourth Edition
B.W.Lindgren
Statistics for Accountants
S.Letchford
Statistics for Epidemiology
Nicholas P.Jewell
Statistics for Technology—A Course in Applied Statistics, Third Edition
C.Chatfield
Statistics in Engineering—A Practical Approach
A.V.Metcalfe
Statistics in Research and Development, Second Edition
R.Caulcutt
Survival Analysis Using S—Analysis of Time-to-Event Data
Mara Tableman and Jong Sung Kim
The Theory of Linear Models
B.Jørgensen
Linear Models with R
Julian J.Faraway
Texts in Statistical Science
Downloaded by [University of Toronto] at 16:20 23 May 2014
Linear Models
with R
Julian J.Faraway
CHAPMAN & HALL/CRC
A CRC Press Company
Boca Raton London NewYork Washington, D.C.
This edition published in the Taylor & Francis e-Library, 2009.
Downloaded by [University of Toronto] at 16:20 23 May 2014
To purchase your own copy of this or any of
Taylor & Francis or Routledge’s collection of thousands of eBooks
please go to www.eBookstore.tandf.co.uk.
Library of Congress Cataloging-in-Publication Data
Faraway, Julian James.
Linear models with R/Julian J.Faraway.
p. cm.—(Chapman & Hall/CRC texts in statistical science series; v. 63)
Includes bibliographical references and index.
ISBN 1-58488-425-8 (alk. paper)
1. Analysis of variance. 2. Regression analysis. I. Title. II. Texts in statistical science;
v. 63.
QA279.F37 2004
519.5'38–dc22 2004051916
This book contains information obtained from authentic and highly regarded sources. Reprinted
material is quoted with permission, and sources are indicated. A wide variety of references are
listed. Reasonable efforts have been made to publish reliable data and information, but the
author and the publisher cannot assume responsibility for the validity of all materials or for the
consequences of their use.
Neither this book nor any part may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopying, microfilming, and recording, or by any
information storage or retrieval system, without prior permission in writing from the publisher.
The consent of CRC Press LLC does not extend to copying for general distribution, for promotion,
for creating new works, or for resale. Specific permission must be obtained in writing from CRC
Press LLC for such copying.
Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431.
Trademark Notice: Product or corporate names may be trademarks or registered
trademarks, and are used only for identification and explanation, without intent to
infringe.
Visit the CRC Press Web site at www.crcpress.com
© 2005 by Chapman & Hall/CRC
No claim to original U.S. Government works
ISBN 0-203-50727-4 Master e-book ISBN
ISBN 0-203-59454-1 (Adobe ebook Reader Format)
International Standard Book Number 1-58488-425-8
Library of Congress Card Number 2004051916
Contents
Preface
Downloaded by [University of Toronto] at 16:20 23 May 2014
1 Introduction
xi
1
1.1 Before You Start
1
1.2 Initial Data Analysis
2
1.3 When to Use Regression Analysis
7
1.4 History
7
2 Estimation
12
2.1 Linear Model
12
2.2 Matrix Representation
13
2.3 Estimating !
13
2.4 Least Squares Estimation
14
2.5 Examples of Calculating
16
2.6 Gauss—Markov Theorem
17
2.7 Goodness of Fit
18
2.8 Example
20
2.9 Identifiability
23
3 Inference
28
3.1 Hypothesis Tests to Compare Models
28
3.2 Testing Examples
30
3.3 Permutation Tests
36
3.4 Confidence Intervals for !
38
3.5 Confidence Intervals for Predictions
41
3.6 Designed Experiments
44
3.7 Observational Data
48
3.8 Practical Difficulties
53
4 Diagnostics
58
4.1 Checking Error Assumptions
58
4.2 Finding Unusual Observations
4.3 Checking the Structure of the Model
69
78
Downloaded by [University of Toronto] at 16:20 23 May 2014
viii Contents
5 Problems with the Predictors
83
5.1 Errors in the Predictors
83
5.2 Changes of Scale
88
5.3 Collinearity
89
6 Problems with the Error
96
6.1 Generalized Least Squares
96
6.2 Weighted Least Squares
99
6.3 Testing for Lack of Fit
102
6.4 Robust Regression
106
7 Transformation
117
7.1 Transforming the Response
117
7.2 Transforming the Predictors
120
8 Variable Selection
130
8.1 Hierarchical Models
130
8.2 Testing-Based Procedures
131
8.3 Criterion-Based Procedures
134
8.4 Summary
139
9 Shrinkage Methods
142
9.1 Principal Components
142
9.2 Partial Least Squares
150
9.3 Ridge Regression
152
10 Statistical Strategy and Model Uncertainty
157
10.1 Strategy
157
10.2 An Experiment in Model Building
158
10.3 Discussion
159
11 Insurance Redlining—A Complete Example
161
11.1 Ecological Correlation
161
11.2 Initial Data Analysis
163
11.3 Initial Model and Diagnostics
165
11.4 Transformation and Variable Selection
168
11.5 Discussion
171
12 Missing Data
173
Contents ix
13 Analysis of Covariance
13.1 A Two-Level Example
178
13.2 Coding Qualitative Predictors
182
13.3 A Multilevel Factor Example
184
14 One-Way Analysis of Variance
Downloaded by [University of Toronto] at 16:20 23 May 2014
177
191
14.1 The Model
191
14.2 An Example
192
14.3 Diagnostics
195
14.4 Pairwise Comparisons
196
15 Factorial Designs
199
15.1 Two-Way ANOVA
199
15.2 Two-Way ANOVA with One Observation per Cell
200
15.3 Two-Way ANOVA with More than One Observation per Cell
203
207
15.4 Larger Factorial Experiments
16 Block Designs
213
16.1 Randomized Block Design
213
16.2 Latin Squares
218
16.3 Balanced Incomplete Block Design
222
A R Installation, Functions and Data
227
B Quick Introduction to R
229
B.1 Reading the Data In
229
B.2 Numerical Summaries
229
B.3 Graphical Summaries
230
B.4 Selecting Subsets of the Data
231
B.5 Learning More about R
232
Bibliography
233
Index
237
Downloaded by [University of Toronto] at 16:20 23 May 2014
Downloaded by [University of Toronto] at 16:20 23 May 2014
Preface
There are many books on regression and analysis of variance. These books expect
different levels of preparedness and place different emphases on the material. This book
is not introductory. It presumes some knowledge of basic statistical theory and practice.
Readers are expected to know the essentials of statistical inference such as estimation,
hypothesis testing and confidence intervals. A basic knowledge of data analysis is
presumed. Some linear algebra and calculus are also required.
The emphasis of this text is on the practice of regression and analysis of variance. The
objective is to learn what methods are available and more importantly, when they should
be applied. Many examples are presented to clarify the use of the techniques and to
demonstrate what conclusions can be made. There is relatively less emphasis on
mathematical theory, partly because some prior knowledge is assumed and partly because
the issues are better tackled elsewhere. Theory is important because it guides the
approach we take. I take a wider view of statistical theory. It is not just the formal
theorems. Qualitative statistical concepts are just as important in statistics because these
enable us to actually do it rather than just talk about it. These qualitative principles are
harder to learn because they are difficult to state precisely but they guide the successful
experienced statistician.
Data analysis cannot be learned without actually doing it. This means using a statistical
computing package. There is a wide choice of such packages. They are designed for
different audiences and have different strengths and weaknesses. I have chosen to use R
(Ref. Ihaka and Gentleman (1996) and R Development Core Team (2003)). Why have I
used R? There are several reasons.
1. Versatility. R is also a programming language, so I am not limited by the procedures that
are preprogrammed by a package. It is relatively easy to program new methods in R.
2. Interactivity. Data analysis is inherently interactive. Some older statistical packages
were designed when computing was more expensive and batch processing of
computations was the norm. Despite improvements in hardware, the old batch
processing paradigm lives on in their use. R does one thing at a time, allowing us to
make changes on the basis of what we see during the analysis.
3. Freedom. R is based on S from which the commercial package S-plus is derived. R
itself is open-source software and may be obtained free of charge to all. Linux, Macintosh, Windows and other UNIX versions are maintained and can be obtained from the
R-project at www.r-project.org. R is mostly compatible with S-plus, meaning that Splus could easily be used for most of the examples provided in this book.
4. Popularity. SAS is the most common statistics package in general use but R or S is
most popular with researchers in statistics. A look at common statistical journals confirms this popularity. R is also popular for quantitative applications in finance.
Getting Started with R
R requires some effort to learn. Such effort will be repaid with increased productivity.
You can learn how to obtain R in Appendix A along with instructions on the installation
of additional software and data used in this book.
Downloaded by [University of Toronto] at 16:20 23 May 2014
xii Preface
This book is not an introduction to R. Appendix B provides a brief introduction to the
language, but alone is insufficient. I have intentionally included in the text all the
commands used to produce the output seen in this book. This means that you can
reproduce these analyses and experiment with changes and variations before fully
understanding R. You may choose to start working through this text before learning R
and pick it up as you go. Free introductory guides to R may be obtained from the R
project Web site at www.r-project.org. Introductory books have been written by Dalgaard
(2002) and Maindonald and Braun (2003). Venables and Ripley (2002) also have an
introduction to R along with more advanced material. Fox (2002) is intended as a
companion to a standard regression text. You may also find Becker, Chambers, and Wilks
(1998) and Chambers and Hastie (1991) to be useful references to the S language. Ripley
and Venables (2000) wrote a more advanced text on programming in S or R.
The Web site for this book is at www.stat.lsa.umich.edu/˜faraway/LMR where data
described in this book appear. Updates and errata will appear there also.
Thanks to the builders of R without whom this book would not have been possible.
CHAPTER 1
Introduction
Downloaded by [University of Toronto] at 16:20 23 May 2014
1.1 Before You Start
Statistics starts with a problem, proceeds with the collection of data, continues with the
data analysis and finishes with conclusions. It is a common mistake of inexperienced
statisticians to plunge into a complex analysis without paying attention to what the
objectives are or even whether the data are appropriate for the proposed analysis. Look
before you leap!
The formulation of a problem is often more essential than its solution which
may be merely a matter of mathematical or experimental skill. Albert Einstein
To formulate the problem correctly, you must:
1. Understand the physical background. Statisticians often work in collaboration with
others and need to understand something about the subject area. Regard this as an
opportunity to learn something new rather than a chore.
2. Understand the objective. Again, often you will be working with a collaborator who
may not be clear about what the objectives are. Beware of “fishing expeditions”—if
you look hard enough, you will almost always find something, but that something may
just be a coincidence.
3. Make sure you know what the client wants. You can often do quite different analyses
on the same dataset. Sometimes statisticians perform an analysis far more complicated
than the client really needed. You may find that simple descriptive statistics are all that
are needed.
4. Put the problem into statistical terms. This is a challenging step and where irreparable
errors are sometimes made. Once the problem is translated into the language of
statistics, the solution is often routine. Difficulties with this step explain why artificial
intelligence techniques have yet to make much impact in application to statistics.
Defining the problem is hard to program.
That a statistical method can read in and process the data is not enough. The results of an
inapt analysis may be meaningless.
It is also important to understand how the data were collected.
• were they obtained via a designed sample survey. How the data were collected has a
crucial impact on what conclusions can be made.
• Is there nonresponse? The data you do not see may be just as important as the data
you do see.
• Are there missing values? This is a common problem that is troublesome and time
consuming to handle.
• How are the data coded? In particular, how are the qualitative variables represented?
• What are the units of measurement?
2
Linear Models with R
• Beware of data entry errors and other corruption of the data. This problem is all too common —
almost a certainty in any real dataset of at least moderate size. Perform some data sanity checks.
Downloaded by [University of Toronto] at 16:20 23 May 2014
1.2 Initial Data Analysis
This is a critical step that should always be performed. It looks simple but it is vital. You
should make numerical summaries such as means, standard deviations (SDs), maximum
and minimum, correlations and whatever else is appropriate to the specific dataset.
Equally important are graphical summaries. There is a wide variety of techniques to
choose from. For one variable at a time, you can make boxplots, histograms, density plots
and more. For two variables, scatterplots are standard while for even more variables,
there are numerous good ideas for display including interactive and dynamic graphics. In
the plots, look for outliers, data-entry errors, skewed or unusual distributions
and structure. Check whether the data are distributed according to prior expectations.
Getting data into a form suitable for analysis by cleaning out mistakes and aberrations is
often time consuming. It often takes more time than the data analysis itself. In this course,
all the data will be ready to analyze, but you should realize that in practice this is rarely the case.
Let’s look at an example. The National Institute of Diabetes and Digestive and Kidney
Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The
following variables were recorded: number of times pregnant, plasma glucose
concentration at 2 hours in an oral glucose tolerance test, diastolic blood pressure
(mmHg), triceps skin fold thickness (mm), 2-hour serum insulin (mu U/ml), body mass
index (weight in kg/(height in m2)), diabetes pedigree function, age (years) and a test
whether the patient showed signs of diabetes (coded zero if negative, one if positive). The
data may be obtained from UCI Repository of machine learning databases at
www.ics.uci.edu/˜mlearn/MLRepository.html.
Of course, before doing anything else, one should find out the purpose of the study and
more about how the data were collected. However, let’s skip ahead to a look at the data:
> library(faraway)
> data (pima)
> pima
pregnant glucose diastolic
1
6
148
72
2
1
85
66
3
8
183
64
…much deleted…
768
1
93
70
triceps insulin bmi diabetes age
35
0
33.6
0.627 50
29
0
26.6
0.351 31
0
0
23.3
0.672 32
31
0
30.4
0.315
23
The library (faraway) command makes the data used in this book available. You need to
install this package first as explained in Appendix A. We have explicitly written this
command here, but in all subsequent chapters, we will assume that you have already
issued this command if you plan to use data mentioned in the text. If you get an error
message about data not being found, it may be that you have forgotten to type this.
Introduction 3
Downloaded by [University of Toronto] at 16:20 23 May 2014
The command data (pima) calls up this particular dataset. Simply typing the name of
the data frame, pima, prints out the data. It is too long to show it all here. For a dataset of
this size, one can just about visually skim over the data for anything out of place, but it is
certainly easier to use summary methods.
We start with some numerical summaries:
The summary ( ) command is a quick way to get the usual univariate summary
information. At this stage, we are looking for anything unusual or unexpected, perhaps
indicating a data-entry error. For this purpose, a close look at the minimum and
maximum values of each variable is worthwhile. Starting with pregnant, we see a maximum value of 17. This is large, but not impossible. However, we then see that the next
five variables have minimum values of zero. No blood pressure is not good for the
health—something must be wrong. Let’s look at the sorted values:
Downloaded by [University of Toronto] at 16:20 23 May 2014
4
Linear Models with R
We see that the first 35 values are zero. The description that comes with the data says
nothing about it but it seems likely that the zero has been used as a missing value code. For
one reason or another, the researchers did not obtain the blood pressures of 35 patients.
In a real investigation, one would likely be able to question the researchers about what
really happened. Nevertheless, this does illustrate the kind of misunderstanding that can
easily occur. A careless statistician might overlook these presumed missing values and
complete an analysis assuming that these were real observed zeros. If the error was later discovered,
they might then blame the researchers for using zero as a missing value code (not a good choice
since it is a valid value for some of the variables) and not mentioning it in their data description.
Unfortunately such oversights are not uncommon, particularly with datasets of any size
or complexity. The statistician bears some share of responsibility for spotting these mistakes.
We set all zero values of the five variables to NA which is the missing value code
used by R:
>
>
>
>
>
pima$diastolic [pima$diastolic =
pima$glucose [pima$glucose == 0]
pima$triceps [pima$triceps == 0]
pima$insulin [pima$insulin == 0]
pima$bmi [pima$bmi == 0] < - NA
=
<
<
<
0] < - NA
- NA
- NA
- NA
The variable test is not quantitative but categorical. Such variables are also called factors.
However, because of the numerical coding, this variable has been treated as if it were
quantitative. It is best to designate such variables as factors so that they are treated
appropriately. Sometimes people forget this and compute stupid statistics such as the
“average zip code.”
> pima$test < - factor (pima$test)
> summary (pima$test)
0
1
500
268
We now see that 500 cases were negative and 268 were positive. It is even better to use
descriptive labels:
Introduction 5
Downloaded by [University of Toronto] at 16:20 23 May 2014
Now that we have cleared up the missing values and coded the data appropriately, we are
ready to do some plots. Perhaps the most well-known univariate plot is the histogram:
Figure 1.1 The first panel shows a histogram of the diastolic blood pressures, the second shows a kernel density estimate of the
same, while the third shows an index plot of the sorted values.
> hist (pima$diastolic)
as seen in the first panel of Figure 1.1. We see a bell-shaped distribution for the diastolic
blood pressures centered around 70. The construction of a histogram requires the
specification of the number of bins and their spacing on the horizontal axis. Some choices
can lead to histograms that obscure some features of the data. R specifies the number and
spacing of bins given the size and distribution of the data, but this choice is not foolproof
and misleading histograms are possible. For this reason, some prefer to use kernel density
estimates, which are essentially a smoothed version of the histogram (see Simonoff
(1996) for a discussion of the relative merits of histograms and kernel estimates):
> plot (density (pima$diastolic, na . rm=TRUE) )
The kernel estimate may be seen in the second panel of Figure 1.1. We see that this plot
avoids the distracting blockiness of the histogram. Another alternative is to simply plot
the sorted data against its index:
6
Linear Models with R
> plot (sort (pima$diastolic), pch=".")
The advantage of this is that we can see all the cases individually. We can see the
distribution and possible outliers. We can also see the discreteness in the measurement of
blood pressure—values are rounded to the nearest even number and hence we see the
“steps” in the plot.
Now note a couple of bivariate plots, as seen in Figure 1.2:
Downloaded by [University of Toronto] at 16:20 23 May 2014
> plot (diabetes ˜ diastolic,pima)
> plot (diabetes ˜ test, pima)
Figure 1.2 The first panel shows scatterplot of the diastolic blood pressures against diabetes function and the second shows boxplots of diastolic blood pressure broken down by test result.
First, we see the standard scatterplot showing two quantitative variables. Second, we see
a side-by-side boxplot suitable for showing a quantitative and a qualititative variable.
Also useful is a scatterplot matrix, not shown here, produced by:
> pairs (pima)
We will be seeing more advanced plots later, but the numerical and
graphical summaries presented here are sufficient for a first look at the data.
Introduction 7
1.3 When to Use Regression Analysis
Regression analysis is used for explaining or modeling the relationship between a single
variable Y, called the response, output or dependent variable; and one or more predictor,
Downloaded by [University of Toronto] at 16:20 23 May 2014
input, independent or explanatory variables, X1,…, Xp. When p=1, it is called simple
regression but when p>1 it is called multiple regression or sometimes multivariate
regression. When there is more than one Y, then it is called multivariate multiple
regression which we will not be covering explicity here, although you can just do
separate regressions on each Y.
The response must be a continuous variable, but the explanatory variables can be
continuous, discrete or categorical, although we leave the handling of categorical
explanatory variables to later in the book. Taking the example presented above, a
regression with diastolic and bmi as X’s and diabetes as Y would be a multiple regression
involving only quantitative variables which we shall be tackling shortly. A regression with
diastolic and test as X’s and bmi as Y would have one predictor that is quantitative and one
that is qualitative, which we will consider later in Chapter 13 on analysis of covariance.
A regression with test as X and diastolic as Y involves just qualitative predictors—a topic
called analysis of variance (ANOVA), although this would just be a simple two sample situation.
A regression of test as Y on diastolic and bmi as predictors would involve a qualitative
response. A logistic regression could be used, but this will not be covered in this book.
Regression analyses have several possible objectives including:
1. Prediction of future observations
2. Assessment of the effect of, or relationship between, explanatory variables and the
response
3. A general description of data structure
Extensions exist to handle multivariate responses, binary responses (logistic regression
analysis) and count responses (Poisson regression) among others.
1.4 History
Regression-type problems were first considered in the 18th century to aid navigation with
the use of astronomy. Legendre developed the method of least squares in 1805. Gauss
claimed to have developed the method a few years earlier and in 1809 showed that least
squares is the optimal solution when the errors are normally distributed. The
methodology was used almost exclusively in the physical sciences until later in the 19th
century. Francis Galton coined the term regression to mediocrity in 1875 in reference to
the simple regression equation in the form:
8
Linear Models with R
Downloaded by [University of Toronto] at 16:20 23 May 2014
where r is the correlation between x and y. Galton used this equation to explain the
phenomenon that sons of tall fathers tend to be tall but not as tall as their fathers, while
sons of short fathers tend to be short but not as short as their fathers. This phenomenom is
called the regression effect. See Stigler (1986) for more of the history.
We can illustrate this effect with some data on scores from a course taught using this
book. In Figure 1.3, we see a plot of midterm against final scores. We scale each variable
to have mean zero and SD one so that we are not distracted by the relative difficulty of
each exam and the total number of points possible. Furthermore, this simplifies the
regression equation to:
y = rx
>
>
>
>
data (stat500)
stat500 < - data.frame (scale (stat500))
plot (final ˜ midterm, stat500)
abline (0, l)
Figure 1.3 The final and midterm scores in standard units. The least squares fit
is shown witha dotted line while y=x is shown with a solid line.
Introduction 9
Downloaded by [University of Toronto] at 16:20 23 May 2014
We have added the y=x (solid) line to the plot. Now a student scoring, say 1 SD above
average on the midterm might reasonably expect to do equally well on the final. We
compute the least squares regression fit and plot the regression line (more on the details
later). We also compute the correlations:
> g < - lm (final ˜ midterm, stat500)
> abline (coef (g), lty=5)
> cor (stat500)
midterm
final
hw
total
midterm 1.00000 0.545228 0.272058 0.84446
final
0.54523 1.000000 0.087338 0.77886
hw
0.27206 0.087338 1.000000 0.56443
total
0.84446 0.778863 0.564429 1.00000
The regression fit is the dotted line in Figure 1.3 and is always shallower than the y=x
line. We see that a student scoring 1 SD above average on the midterm is predicted to
score only 0.545 SDs above average on the final
Correspondingly, a student scoring below average on the midterm might expect to do
relatively better in the final, although still below average.
If exams managed to measure the ability of students perfectly, then provided that
ability remained unchanged from midterm to final, we would expect to see an exact
correlation. Of course, it is too much to expect such a perfect exam and some variation is
inevitably present. Furthermore, individual effort is not constant. Getting a high score on
the midterm can partly be attributed to skill, but also a certain amount of luck. One
cannot rely on this luck to be maintained in the final. Hence we see the “regression to
mediocrity.”
Of course this applies to any (x, y) situation like this—an example is the so-called
sophomore jinx in sports when a new star has a so-so second season after a great first
year. Although in the father-son example, it does predict that successive descendants will
come closer to the mean; it does not imply the same of the population in general since
random fluctuations will maintain the variation. In many other applications of regression,
the regression effect is not of interest, so it is unfortunate that we are now stuck with this
rather misleading name.
Regression methodology developed rapidly with the advent of high-speed computing.
Just fitting a regression model used to require extensive hand calculation. As computing
hardware has improved, the scope for analysis has widened.
Exercises
1. The dataset teengamb concerns a study of teenage gambling in Britain. Make a
numerical and graphical summary of the data, commenting on any features that you
find interesting. Limit the output you present to a quantity that a busy reader would
find sufficient to get a basic understanding of the data.
2. The dataset uswages is drawn as a sample from the Current Population Survey in
1988. Make a numerical and graphical summary of the data as in the previous
question.
10
Linear Models with R
Downloaded by [University of Toronto] at 16:20 23 May 2014
3. The dataset prostate is from a study on 97 men with prostate cancer who were due to receive
a radical prostatectomy. Make a numerical and graphical summary of the data as in the first
question.
4. The dataset sat comes from a study entitled “Getting What You Pay For: The Debate
Over Equity in Public School Expenditures.” Make a numerical and graphical
summary of the data as in the first question.
5. The dataset divusa contains data on divorces in the United States from 1920 to 1996.
Make a numerical and graphical summary of the data as in the first question.
Downloaded by [University of Toronto] at 16:20 23 May 2014
CHAPTER 2
Estimation
2.1 Linear Model
Suppose we want to model the response Y in terms of three predictors, X1, X2 and X3. One
Downloaded by [University of Toronto] at 16:20 23 May 2014
very general form for the model would be:
where f is some unknown function and " is the error in this representation. " is additive in
this instance, but could enter in some more general form. Still, if we assume that f is a
smooth, continuous function, that still leaves a very wide range of possibilities. Even with
just three predictors, we typically will not have enough data to try to estimate f directly.
So we usually have to assume that it has some more restricted form, perhaps linear as in:
where !i, i=0, 1, 2, 3 are unknown parameters. !0 is called the intercept term. Thus the
problem is reduced to the estimation of four parameters rather than the infinite
dimensional f. In a linear model the parameters enter linearly—the predictors themselves
do not have to be linear. For example:
is a linear model, but:
is not. Some relationships can be transformed to linearity—for example,
can
be linearized by taking logs. Linear models seem rather restrictive, but because the
predictors can be transformed and combined in any way, they are actually very flexible.
The term linear is often used in everyday speech as almost a synonym for simplicity. This
gives the casual observer the impression that linear models can only handle small simple
datasets. This is far from the truth—linear models can easily be expanded and modified
to handle complex datasets. Linear is also used to refer to straight lines, but linear models
can be curved. Truly nonlinear models are rarely absolutely necessary and most often
arise from a theory about the relationships between the variables, rather than an empirical
investigation.
Estimation 13
2.2 Matrix Representation
If we have a response Y and three predictors, X1, X2 and X3, the data might be presented
Downloaded by [University of Toronto] at 16:20 23 May 2014
in tabular form like this:
where n is the number of observations, or cases, in the dataset.
Given the actual data values, we may write the model as:
yi=!0+!1x1i+!2x2i+!3x3i+"i i=1,…, n
but the use of subscripts becomes inconvenient and conceptually obscure. We will find it
simpler both notationally and theoretically to use a matrix/vector representation. The
regression equation is written as:
y=X!+"
where y=(y1,…, yn)T, "=("1,…, "n)T, !=(!0,…,!3)T and:
The column of ones incorporates the intercept term. One simple example is the null
model where there is no predictor and just a mean y=!+":
We can assume that E"=0 since if this were not so, we could simply absorb the nonzero
expectation for the error into the mean ! to get a zero expectation.
2.3 Estimating !
The regression model, y=X!+", partitions the response into a systematic component X!
and a random component " We would like to choose ! so that the systematic part explains
14
Linear Models with R
as much of the response as possible. Geometrically speaking, the response lies in an
Downloaded by [University of Toronto] at 16:20 23 May 2014
n-dimensional space, that is,
while
where p is the number of parameters.
If we include the intercept then p is the number of predictors plus one. We will use this
definition of p from now on. It is easy to get confused as to whether p is the number of
predictors or parameters, as different authors use different conventions, so be careful.
The problem is to find ! so that X! is as close to Y as possible. The best choice, the
estimate , is apparent in the geometrical representation seen in Figure 2.1.
is, in this sense, the best estimate of ! within the model space. The response predicted
by the model is
or Hy where H is an orthogonal projection matrix. The difference
Figure 2.1 Geometrical representation of the estimation !. The data
vector Y is projected orthogonally onto the model space
spanned by X. The fit is represented by projection
. data represented
with the difference between the fit and the
by the residual vector
between the actual response and the predicted response is denoted by and is called the
residuals.
The conceptual purpose of the model is to represent, as accurately as possible,
something complex, y, which is n-dimensional, in terms of something much simpler, the
model, which is p-dimensional. Thus if our model is successful, the structure in the data
should be captured in those p dimensions, leaving just random variation in the residuals
which lie in an (n#p)-dimensional space. We have:
Data=Systematic Structure+Random Variation
n dimensions=p dimensions+(n#p)dimensions
2.4 Least Squares Estimation
The estimation of ! can also be considered from a nongeometrical point of view. We
might define the best estimate of ! as the one which minimizes the sum of the squared
errors. The least squares estimate of !, called minimizes:
Estimation 15
Downloaded by [University of Toronto] at 16:20 23 May 2014
Differentiating with respect to ! and setting to zero, we find that
satisfies:
These are called the normal equations. We can derive the same result using the
geometrical approach. Now provided XTX is invertible:
H=X(XTX)!1XT is called the hat-matrix and is the orthogonal projection of y onto the
space spanned by X. H is useful for theoretical manipulations, but you usually do not
want to compute it explicitly, as it is an n!n matrix which could be uncomfortably large
for some datasets. The following useful quantities can now be represented using H.
The predicted or fitted values are
while the residuals are
The residual sum of squares (RSS) is
Later, we will show that the least squares estimate is the best possible estimate of !
when the errors " are uncorrelated and have equal variance or more briefly put var "=#2I.
is unbiased and has variance (XTX)–1#2 provided var " = #2I. Since is a
variance is a matrix.
We also need to estimate #2. We find that
estimator:
vector, its
which suggests the
as an unbiased estimate of #2. n#p is the degrees of freedom of the model. Sometimes
you need the standard error for a particular component of
which can be picked out as
16
Linear Models with R
2.5 Examples of Calculating
In a few simple models, it is possible to derive explicit formulae for
1. When y = !+", X=1 and !=! so XTX=1T1=n so:
Downloaded by [University of Toronto] at 16:20 23 May 2014
2. Simple linear regression (one predictor):
We can now apply the formula but a simpler approach is to rewrite the equation as:
so now:
Next work through the rest of the calculation to reconstruct the familiar estimates, that is:
In higher dimensions, it is usually not possible to find such explicit formulae for the
parameter estimates unless XTX happens to be a simple form. So typically we need
computers to fit such models. Regression has a long history, so in the time before
computers became readily available, fitting even quite simple models was a tedious time
consuming task. When computing was expensive, data analysis was limited. It was
designed to keep calculations to a minimum and restrict the number of plots. This
mindset remained in statistical practice for some time even after computing became
widely and cheaply available. Now it is a simple matter to fit a multitude of models and
make more plots than one could reasonably study. The challenge for the analyst is to
choose among these intelligently to extract the crucial information in the data.
Estimation 17
2.6 Gauss-Markov Theorem
Downloaded by [University of Toronto] at 16:20 23 May 2014
is a plausible estimator, but there are alternatives. Nonetheless, there are three
goodreasons to use least squares:
1. It results from an orthogonal projection onto the model space. It makes sense
geometrically.
2. If the errors are independent and identically normally distributed, it is the maximum
likelihood estimator. Loosely put, the maximum likelihood estimate is the value of !
that maximizes the probability of the data that was observed.
3. The Gauss-Markov theorem states that
is the best linear unbiased estimate (BLUE).
To understand the Gauss-Markov theorem we first need to understand the concept of an
estimable function. A linear combination of the parameters
only if there exists a linear combination aTy such that:
is estimable if and
Estimable functions include predictions of future observations, which explains why they
are well worth considering. If X is of full rank, then all linear combinations are estimable.
Suppose E"=0 and var "=#2I. Suppose also that the structural part of the model,
EY=X! is correct. (Clearly these are big assumptions and so we will address the
implications of this later.)
be an estimable function; then the Gauss-Markov
theorem states that in the class of all unbiased linear estimates of
has the
minimum variance and is unique.
We prove this theorem. Suppose aTy is some unbiased estimate of cT! so that:
which means that aTX=CT. This implies that c must be in the range space of XT which in
turn implies that c is also in the range space of XTX which means there exists a $, such
that c=XTX$ so:
Now we can show that the least squares estimator has the minimum variance—pick an
arbitrary estimate aTy and compute its variance:
18
Linear Models with R
Downloaded by [University of Toronto] at 16:20 23 May 2014
but
so
Now since variances cannot be negative, we see that:
In other words,
has minimum variance. It now remains to show that it is unique.
There will be equality in the above relationship if var (aTy#$TXTy)=0 which would
require that aT#$TXT=0 which means that
T
T
if a y=c
So equality occurs only
so the estimator is unique. This completes the proof.
The Gauss-Markov theorem shows that the least squares estimate is a good choice,
but it does require that the errors are uncorrelated and have unequal variance. Even if the
errors behave, but are nonnormal, then nonlinear or biased estimates may work better. So
this theorem does not tell one to use least squares all the time; it just strongly suggests it
unless there is some strong reason to do otherwise. Situations where estimators other than
ordinary least squares should be considered are:
1. When the errors are correlated or have unequal variance, generalized least squares
should be used. See Section 6.1.
2. When the error distribution is long-tailed, then robust estimates might be used. Robust
estimates are typically not linear in y. See Section 6.4.
3. When the predictors are highly correlated (collinear), then biased estimators such as
ridge regression might be preferable. See Chapter 9.
2.7 Goodness of Fit
It is useful to have some measure of how well the model fits the data. One common
choice is R2, the so-called coefficient of determination or percentage of variance explained:
Estimation 19
Downloaded by [University of Toronto] at 16:20 23 May 2014
Its range is 0"R2"1—values closer to 1 indicating better fits. For simple linear regression
R2=r2 where r is the correlation between x and y. An equivalent definition
Figure 2.2 Variation in the response y when x is known is denoted by
dotted arrows while variation in y when x is unknown is
shown with the solid arrows.
The graphical intuition behind R2 is seen in Figure 2.2. Suppose you want to predict y. If
you do not know x, then your best prediction is but the variability in this prediction is
high. If you do know x, then your prediction will be given by the regression fit. This
prediction will be less variable provided there is some relationship between x and y. R2 is
one minus the ratio of the sum of squares for these two predictions. Thus for perfect
predictions the ratio will be zero and R2 will be one.
R2 as defined here does not make any sense if you do not have an intercept in your
model. This is because the denominator in the definition of R2 has a null model with an
intercept in mind when the sum of squares is calculated. Alternative definitions of R2 are
possible when there is no intercept, but the same graphical intuition is not available and
Downloaded by [University of Toronto] at 16:20 23 May 2014
20
Linear Models with R
the R2s obtained in this way should not be compared to those for models with an
intercept. Beware of high R2s reported from models without an intercept.
What is a good value of R2? It depends on the area of application. In the biological and
social sciences, variables tend to be more weakly correlated and there is a lot of noise.
We would expect lower values for R2 in these areas—a value of 0.6 might be considered
good. In physics and engineering, where most data come from closely controlled
experiments, we expect to get much higher R2s and a value of 0.6 would be considered
low. Of course, I generalize excessively here so some experience with the particular area
is necessary for you to judge your R2s well.
An alternative measure of fit is This quantity is directly related to the standard errors
of estimates of ! and predictions. The advantage is that
is measured in the units of
the response and so may be directly interpreted in the context of the particular dataset.
This may also be a disadvantage in that one must understand whether the practical
significance of this measure whereas R2, being unitless, is easy to understand.
2.8 Example
Now let’s look at an example concerning the number of species of tortoise on the various
Galápagos Islands. There are 30 cases (Islands) and seven variables in the dataset. We
start by reading the data into R and examining it (remember you first need to load the
book data with the library (faraway) command):
> data (gala) >
gala
Species
Baltra
58
Bartolome
31
. . .
Endemics
23
21
Area
25.09
1.24
Elevation Nearest
346
0.6
109
0.6
Scruz
0.6
26.3
The variables are Species—the number of species of tortoise found on the island,
Endemics—the number of endemic species, Area—the area of the island (km2),
Elevation—the highest elevation of the island (m), Nearest—the distance from the
nearest island (km), Scruz—the distance from Santa Cruz Island (km), Adjacent—the
area of the adjacent island (km2).
The data were presented by Johnson and Raven (1973) and also appear in Weisberg (1985).
I have filled in some missing values for simplicity (see Chapter 12 for how this can be
done). Fitting a linear model in R is done using the lm ( ) command. Notice the syntax for
specifying the predictors in the model. This is part of the Wilkinson-Rogers notation. In
this case, since all the variables are in the gala data frame, we must use the data=argument:
> mdl < - lm (Species ˜ Area + Elevation + Nearest + Scruz
+ Adjacent, data=gala) >
summary (mdl)
Downloaded by [University of Toronto] at 16:20 23 May 2014
Estimation 21
Call:
lm (formula = Species ˜ Area + Elevation + Nearest + Scruz
+ Adjacent, data = gala)
Residuals:
Min
1Q
Median
3Q
Max
!111.68
!34.90
!7.86
33.46
182.58
Coefficients:
Estimate
Std. Error t value Pr(>|t|)
(Intercept) 7.06822 19.15420
0.37
0.7154
Area
!0.02394
0.02242
!1.07
0.2963
Elevation
0.31946
0.05366
5.95 3.8e–06
Nearest
0.00914
1.05414
0.01
0.9932
Scruz
!0.24052
0.21540
!1.12
0.2752
Adjacent
!0.07480
0.01770
!4.23
0.0003
Residual standard error: 61 on 24 degrees of freedom
Multiple R-Squared: 0.766, Adjusted R-squared: 0.717
F-statistic: 15.7 on 5 and 24 DF, p-value: 6.84e–07
We can identify several useful quantities in this output. Other statistical packages tend to
produce output quite similar to this. One useful feature of R is that it is possible to directly calculate quantities of interest. Of course, it is not necessary here because the lm ( ) function does the
job, but it is very useful when the statistic you want is not part of the prepackaged functions.
First, we make the X-matrix:
> x < - model.matrix ( ˜ Area + Elevation + Nearest + Scruz
+ Adjacent, gala)
and here is the response y:
> y < - gala$Species
Now let’s construct
. t ( ) does transpose and %*% does matrix multiplication.
!1
solve (A) computes A while solve (A, b) solves Ax=b:
> xtxi < - solve (t (x) %*% x)
We can get
directly, using (XTX)!1XTy:
> xtxi %*% t (x) %*% y
[,1]
1
7.068221
Area
!0.023938
Elevation
0.319465
Nearest
0.009144
Scruz
!0.240524
Adjacent
!0.074805
22
Linear Models with R
Downloaded by [University of Toronto] at 16:20 23 May 2014
This is a very bad way to compute . It is inefficient and can be very inaccurate when the
predictors are strongly correlated. Such problems are exacerbated by large datasets. A
better, but not perfect, way is:
> solve (crossprod (x, x), crossprod (x, y))
[,1]
1
7.068221
Area
!0.023938
Elevation 0.319465
Nearest
0.009144
Scruz
!0.240524
Adjacent !0.074805
where crossprod (x, y) computes xT y. Here we get the same result as lm ( ) because the
data are well-behaved. In the long run, you are advised to use carefully programmed code
such as found in lm ( ). To see the full details, consult a text such as Thisted (1988).
We can extract the regression quantities we need from the model object. Commonly
used are residuals ( ), fitted ( ), deviance ( ) which gives the RSS, df . residual ( ) which
gives the degrees of freedom and coef ( ) which gives the . You can also extract other
needed quantities by examining the model object and its summary:
> names (mdl)
[1] "coefficients" "residuals"
[4] "rank"
"fitted.values"
[7] "qr"
"df.residual"
[10] "call"
"terms"
> md1s < - summary (mdl)
> names (mdls)
[1] "call"
"terms"
[4] "coefficients" "aliased"
[7] "df"
"r.squared"
[10] "fstatistic"
"cov.unscaled"
"effects"
"assign"
"xlevels"
"model"
"residuals"
"sigma"
"adj.r.squared"
We can estimate # using the formula in the text above or extract it from the summary
object:
> sqrt (deviance (mdl) /df.residual (mdl))
[1] 60.975
> mdls$sigma
[1] 60.975
We can also extract (XTX)–1and use it to compute the standard errors for the coefficients.
(diag ( ) returns the diagonal of a matrix):
> xtxi < - mdls$cov.unscaled
> sqrt (diag (xtxi)) *60.975
(Intercept)
Area
Elevation
19.154139 0.022422
0.053663
Nearest
1.054133
Scruz
0.215402
Estimation 23
Adjacent
0.017700
or get them from the summary object:
Downloaded by [University of Toronto] at 16:20 23 May 2014
> mdls$coef [,2]
(Intercept)
Area
19.154198 0.022422
Adjacent
0.017700
Elevation Nearest
0.053663 1.054136
Scruz
0.215402
Finally, we may compute or extract R2:
> 1-deviance (mdl) /sum ((y-mean (y))^2)
[1] 0.76585
> mdls$r.squared
[1] 0.76585
2.9 Identifiability
The least squares estimate is the solution to the normal equations:
where X is an n!p matrix. If XTX is singular and cannot be inverted, then there will be
infinitely many solutions to the normal equations and is at least partially unidentifiable.
Unidentifiability will occur when X is not of full rank—when its columns are linearly
dependent. With observational data, unidentifiability is usually caused by some oversight.
Here are some examples:
1. A person’s weight is measured both in pounds and kilos and both variables are
entered into the model.
2. For each individual we record the number of years of preuniversity education, the
number of years of university education and also the total number of years of
education and put all three variables into the model.
3. We have more variables than cases, that is, p>n. When p=n, we may perhaps estimate
all the parameters, but with no degrees of freedom left to estimate any standard errors
or do any testing. Such a model is called saturated. When p>n, then the model is
called supersaturated. Oddly enough, such models are considered in large-scale
screening experiments used in product design and manufacture, but there is no hope of
uniquely estimating all the parameters in such a model.
Such problems can be avoided by paying attention. Identifiability is more of an issue in
designed experiments. Consider a simple two-sample experiment, where the treatment
observations are y1,…, yn and the controls are yn+1,…, ym+n. Suppose we try to model the
response by an overall mean ! and group effects %1 and %2 :
Downloaded by [University of Toronto] at 16:20 23 May 2014
24
Linear Models with R
Now although X has three columns, it has only rank 2—(!, %1, %2) are not identifiable
and the normal equations have infinitely many solutions. We can solve this problem by
imposing some constraints, !=0 or %1+%2=0, for example.
Statistics packages handle nonidentifiability differently. In the regression case above,
some may return error messages and some may fit models because rounding error may
remove the exact identifiability. In other cases, constraints may be applied but these may
be different from what you expect. By default, R fits the largest identifiable model by
removing variables in the reverse order of appearance in the model formula.
Here is an example. Suppose we create a new variable for the Galápagos dataset—the
difference in area between the island and its nearest neighbor:
> gala$Adiff < - gala$Area -gala$Adjacent
and add that to the model:
> g < - lm (Species ˜ Area+Elevation+Nearest+Scruz+Adjacent
+Adiff, gala)
> summary (g)
Coefficients: (1 not defined because of singularities)
Estimate
Std. Error t value Pr(>|t|)
(Intercept) 7.06822
19.15420
0.37
0.7154
Area
!0.02394
0.02242
!1.07
0.2963
Elevation
0.31946
0.05366
5.95 3.8e–06
Nearest
0.00914
1.05414
0.01
0.9932
Scruz
!0.24052
0.21540
!1.12
0.2752
Adjacent
!0.07480
0.01770
!4.23
0.0003
Adiff
NA
NA
NA
NA
Residual standard error: 61 on 24 degrees of freedom
Multiple R-Squared: 0.766, Adjusted R-squared: 0.717
F-statistic: 15.7 on 5 and 24 DF, p-value: 6.84e–07
We get a message about one undefined coefficient because the rank of the design matrix X is
six, which is less than its seven columns. In most cases, the cause of identifiability can be
revealed with some thought about the variables, but, failing that, an eigendecomposition of XTX will reveal the linear combination(s) that gave rise to the unidentifiability.
Estimation 25
Lack of identifiability is obviously a problem, but it is usually easy to identify and work
around. More problematic are cases where we are close to unidentifiability. To demonstrate this,
suppose we add a small random perturbation to the third decimal place of Adiff by adding
a random variate from U [&0.005, 0.005] where U denotes the uniform distribution:
> Adiffe < - gala$Adiff+0.001*(runif(30)-0.5)
Downloaded by [University of Toronto] at 16:20 23 May 2014
and now refit the model:
> g < - lm (Species ˜ Area+Elevation+Nearest+Scruz
+Adjacent+Adiffe, gala)
> summary (g)
Coefficients:
Estimate
Std. Error t value Pr(>|t|)
(Intercept) 7.14e+00
1.956+01
0.37
0.72
Area
!2.38e+04
4.70e+04
!0.51
0.62
Elevation
3.12e–01
5.67e–02
5.50 1.46–05
Nearest
1.38e–01
1.10e+00
0.13
0.90
Scruz
!2.50e–01
2.206–01
!1.14
0.27
Adjacent
2.386+04
4.70e+04
0.51
0.62
Adiffe
2.386+04
4.70e+04
0.51
0.62
Residual standard error: 61.9 on 23 degrees of freedom
Multiple R-Squared: 0.768, Adjusted R-squared: 0.708
F-statistic: 12.7 on 6 and 23 DF, p-value: 2.58e–06
Notice that now all parameters are estimated, but the standard errors are very large
because we cannot estimate them in a stable way. We deliberately caused this problem so
we know the cause but in general we need to be able to identify such situations. We do
this in Section 5.3.
Exercises
1. The dataset teengamb concerns a study of teenage gambling in Britain. Fit a regression
model with the expenditure on gambling as the response and the sex, status, income
and verbal score as predictors. Present the output.
(a) What percentage of variation in the response is explained by these predictors?
(b) Which observation has the largest (positive) residual? Give the case number.
(c) Compute the mean and median of the residuals.
(d) Compute the correlation of the residuals with the fitted values.
(e) Compute the correlation of the residuals with the income.
(f) For all other predictors held constant, what would be the difference in predicted
expenditure on gambling for a male compared to a female?
2. The dataset uswages is drawn as a sample from the Current Population Survey in
1988. Fit a model with weekly wages as the response and years of education and
experience as predictors. Report and give a simple interpretation to the regression
26
Linear Models with R
coefficient for years of education. Now fit the same model but with logged weekly
wages. Give an interpretation to the regression coefficient for years of education.
Which interpretation is more natural?
3. In this question, we investigate the relative merits of methods for computing the
coefficients. Generate some artificial data by:
Downloaded by [University of Toronto] at 16:20 23 May 2014
> x < - 1:20
> y < - x+rnorm(20)
Fit a polynomial in x for predicting y. Compute in two ways—by lm ( ) and by
using the direct calculation described in the chapter. At what degree of polynomial
does the direct calculation method fail? (Note the need for the I ( ) function in fitting
the polynomial, that is, lm(y˜x+I(x^2)) .
4. The dataset prostate comes from a study on 97 men with prostate cancer who were due
to receive a radical prostatectomy. Fit a model with lpsa as the response and l cavol as
the predictor. Record the residual standard error and the R2. Now add lweight, svi,
lpph, age, l cp, pgg45 and gleason to the model one at a time. For each model record
the residual standard error and the R2. Plot the trends in these two statistics.
5. Using the prostate data, plot lpsa against l cavol. Fit the regressions of lpsa on lcavol
and lcavol on lpsa. Display both regression lines on the plot. At what point do the two
lines intersect?
Downloaded by [University of Toronto] at 16:20 23 May 2014
Downloaded by [University of Toronto] at 16:20 23 May 2014
CHAPTER 3
Inference
Until now, we have not found it necessary to assume any distributional form for the errors
". However, if we want to make any confidence intervals or perform any hypothesis tests,
we will need to do this. The common assumption is that the errors are normally
distributed. In practice, this is often, although not always, a reasonable assumption. We
have already assumed that the errors are independent and identically distributed (i.i.d.)
with mean 0 and variance #2, so we have " ~ N(0, #2I). Now since y=X!+", we have y ~
N(X!, #2I). which is a compact description of the regression model. From this we find, using the fact that linear combinations of normally distributed values are also normal, that:
3.1 Hypothesis Tests to Compare Models
Given several predictors for a response, we might wonder whether all are needed.
Consider a larger model, #, and a smaller model, ', which consists of a subset of the
predictors that are in ( If there is not much difference in the fit, we would prefer the
smaller model on the principle that simpler explanations are preferred. On the other hand,
if the fit of the larger model is appreciably better, we will prefer it. We will take ' to
represent the null hypothesis and ( to represent the alternative. A geometrical view of the
problem may be seen in Figure 3.1.
RSS"&RSS# is small, then the fit of the smaller model is almost as good as the larger
model and so we would prefer the smaller model on the grounds of simplicity. On the
other hand, if the difference is large, then the superior fit of the larger model would be
preferred. This suggests that something like:
would be a potentially good test statistic where the denominator is used for scaling
purposes.
As it happens, the same test statistic arises from the likelihood-ratio testing approach.
We give an outline of the development: If L (!, # y) is the likelihood function, then the
likelihood-ratio statistic is:
Inference 29
Downloaded by [University of Toronto] at 16:20 23 May 2014
The test should reject if this ratio is too large. Working through the details, we find that for each
model:
Figure 3.1 Geometrical view of the comparison between big model, !, and small
model, ". The squared length of the residual vector for the big model
is RSS! while that for the small model is RSS!. By Pythagoras’
theorem, the squared length of the vector connecting the two fits is
RSS""RSS!. A small value for this indicates that the small model fits
almost as well as the large model and thus might be preferred due to
its simplicity.
which after some manipulation gives us a test that rejects if:
which is the same statistic suggested by the geometrical view. It remains for us to
discover the null distribution of this statistic.
Now suppose that the dimension (or number of parameters) of ( is p and the
dimension of ' is q, then:
Details of the derivation of this statistic may be found in more theoretically oriented texts
such as Sen and Srivastava (1990).
Thus we would reject the null hypothesis if
The degrees of freedom of a
model are (usually) the number of observations minus the number of parameters so this
test statistic can also be written:
30
Linear Models with R
Downloaded by [University of Toronto] at 16:20 23 May 2014
where df#=n#p and df"=n#q. The same test statistic applies not just to when ' is a
subset of #, but also to a subspace. This test is very widely used in regression and
analysis of variance. When it is applied in different situations, the form of test statistic
may be reexpressed in various different ways. The beauty of this approach is you only
need to know the general form. In any particular case, you just need to figure out which
models represent the null and alternative hypotheses, fit them and compute the test
statistic. It is very versatile.
3.2 Testing Examples
Test of all the predictors
Are any of the predictors useful in predicting the response? Let the full model (() be
y=X!+" where X is a full-rank n!p matrix and the reduced model (') be y=!". We would
estimate ) by . We write the null hypothesis as:
H0: !1=…!p!1=0
Now
, the residual sum of squares for the full
model, while
, which is sometimes known as the sum of
squares corrected for the mean. So in this case:
We would now refer to Fp!1,n!p for a critical value or a p-value. Large values of F would
indicate rejection of the null. Traditionally, the information in the above test is presented
in an analysis of variance table. Most computer packages produce a variant on this. See
Table 3.1 for the layout. It is not really necessary to specifically compute all the elements
of the table. As the originator of the table, Fisher said in 1931, it is “nothing but a
convenient way of arranging the arithmetic.” Since he had to do his calculations by hand,
the table served a necessary purpose, but it is not essential now.
Source
Deg. of Freedom
Sum of Squares
Mean Square
F
Regression
P#1
SSreg
SSreg/(p#1)
F
Inference 31
Residual
n"p
RSS
Total
n#1
TSS
RSS/(n"p)
Downloaded by [University of Toronto] at 16:20 23 May 2014
Table 3.1 Analysis of variance table.
A failure to reject the null hypothesis is not the end of the game—you must still
investigate the possibility of nonlinear transformations of the variables and of outliers
which may obscure the relationship. Even then, you may just have insufficient data to
demonstrate a real effect, which is why we must be careful to say “fail to reject” the null
rather than “accept” the null. It would be a mistake to conclude that no real relationship
exists. This issue arises when a pharmaceutical company wishes to show that a proposed
generic replacement for a brand-named drug is equivalent. It would not be enough in this
instance just to fail to reject the null. A higher standard would be required.
When the null is rejected, this does not imply that the alternative model is the best model. We
do not know whether all the predictors are required to predict the response or just some
of them. Other predictors might also be added or existing predictors transformed or
recombined. Either way, the overall F-test is just the beginning of an analysis and not the end.
Let’s illustrate this test and others using an old economic dataset on 50 different countries.
These data are averages from 1960 to 1970 (to remove business cycle or other short-term
fluctuations). dpi is per capita disposable income in U.S. dollars; ddpi is the percentage
rate of change in per capita disposable income; sr is aggregate personal saving divided by
disposable income. The percentage of population under 15 (pop15) and over 75 (pop75) is also
recorded. The data come from Belsley, Kuh, and Welsch (1980). Take a look at the data:
> data(savings)
> savings
sr
Australia
11.43
Austria
12.07
…cases deleted…
Malaysia
4.71
pop15 pop75
29.35 2.87
23.32 4.41
47.20
0.66
dpi
2329.68
1507.99
ddpi
2.87
3.93
242.69
5.08
First, consider a model with all the predictors:
> g < - 1m (sr ˜ pop15 + pop75 + dpi + ddpi, savings)
> summary (g)
Coefficients:
Estimate Std. Error t value Pr (>|t|)
(Intercept) 28.566087 7.354516
3.88
0.00033
pop15
!0.461193 0.144642
!3.19
0.00260
pop75
!1.691498 1.083599
!1.56
0.12553
dpi
!0.000337 0.000931
!0.36
0.71917
ddpi
0.409695 0.196197
2.09
0.04247
Residual standard error: 3.8 on 45 degrees of freedom
Multiple R-Squared: 0.338, Adjusted R-squared: 0.28
F-statistic: 5.76 on 4 and 45 DF, p-value: 0.00079
32
Linear Models with R
We can see directly the result of the test of whether any of the predictors have
significance in the model. In other words, whether !1=(!2=!3=!4=0. Since the p-value,
Downloaded by [University of Toronto] at 16:20 23 May 2014
0.00079, is so small, this null hypothesis is rejected.
We can also do it directly using the F-testing formula:
> g < - 1m (sr ˜ pop15 + pop75 + dpi + ddpi, savings)
> (tss < - sum((savings$sr-mean (savings$sr))^2))
[1] 983.63
> (rss < - deviance(g))
[1] 650.71
> df.residual(g)
[13 45
> (fstat < - ((tss-rss)/4)/(rss/df.residual(g)))
[1] 5.7557
> 1-pf (fstat, 4, df.residual (g))
[1] 0.00079038
Verify that the numbers match the regression summary above.
Testing just one predictor
Can one particular predictor be dropped from the model? The null hypothesis would be
H0: !i=0. Let RSS# be the RSS for the model with all the predictors of interest which has
p parameters and let RSS" be the RSS for the model with all the same predictors except
predictor i.
The F-statistic may be computed using the standard formula. An alternative approach
is to use a t-statistic for testing the hypothesis:
and check for significance using a t-distribution with n#p degrees of freedom.
However,
is exactly the F-statistic here, so the two approaches are numerically
identical. The latter is less work and is presented in typical regression outputs.
For example, to test the null hypothesis that !1=0, (that pop15 is not significant in the
full model) we can simply observe that the p-value is 0.0026 from the table and conclude
that the null should be rejected.
Let’s do the same test using the general F-testing approach: We will need the RSS and
df for the full model which are 650.71 and 45, respectively. We now fit the model that
represents the null:
> g2 < - 1m (sr ˜ pop75 + dpi + ddpi, savings)
and compute the RSS and the F-statistic:
> (rss2 < - deviance (g2))
[1] 797.72
> (fstat < - (deviance (g2)-deviance (g))/
Inference 33
(deviance (g)/df.residual(g)))
[1] 10.167
The p-value is then:
> l-pf (fstat, l, df.residual(g))
[1] 0.002603
Downloaded by [University of Toronto] at 16:20 23 May 2014
We can relate this to the t-based test and p-value by:
> sqrt (fstat)
[1] 3.1885
> (tstat < - summary(g)$coef[2, 3])
[1] !3.1885
> 2 * (l-pt (sqrt (fstat), 45))
[1] 0.002603
A more convenient way to compare two nested models is:
> anova (g2, g)
Analysis of Variance Table
Model 1: sr ˜ pop75 + dpi + ddpi
Model 2: sr ˜ pop15 + pop75 + dpi + ddpi
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
1
46
798
2
45
651 1
147
10.2 0.0026
Understand that this test of pop15 is relative to the other predictors in the model, namely,
pop75, dpi and ddpi. If these other predictors were changed, the result of the test may be
different. This means that it is not possible to look at the effect of pop15 in isolation.
Simply stating the null hypothesis as H0: !pop15=0 is insufficient—information about
what other predictors are included in the null is necessary. The result of the test may be
different if the predictors change.
Testing a pair of predictors
Suppose we wish to test the significance of variables Xj and Xk. We might construct a
table as seen above and find that both variables have p-values greater than 0.05 thus
indicating that individually each one is not significant. Does this mean that both Xj and Xk
can be eliminated from the model? Not necessarily.
Except in special circumstances, dropping one variable from a regression model causes
the estimates of the other parameters to change so that we might find that after dropping
Xj, a test of the significance of Xk shows that it should now be included in the model.
If you really want to check the joint significance of Xj and Xk you should fit a model
with and then without them and use the general F-test discussed before. Remember that
even the result of this test may depend on what other predictors are in the model.
We test the hypothesis that both pop75 and ddpi may be excluded from the model:
34
Linear Models with R
Downloaded by [University of Toronto] at 16:20 23 May 2014
> g3 < - 1m (sr ˜ popl5+dpi , savings)
> anova (g3, g)
Analysis of Variance Table
Model 1: sr pop15 + dpi
Model 2: sr pop15 + pop75 + dpi + ddpi
Res.Df
RSS Df Sum of Sq
F Pr(>F)
1
47 744.12
2
45 650.71 2 9
3.41 3.2299 0.04889
We see that the pair of predictors is just barely significant at the 5% level.
Tests of more than two predictors may be performed in a similar way by comparing the
appropriate models.
Testing a subspace
Consider this example. Suppose that y is the first year grade point average for a student,
Xj is the score on the quantitative part of a standardized test and Xk is the score on the
verbal part. There might also be some other predictors. We might wonder whether we
need two separate scores—perhaps they can be replaced by the total, Xj+Xk. So if the
original model was:
y=!0+…+!jXj+!kXk+…+"
then the reduced model is:
y=!0+…+!l(Xj+Xk)+…+"
which requires that !j=!k for this reduction to be possible. So the null hypothesis is:
H0: !j=!k
This defines a linear subspace to which the general F-testing procedure applies. In our
example, we might hypothesize that the effect of young and old people on the savings
rate was the same or in other words that:
H0: !pop15=!pop75
In this case the null model would take the form:
y=!0+!dep(pop15+pop75)+!dpidpi+!ddpiddpi+"
We can then compare this to the full model as follows:
> g < - 1m (sr ˜ .,savings)
> gr < - 1m (sr ˜ I (pop15+pop75)+dpi+ddpi/savings)
Inference 35
Downloaded by [University of Toronto] at 16:20 23 May 2014
> anova (gr, g)
Analysis of Variance Table
Model 1: sr ˜ I (pop15 + pop75) + dpi + ddpi
Model 2: sr ˜ pop15 + pop75 + dpi + ddpi
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
1
46
674
2
45
651 1
23
1.58
0.21
The period in the first model formula is shorthand for all the other variables in the data
frame. The function I ( ) ensures that the argument is evaluated rather than interpreted as
part of the model formula. The p-value of 0.21 indicates that the null cannot be rejected
here, meaning that there is not evidence that young and old people need to be treated
separately in the context of this particular model.
Suppose we want to test whether one of the coefficients can be set to a particular value.
For example:
H0: !ddpi=0.5
Here the null model would take the form:
y=!0+!pop15pop15+!pop75pop75+!dpidpi+0.5ddpi+"
Notice that there is now a fixed coefficient on the ddpi term. Such a fixed term in the
regression equation is called an offset. We fit this model and compare it to the full:
> gr < - 1m (sr ˜ pop15+pop75+dpi+offset(0.5*ddpi),savings)
> anova (gr, g)
Analysis of Variance Table
Model 1: sr pop15 + pop75 + dpi + offset(0.5 * ddpi)
Model 2: sr pop15 + pop75 + dpi + ddpi
Res.Df RSS Df Sum of Sq
F Pr(>F)
1
46 654
2
45 651 1
3 0.21
0.65
We see that the p-value is large and the null hypothesis here is not rejected. A simpler
way to test such point hypotheses is to use a t-statistic:
where c is the point hypothesis. So in our example the statistic and corresponding p-value
is:
> (tstat < - (0.409695–0.5)/0.196197)
[1] !0.46028
> 2*pt (tstat, 45)
[1] 0.64753
36
Linear Models with R
We can see the p-value is the same as before and if we square the t-statistic
> tstatˆ 2
[1] 0.21186
Downloaded by [University of Toronto] at 16:20 23 May 2014
we find we get the same F-value as above. This latter approach is preferred in practice
since we do not need to fit two models but it is important to understand that it is
equivalent to the result obtained using the general F-testing approach.
Can we test a hypothesis such as H0: !j!k=1 using our general theory? No. This
hypothesis is not linear in the parameters so we cannot use our general method. We
would need to fit a nonlinear model and that lies beyond the scope of this book.
3.3 Permutation Tests
We can put a different interpretation on the hypothesis tests we are making. For the
Galapágos dataset, we might suppose that if the number of species had no relation to the
five geographic variables, then the observed response values would be randomly
distributed between the islands without relation to the predictors. The F-statistic is a good
measure of the association between the predictors and the response with larger values
indicating stronger associations. We might then ask what the chance would be under this
assumption that an F-statistic would be observed as large, or larger than the one we
actually observed. We could compute this exactly by computing the F-statistic for all
possible (30!) permutations of the response variable and see what proportion exceed the
observed F-statistic. This is a permutation test. If the observed proportion is small, then
we must reject the contention that the response is unrelated to the predictors. Curiously,
this proportion is estimated by the p-value calculated in the usual way based on the
assumption of normal errors thus saving us from the massive task of actually computing
the regression on all those computations. See Freedman and Lane (1983) for a discussion
of these matters.
Let’s see how we can apply the permutation test to the savings data. I chose a model
with just pop75 and dpi so as to get a p-value for the F-statistic that is not too small (and
therefore less interesting):
> g < - 1m (sr ˜ pop75+dpi,savings)
> summary (g)
Coefficients:
Estimate Std. Error t value Pr (>|t|)
(Intercept) 7.056619
1.290435
5.47
1.7e–06
pop75
1.304965
0.777533
1.68
0.10
dpi
!0.000341
0.001013
!0.34
0.74
Residual standard error: 4.33 on 47 degrees of freedom
Multiple R-Squared: 0.102, Adjusted R-squared: 0.0642
F-statistic: 2.68 on 2 and 47 DF, p-value: 0.079
We can extract the F-statistic as:
> gs < - summary (g)
Inference 37
> gs$fstat
value numdf
dendf
2.6796 2.0000 47.0000
Downloaded by [University of Toronto] at 16:20 23 May 2014
The function sample ( ) generates random permutations. We compute the F-statistic for 4000
randomly selected permutations and see what proportion exceeds the F-statistic for the original data:
> fstats < - numeric(4000)
> for (i in 1:4000){
+ ge < - lm (sample(sr) ˜ pop75+dpi,data=savings)
+ fstats [i] < - summary(ge)$fstat[1]
+ }
> length(fstats[fstats > 2.6796])/4000
[1] 0.07425
This should take less than a minute on any relatively new computer. If you repeat this,
you will get a slightly different result each time because of the random selection of the
permutations. So our estimated p-value using the permutation test is 0.07425, which is
close to the normal theory-based value of 0.0791. We could reduce variability in the
estimation of the p-value simply by computing more random permutations. Since the
permutation test does not depend on the assumption of normality, we might regard it as
superior to the normal theory based value. It does take longer to compute, so we might
use the normal inference secure, in the knowledge that the results can also be justified
with an alternative argument.
Tests involving just one predictor also fall within the permutation test framework. We
permute that predictor rather than the response. Let’s test the pop75 predictor in the
model. We can extract the t-statistic as:
>summary(g)$coef [2, 3]
[1] 1.6783
Now we perform 4000 permutations of pop75 and check what fraction of the t-statistics
exceeds 1.6783 in absolute value:
>tstats < - numeric(4000)
>for (i in 1:4000){
+ge < - 1m (sr ˜ sample(pop75)+dpi, savings)
+tstats [i] < - summary(ge)$coef[2,3]
+}
>mean (abs (tstats) > 1.6783)
[1] 0.10475
The outcome is very similar to the observed normal-based p-value of 0.10.
38
Linear Models with R
Downloaded by [University of Toronto] at 16:20 23 May 2014
3.4 Confidence Intervals for !
Confidence intervals (CIs) provide an alternative way of expressing the uncertainty in our
estimates. They are linked to the tests that we have already constructed. For the CIs and
regions that we will consider here, the following relationship holds. For a 100(1&%)%
confidence region, any point that lies within the region represents a null hypothesis that
would not be rejected at the 100%% level while every point outside represents a null
hypothesis that would be rejected. So, in one sense, the confidence region provides a lot
more information than a single hypothesis test in that it tells us the outcome of a whole
range of hypotheses about the parameter values. Of course, by selecting the particular
level of confidence for the region, we can only make tests at that level and we cannot
determine the p-value for any given test simply from the region. However, since it is
dangerous to read too much into the relative size of p-values (as far as how much
evidence they provide against the null), this loss is not particularly important.
The confidence region tells us about plausible values for the parameters in a way that
the hypothesis test cannot. This makes it more valuable. As with testing, we must decide
whether to form confidence regions for parameters individually or simultaneously.
Simultaneous regions are preferable, but for more than two dimensions they are difficult
to display and so there is still some value in computing the one-dimensional CIs.
We can consider each parameter individually, which leads to CIs taking the general
form of:
Estimate±Critical Value!SE of Estimate
or specifically in this case:
Alternatively, a 100(1–%)% confidence region for ! satisfies:
These regions are ellipsoidally shaped. Because these ellipsoids live in higher
dimensions, they cannot easily be visualized except for the two-dimensional case.
It is better to consider the joint CIs when possible, especially when the
correlated.
Consider the full model for the savings data:
are heavily
> g < - 1m (sr ˜ . , savings)
where the output is displayed on page 28. We can construct individual 95% CIs for
!pop75:
> qt (0.975,45)
[1] 2.0141
> c (!1.69!2.01*1.08,!1.69+2.01*1.08)
[1] !3.8608 0.4808
Inference 39
CIs have a duality with two-sided hypothesis tests as we mentioned above. Thus the
interval contains zero, which indicates that the null hypothesis H0: !pop75=0 would not be
rejected at the 5% level. We can see from the summary that the p-value is 12.5%—greater
than 5%—confirming this point.
The CI for !pop75 is:
Downloaded by [University of Toronto] at 16:20 23 May 2014
> c (0.41 !2.01 * 0.196, 0.41+2.01*0.196)
[1] 0.01604 0.80396
Because zero is in this interval, the null is rejected. Nevertheless, this CI is relatively
wide in the sense that the upper limit is about 50 times larger than the lower limit. This
means that we are not really that confident about what the exact effect of growth on
savings really is, even though it is statistically significant. A convenient way to obtain all
the univariate intervals is:
> confint (g)
(Intercept)
pop15
pop75
dpi
ddpi
2.5 %
13.7533307
–0.7525175
–3.8739780
–0.0022122
0.0145336
97.5 %
43.3788424
–0.1698688
0.4909826
0.0015384
0.8048562
Now we construct the joint 95% confidence region for !pop15 and !pop75. First, we load in
a package for drawing confidence ellipses (which is not part of base R and so may need
to be downloaded):
> library(ellipse)
and now the plot:
> plot(ellipse(g,c(2,3)),type="l",xlim=c(-1,0))
We add the origin and the point of the estimates:
> points(0,0)
> points (coef (g) [2], coef (g) [3], pch=18)
Since the origin lies outside the ellipse, we reject the hypothesis that !pop15=!pop75= 0.
We mark the one-way CI on the plot for reference:
> abline (v=confint (g) [2,], lty=2)
> abline (h=confint (g) [3,], lty=2)
40
Linear Models with R
Downloaded by [University of Toronto] at 16:20 23 May 2014
See the plot in Figure 3.2. Notice that these lines are not tangential to the ellipse. If the
lines were moved out so that they enclosed the ellipse exactly, the CIs would be jointly
correct.
In some circumstances, the origin could lie within both one-way CIs, but lie outside the
ellipse. In this case, both one-at-a-time tests would not reject the null whereas the joint
test would. The latter test would be preferred. It is also possible for the origin to lie
outside the rectangle but inside the ellipse. In this case, the joint test would not reject the
null whereas both one-at-a-time tests would reject. Again we prefer the joint test result.
Examine the correlation of the two predictors:
> cor (savings$pop15, savings$pop75)
[1] !0.90848
However, from the plot, we see that coefficients have a positive correlation. We can
confirm this from:
Figure 3.2 Confidence ellipse and regions for !pop75. and !pop15
>summary (g, corr=TRUE) $corr
(Intercept)
pop15
pop75
dpi
ddpi
(Intercept)
1.00000 !0.98416 !0.809111 !0.16588 !0.188265
pop15
!0.98416
1.00000 0.765356
0.17991 0.102466
pop75
!0.80911
0.76536 1.000000 !0.36705 !0.054722
Inference 41
dpi
ddpi
!0.16588
!0.18827
0.17991 !0.367046
0.10247 !0.054722
1.00000
0.25548
0.255484
1.000000
Downloaded by [University of Toronto] at 16:20 23 May 2014
where we see that the correlation is 0.765. The correlation between predictors and the correlation between the coefficients of those predictors are often different in sign. Loosely speaking, two positively correlated predictors will attempt to perform the same job of explanation. The more work one does, the less the other needs to do and hence a negative correlation
in the coefficients. For the negatively correlated predictors, as seen here, the effect is reversed.
3.5 Confidence Intervals for Predictions
However, we need
Given a new set of predictors, x0, the predicted response is
to assess the uncertainty in this prediction. Decision makers need more than just a point
estimate to make rational choices. If the prediction has a wide CI, we need to allow for
outcomes far from the point estimate. For example, suppose we need to predict the high
water mark of a river. We may need to construct barriers high enough to withstand floods
much higher than the predicted maximum.
We must also distinguish between predictions of the future mean response and
predictions of future observations. To make the distinction clear, suppose we have built a
regression model that predicts the selling price of homes in a given area that is based on
predictors such as the number of bedrooms and closeness to a major highway.There are
two kinds of predictions that can be made for a given x0:
1. Suppose a specific house comes on the market with characteristics x0. Its selling price
will be
Since E"=0, the predicted price is
but in assessing the variance of
this prediction, we must include the variance of ".
2. Suppose we ask the question—“What would a house with characteristics x0 sell for on
and is again predicted by
but now only the
average?” This selling price is
variance in needs to be taken into account.
Most times, we will want the first case, which is called “prediction of a future value,” while the
second case, called “prediction of the mean response” is less commonly required.
A future observation is predicted to be
We have var
(where we do not know what the future " will be and we can reasonably assume
this to be independent of ). So a 100(1–%) % CI for a single future response is:
42
Linear Models with R
Downloaded by [University of Toronto] at 16:20 23 May 2014
If, on the other hand, you want a CI for the mean response for given x0, use:
We return to the Galápagos data for this example:
> g x0 < - c (1, 0.08, 93, 6.0, 12.0, 0.34)
> (y0 < - sum (x0*coef(g)))
[1] 33.92
This is the predicted number of species. If a whole number is preferred, we could round
up to 34.
Now if we want a 95% CI for the prediction, we must decide whether we are
predicting the number of species on one new island or the mean response for all islands
with same predictors x0. Suppose that an island was not surveyed for the original dataset.
The former interval would be the one we want. For this dataset, the latter interval would
be more valuable for “what if” type of calculations. First, we need the t-critical value:
> qt (0.975,24)
[1] 2.0639
We calculate the (XTX)!1 matrix:
> x < - model.matrix (g)
> xtxi < - solve (t (x) %*% x)
The width of the bands for mean response CI is(
):
> (bm < - sqrt (x0 %*% xtxi %*% x0) *2.064 * 60.98)
[,1]
[1,] 32.89
and the interval is:
Inference 43
> c (y0!bm, y0+bm)
[1] 1.0296 66.8097
Now we compute the prediction interval for the single future response:
Downloaded by [University of Toronto] at 16:20 23 May 2014
> bm < - sqrt (l+x0 %*% xtxi %*% x0) *2.064 * 60.98
> c(y0–bm, y0+bm)
[1] !96.17 164.01
Of course, the number of species cannot be negative. In such instances, impossible values
in the CI can be avoided by transforming the response, say taking logs (explained in a
later chapter), or by using a probability model more appropriate to the response. The
normal distribution is supported on the whole real line and so negative values are always
possible. A better choice for this example might be the Poisson distribution which is
supported on the nonnegative integers.
There is a more direct method for computing the CI. The function predict ( ) requires
that its second argument be a data frame with variables named in the same way as the
original dataset:
> x0 < - data. frame (Area=0.08, Elevation=93, Nearest=6.0,
Scruz=12, Adjacent=0.34)
> str (predict (g, x0, se=TRUE))
List of 4
$ fit
: num 33.9
$ se.fit
: num 15.9
$ df
: int 24
$ residual.scale: num 61
> predict (g, x0, interval="confidence")
fit
lwr
upr
[1,] 33.92 1.0338 66.806
> predict (g, x0, interval="prediction")
fit
lwr
upr
[1,] 33.92 !96.153 163.99
Extrapolation occurs when we try to predict the response for values of the predictor
which lie outside the range of the original data. There are two different types of
extrapolation:
1. Quantitative extrapolation: We must check whether the new x0 is within the range of
the original data. If not, the prediction may be unrealistic. CIs for predictions get wider
as we move away from the data. We can compute these bands for the Galápagos model
where we vary the Nearest variable while holding the other predictors fixed:
> grid < - seq (0, 100, 1)
> p < - predict (g, data.frame (Area=0.08, Elevation=93Nearest=
grid, Scruz=12, Adjacent=0.34), se=T, interval="confidence")
> matplot (grid, p$fit, lty=c (1,2,2), type=“l”,
44
Linear Models with R
xlab="Nearest",ylab="Species") >
rug (gala$Nearest)
Downloaded by [University of Toronto] at 16:20 23 May 2014
We see that the confidence bands in Figure 3.3 become wider as we move away from the
range of the data. However, this widening does not reflect the possibility that the structure
of the model may change as we move into new territory. The uncertainty in the
parametric estimates is allowed for, but not uncertainty about the model. The relationship
may become nonlinear outside the range of the data—we have no way of knowing.
Figure 3.3 Predicted Nearest over a range of values with 95% pointwise
confidence bands for the mean response shown as dotted lines.
A “rug” shows the location of the observed values of Nearest.
2. Qualitative extrapolation: Is the new x0 drawn from the same population from which
the original sample was drawn? If the model was built in the past and is to be used for
future predictions, we must make a difficult judgment as to whether conditions have
remained constant enough for this to work.
3.6 Designed Experiments
In a designed experiment, the user has some control over X. For example, suppose we
wish to compare several physical exercise regimes. The experimental units are the people
we use for the study. We can choose some of the predictors such as the amount of time
spent excercising or the type of equipment used. Some other predictors might not be
controlled, but can be measured, such as the physical characteristics of the people. Other
variables, such as the temperature in the room, might be held constant.
Our control over the conditions of the experiment allows us to make stronger
conclusions from the analysis. Two important design features are orthogonality and
randomization.
Inference 45
Orthogonality is a useful property because it allows us to more easily interpret the
effect of one predictor without regard to another. Suppose we can partition X in two, X=
[X1 X2] such that
So now:
y=X!+"=X1!+X2!2+"
Downloaded by [University of Toronto] at 16:20 23 May 2014
and
which means:
Notice that
will be the same regardless of whether X2 is in the model or not (and vice
versa). So we can interpret the effect of X1without a concern forX2. Unfortunately, the
decoupling is not perfect for suppose we wish to test H0: !1=0. We have
that will be different depending on whether X2 is included in the model or not, but the
difference in F is not liable to be as large as in nonorthogonal cases.
Orthogonality is a desirable property, but will only occur when X is chosen by the
experimenter. It is a feature of a good design. In observational data, we do not have direct
control over X and this is the source of many of the interpretational difficulties associated
with nonexperimental data.
Here is an example of an experiment to determine the effects of column temperature,
gas/liquid ratio and packing height in reducing the unpleasant odor of a chemical product
that was sold for household use. Read the data in and display:
> data (odor)
> odor
odor temp
1
66
!1
2
39
1
3
43
!1
4
49
1
5
58
!1
6
17
1
7
!5
!1
8
!40
1
9
65
0
10
7
0
gas
!1
!1
1
1
0
0
0
0
!1
1
pack
0
0
0
0
!1
!1
1
1
!1
!1
46
Linear Models with R
11
12
13
14
15
43
!22
!31
!35
!26
0
0
0
0
0
!1
1
0
0
0
1
1
0
0
0
Downloaded by [University of Toronto] at 16:20 23 May 2014
The three predictors have been transformed from their original scale of measurement, for
example, temp=(Fahrenheit-80)/40 so the original values of the predictor were 40, 80 and
120. The data is presented in John (1971) and give an example of a central composite
design. Here is the X-matrix:
> x < - as.matrix (cbind (1, odor[,!1]))
and XTX:
> t (x) %*% x
1 temp gas pack
1 15
0
0
0
temp 0
8
0
0
gas
0
0
8
0
pack 0
0
0
8
The matrix is diagonal. Even if temp was measured in the original Fahrenheit scale, the
matrix would still be diagonal, but the entry in the matrix corresponding to temp would
change. Now fit a model, while asking for the correlation of the coefficients:
> g < - lm (odor ˜ temp + gas + pack, odor)
> summary (g, cor=T)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
15.2
9.3
1.63
0.13
temp
!12.1
12.7
!0.95
0.36
gas
!17.0
12.7
!1.34
0.21
pack
!21.4
12.7
!1.68
0.12
Residual standard error: 36 on 11 degrees of freedom
Multiple R-Squared: 0.334, Adjusted R-squared: 0.152
F-statistic: 1.84 on 3 and 11 DF, p-value: 0.199
Correlation of Coefficients:
(Intercept) temp gas
temp 0.00
gas 0.00
0.00
pack 0.00
0.00 0.00
We see that, as expected, the pairwise correlation of all the coefficients is zero. Notice that
the SEs for the coefficients are equal due to the balanced design. Now drop one of the
variables:
> g < - lm (odor ˜ gas + pack, odor) >
summary (g)
Coefficients:
Inference 47
Downloaded by [University of Toronto] at 16:20 23 May 2014
Estimate Std. Error t value Pr(>|t|)
(Intercept)
15.20
9.26
1.64
0.13
gas
!17.00
12.68
!1.34
0.20
pack
!21.37
12.68
!1.69
0.12
Residual standard error: 35.9 on 12 degrees of freedom
Multiple R-Squared: 0.279, Adjusted R-squared: 0.159
F-statistic: 2.32 on 2 and 12 DF, p-value: 0.141
The coefficients themselves do not change, but the residual SE does change slightly,
which causes small changes in the SEs of the coefficients, t-statistics and p-values, but
nowhere near enough to change our qualitative conclusions.
These were data from an experiment so it was possible to control the values of the
predictors to ensure orthogonality. Now consider the savings data, which are
observational:
> g < - 1m (sr ˜ pop15 + pop75 + dpi + ddpi, savings)
> summary (g)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.566087
7.354516
3.88 0.00033
pop15
!0.461193
0.144642
!3.19 0.00260
pop75
!1.691498
1.083599
!1.56 0.12553
dpi
!0.000337
0.000931
!0.36 0.71917
ddpi
0.409695
0.196197
2.09 0.04247
Residual standard error: 3.8 on 45 degrees of freedom
Multiple R-Squared: 0.338, Adjusted R-squared: 0.28
F-statistic: 5.76 on 4 and 45 DF, p-value: 0.00079
Drop pop15 from the model:
> g < - update (g, . ˜ . - pop15)
> summary (g)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.487494
1.427662
3.84 0.00037
pop75
0.952857
0.763746
1.25 0.21849
dpi
0.000197
0.001003
0.20 0.84499
ddpi
0.473795
0.213727
2.22 0.03162
Residual standard error: 4.16 on 46 degrees of freedom
Multiple R-Squared: 0.189, Adjusted R-squared: 0.136
F-statistic: 3.57 on 3 and 46 DF, p-value: 0.0209
Pay particular attention to pop75. The effect has now become positive whereas before it
was negative. Granted, in both cases it is not significant, but it is not uncommon in other
datasets for such sign changes to occur and for them to be significant.
Another important feature of designed experiments is the ability to randomly assign the
experimental units to our chosen values of the predictors. For example, in our exercise
regime experiment, we might choose an X with some desirable orthogonality properties
and then fit a y=X!+" model. However, the subjects we use for the experiment are likely
48
Linear Models with R
Downloaded by [University of Toronto] at 16:20 23 May 2014
to differ in ways that might affect the response, which could, for example, be resting heart
rate. Some of these characteristics may be known and measurable such as age or weight.
These can be included in X. Including such known important variables is typically
beneficial, as it increases the precision of our estimates of !.
However, there will be other variables, Z that we cannot measure or may not even
suspect. The model we use is y=X!+" while the true model might be y=X!+ Z*++. + is
the measurement error in the response. We can assume that E"=0 without any loss of
generality, because if E"=c, we could simply redefine !0 as !0+c and the error would
again have expectation zero. This is another reason why it is generally unwise to remove
the intercept term from the model since it acts as a sink for the mean effect of unincluded
variables. So we see that " incorporates both measurement error and the effect of other
variables. We will not observe Z so we cannot estimate *. If X were orthogonal to Z, that
is, XTZ=0, the estimate of ! would not depend on the presence of Z. Of course, we cannot
achieve this exactly, but because of the random assignment of the experimental units,
which each carry an unknown value of Z, to the predictors X, we will have cor (X, Z)=0.
This means that the estimate of ! will not be biased by Z.
Another compelling argument in favor of the randomization is that it justifies the use
of permutation tests. We really can argue that any assignment of X to the experimental
units was one of n! possible permutations.
For observational data, no randomization can be used in assigning X to the units and
orthogonality will not just happen. An unmeasured and possible unsuspected lurking
variable Z may be the real cause of an observed relationship between y and X. For
example, we will observe a positive correlation among the shoe sizes and reading abilities
of elementary school students, but this relationship is driven by a lurking variable—the
age of the child. In this example, the nature of Z is obvious, but in other cases, we may
not be able to identify an important Z. The possible existence of such a lurking variable
casts a shadow over any conclusions from observational data.
Simply using a designed experiment does not ensure success. We may choose a bad
model for the data or find that the variation in the unobserved Z overwhelms the effect of
the...
Purchase answer to see full
attachment