Data analysis project using R language

User Generated

znexkh

Mathematics

Description

This is a capstone project. And I want you to treat this as a real data analysis project. By saying that, I mean just follow the procedure that you used to deal with data analysis project. I already subset the data(from a pretty large data). So this should be very straightforward for you. Check the link shown on document for what's the meaning of each variable.

We are asked to use the method that we have learned in course. I will attach a document that lists all the methods we have covered in this course. Please do not create a new method.!! Selecting academic level as Doctoral level is only because I want this project to be concise and clear. Not because this is is extremely hard.

Also I need your code for reviewing and studying the way you think!!! Thank you so much!I also attached textbook that carries out all method can be used.

Unformatted Attachment Preview

STAT 6120 Fall 2018 Data Analysis Project Due by the end of Dec 10. Please email your report to tianxili@virginia.edu. The National Health and Examination Study (NHANES) sometimes includes a physical activity component in which subjects wear a device that records physical activity for every minute, recorded over seven days. The dataset is quite large and can be found here: http://wwwn.cdc.gov/Nchs/Nhanes/ 2005-2006/PAXRAW_D.ZIP Some documentation of the data is here: http://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/ PAXRAW_D.htm We will only focus on the good quality sample, which can be accessed by requiring PAXSTAT==1 and PAXCAL==1. First, calculate the “activity intensity” number per hour per subject. You should merge the physical activity data summaries created above with basic NHANES demographic data (https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/DEMO_D.htm) such as age and gender, and body measurement data (https://wwwn.cdc.gov/Nchs/Nhanes/2005-2006/ BMX_D.htm), especially BMI. Then consider how physical activity patterns are related to these factors. Such patterns might simply be the “activity intensity” number per hour or be some specific metric of its variation over time, or so. Find a reasonable predictive model and explain the model. You do not need to use the case weights in your analysis. You can use external R packages (if you do use R) to help processing the data. Some possibly helpful packages may include dplyr, data.table and foreign. Something to keep in mind: • This will be an individual project - you are on your own for it. Of course, you can use internet. The honor code is applied. • Only the methods covered in the course are allowed and you are not supposed to develop a new method. • Your report should be at most four pages long (and concise reports are encouraged), including all graphs and tables. You do not need to submit any code, but your report should include clear descriptions of all the methods that you used. 1 Downloaded by [University of Toronto] at 16:20 23 May 2014 CHAPMAN & HALL/CRC Downloaded by [University of Toronto] at 16:20 23 May 2014 Texts in Statistical Science Series Series Editors Chris Chatfield, University of Bath, UK Martin Tanner, Northwestern University, USA Jim Zidek, University of British Columbia, Canada Analysis of Failure and Survival Data Peter J.Smith The Analysis and Interpretation of Multivariate Data for Social Scientists David J.Bartholomew, Fiona Steele, Irini Moustaki, and Jane Galbraith The Analysis of Time Series—An Introduction, Sixth Edition Chris Chatfield Applied Bayesian Forecasting and Time Series Analysis A.Pole, M.West and J.Harrison Applied Nonparametric Statistical Methods, Third Edition P.Sprent and N.C.Smeeton Applied Statistics—Handbook of GENSTAT Analysis E.J.Snell and H.Simpson Applied Statistics—Principles and Examples D.R.Cox and E.J.Snell Bayes and Empirical Bayes Methods for Data Analysis, Second Edition Bradley P.Carlin and Thomas A.Louis Bayesian Data Analysis, Second Edition Andrew Gelman, John B.Carlin, Hal S.Stern, and Donald B.Rubin Beyond ANOVA—Basics of Applied Statistics R.G.Miller, Jr. Computer-Aided Multivariate Analysis, Third Edition A.A.Afifi and V.A.Clark A Course in Categorical Data Analysis T.Leonard A Course in Large Sample Theory T.S.Ferguson Downloaded by [University of Toronto] at 16:20 23 May 2014 Data Driven Statistical Methods P.Sprent Decision Analysis—A Bayesian Approach J.Q.Smith Elementary Applications of Probability Theory, Second Edition H.C.Tuckwell Elements of Simulation B.J.T.Morgan Epidemiology—Study Design and Data Analysis M.Woodward Essential Statistics, Fourth Edition D.A.G.Rees A First Course in Linear Model Theory Nalini Ravishanker and Dipak K.Dey Interpreting Data—A First Course in Statistics A.J.B.Anderson An Introduction to Generalized Linear Models, Second Edition A.J.Dobson Introduction to Multivariate Analysis C.Chatfield and A.J.Collins Introduction to Optimization Methods and their Applications in Statistics B.S.Everitt Large Sample Methods in Statistics P.K.Sen and J.da Motta Singer Markov Chain Monte Carlo—Stochastic Simulation for Bayesian Inference D.Gamerman Mathematical Statistics K.Knight Modeling and Analysis of Stochastic Systems V.Kulkarni Modelling Binary Data, Second Edition D.Collett Downloaded by [University of Toronto] at 16:20 23 May 2014 Modelling Survival Data in Medical Research, Second Edition D.Collett Multivariate Analysis of Variance andRepeated Measures—A Practical Approach for Behavioural Scientists D.J.Hand and C.C.Taylor Multivariate Statistics—A Practical Approach B.Flury and H.Riedwyl Practical Data Analysis for Designed Experiments B.S.Yandell Practical Longitudinal Data Analysis D.J.Hand and M.Crowder Practical Statistics for Medical Research D.G.Altman Probability—Methods and Measurement A.O’Hagan Problem Solving—A Statistician’s Guide, Second Edition C.Chatfield Randomization, Bootstrap and Monte Carlo Methods in Biology, Second Edition B.F.J.Manly Readings in Decision Analysis S.French Sampling Methodologies with Applications Poduri S.R.S.Rao Statistical Analysis of Reliability Data M.J.Crowder, A.C.Kimber, T.J.Sweeting, and R.L.Smith Statistical Methods for SPC and TQM D.Bissell Statistical Methods in Agriculture and Experimental Biology, Second Edition R.Mead, R.N.Curnow, and A.M.Hasted Statistical Process Control—Theory and Practice, Third Edition G.B.Wetherill and D.W.Brown Downloaded by [University of Toronto] at 16:20 23 May 2014 Statistical Theory, Fourth Edition B.W.Lindgren Statistics for Accountants S.Letchford Statistics for Epidemiology Nicholas P.Jewell Statistics for Technology—A Course in Applied Statistics, Third Edition C.Chatfield Statistics in Engineering—A Practical Approach A.V.Metcalfe Statistics in Research and Development, Second Edition R.Caulcutt Survival Analysis Using S—Analysis of Time-to-Event Data Mara Tableman and Jong Sung Kim The Theory of Linear Models B.Jørgensen Linear Models with R Julian J.Faraway Texts in Statistical Science Downloaded by [University of Toronto] at 16:20 23 May 2014 Linear Models with R Julian J.Faraway CHAPMAN & HALL/CRC A CRC Press Company Boca Raton London NewYork Washington, D.C. This edition published in the Taylor & Francis e-Library, 2009. Downloaded by [University of Toronto] at 16:20 23 May 2014 To purchase your own copy of this or any of Taylor & Francis or Routledge’s collection of thousands of eBooks please go to www.eBookstore.tandf.co.uk. Library of Congress Cataloging-in-Publication Data Faraway, Julian James. Linear models with R/Julian J.Faraway. p. cm.—(Chapman & Hall/CRC texts in statistical science series; v. 63) Includes bibliographical references and index. ISBN 1-58488-425-8 (alk. paper) 1. Analysis of variance. 2. Regression analysis. I. Title. II. Texts in statistical science; v. 63. QA279.F37 2004 519.5'38–dc22 2004051916 This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation, without intent to infringe. Visit the CRC Press Web site at www.crcpress.com © 2005 by Chapman & Hall/CRC No claim to original U.S. Government works ISBN 0-203-50727-4 Master e-book ISBN ISBN 0-203-59454-1 (Adobe ebook Reader Format) International Standard Book Number 1-58488-425-8 Library of Congress Card Number 2004051916 Contents Preface Downloaded by [University of Toronto] at 16:20 23 May 2014 1 Introduction xi 1 1.1 Before You Start 1 1.2 Initial Data Analysis 2 1.3 When to Use Regression Analysis 7 1.4 History 7 2 Estimation 12 2.1 Linear Model 12 2.2 Matrix Representation 13 2.3 Estimating ! 13 2.4 Least Squares Estimation 14 2.5 Examples of Calculating 16 2.6 Gauss—Markov Theorem 17 2.7 Goodness of Fit 18 2.8 Example 20 2.9 Identifiability 23 3 Inference 28 3.1 Hypothesis Tests to Compare Models 28 3.2 Testing Examples 30 3.3 Permutation Tests 36 3.4 Confidence Intervals for ! 38 3.5 Confidence Intervals for Predictions 41 3.6 Designed Experiments 44 3.7 Observational Data 48 3.8 Practical Difficulties 53 4 Diagnostics 58 4.1 Checking Error Assumptions 58 4.2 Finding Unusual Observations 4.3 Checking the Structure of the Model 69 78 Downloaded by [University of Toronto] at 16:20 23 May 2014 viii Contents 5 Problems with the Predictors 83 5.1 Errors in the Predictors 83 5.2 Changes of Scale 88 5.3 Collinearity 89 6 Problems with the Error 96 6.1 Generalized Least Squares 96 6.2 Weighted Least Squares 99 6.3 Testing for Lack of Fit 102 6.4 Robust Regression 106 7 Transformation 117 7.1 Transforming the Response 117 7.2 Transforming the Predictors 120 8 Variable Selection 130 8.1 Hierarchical Models 130 8.2 Testing-Based Procedures 131 8.3 Criterion-Based Procedures 134 8.4 Summary 139 9 Shrinkage Methods 142 9.1 Principal Components 142 9.2 Partial Least Squares 150 9.3 Ridge Regression 152 10 Statistical Strategy and Model Uncertainty 157 10.1 Strategy 157 10.2 An Experiment in Model Building 158 10.3 Discussion 159 11 Insurance Redlining—A Complete Example 161 11.1 Ecological Correlation 161 11.2 Initial Data Analysis 163 11.3 Initial Model and Diagnostics 165 11.4 Transformation and Variable Selection 168 11.5 Discussion 171 12 Missing Data 173 Contents ix 13 Analysis of Covariance 13.1 A Two-Level Example 178 13.2 Coding Qualitative Predictors 182 13.3 A Multilevel Factor Example 184 14 One-Way Analysis of Variance Downloaded by [University of Toronto] at 16:20 23 May 2014 177 191 14.1 The Model 191 14.2 An Example 192 14.3 Diagnostics 195 14.4 Pairwise Comparisons 196 15 Factorial Designs 199 15.1 Two-Way ANOVA 199 15.2 Two-Way ANOVA with One Observation per Cell 200 15.3 Two-Way ANOVA with More than One Observation per Cell 203 207 15.4 Larger Factorial Experiments 16 Block Designs 213 16.1 Randomized Block Design 213 16.2 Latin Squares 218 16.3 Balanced Incomplete Block Design 222 A R Installation, Functions and Data 227 B Quick Introduction to R 229 B.1 Reading the Data In 229 B.2 Numerical Summaries 229 B.3 Graphical Summaries 230 B.4 Selecting Subsets of the Data 231 B.5 Learning More about R 232 Bibliography 233 Index 237 Downloaded by [University of Toronto] at 16:20 23 May 2014 Downloaded by [University of Toronto] at 16:20 23 May 2014 Preface There are many books on regression and analysis of variance. These books expect different levels of preparedness and place different emphases on the material. This book is not introductory. It presumes some knowledge of basic statistical theory and practice. Readers are expected to know the essentials of statistical inference such as estimation, hypothesis testing and confidence intervals. A basic knowledge of data analysis is presumed. Some linear algebra and calculus are also required. The emphasis of this text is on the practice of regression and analysis of variance. The objective is to learn what methods are available and more importantly, when they should be applied. Many examples are presented to clarify the use of the techniques and to demonstrate what conclusions can be made. There is relatively less emphasis on mathematical theory, partly because some prior knowledge is assumed and partly because the issues are better tackled elsewhere. Theory is important because it guides the approach we take. I take a wider view of statistical theory. It is not just the formal theorems. Qualitative statistical concepts are just as important in statistics because these enable us to actually do it rather than just talk about it. These qualitative principles are harder to learn because they are difficult to state precisely but they guide the successful experienced statistician. Data analysis cannot be learned without actually doing it. This means using a statistical computing package. There is a wide choice of such packages. They are designed for different audiences and have different strengths and weaknesses. I have chosen to use R (Ref. Ihaka and Gentleman (1996) and R Development Core Team (2003)). Why have I used R? There are several reasons. 1. Versatility. R is also a programming language, so I am not limited by the procedures that are preprogrammed by a package. It is relatively easy to program new methods in R. 2. Interactivity. Data analysis is inherently interactive. Some older statistical packages were designed when computing was more expensive and batch processing of computations was the norm. Despite improvements in hardware, the old batch processing paradigm lives on in their use. R does one thing at a time, allowing us to make changes on the basis of what we see during the analysis. 3. Freedom. R is based on S from which the commercial package S-plus is derived. R itself is open-source software and may be obtained free of charge to all. Linux, Macintosh, Windows and other UNIX versions are maintained and can be obtained from the R-project at www.r-project.org. R is mostly compatible with S-plus, meaning that Splus could easily be used for most of the examples provided in this book. 4. Popularity. SAS is the most common statistics package in general use but R or S is most popular with researchers in statistics. A look at common statistical journals confirms this popularity. R is also popular for quantitative applications in finance. Getting Started with R R requires some effort to learn. Such effort will be repaid with increased productivity. You can learn how to obtain R in Appendix A along with instructions on the installation of additional software and data used in this book. Downloaded by [University of Toronto] at 16:20 23 May 2014 xii Preface This book is not an introduction to R. Appendix B provides a brief introduction to the language, but alone is insufficient. I have intentionally included in the text all the commands used to produce the output seen in this book. This means that you can reproduce these analyses and experiment with changes and variations before fully understanding R. You may choose to start working through this text before learning R and pick it up as you go. Free introductory guides to R may be obtained from the R project Web site at www.r-project.org. Introductory books have been written by Dalgaard (2002) and Maindonald and Braun (2003). Venables and Ripley (2002) also have an introduction to R along with more advanced material. Fox (2002) is intended as a companion to a standard regression text. You may also find Becker, Chambers, and Wilks (1998) and Chambers and Hastie (1991) to be useful references to the S language. Ripley and Venables (2000) wrote a more advanced text on programming in S or R. The Web site for this book is at www.stat.lsa.umich.edu/˜faraway/LMR where data described in this book appear. Updates and errata will appear there also. Thanks to the builders of R without whom this book would not have been possible. CHAPTER 1 Introduction Downloaded by [University of Toronto] at 16:20 23 May 2014 1.1 Before You Start Statistics starts with a problem, proceeds with the collection of data, continues with the data analysis and finishes with conclusions. It is a common mistake of inexperienced statisticians to plunge into a complex analysis without paying attention to what the objectives are or even whether the data are appropriate for the proposed analysis. Look before you leap! The formulation of a problem is often more essential than its solution which may be merely a matter of mathematical or experimental skill. Albert Einstein To formulate the problem correctly, you must: 1. Understand the physical background. Statisticians often work in collaboration with others and need to understand something about the subject area. Regard this as an opportunity to learn something new rather than a chore. 2. Understand the objective. Again, often you will be working with a collaborator who may not be clear about what the objectives are. Beware of “fishing expeditions”—if you look hard enough, you will almost always find something, but that something may just be a coincidence. 3. Make sure you know what the client wants. You can often do quite different analyses on the same dataset. Sometimes statisticians perform an analysis far more complicated than the client really needed. You may find that simple descriptive statistics are all that are needed. 4. Put the problem into statistical terms. This is a challenging step and where irreparable errors are sometimes made. Once the problem is translated into the language of statistics, the solution is often routine. Difficulties with this step explain why artificial intelligence techniques have yet to make much impact in application to statistics. Defining the problem is hard to program. That a statistical method can read in and process the data is not enough. The results of an inapt analysis may be meaningless. It is also important to understand how the data were collected. • were they obtained via a designed sample survey. How the data were collected has a crucial impact on what conclusions can be made. • Is there nonresponse? The data you do not see may be just as important as the data you do see. • Are there missing values? This is a common problem that is troublesome and time consuming to handle. • How are the data coded? In particular, how are the qualitative variables represented? • What are the units of measurement? 2 Linear Models with R • Beware of data entry errors and other corruption of the data. This problem is all too common — almost a certainty in any real dataset of at least moderate size. Perform some data sanity checks. Downloaded by [University of Toronto] at 16:20 23 May 2014 1.2 Initial Data Analysis This is a critical step that should always be performed. It looks simple but it is vital. You should make numerical summaries such as means, standard deviations (SDs), maximum and minimum, correlations and whatever else is appropriate to the specific dataset. Equally important are graphical summaries. There is a wide variety of techniques to choose from. For one variable at a time, you can make boxplots, histograms, density plots and more. For two variables, scatterplots are standard while for even more variables, there are numerous good ideas for display including interactive and dynamic graphics. In the plots, look for outliers, data-entry errors, skewed or unusual distributions and structure. Check whether the data are distributed according to prior expectations. Getting data into a form suitable for analysis by cleaning out mistakes and aberrations is often time consuming. It often takes more time than the data analysis itself. In this course, all the data will be ready to analyze, but you should realize that in practice this is rarely the case. Let’s look at an example. The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix. The following variables were recorded: number of times pregnant, plasma glucose concentration at 2 hours in an oral glucose tolerance test, diastolic blood pressure (mmHg), triceps skin fold thickness (mm), 2-hour serum insulin (mu U/ml), body mass index (weight in kg/(height in m2)), diabetes pedigree function, age (years) and a test whether the patient showed signs of diabetes (coded zero if negative, one if positive). The data may be obtained from UCI Repository of machine learning databases at www.ics.uci.edu/˜mlearn/MLRepository.html. Of course, before doing anything else, one should find out the purpose of the study and more about how the data were collected. However, let’s skip ahead to a look at the data: > library(faraway) > data (pima) > pima pregnant glucose diastolic 1 6 148 72 2 1 85 66 3 8 183 64 …much deleted… 768 1 93 70 triceps insulin bmi diabetes age 35 0 33.6 0.627 50 29 0 26.6 0.351 31 0 0 23.3 0.672 32 31 0 30.4 0.315 23 The library (faraway) command makes the data used in this book available. You need to install this package first as explained in Appendix A. We have explicitly written this command here, but in all subsequent chapters, we will assume that you have already issued this command if you plan to use data mentioned in the text. If you get an error message about data not being found, it may be that you have forgotten to type this. Introduction 3 Downloaded by [University of Toronto] at 16:20 23 May 2014 The command data (pima) calls up this particular dataset. Simply typing the name of the data frame, pima, prints out the data. It is too long to show it all here. For a dataset of this size, one can just about visually skim over the data for anything out of place, but it is certainly easier to use summary methods. We start with some numerical summaries: The summary ( ) command is a quick way to get the usual univariate summary information. At this stage, we are looking for anything unusual or unexpected, perhaps indicating a data-entry error. For this purpose, a close look at the minimum and maximum values of each variable is worthwhile. Starting with pregnant, we see a maximum value of 17. This is large, but not impossible. However, we then see that the next five variables have minimum values of zero. No blood pressure is not good for the health—something must be wrong. Let’s look at the sorted values: Downloaded by [University of Toronto] at 16:20 23 May 2014 4 Linear Models with R We see that the first 35 values are zero. The description that comes with the data says nothing about it but it seems likely that the zero has been used as a missing value code. For one reason or another, the researchers did not obtain the blood pressures of 35 patients. In a real investigation, one would likely be able to question the researchers about what really happened. Nevertheless, this does illustrate the kind of misunderstanding that can easily occur. A careless statistician might overlook these presumed missing values and complete an analysis assuming that these were real observed zeros. If the error was later discovered, they might then blame the researchers for using zero as a missing value code (not a good choice since it is a valid value for some of the variables) and not mentioning it in their data description. Unfortunately such oversights are not uncommon, particularly with datasets of any size or complexity. The statistician bears some share of responsibility for spotting these mistakes. We set all zero values of the five variables to NA which is the missing value code used by R: > > > > > pima$diastolic [pima$diastolic = pima$glucose [pima$glucose == 0] pima$triceps [pima$triceps == 0] pima$insulin [pima$insulin == 0] pima$bmi [pima$bmi == 0] < - NA = < < < 0] < - NA - NA - NA - NA The variable test is not quantitative but categorical. Such variables are also called factors. However, because of the numerical coding, this variable has been treated as if it were quantitative. It is best to designate such variables as factors so that they are treated appropriately. Sometimes people forget this and compute stupid statistics such as the “average zip code.” > pima$test < - factor (pima$test) > summary (pima$test) 0 1 500 268 We now see that 500 cases were negative and 268 were positive. It is even better to use descriptive labels: Introduction 5 Downloaded by [University of Toronto] at 16:20 23 May 2014 Now that we have cleared up the missing values and coded the data appropriately, we are ready to do some plots. Perhaps the most well-known univariate plot is the histogram: Figure 1.1 The first panel shows a histogram of the diastolic blood pressures, the second shows a kernel density estimate of the same, while the third shows an index plot of the sorted values. > hist (pima$diastolic) as seen in the first panel of Figure 1.1. We see a bell-shaped distribution for the diastolic blood pressures centered around 70. The construction of a histogram requires the specification of the number of bins and their spacing on the horizontal axis. Some choices can lead to histograms that obscure some features of the data. R specifies the number and spacing of bins given the size and distribution of the data, but this choice is not foolproof and misleading histograms are possible. For this reason, some prefer to use kernel density estimates, which are essentially a smoothed version of the histogram (see Simonoff (1996) for a discussion of the relative merits of histograms and kernel estimates): > plot (density (pima$diastolic, na . rm=TRUE) ) The kernel estimate may be seen in the second panel of Figure 1.1. We see that this plot avoids the distracting blockiness of the histogram. Another alternative is to simply plot the sorted data against its index: 6 Linear Models with R > plot (sort (pima$diastolic), pch=".") The advantage of this is that we can see all the cases individually. We can see the distribution and possible outliers. We can also see the discreteness in the measurement of blood pressure—values are rounded to the nearest even number and hence we see the “steps” in the plot. Now note a couple of bivariate plots, as seen in Figure 1.2: Downloaded by [University of Toronto] at 16:20 23 May 2014 > plot (diabetes ˜ diastolic,pima) > plot (diabetes ˜ test, pima) Figure 1.2 The first panel shows scatterplot of the diastolic blood pressures against diabetes function and the second shows boxplots of diastolic blood pressure broken down by test result. First, we see the standard scatterplot showing two quantitative variables. Second, we see a side-by-side boxplot suitable for showing a quantitative and a qualititative variable. Also useful is a scatterplot matrix, not shown here, produced by: > pairs (pima) We will be seeing more advanced plots later, but the numerical and graphical summaries presented here are sufficient for a first look at the data. Introduction 7 1.3 When to Use Regression Analysis Regression analysis is used for explaining or modeling the relationship between a single variable Y, called the response, output or dependent variable; and one or more predictor, Downloaded by [University of Toronto] at 16:20 23 May 2014 input, independent or explanatory variables, X1,…, Xp. When p=1, it is called simple regression but when p>1 it is called multiple regression or sometimes multivariate regression. When there is more than one Y, then it is called multivariate multiple regression which we will not be covering explicity here, although you can just do separate regressions on each Y. The response must be a continuous variable, but the explanatory variables can be continuous, discrete or categorical, although we leave the handling of categorical explanatory variables to later in the book. Taking the example presented above, a regression with diastolic and bmi as X’s and diabetes as Y would be a multiple regression involving only quantitative variables which we shall be tackling shortly. A regression with diastolic and test as X’s and bmi as Y would have one predictor that is quantitative and one that is qualitative, which we will consider later in Chapter 13 on analysis of covariance. A regression with test as X and diastolic as Y involves just qualitative predictors—a topic called analysis of variance (ANOVA), although this would just be a simple two sample situation. A regression of test as Y on diastolic and bmi as predictors would involve a qualitative response. A logistic regression could be used, but this will not be covered in this book. Regression analyses have several possible objectives including: 1. Prediction of future observations 2. Assessment of the effect of, or relationship between, explanatory variables and the response 3. A general description of data structure Extensions exist to handle multivariate responses, binary responses (logistic regression analysis) and count responses (Poisson regression) among others. 1.4 History Regression-type problems were first considered in the 18th century to aid navigation with the use of astronomy. Legendre developed the method of least squares in 1805. Gauss claimed to have developed the method a few years earlier and in 1809 showed that least squares is the optimal solution when the errors are normally distributed. The methodology was used almost exclusively in the physical sciences until later in the 19th century. Francis Galton coined the term regression to mediocrity in 1875 in reference to the simple regression equation in the form: 8 Linear Models with R Downloaded by [University of Toronto] at 16:20 23 May 2014 where r is the correlation between x and y. Galton used this equation to explain the phenomenon that sons of tall fathers tend to be tall but not as tall as their fathers, while sons of short fathers tend to be short but not as short as their fathers. This phenomenom is called the regression effect. See Stigler (1986) for more of the history. We can illustrate this effect with some data on scores from a course taught using this book. In Figure 1.3, we see a plot of midterm against final scores. We scale each variable to have mean zero and SD one so that we are not distracted by the relative difficulty of each exam and the total number of points possible. Furthermore, this simplifies the regression equation to: y = rx > > > > data (stat500) stat500 < - data.frame (scale (stat500)) plot (final ˜ midterm, stat500) abline (0, l) Figure 1.3 The final and midterm scores in standard units. The least squares fit is shown witha dotted line while y=x is shown with a solid line. Introduction 9 Downloaded by [University of Toronto] at 16:20 23 May 2014 We have added the y=x (solid) line to the plot. Now a student scoring, say 1 SD above average on the midterm might reasonably expect to do equally well on the final. We compute the least squares regression fit and plot the regression line (more on the details later). We also compute the correlations: > g < - lm (final ˜ midterm, stat500) > abline (coef (g), lty=5) > cor (stat500) midterm final hw total midterm 1.00000 0.545228 0.272058 0.84446 final 0.54523 1.000000 0.087338 0.77886 hw 0.27206 0.087338 1.000000 0.56443 total 0.84446 0.778863 0.564429 1.00000 The regression fit is the dotted line in Figure 1.3 and is always shallower than the y=x line. We see that a student scoring 1 SD above average on the midterm is predicted to score only 0.545 SDs above average on the final Correspondingly, a student scoring below average on the midterm might expect to do relatively better in the final, although still below average. If exams managed to measure the ability of students perfectly, then provided that ability remained unchanged from midterm to final, we would expect to see an exact correlation. Of course, it is too much to expect such a perfect exam and some variation is inevitably present. Furthermore, individual effort is not constant. Getting a high score on the midterm can partly be attributed to skill, but also a certain amount of luck. One cannot rely on this luck to be maintained in the final. Hence we see the “regression to mediocrity.” Of course this applies to any (x, y) situation like this—an example is the so-called sophomore jinx in sports when a new star has a so-so second season after a great first year. Although in the father-son example, it does predict that successive descendants will come closer to the mean; it does not imply the same of the population in general since random fluctuations will maintain the variation. In many other applications of regression, the regression effect is not of interest, so it is unfortunate that we are now stuck with this rather misleading name. Regression methodology developed rapidly with the advent of high-speed computing. Just fitting a regression model used to require extensive hand calculation. As computing hardware has improved, the scope for analysis has widened. Exercises 1. The dataset teengamb concerns a study of teenage gambling in Britain. Make a numerical and graphical summary of the data, commenting on any features that you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data. 2. The dataset uswages is drawn as a sample from the Current Population Survey in 1988. Make a numerical and graphical summary of the data as in the previous question. 10 Linear Models with R Downloaded by [University of Toronto] at 16:20 23 May 2014 3. The dataset prostate is from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Make a numerical and graphical summary of the data as in the first question. 4. The dataset sat comes from a study entitled “Getting What You Pay For: The Debate Over Equity in Public School Expenditures.” Make a numerical and graphical summary of the data as in the first question. 5. The dataset divusa contains data on divorces in the United States from 1920 to 1996. Make a numerical and graphical summary of the data as in the first question. Downloaded by [University of Toronto] at 16:20 23 May 2014 CHAPTER 2 Estimation 2.1 Linear Model Suppose we want to model the response Y in terms of three predictors, X1, X2 and X3. One Downloaded by [University of Toronto] at 16:20 23 May 2014 very general form for the model would be: where f is some unknown function and " is the error in this representation. " is additive in this instance, but could enter in some more general form. Still, if we assume that f is a smooth, continuous function, that still leaves a very wide range of possibilities. Even with just three predictors, we typically will not have enough data to try to estimate f directly. So we usually have to assume that it has some more restricted form, perhaps linear as in: where !i, i=0, 1, 2, 3 are unknown parameters. !0 is called the intercept term. Thus the problem is reduced to the estimation of four parameters rather than the infinite dimensional f. In a linear model the parameters enter linearly—the predictors themselves do not have to be linear. For example: is a linear model, but: is not. Some relationships can be transformed to linearity—for example, can be linearized by taking logs. Linear models seem rather restrictive, but because the predictors can be transformed and combined in any way, they are actually very flexible. The term linear is often used in everyday speech as almost a synonym for simplicity. This gives the casual observer the impression that linear models can only handle small simple datasets. This is far from the truth—linear models can easily be expanded and modified to handle complex datasets. Linear is also used to refer to straight lines, but linear models can be curved. Truly nonlinear models are rarely absolutely necessary and most often arise from a theory about the relationships between the variables, rather than an empirical investigation. Estimation 13 2.2 Matrix Representation If we have a response Y and three predictors, X1, X2 and X3, the data might be presented Downloaded by [University of Toronto] at 16:20 23 May 2014 in tabular form like this: where n is the number of observations, or cases, in the dataset. Given the actual data values, we may write the model as: yi=!0+!1x1i+!2x2i+!3x3i+"i i=1,…, n but the use of subscripts becomes inconvenient and conceptually obscure. We will find it simpler both notationally and theoretically to use a matrix/vector representation. The regression equation is written as: y=X!+" where y=(y1,…, yn)T, "=("1,…, "n)T, !=(!0,…,!3)T and: The column of ones incorporates the intercept term. One simple example is the null model where there is no predictor and just a mean y=!+": We can assume that E"=0 since if this were not so, we could simply absorb the nonzero expectation for the error into the mean ! to get a zero expectation. 2.3 Estimating ! The regression model, y=X!+", partitions the response into a systematic component X! and a random component " We would like to choose ! so that the systematic part explains 14 Linear Models with R as much of the response as possible. Geometrically speaking, the response lies in an Downloaded by [University of Toronto] at 16:20 23 May 2014 n-dimensional space, that is, while where p is the number of parameters. If we include the intercept then p is the number of predictors plus one. We will use this definition of p from now on. It is easy to get confused as to whether p is the number of predictors or parameters, as different authors use different conventions, so be careful. The problem is to find ! so that X! is as close to Y as possible. The best choice, the estimate , is apparent in the geometrical representation seen in Figure 2.1. is, in this sense, the best estimate of ! within the model space. The response predicted by the model is or Hy where H is an orthogonal projection matrix. The difference Figure 2.1 Geometrical representation of the estimation !. The data vector Y is projected orthogonally onto the model space spanned by X. The fit is represented by projection . data represented with the difference between the fit and the by the residual vector between the actual response and the predicted response is denoted by and is called the residuals. The conceptual purpose of the model is to represent, as accurately as possible, something complex, y, which is n-dimensional, in terms of something much simpler, the model, which is p-dimensional. Thus if our model is successful, the structure in the data should be captured in those p dimensions, leaving just random variation in the residuals which lie in an (n#p)-dimensional space. We have: Data=Systematic Structure+Random Variation n dimensions=p dimensions+(n#p)dimensions 2.4 Least Squares Estimation The estimation of ! can also be considered from a nongeometrical point of view. We might define the best estimate of ! as the one which minimizes the sum of the squared errors. The least squares estimate of !, called minimizes: Estimation 15 Downloaded by [University of Toronto] at 16:20 23 May 2014 Differentiating with respect to ! and setting to zero, we find that satisfies: These are called the normal equations. We can derive the same result using the geometrical approach. Now provided XTX is invertible: H=X(XTX)!1XT is called the hat-matrix and is the orthogonal projection of y onto the space spanned by X. H is useful for theoretical manipulations, but you usually do not want to compute it explicitly, as it is an n!n matrix which could be uncomfortably large for some datasets. The following useful quantities can now be represented using H. The predicted or fitted values are while the residuals are The residual sum of squares (RSS) is Later, we will show that the least squares estimate is the best possible estimate of ! when the errors " are uncorrelated and have equal variance or more briefly put var "=#2I. is unbiased and has variance (XTX)–1#2 provided var " = #2I. Since is a variance is a matrix. We also need to estimate #2. We find that estimator: vector, its which suggests the as an unbiased estimate of #2. n#p is the degrees of freedom of the model. Sometimes you need the standard error for a particular component of which can be picked out as 16 Linear Models with R 2.5 Examples of Calculating In a few simple models, it is possible to derive explicit formulae for 1. When y = !+", X=1 and !=! so XTX=1T1=n so: Downloaded by [University of Toronto] at 16:20 23 May 2014 2. Simple linear regression (one predictor): We can now apply the formula but a simpler approach is to rewrite the equation as: so now: Next work through the rest of the calculation to reconstruct the familiar estimates, that is: In higher dimensions, it is usually not possible to find such explicit formulae for the parameter estimates unless XTX happens to be a simple form. So typically we need computers to fit such models. Regression has a long history, so in the time before computers became readily available, fitting even quite simple models was a tedious time consuming task. When computing was expensive, data analysis was limited. It was designed to keep calculations to a minimum and restrict the number of plots. This mindset remained in statistical practice for some time even after computing became widely and cheaply available. Now it is a simple matter to fit a multitude of models and make more plots than one could reasonably study. The challenge for the analyst is to choose among these intelligently to extract the crucial information in the data. Estimation 17 2.6 Gauss-Markov Theorem Downloaded by [University of Toronto] at 16:20 23 May 2014 is a plausible estimator, but there are alternatives. Nonetheless, there are three goodreasons to use least squares: 1. It results from an orthogonal projection onto the model space. It makes sense geometrically. 2. If the errors are independent and identically normally distributed, it is the maximum likelihood estimator. Loosely put, the maximum likelihood estimate is the value of ! that maximizes the probability of the data that was observed. 3. The Gauss-Markov theorem states that is the best linear unbiased estimate (BLUE). To understand the Gauss-Markov theorem we first need to understand the concept of an estimable function. A linear combination of the parameters only if there exists a linear combination aTy such that: is estimable if and Estimable functions include predictions of future observations, which explains why they are well worth considering. If X is of full rank, then all linear combinations are estimable. Suppose E"=0 and var "=#2I. Suppose also that the structural part of the model, EY=X! is correct. (Clearly these are big assumptions and so we will address the implications of this later.) be an estimable function; then the Gauss-Markov theorem states that in the class of all unbiased linear estimates of has the minimum variance and is unique. We prove this theorem. Suppose aTy is some unbiased estimate of cT! so that: which means that aTX=CT. This implies that c must be in the range space of XT which in turn implies that c is also in the range space of XTX which means there exists a $, such that c=XTX$ so: Now we can show that the least squares estimator has the minimum variance—pick an arbitrary estimate aTy and compute its variance: 18 Linear Models with R Downloaded by [University of Toronto] at 16:20 23 May 2014 but so Now since variances cannot be negative, we see that: In other words, has minimum variance. It now remains to show that it is unique. There will be equality in the above relationship if var (aTy#$TXTy)=0 which would require that aT#$TXT=0 which means that T T if a y=c So equality occurs only so the estimator is unique. This completes the proof. The Gauss-Markov theorem shows that the least squares estimate is a good choice, but it does require that the errors are uncorrelated and have unequal variance. Even if the errors behave, but are nonnormal, then nonlinear or biased estimates may work better. So this theorem does not tell one to use least squares all the time; it just strongly suggests it unless there is some strong reason to do otherwise. Situations where estimators other than ordinary least squares should be considered are: 1. When the errors are correlated or have unequal variance, generalized least squares should be used. See Section 6.1. 2. When the error distribution is long-tailed, then robust estimates might be used. Robust estimates are typically not linear in y. See Section 6.4. 3. When the predictors are highly correlated (collinear), then biased estimators such as ridge regression might be preferable. See Chapter 9. 2.7 Goodness of Fit It is useful to have some measure of how well the model fits the data. One common choice is R2, the so-called coefficient of determination or percentage of variance explained: Estimation 19 Downloaded by [University of Toronto] at 16:20 23 May 2014 Its range is 0"R2"1—values closer to 1 indicating better fits. For simple linear regression R2=r2 where r is the correlation between x and y. An equivalent definition Figure 2.2 Variation in the response y when x is known is denoted by dotted arrows while variation in y when x is unknown is shown with the solid arrows. The graphical intuition behind R2 is seen in Figure 2.2. Suppose you want to predict y. If you do not know x, then your best prediction is but the variability in this prediction is high. If you do know x, then your prediction will be given by the regression fit. This prediction will be less variable provided there is some relationship between x and y. R2 is one minus the ratio of the sum of squares for these two predictions. Thus for perfect predictions the ratio will be zero and R2 will be one. R2 as defined here does not make any sense if you do not have an intercept in your model. This is because the denominator in the definition of R2 has a null model with an intercept in mind when the sum of squares is calculated. Alternative definitions of R2 are possible when there is no intercept, but the same graphical intuition is not available and Downloaded by [University of Toronto] at 16:20 23 May 2014 20 Linear Models with R the R2s obtained in this way should not be compared to those for models with an intercept. Beware of high R2s reported from models without an intercept. What is a good value of R2? It depends on the area of application. In the biological and social sciences, variables tend to be more weakly correlated and there is a lot of noise. We would expect lower values for R2 in these areas—a value of 0.6 might be considered good. In physics and engineering, where most data come from closely controlled experiments, we expect to get much higher R2s and a value of 0.6 would be considered low. Of course, I generalize excessively here so some experience with the particular area is necessary for you to judge your R2s well. An alternative measure of fit is This quantity is directly related to the standard errors of estimates of ! and predictions. The advantage is that is measured in the units of the response and so may be directly interpreted in the context of the particular dataset. This may also be a disadvantage in that one must understand whether the practical significance of this measure whereas R2, being unitless, is easy to understand. 2.8 Example Now let’s look at an example concerning the number of species of tortoise on the various Galápagos Islands. There are 30 cases (Islands) and seven variables in the dataset. We start by reading the data into R and examining it (remember you first need to load the book data with the library (faraway) command): > data (gala) > gala Species Baltra 58 Bartolome 31 . . . Endemics 23 21 Area 25.09 1.24 Elevation Nearest 346 0.6 109 0.6 Scruz 0.6 26.3 The variables are Species—the number of species of tortoise found on the island, Endemics—the number of endemic species, Area—the area of the island (km2), Elevation—the highest elevation of the island (m), Nearest—the distance from the nearest island (km), Scruz—the distance from Santa Cruz Island (km), Adjacent—the area of the adjacent island (km2). The data were presented by Johnson and Raven (1973) and also appear in Weisberg (1985). I have filled in some missing values for simplicity (see Chapter 12 for how this can be done). Fitting a linear model in R is done using the lm ( ) command. Notice the syntax for specifying the predictors in the model. This is part of the Wilkinson-Rogers notation. In this case, since all the variables are in the gala data frame, we must use the data=argument: > mdl < - lm (Species ˜ Area + Elevation + Nearest + Scruz + Adjacent, data=gala) > summary (mdl) Downloaded by [University of Toronto] at 16:20 23 May 2014 Estimation 21 Call: lm (formula = Species ˜ Area + Elevation + Nearest + Scruz + Adjacent, data = gala) Residuals: Min 1Q Median 3Q Max !111.68 !34.90 !7.86 33.46 182.58 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.06822 19.15420 0.37 0.7154 Area !0.02394 0.02242 !1.07 0.2963 Elevation 0.31946 0.05366 5.95 3.8e–06 Nearest 0.00914 1.05414 0.01 0.9932 Scruz !0.24052 0.21540 !1.12 0.2752 Adjacent !0.07480 0.01770 !4.23 0.0003 Residual standard error: 61 on 24 degrees of freedom Multiple R-Squared: 0.766, Adjusted R-squared: 0.717 F-statistic: 15.7 on 5 and 24 DF, p-value: 6.84e–07 We can identify several useful quantities in this output. Other statistical packages tend to produce output quite similar to this. One useful feature of R is that it is possible to directly calculate quantities of interest. Of course, it is not necessary here because the lm ( ) function does the job, but it is very useful when the statistic you want is not part of the prepackaged functions. First, we make the X-matrix: > x < - model.matrix ( ˜ Area + Elevation + Nearest + Scruz + Adjacent, gala) and here is the response y: > y < - gala$Species Now let’s construct . t ( ) does transpose and %*% does matrix multiplication. !1 solve (A) computes A while solve (A, b) solves Ax=b: > xtxi < - solve (t (x) %*% x) We can get directly, using (XTX)!1XTy: > xtxi %*% t (x) %*% y [,1] 1 7.068221 Area !0.023938 Elevation 0.319465 Nearest 0.009144 Scruz !0.240524 Adjacent !0.074805 22 Linear Models with R Downloaded by [University of Toronto] at 16:20 23 May 2014 This is a very bad way to compute . It is inefficient and can be very inaccurate when the predictors are strongly correlated. Such problems are exacerbated by large datasets. A better, but not perfect, way is: > solve (crossprod (x, x), crossprod (x, y)) [,1] 1 7.068221 Area !0.023938 Elevation 0.319465 Nearest 0.009144 Scruz !0.240524 Adjacent !0.074805 where crossprod (x, y) computes xT y. Here we get the same result as lm ( ) because the data are well-behaved. In the long run, you are advised to use carefully programmed code such as found in lm ( ). To see the full details, consult a text such as Thisted (1988). We can extract the regression quantities we need from the model object. Commonly used are residuals ( ), fitted ( ), deviance ( ) which gives the RSS, df . residual ( ) which gives the degrees of freedom and coef ( ) which gives the . You can also extract other needed quantities by examining the model object and its summary: > names (mdl) [1] "coefficients" "residuals" [4] "rank" "fitted.values" [7] "qr" "df.residual" [10] "call" "terms" > md1s < - summary (mdl) > names (mdls) [1] "call" "terms" [4] "coefficients" "aliased" [7] "df" "r.squared" [10] "fstatistic" "cov.unscaled" "effects" "assign" "xlevels" "model" "residuals" "sigma" "adj.r.squared" We can estimate # using the formula in the text above or extract it from the summary object: > sqrt (deviance (mdl) /df.residual (mdl)) [1] 60.975 > mdls$sigma [1] 60.975 We can also extract (XTX)–1and use it to compute the standard errors for the coefficients. (diag ( ) returns the diagonal of a matrix): > xtxi < - mdls$cov.unscaled > sqrt (diag (xtxi)) *60.975 (Intercept) Area Elevation 19.154139 0.022422 0.053663 Nearest 1.054133 Scruz 0.215402 Estimation 23 Adjacent 0.017700 or get them from the summary object: Downloaded by [University of Toronto] at 16:20 23 May 2014 > mdls$coef [,2] (Intercept) Area 19.154198 0.022422 Adjacent 0.017700 Elevation Nearest 0.053663 1.054136 Scruz 0.215402 Finally, we may compute or extract R2: > 1-deviance (mdl) /sum ((y-mean (y))^2) [1] 0.76585 > mdls$r.squared [1] 0.76585 2.9 Identifiability The least squares estimate is the solution to the normal equations: where X is an n!p matrix. If XTX is singular and cannot be inverted, then there will be infinitely many solutions to the normal equations and is at least partially unidentifiable. Unidentifiability will occur when X is not of full rank—when its columns are linearly dependent. With observational data, unidentifiability is usually caused by some oversight. Here are some examples: 1. A person’s weight is measured both in pounds and kilos and both variables are entered into the model. 2. For each individual we record the number of years of preuniversity education, the number of years of university education and also the total number of years of education and put all three variables into the model. 3. We have more variables than cases, that is, p>n. When p=n, we may perhaps estimate all the parameters, but with no degrees of freedom left to estimate any standard errors or do any testing. Such a model is called saturated. When p>n, then the model is called supersaturated. Oddly enough, such models are considered in large-scale screening experiments used in product design and manufacture, but there is no hope of uniquely estimating all the parameters in such a model. Such problems can be avoided by paying attention. Identifiability is more of an issue in designed experiments. Consider a simple two-sample experiment, where the treatment observations are y1,…, yn and the controls are yn+1,…, ym+n. Suppose we try to model the response by an overall mean ! and group effects %1 and %2 : Downloaded by [University of Toronto] at 16:20 23 May 2014 24 Linear Models with R Now although X has three columns, it has only rank 2—(!, %1, %2) are not identifiable and the normal equations have infinitely many solutions. We can solve this problem by imposing some constraints, !=0 or %1+%2=0, for example. Statistics packages handle nonidentifiability differently. In the regression case above, some may return error messages and some may fit models because rounding error may remove the exact identifiability. In other cases, constraints may be applied but these may be different from what you expect. By default, R fits the largest identifiable model by removing variables in the reverse order of appearance in the model formula. Here is an example. Suppose we create a new variable for the Galápagos dataset—the difference in area between the island and its nearest neighbor: > gala$Adiff < - gala$Area -gala$Adjacent and add that to the model: > g < - lm (Species ˜ Area+Elevation+Nearest+Scruz+Adjacent +Adiff, gala) > summary (g) Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 7.06822 19.15420 0.37 0.7154 Area !0.02394 0.02242 !1.07 0.2963 Elevation 0.31946 0.05366 5.95 3.8e–06 Nearest 0.00914 1.05414 0.01 0.9932 Scruz !0.24052 0.21540 !1.12 0.2752 Adjacent !0.07480 0.01770 !4.23 0.0003 Adiff NA NA NA NA Residual standard error: 61 on 24 degrees of freedom Multiple R-Squared: 0.766, Adjusted R-squared: 0.717 F-statistic: 15.7 on 5 and 24 DF, p-value: 6.84e–07 We get a message about one undefined coefficient because the rank of the design matrix X is six, which is less than its seven columns. In most cases, the cause of identifiability can be revealed with some thought about the variables, but, failing that, an eigendecomposition of XTX will reveal the linear combination(s) that gave rise to the unidentifiability. Estimation 25 Lack of identifiability is obviously a problem, but it is usually easy to identify and work around. More problematic are cases where we are close to unidentifiability. To demonstrate this, suppose we add a small random perturbation to the third decimal place of Adiff by adding a random variate from U [&0.005, 0.005] where U denotes the uniform distribution: > Adiffe < - gala$Adiff+0.001*(runif(30)-0.5) Downloaded by [University of Toronto] at 16:20 23 May 2014 and now refit the model: > g < - lm (Species ˜ Area+Elevation+Nearest+Scruz +Adjacent+Adiffe, gala) > summary (g) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.14e+00 1.956+01 0.37 0.72 Area !2.38e+04 4.70e+04 !0.51 0.62 Elevation 3.12e–01 5.67e–02 5.50 1.46–05 Nearest 1.38e–01 1.10e+00 0.13 0.90 Scruz !2.50e–01 2.206–01 !1.14 0.27 Adjacent 2.386+04 4.70e+04 0.51 0.62 Adiffe 2.386+04 4.70e+04 0.51 0.62 Residual standard error: 61.9 on 23 degrees of freedom Multiple R-Squared: 0.768, Adjusted R-squared: 0.708 F-statistic: 12.7 on 6 and 23 DF, p-value: 2.58e–06 Notice that now all parameters are estimated, but the standard errors are very large because we cannot estimate them in a stable way. We deliberately caused this problem so we know the cause but in general we need to be able to identify such situations. We do this in Section 5.3. Exercises 1. The dataset teengamb concerns a study of teenage gambling in Britain. Fit a regression model with the expenditure on gambling as the response and the sex, status, income and verbal score as predictors. Present the output. (a) What percentage of variation in the response is explained by these predictors? (b) Which observation has the largest (positive) residual? Give the case number. (c) Compute the mean and median of the residuals. (d) Compute the correlation of the residuals with the fitted values. (e) Compute the correlation of the residuals with the income. (f) For all other predictors held constant, what would be the difference in predicted expenditure on gambling for a male compared to a female? 2. The dataset uswages is drawn as a sample from the Current Population Survey in 1988. Fit a model with weekly wages as the response and years of education and experience as predictors. Report and give a simple interpretation to the regression 26 Linear Models with R coefficient for years of education. Now fit the same model but with logged weekly wages. Give an interpretation to the regression coefficient for years of education. Which interpretation is more natural? 3. In this question, we investigate the relative merits of methods for computing the coefficients. Generate some artificial data by: Downloaded by [University of Toronto] at 16:20 23 May 2014 > x < - 1:20 > y < - x+rnorm(20) Fit a polynomial in x for predicting y. Compute in two ways—by lm ( ) and by using the direct calculation described in the chapter. At what degree of polynomial does the direct calculation method fail? (Note the need for the I ( ) function in fitting the polynomial, that is, lm(y˜x+I(x^2)) . 4. The dataset prostate comes from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Fit a model with lpsa as the response and l cavol as the predictor. Record the residual standard error and the R2. Now add lweight, svi, lpph, age, l cp, pgg45 and gleason to the model one at a time. For each model record the residual standard error and the R2. Plot the trends in these two statistics. 5. Using the prostate data, plot lpsa against l cavol. Fit the regressions of lpsa on lcavol and lcavol on lpsa. Display both regression lines on the plot. At what point do the two lines intersect? Downloaded by [University of Toronto] at 16:20 23 May 2014 Downloaded by [University of Toronto] at 16:20 23 May 2014 CHAPTER 3 Inference Until now, we have not found it necessary to assume any distributional form for the errors ". However, if we want to make any confidence intervals or perform any hypothesis tests, we will need to do this. The common assumption is that the errors are normally distributed. In practice, this is often, although not always, a reasonable assumption. We have already assumed that the errors are independent and identically distributed (i.i.d.) with mean 0 and variance #2, so we have " ~ N(0, #2I). Now since y=X!+", we have y ~ N(X!, #2I). which is a compact description of the regression model. From this we find, using the fact that linear combinations of normally distributed values are also normal, that: 3.1 Hypothesis Tests to Compare Models Given several predictors for a response, we might wonder whether all are needed. Consider a larger model, #, and a smaller model, ', which consists of a subset of the predictors that are in ( If there is not much difference in the fit, we would prefer the smaller model on the principle that simpler explanations are preferred. On the other hand, if the fit of the larger model is appreciably better, we will prefer it. We will take ' to represent the null hypothesis and ( to represent the alternative. A geometrical view of the problem may be seen in Figure 3.1. RSS"&RSS# is small, then the fit of the smaller model is almost as good as the larger model and so we would prefer the smaller model on the grounds of simplicity. On the other hand, if the difference is large, then the superior fit of the larger model would be preferred. This suggests that something like: would be a potentially good test statistic where the denominator is used for scaling purposes. As it happens, the same test statistic arises from the likelihood-ratio testing approach. We give an outline of the development: If L (!, # y) is the likelihood function, then the likelihood-ratio statistic is: Inference 29 Downloaded by [University of Toronto] at 16:20 23 May 2014 The test should reject if this ratio is too large. Working through the details, we find that for each model: Figure 3.1 Geometrical view of the comparison between big model, !, and small model, ". The squared length of the residual vector for the big model is RSS! while that for the small model is RSS!. By Pythagoras’ theorem, the squared length of the vector connecting the two fits is RSS""RSS!. A small value for this indicates that the small model fits almost as well as the large model and thus might be preferred due to its simplicity. which after some manipulation gives us a test that rejects if: which is the same statistic suggested by the geometrical view. It remains for us to discover the null distribution of this statistic. Now suppose that the dimension (or number of parameters) of ( is p and the dimension of ' is q, then: Details of the derivation of this statistic may be found in more theoretically oriented texts such as Sen and Srivastava (1990). Thus we would reject the null hypothesis if The degrees of freedom of a model are (usually) the number of observations minus the number of parameters so this test statistic can also be written: 30 Linear Models with R Downloaded by [University of Toronto] at 16:20 23 May 2014 where df#=n#p and df"=n#q. The same test statistic applies not just to when ' is a subset of #, but also to a subspace. This test is very widely used in regression and analysis of variance. When it is applied in different situations, the form of test statistic may be reexpressed in various different ways. The beauty of this approach is you only need to know the general form. In any particular case, you just need to figure out which models represent the null and alternative hypotheses, fit them and compute the test statistic. It is very versatile. 3.2 Testing Examples Test of all the predictors Are any of the predictors useful in predicting the response? Let the full model (() be y=X!+" where X is a full-rank n!p matrix and the reduced model (') be y=!". We would estimate ) by . We write the null hypothesis as: H0: !1=…!p!1=0 Now , the residual sum of squares for the full model, while , which is sometimes known as the sum of squares corrected for the mean. So in this case: We would now refer to Fp!1,n!p for a critical value or a p-value. Large values of F would indicate rejection of the null. Traditionally, the information in the above test is presented in an analysis of variance table. Most computer packages produce a variant on this. See Table 3.1 for the layout. It is not really necessary to specifically compute all the elements of the table. As the originator of the table, Fisher said in 1931, it is “nothing but a convenient way of arranging the arithmetic.” Since he had to do his calculations by hand, the table served a necessary purpose, but it is not essential now. Source Deg. of Freedom Sum of Squares Mean Square F Regression P#1 SSreg SSreg/(p#1) F Inference 31 Residual n"p RSS Total n#1 TSS RSS/(n"p) Downloaded by [University of Toronto] at 16:20 23 May 2014 Table 3.1 Analysis of variance table. A failure to reject the null hypothesis is not the end of the game—you must still investigate the possibility of nonlinear transformations of the variables and of outliers which may obscure the relationship. Even then, you may just have insufficient data to demonstrate a real effect, which is why we must be careful to say “fail to reject” the null rather than “accept” the null. It would be a mistake to conclude that no real relationship exists. This issue arises when a pharmaceutical company wishes to show that a proposed generic replacement for a brand-named drug is equivalent. It would not be enough in this instance just to fail to reject the null. A higher standard would be required. When the null is rejected, this does not imply that the alternative model is the best model. We do not know whether all the predictors are required to predict the response or just some of them. Other predictors might also be added or existing predictors transformed or recombined. Either way, the overall F-test is just the beginning of an analysis and not the end. Let’s illustrate this test and others using an old economic dataset on 50 different countries. These data are averages from 1960 to 1970 (to remove business cycle or other short-term fluctuations). dpi is per capita disposable income in U.S. dollars; ddpi is the percentage rate of change in per capita disposable income; sr is aggregate personal saving divided by disposable income. The percentage of population under 15 (pop15) and over 75 (pop75) is also recorded. The data come from Belsley, Kuh, and Welsch (1980). Take a look at the data: > data(savings) > savings sr Australia 11.43 Austria 12.07 …cases deleted… Malaysia 4.71 pop15 pop75 29.35 2.87 23.32 4.41 47.20 0.66 dpi 2329.68 1507.99 ddpi 2.87 3.93 242.69 5.08 First, consider a model with all the predictors: > g < - 1m (sr ˜ pop15 + pop75 + dpi + ddpi, savings) > summary (g) Coefficients: Estimate Std. Error t value Pr (>|t|) (Intercept) 28.566087 7.354516 3.88 0.00033 pop15 !0.461193 0.144642 !3.19 0.00260 pop75 !1.691498 1.083599 !1.56 0.12553 dpi !0.000337 0.000931 !0.36 0.71917 ddpi 0.409695 0.196197 2.09 0.04247 Residual standard error: 3.8 on 45 degrees of freedom Multiple R-Squared: 0.338, Adjusted R-squared: 0.28 F-statistic: 5.76 on 4 and 45 DF, p-value: 0.00079 32 Linear Models with R We can see directly the result of the test of whether any of the predictors have significance in the model. In other words, whether !1=(!2=!3=!4=0. Since the p-value, Downloaded by [University of Toronto] at 16:20 23 May 2014 0.00079, is so small, this null hypothesis is rejected. We can also do it directly using the F-testing formula: > g < - 1m (sr ˜ pop15 + pop75 + dpi + ddpi, savings) > (tss < - sum((savings$sr-mean (savings$sr))^2)) [1] 983.63 > (rss < - deviance(g)) [1] 650.71 > df.residual(g) [13 45 > (fstat < - ((tss-rss)/4)/(rss/df.residual(g))) [1] 5.7557 > 1-pf (fstat, 4, df.residual (g)) [1] 0.00079038 Verify that the numbers match the regression summary above. Testing just one predictor Can one particular predictor be dropped from the model? The null hypothesis would be H0: !i=0. Let RSS# be the RSS for the model with all the predictors of interest which has p parameters and let RSS" be the RSS for the model with all the same predictors except predictor i. The F-statistic may be computed using the standard formula. An alternative approach is to use a t-statistic for testing the hypothesis: and check for significance using a t-distribution with n#p degrees of freedom. However, is exactly the F-statistic here, so the two approaches are numerically identical. The latter is less work and is presented in typical regression outputs. For example, to test the null hypothesis that !1=0, (that pop15 is not significant in the full model) we can simply observe that the p-value is 0.0026 from the table and conclude that the null should be rejected. Let’s do the same test using the general F-testing approach: We will need the RSS and df for the full model which are 650.71 and 45, respectively. We now fit the model that represents the null: > g2 < - 1m (sr ˜ pop75 + dpi + ddpi, savings) and compute the RSS and the F-statistic: > (rss2 < - deviance (g2)) [1] 797.72 > (fstat < - (deviance (g2)-deviance (g))/ Inference 33 (deviance (g)/df.residual(g))) [1] 10.167 The p-value is then: > l-pf (fstat, l, df.residual(g)) [1] 0.002603 Downloaded by [University of Toronto] at 16:20 23 May 2014 We can relate this to the t-based test and p-value by: > sqrt (fstat) [1] 3.1885 > (tstat < - summary(g)$coef[2, 3]) [1] !3.1885 > 2 * (l-pt (sqrt (fstat), 45)) [1] 0.002603 A more convenient way to compare two nested models is: > anova (g2, g) Analysis of Variance Table Model 1: sr ˜ pop75 + dpi + ddpi Model 2: sr ˜ pop15 + pop75 + dpi + ddpi Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F) 1 46 798 2 45 651 1 147 10.2 0.0026 Understand that this test of pop15 is relative to the other predictors in the model, namely, pop75, dpi and ddpi. If these other predictors were changed, the result of the test may be different. This means that it is not possible to look at the effect of pop15 in isolation. Simply stating the null hypothesis as H0: !pop15=0 is insufficient—information about what other predictors are included in the null is necessary. The result of the test may be different if the predictors change. Testing a pair of predictors Suppose we wish to test the significance of variables Xj and Xk. We might construct a table as seen above and find that both variables have p-values greater than 0.05 thus indicating that individually each one is not significant. Does this mean that both Xj and Xk can be eliminated from the model? Not necessarily. Except in special circumstances, dropping one variable from a regression model causes the estimates of the other parameters to change so that we might find that after dropping Xj, a test of the significance of Xk shows that it should now be included in the model. If you really want to check the joint significance of Xj and Xk you should fit a model with and then without them and use the general F-test discussed before. Remember that even the result of this test may depend on what other predictors are in the model. We test the hypothesis that both pop75 and ddpi may be excluded from the model: 34 Linear Models with R Downloaded by [University of Toronto] at 16:20 23 May 2014 > g3 < - 1m (sr ˜ popl5+dpi , savings) > anova (g3, g) Analysis of Variance Table Model 1: sr pop15 + dpi Model 2: sr pop15 + pop75 + dpi + ddpi Res.Df RSS Df Sum of Sq F Pr(>F) 1 47 744.12 2 45 650.71 2 9 3.41 3.2299 0.04889 We see that the pair of predictors is just barely significant at the 5% level. Tests of more than two predictors may be performed in a similar way by comparing the appropriate models. Testing a subspace Consider this example. Suppose that y is the first year grade point average for a student, Xj is the score on the quantitative part of a standardized test and Xk is the score on the verbal part. There might also be some other predictors. We might wonder whether we need two separate scores—perhaps they can be replaced by the total, Xj+Xk. So if the original model was: y=!0+…+!jXj+!kXk+…+" then the reduced model is: y=!0+…+!l(Xj+Xk)+…+" which requires that !j=!k for this reduction to be possible. So the null hypothesis is: H0: !j=!k This defines a linear subspace to which the general F-testing procedure applies. In our example, we might hypothesize that the effect of young and old people on the savings rate was the same or in other words that: H0: !pop15=!pop75 In this case the null model would take the form: y=!0+!dep(pop15+pop75)+!dpidpi+!ddpiddpi+" We can then compare this to the full model as follows: > g < - 1m (sr ˜ .,savings) > gr < - 1m (sr ˜ I (pop15+pop75)+dpi+ddpi/savings) Inference 35 Downloaded by [University of Toronto] at 16:20 23 May 2014 > anova (gr, g) Analysis of Variance Table Model 1: sr ˜ I (pop15 + pop75) + dpi + ddpi Model 2: sr ˜ pop15 + pop75 + dpi + ddpi Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F) 1 46 674 2 45 651 1 23 1.58 0.21 The period in the first model formula is shorthand for all the other variables in the data frame. The function I ( ) ensures that the argument is evaluated rather than interpreted as part of the model formula. The p-value of 0.21 indicates that the null cannot be rejected here, meaning that there is not evidence that young and old people need to be treated separately in the context of this particular model. Suppose we want to test whether one of the coefficients can be set to a particular value. For example: H0: !ddpi=0.5 Here the null model would take the form: y=!0+!pop15pop15+!pop75pop75+!dpidpi+0.5ddpi+" Notice that there is now a fixed coefficient on the ddpi term. Such a fixed term in the regression equation is called an offset. We fit this model and compare it to the full: > gr < - 1m (sr ˜ pop15+pop75+dpi+offset(0.5*ddpi),savings) > anova (gr, g) Analysis of Variance Table Model 1: sr pop15 + pop75 + dpi + offset(0.5 * ddpi) Model 2: sr pop15 + pop75 + dpi + ddpi Res.Df RSS Df Sum of Sq F Pr(>F) 1 46 654 2 45 651 1 3 0.21 0.65 We see that the p-value is large and the null hypothesis here is not rejected. A simpler way to test such point hypotheses is to use a t-statistic: where c is the point hypothesis. So in our example the statistic and corresponding p-value is: > (tstat < - (0.409695–0.5)/0.196197) [1] !0.46028 > 2*pt (tstat, 45) [1] 0.64753 36 Linear Models with R We can see the p-value is the same as before and if we square the t-statistic > tstatˆ 2 [1] 0.21186 Downloaded by [University of Toronto] at 16:20 23 May 2014 we find we get the same F-value as above. This latter approach is preferred in practice since we do not need to fit two models but it is important to understand that it is equivalent to the result obtained using the general F-testing approach. Can we test a hypothesis such as H0: !j!k=1 using our general theory? No. This hypothesis is not linear in the parameters so we cannot use our general method. We would need to fit a nonlinear model and that lies beyond the scope of this book. 3.3 Permutation Tests We can put a different interpretation on the hypothesis tests we are making. For the Galapágos dataset, we might suppose that if the number of species had no relation to the five geographic variables, then the observed response values would be randomly distributed between the islands without relation to the predictors. The F-statistic is a good measure of the association between the predictors and the response with larger values indicating stronger associations. We might then ask what the chance would be under this assumption that an F-statistic would be observed as large, or larger than the one we actually observed. We could compute this exactly by computing the F-statistic for all possible (30!) permutations of the response variable and see what proportion exceed the observed F-statistic. This is a permutation test. If the observed proportion is small, then we must reject the contention that the response is unrelated to the predictors. Curiously, this proportion is estimated by the p-value calculated in the usual way based on the assumption of normal errors thus saving us from the massive task of actually computing the regression on all those computations. See Freedman and Lane (1983) for a discussion of these matters. Let’s see how we can apply the permutation test to the savings data. I chose a model with just pop75 and dpi so as to get a p-value for the F-statistic that is not too small (and therefore less interesting): > g < - 1m (sr ˜ pop75+dpi,savings) > summary (g) Coefficients: Estimate Std. Error t value Pr (>|t|) (Intercept) 7.056619 1.290435 5.47 1.7e–06 pop75 1.304965 0.777533 1.68 0.10 dpi !0.000341 0.001013 !0.34 0.74 Residual standard error: 4.33 on 47 degrees of freedom Multiple R-Squared: 0.102, Adjusted R-squared: 0.0642 F-statistic: 2.68 on 2 and 47 DF, p-value: 0.079 We can extract the F-statistic as: > gs < - summary (g) Inference 37 > gs$fstat value numdf dendf 2.6796 2.0000 47.0000 Downloaded by [University of Toronto] at 16:20 23 May 2014 The function sample ( ) generates random permutations. We compute the F-statistic for 4000 randomly selected permutations and see what proportion exceeds the F-statistic for the original data: > fstats < - numeric(4000) > for (i in 1:4000){ + ge < - lm (sample(sr) ˜ pop75+dpi,data=savings) + fstats [i] < - summary(ge)$fstat[1] + } > length(fstats[fstats > 2.6796])/4000 [1] 0.07425 This should take less than a minute on any relatively new computer. If you repeat this, you will get a slightly different result each time because of the random selection of the permutations. So our estimated p-value using the permutation test is 0.07425, which is close to the normal theory-based value of 0.0791. We could reduce variability in the estimation of the p-value simply by computing more random permutations. Since the permutation test does not depend on the assumption of normality, we might regard it as superior to the normal theory based value. It does take longer to compute, so we might use the normal inference secure, in the knowledge that the results can also be justified with an alternative argument. Tests involving just one predictor also fall within the permutation test framework. We permute that predictor rather than the response. Let’s test the pop75 predictor in the model. We can extract the t-statistic as: >summary(g)$coef [2, 3] [1] 1.6783 Now we perform 4000 permutations of pop75 and check what fraction of the t-statistics exceeds 1.6783 in absolute value: >tstats < - numeric(4000) >for (i in 1:4000){ +ge < - 1m (sr ˜ sample(pop75)+dpi, savings) +tstats [i] < - summary(ge)$coef[2,3] +} >mean (abs (tstats) > 1.6783) [1] 0.10475 The outcome is very similar to the observed normal-based p-value of 0.10. 38 Linear Models with R Downloaded by [University of Toronto] at 16:20 23 May 2014 3.4 Confidence Intervals for ! Confidence intervals (CIs) provide an alternative way of expressing the uncertainty in our estimates. They are linked to the tests that we have already constructed. For the CIs and regions that we will consider here, the following relationship holds. For a 100(1&%)% confidence region, any point that lies within the region represents a null hypothesis that would not be rejected at the 100%% level while every point outside represents a null hypothesis that would be rejected. So, in one sense, the confidence region provides a lot more information than a single hypothesis test in that it tells us the outcome of a whole range of hypotheses about the parameter values. Of course, by selecting the particular level of confidence for the region, we can only make tests at that level and we cannot determine the p-value for any given test simply from the region. However, since it is dangerous to read too much into the relative size of p-values (as far as how much evidence they provide against the null), this loss is not particularly important. The confidence region tells us about plausible values for the parameters in a way that the hypothesis test cannot. This makes it more valuable. As with testing, we must decide whether to form confidence regions for parameters individually or simultaneously. Simultaneous regions are preferable, but for more than two dimensions they are difficult to display and so there is still some value in computing the one-dimensional CIs. We can consider each parameter individually, which leads to CIs taking the general form of: Estimate±Critical Value!SE of Estimate or specifically in this case: Alternatively, a 100(1–%)% confidence region for ! satisfies: These regions are ellipsoidally shaped. Because these ellipsoids live in higher dimensions, they cannot easily be visualized except for the two-dimensional case. It is better to consider the joint CIs when possible, especially when the correlated. Consider the full model for the savings data: are heavily > g < - 1m (sr ˜ . , savings) where the output is displayed on page 28. We can construct individual 95% CIs for !pop75: > qt (0.975,45) [1] 2.0141 > c (!1.69!2.01*1.08,!1.69+2.01*1.08) [1] !3.8608 0.4808 Inference 39 CIs have a duality with two-sided hypothesis tests as we mentioned above. Thus the interval contains zero, which indicates that the null hypothesis H0: !pop75=0 would not be rejected at the 5% level. We can see from the summary that the p-value is 12.5%—greater than 5%—confirming this point. The CI for !pop75 is: Downloaded by [University of Toronto] at 16:20 23 May 2014 > c (0.41 !2.01 * 0.196, 0.41+2.01*0.196) [1] 0.01604 0.80396 Because zero is in this interval, the null is rejected. Nevertheless, this CI is relatively wide in the sense that the upper limit is about 50 times larger than the lower limit. This means that we are not really that confident about what the exact effect of growth on savings really is, even though it is statistically significant. A convenient way to obtain all the univariate intervals is: > confint (g) (Intercept) pop15 pop75 dpi ddpi 2.5 % 13.7533307 –0.7525175 –3.8739780 –0.0022122 0.0145336 97.5 % 43.3788424 –0.1698688 0.4909826 0.0015384 0.8048562 Now we construct the joint 95% confidence region for !pop15 and !pop75. First, we load in a package for drawing confidence ellipses (which is not part of base R and so may need to be downloaded): > library(ellipse) and now the plot: > plot(ellipse(g,c(2,3)),type="l",xlim=c(-1,0)) We add the origin and the point of the estimates: > points(0,0) > points (coef (g) [2], coef (g) [3], pch=18) Since the origin lies outside the ellipse, we reject the hypothesis that !pop15=!pop75= 0. We mark the one-way CI on the plot for reference: > abline (v=confint (g) [2,], lty=2) > abline (h=confint (g) [3,], lty=2) 40 Linear Models with R Downloaded by [University of Toronto] at 16:20 23 May 2014 See the plot in Figure 3.2. Notice that these lines are not tangential to the ellipse. If the lines were moved out so that they enclosed the ellipse exactly, the CIs would be jointly correct. In some circumstances, the origin could lie within both one-way CIs, but lie outside the ellipse. In this case, both one-at-a-time tests would not reject the null whereas the joint test would. The latter test would be preferred. It is also possible for the origin to lie outside the rectangle but inside the ellipse. In this case, the joint test would not reject the null whereas both one-at-a-time tests would reject. Again we prefer the joint test result. Examine the correlation of the two predictors: > cor (savings$pop15, savings$pop75) [1] !0.90848 However, from the plot, we see that coefficients have a positive correlation. We can confirm this from: Figure 3.2 Confidence ellipse and regions for !pop75. and !pop15 >summary (g, corr=TRUE) $corr (Intercept) pop15 pop75 dpi ddpi (Intercept) 1.00000 !0.98416 !0.809111 !0.16588 !0.188265 pop15 !0.98416 1.00000 0.765356 0.17991 0.102466 pop75 !0.80911 0.76536 1.000000 !0.36705 !0.054722 Inference 41 dpi ddpi !0.16588 !0.18827 0.17991 !0.367046 0.10247 !0.054722 1.00000 0.25548 0.255484 1.000000 Downloaded by [University of Toronto] at 16:20 23 May 2014 where we see that the correlation is 0.765. The correlation between predictors and the correlation between the coefficients of those predictors are often different in sign. Loosely speaking, two positively correlated predictors will attempt to perform the same job of explanation. The more work one does, the less the other needs to do and hence a negative correlation in the coefficients. For the negatively correlated predictors, as seen here, the effect is reversed. 3.5 Confidence Intervals for Predictions However, we need Given a new set of predictors, x0, the predicted response is to assess the uncertainty in this prediction. Decision makers need more than just a point estimate to make rational choices. If the prediction has a wide CI, we need to allow for outcomes far from the point estimate. For example, suppose we need to predict the high water mark of a river. We may need to construct barriers high enough to withstand floods much higher than the predicted maximum. We must also distinguish between predictions of the future mean response and predictions of future observations. To make the distinction clear, suppose we have built a regression model that predicts the selling price of homes in a given area that is based on predictors such as the number of bedrooms and closeness to a major highway.There are two kinds of predictions that can be made for a given x0: 1. Suppose a specific house comes on the market with characteristics x0. Its selling price will be Since E"=0, the predicted price is but in assessing the variance of this prediction, we must include the variance of ". 2. Suppose we ask the question—“What would a house with characteristics x0 sell for on and is again predicted by but now only the average?” This selling price is variance in needs to be taken into account. Most times, we will want the first case, which is called “prediction of a future value,” while the second case, called “prediction of the mean response” is less commonly required. A future observation is predicted to be We have var (where we do not know what the future " will be and we can reasonably assume this to be independent of ). So a 100(1–%) % CI for a single future response is: 42 Linear Models with R Downloaded by [University of Toronto] at 16:20 23 May 2014 If, on the other hand, you want a CI for the mean response for given x0, use: We return to the Galápagos data for this example: > g x0 < - c (1, 0.08, 93, 6.0, 12.0, 0.34) > (y0 < - sum (x0*coef(g))) [1] 33.92 This is the predicted number of species. If a whole number is preferred, we could round up to 34. Now if we want a 95% CI for the prediction, we must decide whether we are predicting the number of species on one new island or the mean response for all islands with same predictors x0. Suppose that an island was not surveyed for the original dataset. The former interval would be the one we want. For this dataset, the latter interval would be more valuable for “what if” type of calculations. First, we need the t-critical value: > qt (0.975,24) [1] 2.0639 We calculate the (XTX)!1 matrix: > x < - model.matrix (g) > xtxi < - solve (t (x) %*% x) The width of the bands for mean response CI is( ): > (bm < - sqrt (x0 %*% xtxi %*% x0) *2.064 * 60.98) [,1] [1,] 32.89 and the interval is: Inference 43 > c (y0!bm, y0+bm) [1] 1.0296 66.8097 Now we compute the prediction interval for the single future response: Downloaded by [University of Toronto] at 16:20 23 May 2014 > bm < - sqrt (l+x0 %*% xtxi %*% x0) *2.064 * 60.98 > c(y0–bm, y0+bm) [1] !96.17 164.01 Of course, the number of species cannot be negative. In such instances, impossible values in the CI can be avoided by transforming the response, say taking logs (explained in a later chapter), or by using a probability model more appropriate to the response. The normal distribution is supported on the whole real line and so negative values are always possible. A better choice for this example might be the Poisson distribution which is supported on the nonnegative integers. There is a more direct method for computing the CI. The function predict ( ) requires that its second argument be a data frame with variables named in the same way as the original dataset: > x0 < - data. frame (Area=0.08, Elevation=93, Nearest=6.0, Scruz=12, Adjacent=0.34) > str (predict (g, x0, se=TRUE)) List of 4 $ fit : num 33.9 $ se.fit : num 15.9 $ df : int 24 $ residual.scale: num 61 > predict (g, x0, interval="confidence") fit lwr upr [1,] 33.92 1.0338 66.806 > predict (g, x0, interval="prediction") fit lwr upr [1,] 33.92 !96.153 163.99 Extrapolation occurs when we try to predict the response for values of the predictor which lie outside the range of the original data. There are two different types of extrapolation: 1. Quantitative extrapolation: We must check whether the new x0 is within the range of the original data. If not, the prediction may be unrealistic. CIs for predictions get wider as we move away from the data. We can compute these bands for the Galápagos model where we vary the Nearest variable while holding the other predictors fixed: > grid < - seq (0, 100, 1) > p < - predict (g, data.frame (Area=0.08, Elevation=93Nearest= grid, Scruz=12, Adjacent=0.34), se=T, interval="confidence") > matplot (grid, p$fit, lty=c (1,2,2), type=“l”, 44 Linear Models with R xlab="Nearest",ylab="Species") > rug (gala$Nearest) Downloaded by [University of Toronto] at 16:20 23 May 2014 We see that the confidence bands in Figure 3.3 become wider as we move away from the range of the data. However, this widening does not reflect the possibility that the structure of the model may change as we move into new territory. The uncertainty in the parametric estimates is allowed for, but not uncertainty about the model. The relationship may become nonlinear outside the range of the data—we have no way of knowing. Figure 3.3 Predicted Nearest over a range of values with 95% pointwise confidence bands for the mean response shown as dotted lines. A “rug” shows the location of the observed values of Nearest. 2. Qualitative extrapolation: Is the new x0 drawn from the same population from which the original sample was drawn? If the model was built in the past and is to be used for future predictions, we must make a difficult judgment as to whether conditions have remained constant enough for this to work. 3.6 Designed Experiments In a designed experiment, the user has some control over X. For example, suppose we wish to compare several physical exercise regimes. The experimental units are the people we use for the study. We can choose some of the predictors such as the amount of time spent excercising or the type of equipment used. Some other predictors might not be controlled, but can be measured, such as the physical characteristics of the people. Other variables, such as the temperature in the room, might be held constant. Our control over the conditions of the experiment allows us to make stronger conclusions from the analysis. Two important design features are orthogonality and randomization. Inference 45 Orthogonality is a useful property because it allows us to more easily interpret the effect of one predictor without regard to another. Suppose we can partition X in two, X= [X1 X2] such that So now: y=X!+"=X1!+X2!2+" Downloaded by [University of Toronto] at 16:20 23 May 2014 and which means: Notice that will be the same regardless of whether X2 is in the model or not (and vice versa). So we can interpret the effect of X1without a concern forX2. Unfortunately, the decoupling is not perfect for suppose we wish to test H0: !1=0. We have that will be different depending on whether X2 is included in the model or not, but the difference in F is not liable to be as large as in nonorthogonal cases. Orthogonality is a desirable property, but will only occur when X is chosen by the experimenter. It is a feature of a good design. In observational data, we do not have direct control over X and this is the source of many of the interpretational difficulties associated with nonexperimental data. Here is an example of an experiment to determine the effects of column temperature, gas/liquid ratio and packing height in reducing the unpleasant odor of a chemical product that was sold for household use. Read the data in and display: > data (odor) > odor odor temp 1 66 !1 2 39 1 3 43 !1 4 49 1 5 58 !1 6 17 1 7 !5 !1 8 !40 1 9 65 0 10 7 0 gas !1 !1 1 1 0 0 0 0 !1 1 pack 0 0 0 0 !1 !1 1 1 !1 !1 46 Linear Models with R 11 12 13 14 15 43 !22 !31 !35 !26 0 0 0 0 0 !1 1 0 0 0 1 1 0 0 0 Downloaded by [University of Toronto] at 16:20 23 May 2014 The three predictors have been transformed from their original scale of measurement, for example, temp=(Fahrenheit-80)/40 so the original values of the predictor were 40, 80 and 120. The data is presented in John (1971) and give an example of a central composite design. Here is the X-matrix: > x < - as.matrix (cbind (1, odor[,!1])) and XTX: > t (x) %*% x 1 temp gas pack 1 15 0 0 0 temp 0 8 0 0 gas 0 0 8 0 pack 0 0 0 8 The matrix is diagonal. Even if temp was measured in the original Fahrenheit scale, the matrix would still be diagonal, but the entry in the matrix corresponding to temp would change. Now fit a model, while asking for the correlation of the coefficients: > g < - lm (odor ˜ temp + gas + pack, odor) > summary (g, cor=T) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 15.2 9.3 1.63 0.13 temp !12.1 12.7 !0.95 0.36 gas !17.0 12.7 !1.34 0.21 pack !21.4 12.7 !1.68 0.12 Residual standard error: 36 on 11 degrees of freedom Multiple R-Squared: 0.334, Adjusted R-squared: 0.152 F-statistic: 1.84 on 3 and 11 DF, p-value: 0.199 Correlation of Coefficients: (Intercept) temp gas temp 0.00 gas 0.00 0.00 pack 0.00 0.00 0.00 We see that, as expected, the pairwise correlation of all the coefficients is zero. Notice that the SEs for the coefficients are equal due to the balanced design. Now drop one of the variables: > g < - lm (odor ˜ gas + pack, odor) > summary (g) Coefficients: Inference 47 Downloaded by [University of Toronto] at 16:20 23 May 2014 Estimate Std. Error t value Pr(>|t|) (Intercept) 15.20 9.26 1.64 0.13 gas !17.00 12.68 !1.34 0.20 pack !21.37 12.68 !1.69 0.12 Residual standard error: 35.9 on 12 degrees of freedom Multiple R-Squared: 0.279, Adjusted R-squared: 0.159 F-statistic: 2.32 on 2 and 12 DF, p-value: 0.141 The coefficients themselves do not change, but the residual SE does change slightly, which causes small changes in the SEs of the coefficients, t-statistics and p-values, but nowhere near enough to change our qualitative conclusions. These were data from an experiment so it was possible to control the values of the predictors to ensure orthogonality. Now consider the savings data, which are observational: > g < - 1m (sr ˜ pop15 + pop75 + dpi + ddpi, savings) > summary (g) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 28.566087 7.354516 3.88 0.00033 pop15 !0.461193 0.144642 !3.19 0.00260 pop75 !1.691498 1.083599 !1.56 0.12553 dpi !0.000337 0.000931 !0.36 0.71917 ddpi 0.409695 0.196197 2.09 0.04247 Residual standard error: 3.8 on 45 degrees of freedom Multiple R-Squared: 0.338, Adjusted R-squared: 0.28 F-statistic: 5.76 on 4 and 45 DF, p-value: 0.00079 Drop pop15 from the model: > g < - update (g, . ˜ . - pop15) > summary (g) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.487494 1.427662 3.84 0.00037 pop75 0.952857 0.763746 1.25 0.21849 dpi 0.000197 0.001003 0.20 0.84499 ddpi 0.473795 0.213727 2.22 0.03162 Residual standard error: 4.16 on 46 degrees of freedom Multiple R-Squared: 0.189, Adjusted R-squared: 0.136 F-statistic: 3.57 on 3 and 46 DF, p-value: 0.0209 Pay particular attention to pop75. The effect has now become positive whereas before it was negative. Granted, in both cases it is not significant, but it is not uncommon in other datasets for such sign changes to occur and for them to be significant. Another important feature of designed experiments is the ability to randomly assign the experimental units to our chosen values of the predictors. For example, in our exercise regime experiment, we might choose an X with some desirable orthogonality properties and then fit a y=X!+" model. However, the subjects we use for the experiment are likely 48 Linear Models with R Downloaded by [University of Toronto] at 16:20 23 May 2014 to differ in ways that might affect the response, which could, for example, be resting heart rate. Some of these characteristics may be known and measurable such as age or weight. These can be included in X. Including such known important variables is typically beneficial, as it increases the precision of our estimates of !. However, there will be other variables, Z that we cannot measure or may not even suspect. The model we use is y=X!+" while the true model might be y=X!+ Z*++. + is the measurement error in the response. We can assume that E"=0 without any loss of generality, because if E"=c, we could simply redefine !0 as !0+c and the error would again have expectation zero. This is another reason why it is generally unwise to remove the intercept term from the model since it acts as a sink for the mean effect of unincluded variables. So we see that " incorporates both measurement error and the effect of other variables. We will not observe Z so we cannot estimate *. If X were orthogonal to Z, that is, XTZ=0, the estimate of ! would not depend on the presence of Z. Of course, we cannot achieve this exactly, but because of the random assignment of the experimental units, which each carry an unknown value of Z, to the predictors X, we will have cor (X, Z)=0. This means that the estimate of ! will not be biased by Z. Another compelling argument in favor of the randomization is that it justifies the use of permutation tests. We really can argue that any assignment of X to the experimental units was one of n! possible permutations. For observational data, no randomization can be used in assigning X to the units and orthogonality will not just happen. An unmeasured and possible unsuspected lurking variable Z may be the real cause of an observed relationship between y and X. For example, we will observe a positive correlation among the shoe sizes and reading abilities of elementary school students, but this relationship is driven by a lurking variable—the age of the child. In this example, the nature of Z is obvious, but in other cases, we may not be able to identify an important Z. The possible existence of such a lurking variable casts a shadow over any conclusions from observational data. Simply using a designed experiment does not ensure success. We may choose a bad model for the data or find that the variation in the unobserved Z overwhelms the effect of the...
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

...


Anonymous
Really useful study material!

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Related Tags