Econ with R language

User Generated

mpuv1987

Economics

Unformatted Attachment Preview

Economics 217 Homework #3 Due Tuesday, February 20th Problem 1 This question will use the "Veteran" survival data, which is now available on the course website. It is a standard survival dataset originating from the Veterans Administration that measures the initial state of patient health and the effects of a chemotherapy treatment on survival from lung cancer. It contains the following variables: • ID: ID of the patient • TIME: time of start of the observation period (if Y=0, first occurrence), death (if Y=1) or censoring (if Y=0, second occurrence) • Y: 0 to indicate the initiation observation or censoring, and 1 to indicate death • trt: treatment type • celltype: histological type of the tumor • karno: Karnofsky performance score that describes the overall patients status at the beginning of the study • diagtime: Time between diagnosis and start of the study • age: age of the patient • priortherapy: indicates if the patient has received another therapy before the current one To be clear, each patient is observed twice. On the first occurrence, (TIME=0,Y=0) for all patients (the initial observation). The second observation is either censoring (TIME>0,Y=0) or death (TIME>0,Y=1). a. First, let’s clean the data. We are only interested in the basic survival questions as a function of treatment/control and a few other covariates. So, using the dataset, drop observations for which TIME=0. Please summarize and briefly discuss the variables TIME, Y, trt, and celltype. Do any of these variables suggest we should clean the dataset further? If so, please clean the dataset. (10 points) b. Please plot Kaplan Meier estimates of survival as a function of treatment and control groups. Please label your figures and briefly discuss your answer. (10 points) c. Please estimate a proportional hazards model, hazard being a function of treatment/control, cell type, age, and prior therapy. Please summarize and briefly interpret your results. (10 points) Problem 2 For this question we will use the loess and gam functions in R to study the relationship between real average hourly wages and hours worked (using the Org dataset). a. Using loess, please estimate the relationship between the log real wage, rw and hours worked. On a second plot, please use log of hours worked. Please interpret your figures. (10 points) b. Using gam, please estimate the relationship between the log real wage and log hours worked controlling for education and age. Please provide confidence intervals for your estimates on the Figure, and interpret your results. (10 points) c. Back to using loess, repeat the same exercise as in (2a), but please write a cross-validation procedure to find the optimal degree of smoothing (span, in the function). You may not use a "canned" package from the R library to run the cross-validation. Please plot your optimal figure, as well as provide results as to why you chose the degree of smoothing that you did. (10 points) Problem 3 In this question we will utilize bootstrap procedures to evaluate the relationship between real wages and educational attainment using the ORG dataset. a. To begin our study of real wages, restrict the sample to CA in 2013, and then estimate the following linear regression: log(rw) = β0 + β1 DHS + β2 DSC + β3 DC + β4 DAD + β5 age + u (1) where {HS, SC, C, AD} represent a maximum education level of high school, some college, college, and an advanced degree, respectively. Age is the age of the respondent. Please run a simple regression and interpret the coefficient on β3 . Please construct a 95% confidence interval.(10 points) b. For this question, please run 1000 bootstrap replications of this regression, with each replication being of the same size as the original dataset. Please use the data resampling technique (as opposed to residual resampling). Is the 95% confidence interval larger or smaller than part ‘a’ ? (10 points) c. As in previous work, we’d like to test for the effect of removing all education variables. For this, you should use an F-test. However, rather than running a simple F-test, I want you to use residual resampling to bootstrap the F statistic using 1000 replications. Illustrate the variation in your F-statistic using a Figure, and discuss the implications of your results.
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

This question has not been answered.

Create a free account to get help with this and any other question!

Related Tags