Miami University Simple and Multiple Logistic Regression Models Discussion
Directions: Complete the following questions. The
questions have been separated into 4 parts of similar material. Parts
1, 2, and 3 will only use the corona_train data while Part 4 will use the corona_test data. Use the Markdown starter file here hw7_starter.Rmd.Part 1 - Odds1. Using the training dataset, compute the odds that a county has reported a Coronavirus-related death. (2pts)2. Does the odds of a Coronavirus-related death vary by Census
region? Compute the odds that a county has reported a
Coronavirus-related death for each Census region within the United
States. Compare these values to address the question. (3pts)Part 2 - Simple Logistic Regression3. Build a plot (or plots) to explore how the logarithm of the
population density predicts whether a county has recorded a
coronavirus-related death. Briefly discuss the results of your plot. (2pts)4. Build a simple logistic model to statistically determine if the
logarithm of the population density predicts the probability a county
has reported a Coronavirus-related death. Support your findings with an
appropriate hypothesis test. (3pts)Part 3 - Multiple Logistic Regression Models5. Fit a multiple logistic regression model with the census region,
the logarithm of population density, the cumulative coronavirus rate,
the median county age, the median income, the percent of the county that
are U.S. citizens, the percent with a college degree, the percent of
the population that are veterans of the U.S. armed services, the percent
with healthcare and the percent that voted for President Trump in the
2016 general election to predict the probability a county has reported a
Coronavirus-related death. Conduct an appropriate test to determine
whether this model significantly predicts the probability a county has
reported a Coronavirus-related death. (3pts)6. Perform a backward selection procedure on the model from question 5. Which variable(s) has/have been removed from the model. (2pts)7. We will now continue a backward selection procedure, but this time
using Likelihood Ratio test. Using the drop1() function to determine
which predictors are significant, iteratively remove all insignificant
predictors from the model in question 6. That is, look at the drop1()
output from the model in question 6, refit the model after removing all
insignificant terms, look at the drop1() output, refit the model after
removing all insignificant terms... Continue this process until all
predictors are significant. What predictor variables remain in the
model? (4pts)8. The starter file contains some code to help you along on this
problem. Build a table to compare the AIC, BIC and a Pseudo-R-squared
for the models fit in questions 5, 6 and 7. Which model is best with
respect to each metric? (3pts)9. Code was supplied for a Pseudo-R-squared calculation in question
8. Explain how this value mimics that of the traditional R-squared value
used in multiple linear regression. (2pts)10. For the model with the best BIC, of those fit in questions 5, 6,
or 7, interpret the coefficient regionWest. Be sure to explain this
coefficient in terms of odds (not log-odds, which do not provide a nice
interpretation). How does this compare to the results in question 2?
Why might they be similar/different? (3pts)Part 4 - Prediction11. We will use three fitted models built above to predict whether a
county in the testing dataset will have a Coronavirus-related death.
Some code is supplied in the starter file, edit and replicate so it will
make predictions using all three models. Briefly describe what this
code is doing. (2pts)12. Calculate and discuss the accuracy, sensitivity and specificity
for all three models to predict if a county has reported a
Coronavirus-related death. Which model appears to be the best model at
predicting if a county has a Coronavirus-related death? Code is provided
for the confusion matrix of the first model. Replicate this code to
generate the confusion matrices for the other two models. (6pts)13. Using the best model from the
previous question, compute the sensitivity and specificity if the
probability threshold (the 0.5 provided in the code for question 11)
were 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9. Use these values
to complete the table in the starter file. Which threshold appears to be
the best choice? (5pts)NOTE: the ideas of sensitivity and specificity are
VERY relevant in today's society as scientist develop tests for the
COVID-19 Coronavirus; for both antibody and detection of the disease. We
felt it prudent to introduce these topics under the current
circumstances. Some Coding hintsWe have covered a lot this semester... In an effort to help you with
some of the necessary coding, we provide the following hints but note
additional code is needed for all to workxtabs() can be used in questions 1, 2, and 12ggplot() is needed in question 3glm(), drop1() and/or anova() are needed in questions 4, 5, 7 and 8stats::step() is needed in question 6summary() will provide output with model coefficients, you can also use coef()I need rmd and html file in the end.