DSCI 352 Assignment 5 - Predicting Breast
Cancer
In this assignment you will be applying machine learning to try to predict breast cancer from a
set of biological attributes in a tissue sample. Specifically, you will be constructing, evaluating,
and comparing three different classification models.
As this is an online assignment, it is imperative that you CAREFULLY READ AND FOLLOW
ALL DIRECTIONS. Failure to do so may result in points lost unnecessarily.
This assignment is to be entirely your own work - by submitting you certify that this is true.
To begin, please download the following data file:
https://drive.google.com/file/d/1JUfmGe_3Tr_3hkM55dV-rAYCBuTizIsX/view?usp=sharing
PREREQUISITES - Getting Set Up
First, set your working directory to the folder where you downloaded the data file above.
Second, add the following two lines to your code file AS THE FIRST TWO LINES (if
you are using a code file), or else copy them into your R console BEFORE DOING
ANYTHING ELSE (if you are working directly in the command line):
o library(caret)
o set.seed(32343)
The first command imports the caret library for machine learning, and the set.seed method will
ensure that everyone gets the same answer when using randomized methods such as random
forests. YOU MUST MAKE SURE TO DO THESE TWO STEPS BEFORE PROCEEDING
OR YOUR ANSWERS WILL NOT BE CORRECT.
1. In this assignment we are going to build predictive models which will use cell/tissue (1 point)
sample information to predict whether the sample is normal or or breast cancer.
To begin, read in the above data set and use head to inspect the first few rows.
Which column(s) represent the dependent variable (i.e. the one we want to predict)?
☐ Patient.ID
☐ Clump.Thickness
☐ Size.Uniformity
☐ Shape.Uniformity
☐ Marginal.Adhesion
☐ Epithelial.Cell.Size
☐ Bare.Nuclei
☐ Bland.Chromatin
☐ Normal.Nuclei
☐ Mitosis
☐ Diagnosis
2. Looking at the data above, which column(s) represent the independent variable(s)
(i.e. the one(s) which may conceivably have a predictive relationship with what we
are trying to predict)?
(1 point)
☐ Patient.ID
☐ Clump.Thickness
☐ Size.Uniformity
☐ Shape.Uniformity
☐ Marginal.Adhesion
☐ Epithelial.Cell.Size
☐ Bare.Nuclei
☐ Bland.Chromatin
☐ Normal.Nuclei
☐ Mitosis
☐ Diagnosis
3. How many benign cases are found in this data? (Enter only a number)
(1 point)
*Hint: Use subsetting to find this
4. How many malignant cases are found in this data? (Enter only a number)
(1 point)
5. Is this data set balanced?
(1 point)
◯ Yes, it is approximately balanced
◯ No, it is slightly imbalanced (i.e. one class is no more than twice the size of the other)
◯ No, it is moderately imbalanced (one class is between 2 to 5 times the size of the other)
◯ No, it is highly imbalanced (one class is more than 5 times the size of the other)
INSTRUCTIONS: Please carry out the following steps to prepare your data for further analysis
before proceeding to the next questions:
1. Use data slicing to get rid of any columns in the data set which are neither dependent nor
independent variables as you specified above.
2. Split the data into training and test sets. The training set should contain exactly 70% of
the data and the test set should contain the remaining 30%.
MODEL 1 - Logistic Regression: Construct a logistic regression model to predict the
dependent variable identified above from the independent variables you identified (in questions 1
and 2 above). DO NOT USE ANY SCALING OR OTHER PREPROCESSING. Train your
model on the training data and test/evaluate its performance using the test data. When evaluating
your results, make sure to set the positive class to 'Malignant'. You will use the results of this
model evaluation to answer the next 8 questions.
6. What is the accuracy of this model? (Enter a number only, no rounding)
(1 point)
7. What is the precision of this model? (Enter a number only, no rounding)
(1 point)
8. What is the recall of this model? (Enter a number only, no rounding)
(1 point)
9. What is the balanced accuracy of this model? (Enter a number only, no rounding)
(1 point)
10. How many cases in the test data did this model correctly predict as benign? (Enter (1 point)
only a number, no rounding)
11. How many cases in the test data did this model correctly predict as malignant?
(Enter only a number, no rounding)
(1 point)
12. How many cases in the test data did this model incorrectly predict as benign (i.e.
how many false negatives)? (Enter only a number, no rounding)
(1 point)
13. How many cases in the test data did this model incorrectly predict as malignant
(i.e. how many false positives)? (Enter only a number, no rounding)
(1 point)
MODEL 2 - Naive Bayes: Construct a Naive Bayes model to predict the dependent variable
identified above from the independent variables you identified (in questions 1 and 2 above). DO
NOT USE ANY SCALING OR OTHER PREPROCESSING. Train your model on the
training data and test/evaluate its performance using the test data. When evaluating your results,
make sure to set the positive class to 'Malignant'. You will use the results of this model
evaluation to answer the next 8 questions.
14. What is the accuracy of this model? (Enter a number only, no rounding)
(1 point)
15. What is the precision of this model? (Enter a number only, no rounding)
(1 point)
16. What is the recall of this model? (Enter a number only, no rounding)
(1 point)
17. What is the balanced accuracy of this model? (Enter a number only, no rounding)
(1 point)
18. How many cases in the test data did this model correctly predict as benign? (Enter (1 point)
only a number, no rounding)
19. How many cases in the test data did this model correctly predict as malignant?
(Enter only a number, no rounding)
(1 point)
20. How many cases in the test data did this model incorrectly predict as benign (i.e.
how many false negatives)? (Enter only a number, no rounding)
(1 point)
21. How many cases in the test data did this model incorrectly predict as malignant
(1 point)
(i.e. how many false positives)? (Enter only a number, no rounding)
MODEL 3 - Random Forest: Construct a Random Forest model to predict the dependent
variable identified above from the independent variables you identified (in questions 1 and 2
above). DO NOT USE ANY SCALING OR OTHER PREPROCESSING. Train your model
on the training data and test/evaluate its performance using the test data. When evaluating your
results, make sure to set the positive class to 'Malignant'. You will use the results of this model
evaluation to answer the next 8 questions.
22. What is the accuracy of this model? (Enter a number only, no rounding)
(1 point)
23. What is the precision of this model? (Enter a number only, no rounding)
(1 point)
24. What is the recall of this model? (Enter a number only, no rounding)
(1 point)
25. What is the balanced accuracy of this model? (Enter a number only, no rounding)
(1 point)
26. How many cases in the test data did this model correctly predict as benign? (Enter (1 point)
only a number, no rounding)
27. How many cases in the test data did this model correctly predict as malignant?
(Enter only a number, no rounding)
(1 point)
28. How many cases in the test data did this model incorrectly predict as benign (i.e.
how many false negatives)? (Enter only a number, no rounding)
(1 point)
29. How many cases in the test data did this model incorrectly predict as malignant
(i.e. how many false positives)? (Enter only a number, no rounding)
(1 point)
Model Comparison: Use the results from your three models above to answer the remaining
questions.
30. Based on the three models you constructed and the results, can we make good
predictions about a tissue sample being breast cancer or normal?
(1 point)
◯ No, accuracy was below 60% in all models
◯ No, balanced accuracy was below 60% for all models
◯ No, accuracy, precision, recall, and balanced accuracy were below 60% for all models
◯ This depends on the model, in some cases accuracy and precision were below 60%
◯ Yes, accuracy was above 90% in all cases
◯ Yes, accuracy, precision, recall, and balanced accuracy were above 90% in all models.
31. Suppose we want to select the model with the least chance of missing a case of
breast cancer (i.e. missing a positive instance). Which metric should we use to
compare models?
(1 point)
◯ Prevalance
◯ Specificity
◯ Detection Prevalence
◯ Sensitivity
◯ Balanced Accuracy
◯ Neg Pred Value
◯ Pos Pred Value
◯ Detection Rate
◯ Kappa
32. Suppose we want to select the model in which we can have the greatest certainty
(1 point)
that if the predicted outcome is positive (i.e. predicted as malignant), this is in fact
correct. Which metric should we use to compare models?
◯ Sensitivity
◯ Prevalance
◯ Detection Rate
◯ Detection Prevalence
◯ Pos Pred Value
◯ Balanced Accuracy
◯ Kappa
◯ Specificity
◯ Neg Pred Value
33. By the criteria chosen above, which model would be the worst?
◯ Logistic Regression
◯ Naive Bayes
◯ Random Forest
(1 point)

Purchase answer to see full
attachment