Files to be submitted: word document of the report and R script files.
You may refer to the "Assignment1 - tasks.pdf" file in the attachments for list of tasks and instructions, and same content from the attached file has been posted below.
a) Demonstrate knowledge of data exploration and selection of variables to apply for the predictive models
b) Demonstrate knowledge of building different types of predictive models using R
c) Demonstrate knowledge on comparing and evaluating different predictive models
d) Relate theoretical knowledge of predictive models and best practices to application scenarios
Part A – Data Exploration and Cleaning
Use the data for breakfast cereals (Cereals.csv) to answer the following
1. Which variables are continuous/numerical? Which are ordinal? Which are nominal?
2. Calculate following summary statistics: mean, median, max and standard deviation for each of the continuous variables, and count for each categorical variable. Is there any evidence of extreme values? Briefly discuss.
3. Plot histograms for each of the continuous variables and create summary statistics. Based on the histogram and summary statistics answer the following and provide brief explanations:
a. Which variables have the largest variability?
b. Which variables seems skewed?
c. Are there any values that seem extreme?
4. Which, if any, of the variables have missing values?
a. What are the methods of handling missing values?
b. Demonstrate the output (summary statistics and transformation plot) for each method in (4-a).
c. Apply the 3 methods of missing value handling discussed in the lectures. Which method of handling missing values is most suitable for this data set? Discuss briefly referring to the data set.
Part B – Building predictive models using real world business case
Alpha Traders Pty Ltd. is an Australian car sales company has purchased a stock of used Toyota Corolla cars for sale. The management of the Alpha is in the process of finalizing the selling prices of the purchased cars. Alpha Traders management is very keen to trial predictive modelling for this task and have obtained a historic car sales dataset of Toyota Corolla cars from a publicly available data repository.
The dataset contains 37 attributes of over 1400 sold Toyota Corolla cars. The attributes include the selling price of cars, age, kilometres driven, fuel type, horsepower, automatic or manual, number of doors, weight (in pounds), etc.
The management of Alpha Traders Pty Ltd. has outsourced the task to you to develop a reliable predictive model to predict the selling price of the cars, using the aforementioned historic dataset.
1. Data Exploration and Cleaning (15%)
a. Examine the prices of the Toyota Corolla vehicles. Explain the distribution of the prices.
b. Find out whether there are any missing values. Explain your findings.
c. Are there any categorical values that needs to be transformed into numerical values? Suggest the best possible transformation. Use this method to transform the variable(s).
d. Evaluate the correlations between the variables. Which variables should be used for dimension reduction? Explain. Carry out dimensionality reduction.
e. Explore the distribution of selected variables (from step 1-d) against the target variable. Explain.
2. Regression Modelling (20%)
a. Build a regression model with the selected variables. You need to try out at least 3 regression models to identify the optimal model.
b. Evaluate the accuracy of the regression model.
3. Decision Tree Modelling (20%)
a. Build a decision tree with the selected variables. You need to try out at least 3 decision trees with different complexity parameters to obtain the optimal tree.
b. Explain the output of the selected decision tree, evaluate the accuracy and reason for it to be selected.
4. Model Comparison (15%)
a. Compare the accuracy of the selected (optimal) regression model and (optimal) decision tree and discuss and justify the most suitable predictive model for the business case.
Unformatted Attachment Preview
Purchase answer to see full attachment
Explanation & Answer
Thank you so much
Running head: Report
Part A: Data Exploration and Cleaning
Numerical variables in the data for breakfast cereal are Calories, Protein Fat,
Sodium Fiber, Complex Carbos, Tot Carbo, Sugars, Calories fr Fat, Potassium, Enriched,
Wt/serving as well as cups/serv. The only ordinal variable in the dataset is Fiber GR while the
only nominal variable in the dataset is Hot/cold.
There are extreme values in the dataset. This is so since numerical variables have
a maximum and minimum value
a) Based on the histogram and summary statistics, the variables with the largest
variability were sodium whose range (max- min) was 420, potassium whose range was 390, and
lastly calories whose range was 200.
b) Based on the hist...