Write a report on Building and Evaluating Predictive Models using R. ( Should be familiar with machine learning)

User Generated

wnlx

Programming

Description

Files to be submitted: word document of the report and R script files.

You may refer to the "Assignment1 - tasks.pdf" file in the attachments for list of tasks and instructions, and same content from the attached file has been posted below.

Objective:

a) Demonstrate knowledge of data exploration and selection of variables to apply for the predictive models

b) Demonstrate knowledge of building different types of predictive models using R

c) Demonstrate knowledge on comparing and evaluating different predictive models

d) Relate theoretical knowledge of predictive models and best practices to application scenarios

Part A – Data Exploration and Cleaning

Use the data for breakfast cereals (Cereals.csv) to answer the following

1. Which variables are continuous/numerical? Which are ordinal? Which are nominal?

2. Calculate following summary statistics: mean, median, max and standard deviation for each of the continuous variables, and count for each categorical variable. Is there any evidence of extreme values? Briefly discuss.

3. Plot histograms for each of the continuous variables and create summary statistics. Based on the histogram and summary statistics answer the following and provide brief explanations:

a. Which variables have the largest variability?

b. Which variables seems skewed?

c. Are there any values that seem extreme?

4. Which, if any, of the variables have missing values?

a. What are the methods of handling missing values?

b. Demonstrate the output (summary statistics and transformation plot) for each method in (4-a).

c. Apply the 3 methods of missing value handling discussed in the lectures. Which method of handling missing values is most suitable for this data set? Discuss briefly referring to the data set.

Part B Building predictive models using real world business case

Business Case

Alpha Traders Pty Ltd. is an Australian car sales company has purchased a stock of used Toyota Corolla cars for sale. The management of the Alpha is in the process of finalizing the selling prices of the purchased cars. Alpha Traders management is very keen to trial predictive modelling for this task and have obtained a historic car sales dataset of Toyota Corolla cars from a publicly available data repository.

The dataset contains 37 attributes of over 1400 sold Toyota Corolla cars. The attributes include the selling price of cars, age, kilometres driven, fuel type, horsepower, automatic or manual, number of doors, weight (in pounds), etc.

The management of Alpha Traders Pty Ltd. has outsourced the task to you to develop a reliable predictive model to predict the selling price of the cars, using the aforementioned historic dataset.

1. Data Exploration and Cleaning (15%)

a. Examine the prices of the Toyota Corolla vehicles. Explain the distribution of the prices.

b. Find out whether there are any missing values. Explain your findings.

c. Are there any categorical values that needs to be transformed into numerical values? Suggest the best possible transformation. Use this method to transform the variable(s).

d. Evaluate the correlations between the variables. Which variables should be used for dimension reduction? Explain. Carry out dimensionality reduction.

e. Explore the distribution of selected variables (from step 1-d) against the target variable. Explain.

2. Regression Modelling (20%)

a. Build a regression model with the selected variables. You need to try out at least 3 regression models to identify the optimal model.

b. Evaluate the accuracy of the regression model.

3. Decision Tree Modelling (20%)

a. Build a decision tree with the selected variables. You need to try out at least 3 decision trees with different complexity parameters to obtain the optimal tree.

b. Explain the output of the selected decision tree, evaluate the accuracy and reason for it to be selected.

4. Model Comparison (15%)

a. Compare the accuracy of the selected (optimal) regression model and (optimal) decision tree and discuss and justify the most suitable predictive model for the business case.

Unformatted Attachment Preview

BUS5PA Predictive Analytics - 2018 BUS5PA Predictive Analytics – Semester 2, 2018 Assignment 1: Building and Evaluating Predictive Models Release Date: 14th August 2018 Due Date: 9th September 2018, 11.55pm Weight: 30% Format of Submission: A report (electronic form) + electronic submission of the R project (scripts) in LMS Site Objective: a) Revise BUS5PA material on predictive modelling b) Demonstrate knowledge of data exploration and selection of variables to apply for the predictive models c) Demonstrate knowledge of building different types of predictive models using R d) Demonstrate knowledge on comparing and evaluating different predictive models e) Relate theoretical knowledge of predictive models and best practices to application scenarios Note: BUS5PA lecture weeks 1-4 have focused on providing you with a foundation knowledge of data preparation for predictive analytics, predictive modelling techniques and implementing these with R. Therefore, this assignment will evaluate your knowledge on building and evaluating predictive models. The deploying of the predictive models and the interpretation of results will be included in assignment 2. Part A (30%) – Data Exploration and Cleaning Use the data for breakfast cereals (Cereals.csv) to answer the following 1. Which variables are continuous/numerical? Which are ordinal? Which are nominal? 2. Calculate following summary statistics: mean, median, max and standard deviation for each of the continuous variables, and count for each categorical variable. Is there any evidence of extreme values? Briefly discuss. 3. Plot histograms for each of the continuous variables and create summary statistics. Based on the histogram and summary statistics answer the following and provide brief explanations: a. Which variables have the largest variability? b. Which variables seems skewed? c. Are there any values that seem extreme? 4. Which, if any, of the variables have missing values? a. What are the methods of handling missing values? b. Demonstrate the output (summary statistics and transformation plot) for each method in (4-a). c. Apply the 3 methods of missing value handling discussed in the lectures. Which method of handling missing values is most suitable for this data set? Discuss briefly referring to the data set. BUS5PA Assignment 1 BUS5PA Predictive Analytics - 2018 Part B (70%) – Building predictive models using real world business case Business Case Alpha Traders Pty Ltd. is an Australian car sales company has purchased a stock of used Toyota Corolla cars for sale. The management of the Alpha is in the process of finalizing the selling prices of the purchased cars. Alpha Traders management is very keen to trial predictive modelling for this task and have obtained a historic car sales dataset of Toyota Corolla cars from a publicly available data repository. The dataset contains 37 attributes of over 1400 sold Toyota Corolla cars. The attributes include the selling price of cars, age, kilometres driven, fuel type, horsepower, automatic or manual, number of doors, weight (in pounds), etc. The management of Alpha Traders Pty Ltd. has outsourced the task to you to develop a reliable predictive model to predict the selling price of the cars, using the aforementioned historic dataset. 1. Data Exploration and Cleaning (15%) a. Examine the prices of the Toyota Corolla vehicles. Explain the distribution of the prices. b. Find out whether there are any missing values. Explain your findings. c. Are there any categorical values that needs to be transformed into numerical values? Suggest the best possible transformation. Use this method to transform the variable(s). d. Evaluate the correlations between the variables. Which variables should be used for dimension reduction? Explain. Carry out dimensionality reduction. e. Explore the distribution of selected variables (from step 1-d) against the target variable. Explain. 2. Regression Modelling (20%) a. Build a regression model with the selected variables. You need to try out at least 3 regression models to identify the optimal model. b. Evaluate the accuracy of the regression model. 3. Decision Tree Modelling (20%) a. Build a decision tree with the selected variables. You need to try out at least 3 decision trees with different complexity parameters to obtain the optimal tree. b. Explain the output of the selected decision tree, evaluate the accuracy and reason for it to be selected. 4. Model Comparison (15%) a. Compare the accuracy of the selected (optimal) regression model and (optimal) decision tree and discuss and justify the most suitable predictive model for the business case. BUS5PA Assignment 1
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Thank you so much

Running head: Report

1

Report:
Name:
Institution affiliation:
Date:

Report

2

Part A: Data Exploration and Cleaning
1)
Numerical variables in the data for breakfast cereal are Calories, Protein Fat,
Sodium Fiber, Complex Carbos, Tot Carbo, Sugars, Calories fr Fat, Potassium, Enriched,
Wt/serving as well as cups/serv. The only ordinal variable in the dataset is Fiber GR while the
only nominal variable in the dataset is Hot/cold.
2)
There are extreme values in the dataset. This is so since numerical variables have
a maximum and minimum value
3)
a) Based on the histogram and summary statistics, the variables with the largest
variability were sodium whose range (max- min) was 420, potassium whose range was 390, and
lastly calories whose range was 200.
b) Based on the hist...


Anonymous
Really great stuff, couldn't ask for more.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4
Similar Content
Related Tags