New England College ARE Markdown Program Project

User Generated

unevxnagu

Programming

New England College

Description

Unformatted Attachment Preview

Question Write a fully executed R-Markdown program and submit a pdf / word or html file performing classification task on the Binary response variable from the Santander Bank Case Study. Make sure to try various hyperparameters of the SVM algorithm to find the best available model. You are required to clearly display and explain the models that were run for this task and their effect on the reduction of the Cost Function. Points will be deducted in case you fail to explain the output. Please note that all code assignments must be submitted as a screenshot with a slice of your desktop showing the timestamp. If the time and date are not visible, you will be graded 0. Put the screenshots in a word document, make sure to comment the code (explain what it does) and interpret the graph if applicable(explain what its depicting) All assignments will go through SafeAssign. Your score should be less than 30 and you will only be allowed 2 attempts. Week-3 Assignment 11/7/2021 library(gridExtra) library(grid) library(ggplot2) library(lattice) library(usdm) ## Loading required package: sp ## Loading required package: raster library(pROC) ## Type 'citation("pROC")' for a citation. ## ## Attaching package: 'pROC' ## The following objects are masked from 'package:stats': ## ## cov, smooth, var library(caret) library(rpart) library(DataCombine) ## ## Attaching package: 'DataCombine' ## The following object is masked from 'package:raster': ## ## shift library(ROSE) ## Loaded ROSE 0.0-4 library(e1071) ## ## Attaching package: 'e1071' ## The following object is masked from 'package:raster': ## ## interpolate library(xgboost) setwd("/Users/Sai/Desktop/ML Course/santander-customer-transaction-prediction ") #Reading test and train data frame train =read.csv('train.csv') test =read.csv('test.csv') dim(train) ## [1] 200000 202 dim(test) ## [1] 200000 201 summary(train) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ID_code Length:200000 Class :character Mode :character ## ## 3rd Qu.: 4.935 Max. : 27.907 var_2 Min. : 2.117 1st Qu.: 8.722 Median :10.580 Mean :10.715 3rd Qu.:12.517 Max. :19.353 var_6 target Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.1005 3rd Qu.:0.0000 Max. :1.0000 var_3 Min. :-0.0402 1st Qu.: 5.2541 Median : 6.8250 Mean : 6.7965 3rd Qu.: 8.3241 Max. :13.1883 var_7 train_ID_code_orignal = train$ID_code test_Id_code_orignal = test$ID_code train$ID_code=NULL test$ID_code=NULL print(dim(train)) ## [1] 200000 201 var_0 Min. : 0.4084 1st Qu.: 8.4538 Median :10.5247 Mean :10.6799 3rd Qu.:12.7582 Max. :20.3150 var_4 Min. : 5.075 1st Qu.: 9.883 Median :11.108 Mean :11.078 3rd Qu.:12.261 Max. :16.671 var_8 var_1 Min. :-15.043 1st Qu.: -4.740 Median : -1.608 Mean : -1.628 3rd Qu.: 1.359 Max. : 10.377 var_5 Min. :-32.5626 1st Qu.:-11.2004 Median : -4.8331 Mean : -5.0653 3rd Qu.: 0.9248 Max. : 17.2516 var_9 print(dim(test)) ## [1] 200000 200 table(train$target) ## ## 0 ## 179902 1 20098 findMissingValue =function(df){ missing_val =data.frame(apply(df,2,function(x){sum(is.na(x))})) missing_val$Columns = row.names(missing_val) names(missing_val)[1] = "Missing_percentage" missing_val$Missing_percentage = (missing_val$Missing_percentage/nrow(train )) * 100 missing_val = missing_val[order(-missing_val$Missing_percentage),] row.names(missing_val) = NULL missing_val = missing_val[,c(2,1)] return (missing_val) } head(findMissingValue(train)) ## ## ## ## ## ## ## 1 2 3 4 5 6 Columns Missing_percentage target 0 var_0 0 var_1 0 var_2 0 var_3 0 var_4 0 head(findMissingValue(test)) ## ## ## ## ## ## ## 1 2 3 4 5 6 Columns Missing_percentage var_0 0 var_1 0 var_2 0 var_3 0 var_4 0 var_5 0 independent_var= (colnames(train)!='target') X=train[,independent_var] Y=train$target cor=vifcor(X) print(cor) ## ## ## ## ## ## ## ## ## No variable from the 200 input variables has collinearity problem. The linear correlation coefficients ranges between: min correlation ( var_18 ~ var_11 ): -4.534968e-07 max correlation ( var_177 ~ var_152 ): 0.05417406 ---------- VIFs of the remained variables -------Variables VIF 1 var_0 1.045308 plot_distribution =function(X) { variblename =colnames(X) temp=1 for(i in seq(10,dim(X)[2],10)) { plot_helper(temp,i ,variblename) temp=i+1 } } plot_helper =function(start ,stop, variblename) { par(mar=c(2,2,2,2)) par(mfrow=c(4,3)) for (i in variblename[start:stop]) { plot(density(X[[i]]) ,main=i ) } } plot_distribution(X) plot_boxplot =function(X) { variblename =colnames(X) temp=1 for(i in seq(10,dim(X)[2],10)) { plot_helper(temp,i ,variblename) temp=i+1 } } plot_helper =function(start ,stop, variblename) { par(mar=c(2,2,2,2)) par(mfrow=c(4,3)) for (i in variblename[start:stop]) { boxplot(X[[i]] ,main=i) } } plot_boxplot(X) plot_boxplot(test) fill_outlier_with_na=function(df) { cnames=colnames(df) for(i in cnames) { val = df[,i][df[,i] %in% boxplot.stats(df[,i])$out] df[,i][df[,i] %in% val] = NA } return (df) } X=fill_outlier_with_na(X) print(paste0("Total na's in training data ::" ,sum(is.na(X)))) ## [1] "Total na's in training data ::26533" test=fill_outlier_with_na(test) print(paste0("Total na's in testing data ::" ,sum(is.na(test)))) ## [1] "Total na's in testing data ::27087" fill_outlier_with_mean=function(df) { cnames=colnames(df) for(i in cnames) { df[is.na(df[,i]), i] |z|) (Intercept) -3.1485419 0.0159619 -197.253 < 2e-16 *** var_0 0.1695836 0.0101810 16.657 < 2e-16 *** var_1 0.1733650 0.0103298 16.783 < 2e-16 *** var_2 0.1625150 0.0101630 15.991 < 2e-16 *** var_3 0.0467929 0.0103706 4.512 6.42e-06 *** var_4 0.0477207 0.0103403 4.615 3.93e-06 *** var_5 0.1051940 0.0103299 10.183 < 2e-16 *** var_6 0.2333818 0.0102193 22.837 < 2e-16 *** var_7 -0.0161019 0.0103248 -1.560 0.118870 var_8 0.0548121 0.0103742 5.284 1.27e-07 *** var_9 -0.1432844 0.0102883 -13.927 < 2e-16 *** var_10 -0.0042218 0.0103586 -0.408 0.683596 var_11 0.0675482 0.0103832 6.506 7.74e-11 *** var_12 -0.2314095 0.0101754 -22.742 < 2e-16 *** var_13 -0.1801244 0.0102555 -17.564 < 2e-16 *** var_14 -0.0078425 0.0103623 -0.757 0.449149 var_15 0.0487737 0.0103437 4.715 2.41e-06 *** var_16 0.0212644 0.0103352 2.057 0.039642 * var_17 -0.0040274 0.0103507 -0.389 0.697206 var_18 0.1365054 0.0102937 13.261 < 2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 91304 Residual deviance: 64634 AIC: 65036 on 139999 on 139799 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 6 y_prob =predict(over_logit , test[-201] ,type = 'response' ) y_pred = ifelse(y_prob >0.5, 1, 0) conf_matrix= table(test[,201] , y_pred) getmodel_accuracy(conf_matrix) ## ## ## ## [1] [1] [1] [1] "accuracy 0.91" "precision 0.69" "recall 0.01" "fpr 0.01" ## [1] "fnr 0.73" ## [1] "f1 0.03" roc=roc(test[,201], y_prob) ## Setting levels: control = 0, case = 1 ## Setting direction: controls < cases print(roc ) ## ## ## ## ## ). ## Call: roc.default(response = test[, 201], predictor = y_prob) Data: y_prob in 53966 controls (test[, 201] 0) < 6034 cases (test[, 201] 1 Area under the curve: 0.8558 plot(roc ,main ="Logistic Regression base Roc ") accuracy 0.92 precision 0.68 recall 0.01 fpr 0.01 fnr 0.73 f1 0.03 Area under the curve: 0.8585 This is a very low very poor model over_logit =glm(formula = Y~. ,data =over ,family='binomial') summary(over_logit) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: glm(formula = Y ~ ., family = "binomial", data = over) Deviance Residuals: Min 1Q Median -3.3725 -0.7340 -0.0842 3Q 0.7295 Max 3.1915 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.9421312 0.0063472 -148.434 < 2e-16 *** var_0 0.1658809 0.0049805 33.306 < 2e-16 *** var_199 0.0958636 0.0051390 18.654 < 2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) ## Null deviance: 348677 on 251517 degrees of freedom ## Residual deviance: 232840 on 251317 degrees of freedom ## AIC: 233242 ## ## Number of Fisher Scoring iterations: 5 y_prob =predict(over_logit , test[-201] ,type = 'response' ) y_pred = ifelse(y_prob >0.5, 1, 0) conf_matrix= table(test[,201] , y_pred) getmodel_accuracy(conf_matrix) ## ## ## ## ## ## [1] [1] [1] [1] [1] [1] "accuracy 0.78" "precision 0.28" "recall 0.22" "fpr 0.22" "fnr 0.23" "f1 0.25" roc=roc(test[,201], y_prob ) ## Setting levels: control = 0, case = 1 ## Setting direction: controls < cases print(roc) ## ## ## ## ## ). ## Call: roc.default(response = test[, 201], predictor = y_prob) Data: y_prob in 53966 controls (test[, 201] 0) < 6034 cases (test[, 201] 1 Area under the curve: 0.8554 #plot roc curve plot(roc ,main="Logistic Regression roc-auc oversampled") From the above plot, we obtained accuracy 0.78 precision 0.28 recall 0.22 fpr 0.22 fnr 0.22 f1 0.25 After looking into the values we feel this function is better than the logist ic function. Time Stamp
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

View attached explanation and a...


Anonymous
Just what I was looking for! Super helpful.

Studypool
4.7
Indeed
4.5
Sitejabber
4.4

Similar Content

Related Tags