Description
Unformatted Attachment Preview
Purchase answer to see full attachment
Explanation & Answer
Attached.
School of Computer & Information Sciences
ITS 836 Data Science and Big Data Analytics
Home Work 6
By
Lecture 06 HW06 Regression
➢ HW06 Q1 Linear Regression Example
➢ HW06 Q2 Logistic Regression Example
➢ HW06 Q3 Cars
– https://www.kaggle.com/jerinvarghese/linear-regression-with-r/code
➢ HW06 Q4 Logistic Regression in R
– https://stats.idre.ucla.edu/r/dae/logit-regression/
➢ HW06 Q5 Perform a logistic regression on the following
– https://www.kaggle.com/hidede/graduate-admissions/data
– This is an ongoing Kaggle competition (create a Kaggle account)
Q1 Linear Regression Example
•
Here, the file that has the following fields is read, Id, Income, Age, Education, and Gender.
•
This data set hold 1500 records, with the Mean Income 75.99, Mean Age 43.58, Mean of
Education is 14.68, and Gender presence the only definite variable has a mean of 0.49.
> income_input = as.data.frame( read.csv("income.csv") )
> income_input[1:10,]
1
2
3
4
5
6
7
8
9
10
ID
1
2
3
4
5
6
7
8
9
10
Income
113
91
121
81
68
92
75
76
56
53
> summary(income_input)
ID
Income
Min.
:
1.0 Min.
: 14.00
1st Qu. : 375.8 1st Qu. : 62.00
Median : 750.5 Median : 76.00
Mean : 750.5 Mean : 75.99
3rd Qu. : 1125.2 3rd Qu. : 91.00
Max. : 1500.0 Max. : 134.00
>|
Age
69
52
65
58
31
51
53
56
42
33
Education
12
18
14
12
16
15
15
13
15
11
Age
Min.
:
1st Qu. :
Median :
Mean :
3rd Qu. :
Max. :
18.00
30.00
44.00
43.58
57.00
70.00
Gender
1
0
0
0
1
1
0
0
1
1
Education
Min.
: 10.00
1st Qu. : 12.00
Median : 15.00
Mean : 14.68
3rd Qu. : 16.00
Max. : 20.00
Gender
Min.
: 0.00
1st Qu. : 0.00
Median : 0.00
Mean : 0.49
3rd Qu. : 1.00
Max. : 1.00
splom(~income_input[c(2:5)],
groups=NULL,
data=income_input,
+ axis.line.tck=0,
+ axis.text.alpha=0)
> results summary(results)
Call:
lm(formula = Income ~ Age + Education + Gender, data = income_input)
Residuals:
Min
-37.340
1Q Median
-8.101
0.139
3Q
7.885
Max
37.271
Coefficients:
(Intercept)
Age
Education
Gender
--Signif. Codes:
Estimate Std. Error t value Pr(>|t|
7.26299 1.95575 3.714 0.000212 •••
0.99520 0.02057 48.373
< 2e-16 •••
1.75788 0.11581 15.179
< 2e-16 •••
-0.93433 0.62388 -1.498 0.134443
0 '•••' 0.001 '••' 0.01 '•' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.07 on 1496 degrees of freedom
Multiple R-squared: 0.6364,
Adjusted R-squared: 0.6357
F-statistic:
873 on 3 and 1496 DF, p-value: < 2.2e-16
> results2 summary(results2)
Call:
lm(formula = Income ~ Age + Education + Gender, data = income_input)
Residuals:
Min
-36.889
1Q Median
-7.892
0.185
3Q
8.200
Max
37.740
Coefficients:
(Intercept)
Age
Education
--Signif. Codes:
Estimate Std. Error t value Pr(>|t|
6.75822 1.92728 3.507 0.000467 •••
0.99603 0.02057 48.412
< 2e-16 •••
1.75860 0.11586 15.179
< 2e-16 •••
0 '•••' 0.001 '••' 0.01 '•' 0.05 '.' 0.1 ' ' 1
Residual standard error:
12.08 on 1497 degrees of freedom
Multiple R-squared: 0.6359,
Adjusted R-squared: 0.6354
F-statistic:
1307 on 2 and 1497 DF, p-value: < 2.2e-16
> # compute confidence intervals for the model parameters
> confint(results2, level = 0.95)
2.5%
97.5%
(Intercept) 2.9777598 10.538690
Age
0.9556771 1.036392
Education 1.5313393 1.988562
> # compute a confidence interval on the expected income of a person
> Age Education new_pt conf_int_pt conf_int_pt
fit
lwr
upr
1 68.69884 67.83102 69.56667
> # compute a prediction interval on the income of the person
> pred_int_pt pred_int_pt
fit
lwr
upr
1 68.69884 44.98867 92.40902
>|
> with(results2, {
+
plot(fitted.values, residuals,ylim=c(-40,40) )
+
points(c(min(fitted.values),max(fitted.values) ), c(0,0), type = "l")})
>|
➢ Confirming the statement
➢ hist(results2$residuals, main=””)
➢ Residuals centered on zero and appear normally distributed.
>
> qqnorm(results2$residuals, ylab="residuals"' main="")
>|
>
> qqline(results2$residuals)
>|
Q2 Logistic Regression Example
Here we begin by reading a file that has the following fields, Id, Income, Age, Education, and Gender.
The data set has 8000 records. Of this 8000 records, 1743 have Churned value 1.
>
>
churn_input = as.data.frame( read.csv("churn.csv") )
head(churn_input)
ID
Churned
Age
Married
1
1
0
61
2
2
0
50
3
3
0
47
4
4
0
50
5
5
0
29
6
6
0
43
> sum(churn_input$Churned)
[1] 1743
> View(churn_input)
> summary(churn_input)
ID
Churned
Min. :
1
Min. : 0.0000
1st Qu.:
2001 1st Qu.: 0.0000
Median:
4000 Median: 0.0000
Mean :
4000 Mean : 0.2179
3rd Qu.:
6000 3rd Qu.: 0.0000
Max. :
8000 Max. : 1.0000
>
Cust_years
1
1
1
1
1
1
3
3
2
3
1
4
Churned
Contacts
1
2
0
3
3
3
Age
Min. : 18.00
1st Qu.: 30.00
Median: 41.00
Mean : 41.00
3rd Qu.: 53.00
Max. : 65.00
Married
Min. : 0.0000
1st Qu.: 0.0000
Median: 1.0000
Mean : 0.5004
3rd Qu.: 1.0000
Max. : 1.0000
Cust_years
Churned_contact
Min. :
1.000 Min. :
0.000
1st Qu.:
2.000 1st Qu.:
1.000
Median:
3.000 Median:
2.000
Mean :
3.164 Mean :
1.718
3rd Qu.:
4.000 3rd Qu.:
2.000
Max. :
10.000 Max. :
6.000
➢ Linear connection trends
Glm (Churned~Age + Married + Cust_years +
Churned_contacts, data=churn_input,
Family=binominal (link=”logit”) )
➢ Intercept: Estimate of 3.415.
➢ Age Coef: ~-0.17
➢ Married coef: ~0.066
➢ Cust_years coef: ~0.018
➢ Churned_contacts : ~0.38
➢ Residuals – anticipated to be normally distributed – vary from -2 to +3
> Churn_logistic1 summary(Churn_logistic1)
Call:
glm(formula = Churned ~ Age + Married + Cust_years + Churned_contacts,
family = binomial(link ...