Please answer the questions related to Data Science and Big data analysis, I have attached the question in the ppt and related material in the zip file

User Generated

zrrgu

Computer Science

ITS 836

University of the Cumberlands

ITS

Description

Unformatted Attachment Preview

School of Computer & Information Sciences ITS 836 Data Science and Big Data Analytics ITS 836 1 Lecture 06 HW06 Regression • HW06 Q1 Linear Regression Example • HW06 Q2 Logistic Regression Example • HW06 Q3 Cars – https://www.kaggle.com/jerinvarghese/linear-regression-with-r/code • HW06 Q4 Logistic Regression in R – https://stats.idre.ucla.edu/r/dae/logit-regression/ • HW06 Q5 Perform a logistic regression on the following – https://www.kaggle.com/hidede/graduate-admissions/data – This is an ongoing Kaggle competition (create a Kaggle account) ITS 836 2 6.1.2 Model Description Q1 Linear Regression Example > income_input = as.data.frame(read.csv(“income.csv”)) > income_input[1:10,] Scatterplot > summary(income_input) Examine bottom line income~age: strong + trend income~educ: slight + trend > library(lattice) income~gender: no trend > splom(~income_input[c(2:5)], groups=NULL, data=income_input, axis.line.tck=0, axis.text.alpha=0) ITS 836 3 6.1.2 Model Description Example in R • • • – linear relationship trends lm(Income~Age+Education+Gender,income_input) • • • • • Residuals – uncertainty, sampling error Small p-values indicate statistically significance Intercept: income of $7263 for newborn female Age coef: ~1, year age increase -> $1k income incr Educ coef: ~1.76, year educ + -> $1.76k income + Gender coef: ~-0.93, male income decreases $930 Residuals – assumed to be normally distributed – vary from -37 to +37 (more information coming) – Age and Education highly significant, p results2 summary(results) # results about same as before • • Residual standard error / standard deviation R-squared (R2): variation of data ~64% (R2 = 1 means model explains data perfectly) • ITS 836 F-statistic: tests entire model –p value is small 4 6.1.2 Model Description Confidence Intervals on the Parameters • Once an acceptable linear model is developed, it is often useful to draw some inferences – R provides confidence intervals using confint() function > confint(results2, level = .95) – For example, Education coefficient was 1.76, and now the corresponding 95% confidence interval is (1.53. 1.99) ITS 836 • • In the income example, the regression line provides the expected income for a given Age and Education Using the predict() function in R, a confidence interval on the expected outcome can be obtained > Age Education new_pt conf_int_pt conf_int_pt – Expected income = $68699, conf interval ($67831,$69567) 5 > 6.1.3 Diagnostics Evaluating the Residuals • The error terms was assumed to be normally distributed with zero mean and constant variance > with(results2,{plot(fitted.values,residuals,ylim=c(-40,40)) }) ITS 836 6 > 6.1.3 Diagnostics Evaluating the Normality Assumption • The normality assumption still has to be validate > hist(results2$residuals) Residuals centered on zero and appear normally distributed ITS 836 7 Questions? ITS 836 8
Purchase answer to see full attachment
Explanation & Answer:
5 Questions
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Attached.

School of Computer & Information Sciences
ITS 836 Data Science and Big Data Analytics
Home Work 6
By

Lecture 06 HW06 Regression
➢ HW06 Q1 Linear Regression Example
➢ HW06 Q2 Logistic Regression Example
➢ HW06 Q3 Cars
– https://www.kaggle.com/jerinvarghese/linear-regression-with-r/code
➢ HW06 Q4 Logistic Regression in R
– https://stats.idre.ucla.edu/r/dae/logit-regression/
➢ HW06 Q5 Perform a logistic regression on the following
– https://www.kaggle.com/hidede/graduate-admissions/data
– This is an ongoing Kaggle competition (create a Kaggle account)

Q1 Linear Regression Example


Here, the file that has the following fields is read, Id, Income, Age, Education, and Gender.



This data set hold 1500 records, with the Mean Income 75.99, Mean Age 43.58, Mean of
Education is 14.68, and Gender presence the only definite variable has a mean of 0.49.

> income_input = as.data.frame( read.csv("income.csv") )
> income_input[1:10,]

1
2
3
4
5
6
7
8
9
10

ID
1
2
3
4
5
6
7
8
9
10

Income
113
91
121
81
68
92
75
76
56
53

> summary(income_input)
ID
Income
Min.
:
1.0 Min.
: 14.00
1st Qu. : 375.8 1st Qu. : 62.00
Median : 750.5 Median : 76.00
Mean : 750.5 Mean : 75.99
3rd Qu. : 1125.2 3rd Qu. : 91.00
Max. : 1500.0 Max. : 134.00
>|

Age
69
52
65
58
31
51
53
56
42
33

Education
12
18
14
12
16
15
15
13
15
11

Age
Min.
:
1st Qu. :
Median :
Mean :
3rd Qu. :
Max. :

18.00
30.00
44.00
43.58
57.00
70.00

Gender
1
0
0
0
1
1
0
0
1
1

Education
Min.
: 10.00
1st Qu. : 12.00
Median : 15.00
Mean : 14.68
3rd Qu. : 16.00
Max. : 20.00

Gender
Min.
: 0.00
1st Qu. : 0.00
Median : 0.00
Mean : 0.49
3rd Qu. : 1.00
Max. : 1.00

splom(~income_input[c(2:5)],
groups=NULL,
data=income_input,
+ axis.line.tck=0,
+ axis.text.alpha=0)

> results summary(results)
Call:
lm(formula = Income ~ Age + Education + Gender, data = income_input)
Residuals:
Min
-37.340

1Q Median
-8.101
0.139

3Q
7.885

Max
37.271

Coefficients:
(Intercept)
Age
Education
Gender
--Signif. Codes:

Estimate Std. Error t value Pr(>|t|
7.26299 1.95575 3.714 0.000212 •••
0.99520 0.02057 48.373
< 2e-16 •••
1.75788 0.11581 15.179
< 2e-16 •••
-0.93433 0.62388 -1.498 0.134443
0 '•••' 0.001 '••' 0.01 '•' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.07 on 1496 degrees of freedom
Multiple R-squared: 0.6364,
Adjusted R-squared: 0.6357
F-statistic:
873 on 3 and 1496 DF, p-value: < 2.2e-16

> results2 summary(results2)
Call:
lm(formula = Income ~ Age + Education + Gender, data = income_input)
Residuals:
Min
-36.889

1Q Median
-7.892
0.185

3Q
8.200

Max
37.740

Coefficients:
(Intercept)
Age
Education
--Signif. Codes:

Estimate Std. Error t value Pr(>|t|
6.75822 1.92728 3.507 0.000467 •••
0.99603 0.02057 48.412
< 2e-16 •••
1.75860 0.11586 15.179
< 2e-16 •••
0 '•••' 0.001 '••' 0.01 '•' 0.05 '.' 0.1 ' ' 1

Residual standard error:
12.08 on 1497 degrees of freedom
Multiple R-squared: 0.6359,
Adjusted R-squared: 0.6354
F-statistic:
1307 on 2 and 1497 DF, p-value: < 2.2e-16

> # compute confidence intervals for the model parameters
> confint(results2, level = 0.95)
2.5%
97.5%
(Intercept) 2.9777598 10.538690
Age
0.9556771 1.036392
Education 1.5313393 1.988562
> # compute a confidence interval on the expected income of a person
> Age Education new_pt conf_int_pt conf_int_pt
fit
lwr
upr
1 68.69884 67.83102 69.56667
> # compute a prediction interval on the income of the person
> pred_int_pt pred_int_pt
fit
lwr
upr
1 68.69884 44.98867 92.40902
>|

> with(results2, {
+
plot(fitted.values, residuals,ylim=c(-40,40) )
+
points(c(min(fitted.values),max(fitted.values) ), c(0,0), type = "l")})
>|

➢ Confirming the statement
➢ hist(results2$residuals, main=””)
➢ Residuals centered on zero and appear normally distributed.

>
> qqnorm(results2$residuals, ylab="residuals"' main="")
>|

>
> qqline(results2$residuals)
>|

Q2 Logistic Regression Example
Here we begin by reading a file that has the following fields, Id, Income, Age, Education, and Gender.
The data set has 8000 records. Of this 8000 records, 1743 have Churned value 1.
>
>

churn_input = as.data.frame( read.csv("churn.csv") )
head(churn_input)
ID

Churned

Age

Married

1
1
0
61
2
2
0
50
3
3
0
47
4
4
0
50
5
5
0
29
6
6
0
43
> sum(churn_input$Churned)
[1] 1743
> View(churn_input)
> summary(churn_input)
ID
Churned
Min. :
1
Min. : 0.0000
1st Qu.:
2001 1st Qu.: 0.0000
Median:
4000 Median: 0.0000
Mean :
4000 Mean : 0.2179
3rd Qu.:
6000 3rd Qu.: 0.0000
Max. :
8000 Max. : 1.0000
>

Cust_years
1
1
1
1
1
1

3
3
2
3
1
4

Churned
Contacts
1
2
0
3
3
3

Age
Min. : 18.00
1st Qu.: 30.00
Median: 41.00
Mean : 41.00
3rd Qu.: 53.00
Max. : 65.00

Married
Min. : 0.0000
1st Qu.: 0.0000
Median: 1.0000
Mean : 0.5004
3rd Qu.: 1.0000
Max. : 1.0000

Cust_years
Churned_contact
Min. :
1.000 Min. :
0.000
1st Qu.:
2.000 1st Qu.:
1.000
Median:
3.000 Median:
2.000
Mean :
3.164 Mean :
1.718
3rd Qu.:
4.000 3rd Qu.:
2.000
Max. :
10.000 Max. :
6.000

➢ Linear connection trends
Glm (Churned~Age + Married + Cust_years +
Churned_contacts, data=churn_input,
Family=binominal (link=”logit”) )
➢ Intercept: Estimate of 3.415.
➢ Age Coef: ~-0.17
➢ Married coef: ~0.066
➢ Cust_years coef: ~0.018
➢ Churned_contacts : ~0.38
➢ Residuals – anticipated to be normally distributed – vary from -2 to +3

> Churn_logistic1 summary(Churn_logistic1)
Call:
glm(formula = Churned ~ Age + Married + Cust_years + Churned_contacts,
family = binomial(link ...


Anonymous
Great content here. Definitely a returning customer.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Related Tags