Description
5 slides powerpoint. Integrate the data analysis I have in hand using RStudio and excel into powerpoint slides. We analyze the freethrow percentage in NBA with few model to determine corelation and other models. Please put all the things we have into a few slides based on the instruction I will attach.
If you know how to use Rstudio this should be a 10 min work for you.
setwd("C:/Users/kaziz/Desktop"
library(readr)
Basketball <- read_csv("C:/Users/kaziz/Deskt
# to help generate correlation plots
install.packages("PerformanceA
library(PerformanceAnalytics)
# to help visualize correlation in color
install.packages("corrplot", repos = "http://cran.us.r-project.org"
library(corrplot)
#See some Descriptive statistics about our Basketball dataset
summary(Basketball)
attach(Basketball)
plot(Basketball$FTP,Basketball
```{r, results='hide'}
# Using the function chart.Correlation from "PerformanceAnalytics" package,
# we can create a correlation matrix easily, much easier than built in functions
# However, before that, we need to pick out the numerical variables
# because we cannot run correlation matrix with categorical data or missing data
Basketball.num = sapply(Basketball, is.numeric) # label TRUE FALSE for numerical variables
num = Basketball[,Basketball.num] # selecting only numerical variables
```
chart.Correlation(num)
correlation = cor(num, use = "complete.obs")
corrplot(correlation, type="upper")
Basketball$Post= ifelse(Basketball$Pos=="PG",1,
```{r}
# first load package "caTools"
library(caTools)
# based on probability 70% training data / 30% test data split.
# We create an item variable called "indicator", where indicator = TRUE takes up 70% of data
indicator = sample.split(Basketball, SplitRatio = 0.7)
# Extract out the data based on whether indicator variable is TRUE or FALSE
testing = Basketball[!indicator,] # getting 30% of the data as testing
training = Basketball[indicator,] # getting 70% of the data as training
# Attach training data first
attach(training)
# To build a linear regression model, give this model a name "linear":
linear = lm(FTP~ Post + FGP + `3PP` + AST + TRB+ TOV+BLK+`PS/G`+ MP)
# To see the result of model:
summary(linear)
plot(linear)
# To predict the gross of data from testing dataset using the linear model we built
testing$linear_prediction = predict(linear, newdata = testing)
# To see the accuracy of prediction:
accuracy = testing$linear_prediction - testing$FTP
percent = accuracy/testing$FTP
mean(accuracy,na.rm = TRUE) # to see how much percentage away from the actual
Explanation & Answer
Thank you for the opportunity.50% off the next assignment.
1.1 Number of Games per Season
In [3]:
games = df.drop_duplicates("game_id") \
.groupby(["season", "playoffs"]).size() \
.unstack()
games.head(3)
Out[3]:
playoffs
playoffs
regular
2006 - 2007
79
1218
2007 - 2008
84
1227
2008 - 2009
85
1231
season
In [4]:
fig, ax = plt.subplots(1,2, figsize=(15,5))
plt.suptitle("Number of Games per Season", y=1.03, fontsize=20)
games.regular.plot(marker="o", rot=90, title="Regular Season", color="#41ae76
", ax=ax[0])
games.playoffs.plot(marker="o", rot=90, title="Playoffs", ax=ax[1])
Out[4]:
In the regular season there are 1231 games played (30 teams playing 82 games each plus one Allstar game), except for the 2011-2012 season which was shortened due to a lockout. Thus, there is a
big drop in the diagram. The number of games are also not exactly 1231 for all seasons because for
some games there was simply no data available during the scraping process.
The playoffs are played in a best-of-seven mode and that's why the number of games vary.
1.2 Average Number of Free Throws per Game by Season
In [5]:
ft_total = df.groupby(["season", "playoffs"]).size() \
.unstack()
ft_total.head(3)
Out[5]:
playoffs
playoffs
regular
2006 - 2007
4116
63496
2007 - 2008
4384
61116
2008 - 2009
4455
60900
season
In [6]:
ft_per_game = ft_total / games
ft_per_game.head(2)
Out[6]:
playoffs
playoffs
regular
2006 - 2007
52.101266
52.131363
2007 - 2008
52.190476
49.809291
season
In [7]:
ft_per_game.plot(marker="o", rot=90, figsize=(12,5))
plt.title("Average Number of Free Throws per Game", fontsize=20)
plt.arrow(5.3, 51, -0.5, -1.2, width=0.01, color="k", head_starts_at_zero=Fal
se)
plt.text(4.8, 51.2, "Change of Rules")
Out[7]:
As expected, the number of free throws per game is higher for playoff games than for regular season
games (although only slightly in the first and last season of this data set). Overall, one can see that
there is a decline of free throws per game in the course of the seasons.
There is an especially deep drop from season 2010-2011 to 2011-2012 and it moves almost in
parallel for regular season and playoff games. So, there must have been some kind of change
regarding the rules of what constitutes a foul. And sure enough, I found this article which confirmed
my suspicion: http://www.espn.com/nba/story/_/id/7329584/nba-alters-emphasis-shooting-fouls2011-12
1.3 Number of Free Throws per Period
In [8]:
periods = df.groupby(["game_id", "playoffs", "period"]).size() \
.unstack(["playoffs", "period"]) \
.describe()[:2] \
.stack().unstack(0) \
.swaplevel(0, 1, axis=1).sortlevel(axis=1)
periods
/opt/conda/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: Runti
meWarning: Invalid value encountered in percentile
RuntimeWarning)
Out[8]:
count
p
l
a
y
o
f
f
s
mean
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
1
.
0
8
3
4
.
0
1
1
9
6
6
.
0
9
.
3
2
8
5
3
7
9
.
1
2
4
3
5
2
2
.
0
8
3
6
.
0
1
2
0
1
4
1
2
.
5
6
2
1
1
.
9
6
2
p
e
r
i
o
d
count
p
l
a
y
o
f
f
s
p
l
a
y
o
f
f
s
mean
r
e
g
u
l
a
r
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
2
0
1
3
7
7
1
2
.
6
0
0
0
0
0
1
1
.
9
7
0
5
7
1
1
5
.
4
9
3
9
1
4
.
4
3
2
2
p
e
r
i
o
d
.
0
3
.
0
8
3
5
.
0
1
1
9
9
5
.
0
4
.
0
8
3
2
.
0
1
2
0
1
4
.
0
count
p
l
a
y
o
f
f
s
p
l
a
y
o
f
f
s
mean
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
9
0
4
6
7
1
3
.
0
7
.
3
9
2
1
5
7
7
.
0
9
9
5
7
9
1
1
4
.
0
4
.
5
7
1
4
2
9
7
.
0
7
0
1
7
5
r
e
g
u
l
a
r
p
e
r
i
o
d
5
.
0
6
.
0
5
1
.
0
7
.
0
count
p
l
a
y
o
f
f
s
p
l
a
y
o
f
f
s
mean
r
e
g
u
l
a
r
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
2
0
.
0
7
.
6
6
6
6
6
7
6
.
3
5
0
0
0
0
N
a
N
1
2
.
0
0
0
0
0
0
p
e
r
i
o
d
7
.
0
8
.
0
3
.
0
N
a
N
2
.
0
There were only 7 playoff games that went into the 6th period, so I am not going to include them (or
higher periods) into the following graph.
In [9]:
periods["mean"][:5].plot(marker="o", xticks=(1,2,3,4,5), xlim=(0.8, 5.2), fig
size=(8,5))
plt.title("Average Number of Free Throws", fontsize=20)
Out[9]:
Here again, playoff games have an higher average than regular season games (across all periods).
And as expected, as the game comes closer to the end the number of free throws increases with the
highest average being in the fourth quarter.
There is a huge drop in the fifth quarter because periods in overtime are only 5 minutes long. In
order to compare them with the first 4 periods (which are 12 minutes long), I am going to calculate
the average number of free throws per minute per period.
In [10]:
periods["minutes"] = [12,12,12,12,5,5,5,5]
periods["playoffs"] = periods["mean"].playoffs / periods.minutes
periods["regular"] = periods["mean"].regular / periods.minutes
periods
Out[10]:
co
u
nt
p
l
a
y
o
f
f
s
me
an
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
1
.
0
8
3
4
.
0
1
1
9
6
6
.
0
9
.
3
2
8
5
3
7
9
.
1
2
4
3
5
2
2
.
0
8
3
6
.
0
1
2
0
1
4
1
2
.
5
6
2
1
1
.
9
6
2
m
i
n
u
t
e
s
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
1
2
0
.
7
7
7
3
7
8
0
.
7
6
0
3
6
3
1
2
1
.
0
4
6
8
0
.
9
9
6
8
p
e
r
i
o
d
co
u
nt
p
l
a
y
o
f
f
s
p
l
a
y
o
f
f
s
m
i
n
u
t
e
s
me
an
r
e
g
u
l
a
r
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
2
0
1
3
7
7
1
2
.
6
0
0
0
0
0
1
1
.
9
7
0
5
7
1
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
5
0
6
5
1
.
0
5
0
0
0
0
0
.
9
9
7
5
4
8
p
e
r
i
o
d
.
0
3
.
0
8
3
5
.
0
1
1
9
9
5
.
0
1
2
co
u
nt
p
l
a
y
o
f
f
s
p
l
a
y
o
f
f
s
me
an
r
e
g
u
l
a
r
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
1
2
0
1
4
.
0
1
5
.
4
9
3
9
9
0
1
4
.
4
3
2
2
4
6
7
1
3
7
.
3
9
2
7
.
0
9
9
m
i
n
u
t
e
s
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
1
2
1
.
2
9
1
1
6
6
1
.
2
0
2
6
8
7
5
1
.
4
7
8
1
.
4
1
9
p
e
r
i
o
d
4
.
0
8
3
2
.
0
5
.
0
5
1
.
0
co
u
nt
p
l
a
y
o
f
f
s
p
l
a
y
o
f
f
s
m
i
n
u
t
e
s
me
an
r
e
g
u
l
a
r
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
1
5
7
5
7
9
4
3
1
9
1
6
4
.
5
7
1
4
2
9
7
.
0
7
0
1
7
5
0
.
9
1
4
2
8
6
1
.
4
1
4
0
3
5
p
e
r
i
o
d
.
0
6
.
0
7
.
0
1
1
4
.
0
5
co
u
nt
p
l
a
y
o
f
f
s
p
l
a
y
o
f
f
s
me
an
r
e
g
u
l
a
r
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
2
0
.
0
7
.
6
6
6
6
6
7
6
.
3
5
0
0
0
0
m
i
n
u
t
e
s
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
5
1
.
5
3
3
3
3
3
1
.
2
7
0
0
0
0
p
e
r
i
o
d
7
.
0
8
.
0
3
.
0
N
a
N
2
.
0
N
a
N
1
2
.
0
0
0
5
N
a
N
2
.
4
0
0
0
co
u
nt
p
l
a
y
o
f
f
s
p
l
a
y
o
f
f
s
m
i
n
u
t
e
s
me
an
r
e
g
u
l
a
r
p
l
a
y
o
f
f
s
p
l
a
y
o
f
f
s
r
e
g
u
l
a
r
r
e
g
u
l
a
r
p
e
r
i
o
d
0
0
0
0
0
In [11]:
per_minute = periods[["playoffs", "regular"]][:5]
per_minute.columns = per_minute.columns.droplevel(1)
per_minute.plot(marker="o", xticks=(1,2,3,4,5), xlim=(0.8, 5.2), figsize=(8,5
))
plt.title("Average Number of Free Throws per Minute", fontsize=20)
Out[11]:
Now, the pattern is more clear. The closer the game gets to the end, the higher the number of free
throws. Let's see if that also applies to the actual playing time left.
1.4 Number of Free Throws: Seconds left
In [12]:
# excluding free throws that were made during overtime
df_seconds_left = df[df.period =100]
shooting.head(3)
Out[16]:
ft_count
percentage
A.J. Price
282
0.748227
Aaron
Brooks
1109
0.836790
Aaron
Gordon
254
0.681102
player
In [17]:
shooting.percentage.hist(bins=50, figsize=(8,5))
plt.title("Distribution of Shooting Percentages", font...