Homework 2: Due Friday, October 22 at 11:59pm
For this assignment you will be analyzing the flights data frame included in the
nycflights13 package, which contains 19 features for 336,776 flights that departed from
New York City in 2013. The goal of this assignment is to become more familiar with data
transformations and exploratory data analysis. Provide all responses in the designated spaces in
this Word document, then save it as a pdf and upload it to Canvas.
1. [25%] Generate notched boxplots to compare distributions of airport distances between
canceled and non-canceled flights. Is the median airport distance for canceled flights
shorter, longer, or roughly the same as for non-canceled flights? Justify your answer, and
then provide a possible explanation for this finding.
Provide code below:
Provide figure below:
Provide answers to questions below:
2. [40%] Generate a bar plot with month on the 𝑥 axis and the proportion of the month’s
flights that are canceled on the 𝑦 axis. Which four months of the year had the highest
proportions of flight cancelations? Provide a possible explanation for this finding.
Note: You will need to use the function geom_col() to generate a bar plot for which you
provide the 𝑥 and 𝑦 axis features. Because geom_col()expects that the feature on the 𝑥
axis is categorical, you must also use the code as.factor(month) to convert the
month feature to a categorical variable taking 12 values (1, 2, …, 12).
Provide code below:
Provide figure below:
Provide answers to questions below:
1
3. [35%] Generate a scatterplot displaying the relationship between the mean and the
standard deviation of airport distance for flights on each of the 365 days of the year. Use
the method argument of the geom_smooth() function to overlay your scatterplot with a
fitted straight line with confidence intervals. Is there a relationship between the mean and
standard deviation of airport distance for flights on each day of the year? Describe the
relationship (or lack of one), and then provide a possible explanation for this finding.
Provide code below:
Provide figure below:
Provide answers to questions below:
2
CAP 5768: Introduction to Data Science
Data Transformations
Review of last lecture
We discussed how to import, tidy, and visualize data from R packages.
Loading a package automatically imports included data, which are
already tidied into data frames (or tibbles).
We also learned how to use the ggplot() function to visualize data.
However, we skipped data transformation, which was unnecessary for
the data (mpg) that we were analyzing.
Data transformation in R
We have already seen that the tidyverse package contains the
ggplot2 package for data visualization.
The tidyverse package also contains the dplyr package for data
transformation.
The toolset in dplyr allows us to easily filter, arrange, select, and
summarize data for easier downstream use.
Recall that we can load the tidyverse package and all of its
included packages with the command:
> library(tidyverse)
The flights data frame (tibble)
We will perform data transformation on the flights data frame
(tibble) that is included in the nycflights13 package.
First, we must install the package:
> install.packages("nycflights13")
Then, we must load the package:
> library(nycflights13)
The flights data frame contains 19 features (columns) for all
336,776 flights (rows) that departed from New York City in 2013.
A snapshot of the flights data frame
> flights
# A tibble: 336,776 x 19
year month
day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1
2013
1
1
517
515
2
830
819
2
2013
1
1
533
529
4
850
830
3
2013
1
1
542
540
2
923
850
4
2013
1
1
544
545
-1
1004
1022
5
2013
1
1
554
600
-6
812
837
6
2013
1
1
554
558
-4
740
728
7
2013
1
1
555
600
-5
913
854
8
2013
1
1
557
600
-3
709
723
9
2013
1
1
557
600
-3
838
846
10
2013
1
1
558
600
-2
753
745
# … with 336,766 more rows, and 11 more variables: arr_delay , carrier ,
#
flight , tailnum , origin , dest , air_time ,
#
distance , hour , minute , time_hour
Learning about the features of the flights data frame
> help(flights)
year, month, day
Date of departure
dep_time, arr_time
carrier
Actual departure and arrival times
(format HHMM or HMM), local tz
Scheduled departure and arrival times
(format HHMM or HMM), local tz
Departure and arrival delays, in minutes.
Negative times represent early
departures/arrivals.
Two letter carrier abbreviation.
flight
tailnum
Flight number
Plane tail number
origin, dest
air_time
Origin and destination
Amount of time spent in air (minutes)
distance
hour, minute
Distance between airports (miles)
Time of schedule departure
time_hour
Scheduled flight date & hour (POSIXct)
sched_dep_time, sched_arr_time
dep_delay, arr_delay
Basics of dplyr
The dplyr package contains five key functions that are useful for
data manipulation and transformation:
filter()
Pick observations by their values
arrange()
Reorder rows
select()
Pick features by their names
mutate()
Create new features with functions of existing
features
summarize()
Collapse many features down to a single
summary
These functions can also be used together with the group_by()
function, which changes the scope of each function from operating on
the entire dataset to operating on it group-by-group.
Similar structure for all six functions
These six dplyr functions all work in similar ways:
- Their first argument is a data frame.
- The subsequent arguments describe what to do with the data
frame, using the feature names.
- The result of the function is a new data frame.
These three properties make it easy to chain together multiple steps,
permitting complex manipulations of the original data.
We will now apply each of them to the flights data frame to see
how they work.
Filtering rows with filter()
The filter() function allows you to subset observations based on
values of their features.
The first argument is the name of the data frame.
The subsequent arguments are expressions that filter the data frame.
You may recall that we already used filter() to subset data when
plotting a smoothing trend line in the previous lecture.
Example from last lecture
> ggplot(data = mpg, mapping = aes(x = displ, y = hwy )) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"),
se = FALSE)
Extracting only flights on January 1st
> filter(flights, month == 1, day == 1)
# A tibble: 842 x 19
year month
day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1
2013
1
1
517
515
2
830
819
2
2013
1
1
533
529
4
850
830
3
2013
1
1
542
540
2
923
850
4
2013
1
1
544
545
-1
1004
1022
5
2013
1
1
554
600
-6
812
837
6
2013
1
1
554
558
-4
740
728
7
2013
1
1
555
600
-5
913
854
8
2013
1
1
557
600
-3
709
723
9
2013
1
1
557
600
-3
838
846
10
2013
1
1
558
600
-2
753
745
# … with 832 more rows, and 11 more variables: arr_delay , carrier ,
#
flight , tailnum , origin , dest , air_time ,
#
distance , hour , minute , time_hour
Creating a new data frame with flights on January 1st
In the previous example, we obtained all flights on January 1st.
However, we may wish to store this new dataset so that we can further
manipulate it downstream.
To save the data frame, we can assign it to a new variable, e.g., Jan1:
> Jan1 filter(flights, month == 1, day == 1)
Here we are stating that we want to take observations (rows) from the
flights data frame if it is TRUE that the feature month is equal to 1
(January) AND that the feature day is equal to 1 (first of the month).
We must use designated comparison operators
If we used = instead of ==, then we would not get our intended results:
> filter(flights, month = 1, day = 1)
Error: Problem with `filter()` input `..1`.
x Input `..1` is named.
ℹ This usually means that you've used `=` instead
of `==`.
ℹ Did you mean `month == 1`?
This is because, as we saw in the last lecture, the = operator is used
by R to assign values to variables.
Therefore, rather than asking whether month is equal to 1 and day is
equal to 1, the above code is trying to assign a value of 1 to each of
these features.
The set of comparison operators in R
Comparison operators in R take the form
x OPERATOR y
to compare the value of x to the value of y, and then return a value of
TRUE or FALSE.
x == y
x is equal to y
x > y
x is greater than y
x >= y
x is greater than or equal to y
x < y
x is less than y
x sqrt(2) ^ 2 == 2
[1] FALSE
> 1 / 49 * 49 == 1
[1] FALSE
Circumventing issues with finite precision arithmetic
We can circumvent some of these issues by using the built-in R
function near(), which examines whether an expression is “close” to
a particular value.
Close is measured by a modifiable argument tol (tolerance of
comparison) of the near() function, which defines how numerically
different the two quantities can be to call them different.
We can use near() to obtain desired results from our examples:
> near(sqrt(2) ^ 2, 2)
[1] TRUE
> near(1 / 49 * 49, 1 )
[1] TRUE
Logical operators
In addition to comparison operators, we have also already observed
an example of a logical operator.
In particular, consider our code from earlier:
> filter(flights, month == 1, day == 1)
The comma between the second and third arguments of this code
represents an “and” operation.
This means rows are only selected if month == 1 and day == 1 .
The set of logical operators
Suppose that x and y are two sets of values.
We can apply the logical operators & (and), | (or), xor() (exclusive
or), and ! (not) to obtain different combinations of these values:
Wickham, Grolemund (2017) R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly.
Example use of logical operators in filter()
If we wanted to consider flights that occurred in either November (11)
or December (12), then we could use the code:
> filter(flights, month == 11 | month == 12)
Here, if month equals either 11 or 12, then we choose that row.
Because it can be cumbersome to list the feature month for each
comparison, we can use the following shorthand code instead:
> filter(flights, month %in% c(11, 12))
Here we are asking whether the value of month is within (%in%
operator) the set of values in a list created with c() that contains 11
and 12.
Filtering data by intervals instead of exact values
We just discussed filtering data by exact values, e.g.,
> filter(flights, month == 11 | month == 12)
We can also filter data by intervals.
For example, we could use the following code to extract all flights that
were not delayed on arrival or departure by more than two hours:
> filter(flights, !(arr_delay > 120 | dep_delay > 120))
The code below produces the same result:
> filter(flights, arr_delay 5
[1] NA
> 10 == NA
[1] NA
> NA + 10
[1] NA
> NA / 2
[1] NA
> NA == NA
[1] NA
Understanding why NA == NA evaluates to NA
Suppose mary_age represents Mary’s age, which we don’t know:
> mary_age john_age mary_age == john_age
[1] NA
We don’t know, because we don’t know the ages of Mary and John.
Determining if a value is missing
We can use the built-in R function is.na()to check whether a value
is missing (NA).
For example:
> is.na(mary_age)
[1] TRUE
> is.na(john_age)
[1] TRUE
Implications of NA values on filter()
The filter() function only returns rows for which the condition
across the provided arguments evaluates to TRUE.
It thus excludes rows for which the condition evaluates to FALSE or to
NA, and so care needs to be taken if we wish to preserve NA values.
If we want to preserve NA values with filter(), then we must ask
for them explicitly.
Example using filter() without preserving NA values
We will first create a new data frame (tibble) with three rows and one
column:
> df df
# A tibble: 3 x 1
x
1
1
2
NA
3
3
Example using filter() without preserving NA values
Filtering values where x is greater than 1, we get:
> filter(df, x > 1)
# A tibble: 1 x 1
x
1
3
Here, the filtered data frame has only one row, because there is only
one row for which x = 3 evaluated to TRUE.
Example using filter() while preserving NA values
Filtering values where x is NA or is greater than 1, we get:
> filter(df, is.na(x) | x > 1)
# A tibble: 2 x 1
x
1
NA
2
3
Here, the filtered data frame has two rows, because both the row with
the NA and the row with x = 3 evaluated to TRUE.
Arranging rows with arrange()
The arrange() function allows you to change the order of rows in a
data frame.
The first argument is the name of the data frame.
The subsequent arguments are column names (or a complicated
expression) that are used to arrange the rows of the data frame.
If more than one column name is provided, then additional columns
will be used to break ties in the values of the preceding columns.
Arranging flights by ascending year, month, and day
> arrange(flights, year, month, day)
# A tibble: 336,776 x 19
year month
day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1
2013
1
1
517
515
2
830
819
2
2013
1
1
533
529
4
850
830
3
2013
1
1
542
540
2
923
850
4
2013
1
1
544
545
-1
1004
1022
5
2013
1
1
554
600
-6
812
837
6
2013
1
1
554
558
-4
740
728
7
2013
1
1
555
600
-5
913
854
8
2013
1
1
557
600
-3
709
723
9
2013
1
1
557
600
-3
838
846
10
2013
1
1
558
600
-2
753
745
# … with 336,766 more rows, and 11 more variables: arr_delay , carrier ,
#
flight , tailnum , origin , dest , air_time ,
#
distance , hour , minute , time_hour
Arranging flights by ascending departure delay
> arrange(flights, dep_delay)
# A tibble: 336,776 x 19
year month
day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1
2013
12
7
2040
2123
-43
40
2352
2
2013
2
3
2022
2055
-33
2240
2338
3
2013
11
10
1408
1440
-32
1549
1559
4
2013
1
11
1900
1930
-30
2233
2243
5
2013
1
29
1703
1730
-27
1947
1957
6
2013
8
9
729
755
-26
1002
955
7
2013
10
23
1907
1932
-25
2143
2143
8
2013
3
30
2030
2055
-25
2213
2250
9
2013
3
2
1431
1455
-24
1601
1631
10
2013
5
5
934
958
-24
1225
1309
# … with 336,766 more rows, and 11 more variables: arr_delay , carrier ,
#
flight , tailnum , origin , dest , air_time ,
#
distance , hour , minute , time_hour
Arranging flights by descending departure delay
> arrange(flights, desc(dep_delay))
# A tibble: 336,776 x 19
year month
day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1
2013
1
9
641
900
1301
1242
1530
2
2013
6
15
1432
1935
1137
1607
2120
3
2013
1
10
1121
1635
1126
1239
1810
4
2013
9
20
1139
1845
1014
1457
2210
5
2013
7
22
845
1600
1005
1044
1815
6
2013
4
10
1100
1900
960
1342
2211
7
2013
3
17
2321
810
911
135
1020
8
2013
6
27
959
1900
899
1236
2226
9
2013
7
22
2257
759
898
121
1026
10
2013
12
5
756
1700
896
1058
2020
# … with 336,766 more rows, and 11 more variables: arr_delay , carrier ,
#
flight , tailnum , origin , dest , air_time ,
#
distance , hour , minute , time_hour
Missing values are always placed in the bottom rows
> df arrange(df, x)
> arrange(df, desc(x))
# A tibble: 3 x 1
# A tibble: 3 x 1
x
x
1
2
1
5
2
5
2
2
3
NA
3
NA
Extracting relevant columns with select()
Modern datasets are often massive, with millions of rows
(observations) and thousands of columns (features).
However, for a given problem, most features are irrelevant.
The select() function allows you to choose a set of columns that
you would like to keep.
The first argument is the name of the data frame.
The subsequent arguments are the names of columns that you want
to extract.
Extracting year, month, and day from flights
> select(flights, year, month, day)
# A tibble: 336,776 x 3
year month
day
1
2013
1
1
2
2013
1
1
3
2013
1
1
4
2013
1
1
5
2013
1
1
6
2013
1
1
7
2013
1
1
8
2013
1
1
9
2013
1
1
10
2013
1
1
# … with 336,766 more rows
All columns except between year and day (inclusive)
> select(flights, -(year:day))
# A tibble: 336,776 x 16
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
1
517
515
2
830
819
11 UA
1545
2
533
529
4
850
830
20 UA
1714
3
542
540
2
923
850
33 AA
1141
4
544
545
-1
1004
1022
-18 B6
725
5
554
600
-6
812
837
-25 DL
461
6
554
558
-4
740
728
12 UA
1696
7
555
600
-5
913
854
19 B6
507
8
557
600
-3
709
723
-14 EV
5708
9
557
600
-3
838
846
-8 B6
79
10
558
600
-2
753
745
8 AA
301
# … with 336,766 more rows, and 8 more variables: tailnum , origin ,
#
dest , air_time , distance , hour , minute ,
#
time_hour
All columns except year, day, and dep_time
> select(flights, -c(year, day, dep_time))
# A tibble: 336,776 x 16
month sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
1
1
515
2
830
819
11 UA
1545
2
1
529
4
850
830
20 UA
1714
3
1
540
2
923
850
33 AA
1141
4
1
545
-1
1004
1022
-18 B6
725
5
1
600
-6
812
837
-25 DL
461
6
1
558
-4
740
728
12 UA
1696
7
1
600
-5
913
854
19 B6
507
8
1
600
-3
709
723
-14 EV
5708
9
1
600
-3
838
846
-8 B6
79
10
1
600
-2
753
745
8 AA
301
# … with 336,766 more rows, and 8 more variables: tailnum , origin ,
#
dest , air_time , distance , hour , minute ,
#
time_hour
Placing specific columns up front with everything()
> select(flights, arr_time, arr_delay, everything())
# A tibble: 336,776 x 19
arr_time arr_delay
year month
day dep_time sched_dep_time dep_delay
1
830
11
2013
1
1
517
515
2
2
850
20
2013
1
1
533
529
4
3
923
33
2013
1
1
542
540
2
4
1004
-18
2013
1
1
544
545
-1
5
812
-25
2013
1
1
554
600
-6
6
740
12
2013
1
1
554
558
-4
7
913
19
2013
1
1
555
600
-5
8
709
-14
2013
1
1
557
600
-3
9
838
-8
2013
1
1
557
600
-3
10
753
8
2013
1
1
558
600
-2
# … with 336,766 more rows, and 11 more variables: sched_arr_time ,
#
carrier , flight , tailnum , origin , dest ,
#
air_time , distance , hour , minute , time_hour
Renaming features with rename()
> rename(flights, depTime = dep_time)
# A tibble: 336,776 x 19
year month
day depTime sched_dep_time dep_delay arr_time sched_arr_time
1
2013
1
1
517
515
2
830
819
2
2013
1
1
533
529
4
850
830
3
2013
1
1
542
540
2
923
850
4
2013
1
1
544
545
-1
1004
1022
5
2013
1
1
554
600
-6
812
837
6
2013
1
1
554
558
-4
740
728
7
2013
1
1
555
600
-5
913
854
8
2013
1
1
557
600
-3
709
723
9
2013
1
1
557
600
-3
838
846
10
2013
1
1
558
600
-2
753
745
# … with 336,766 more rows, and 11 more variables: arr_delay , carrier ,
#
flight , tailnum , origin , dest , air_time ,
#
distance , hour , minute , time_hour
Adding new features with mutate()
Sometimes the features in a dataset are not the features of interest.
For example, a dataset may contain features for distance and time,
whereas the relevant feature is speed, or distance divided by time.
The mutate() function allows you to generate new features that are
functions of existing features (data transformation).
The first argument is the name of the data frame.
The subsequent arguments are expressions to generate new features,
which are added as the last columns of the new data frame.
Let’s first create a smaller dataset for easy visualization
> flights_sml flights_sml
# A tibble: 336,776 x 7
year month
day dep_delay arr_delay distance air_time
1
2013
1
1
2
11
1400
227
2
2013
1
1
4
20
1416
227
3
2013
1
1
2
33
1089
160
4
2013
1
1
-1
-18
1576
183
5
2013
1
1
-6
-25
762
116
6
2013
1
1
-4
12
719
150
7
2013
1
1
-5
19
1065
158
8
2013
1
1
-3
-14
229
53
9
2013
1
1
-3
-8
944
140
10
2013
1
1
-2
8
733
138
# … with 336,766 more rows
Adding two new features to our new data frame
> mutate(flights_sml, gain = dep_delay - arr_delay,
speed = distance / air_time * 60)
# A tibble: 336,776 x 9
year month
day dep_delay arr_delay distance air_time
gain speed
1
2013
1
1
2
11
1400
227
-9
370.
2
2013
1
1
4
20
1416
227
-16
374.
3
2013
1
1
2
33
1089
160
-31
408.
4
2013
1
1
-1
-18
1576
183
17
517.
5
2013
1
1
-6
-25
762
116
19
394.
6
2013
1
1
-4
12
719
150
-16
288.
7
2013
1
1
-5
19
1065
158
-24
404.
8
2013
1
1
-3
-14
229
53
11
259.
9
2013
1
1
-3
-8
944
140
5
405.
10
2013
1
1
-2
8
733
138
-10
319.
# … with 336,766 more rows
Can also apply mutate to newly-generated features
> mutate(flights_sml, gain = dep_delay - arr_delay,
hours = air_time / 60,
gains_per_hour = gain / hours)
# A tibble: 336,776 x 10
year month
day dep_delay arr_delay distance air_time
gain hours gains_per_hour
1
2013
1
1
2
11
1400
227
-9 3.78
-2.38
2
2013
1
1
4
20
1416
227
-16 3.78
-4.23
3
2013
1
1
2
33
1089
160
-31 2.67
4
2013
1
1
-1
-18
1576
183
17 3.05
5.57
5
2013
1
1
-6
-25
762
116
19 1.93
9.83
6
2013
1
1
-4
12
719
150
-16 2.5
-6.4
7
2013
1
1
-5
19
1065
158
-24 2.63
-9.11
8
2013
1
1
-3
-14
229
53
9
2013
1
1
-3
-8
944
140
10
2013
1
1
-2
8
733
138
# … with 336,766 more rows
11 0.883
5 2.33
-10 2.3
-11.6
12.5
2.14
-4.35
Can use transmute() to only keep new variables
> transmute(flights_sml, gain = dep_delay - arr_delay,
hours = air_time / 60,
gains_per_hour = gain / hours)
# A tibble: 336,776 x 3
gain hours gains_per_hour
1
-9 3.78
-2.38
2
-16 3.78
-4.23
3
-31 2.67
4
17 3.05
5.57
5
19 1.93
9.83
-11.6
6
-16 2.5
-6.4
7
-24 2.63
-9.11
8
11 0.883
9
5 2.33
10
-10 2.3
12.5
2.14
-4.35
# … with 336,766 more rows
New features can be generated if vectorizable
There are numerous expressions that can be used to create new
features with mutate().
An important property of these expressions is that they must be
vectorizable.
That is, the expression must take a vector (list) of input values, and
then return a vector (list) with the same number of values as output.
We cannot exhaustively describe all such possible functions, but we
will introduce a few on the next slide.
Examples of vectorizable operations
> x (x * 2) + 1
[1] 3 21 201 2001 20001
> log10(x)
[1] 0 1 2 3 4
> cumsum(x)
[1] 1 11 111 1111 11111
> cumsum(log10(x))
[1] 0 1 3 6 10
> log10(x) / sum(log10(x))
[1] 0.0 0.1 0.2 0.3 0.4
> log10(x) > 2
[1] FALSE FALSE FALSE TRUE TRUE
Summarizing data with summarize()
The summarize() function collapses a data frame into a single row
by summarizing it based on particular features.
For example, we can summarize flights by mean delay time:
> summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 x 1
delay
1
12.6
This is not terribly useful on its own (we can just use mean()here).
But we can use summarize() with group_by() to summarize the
dataset by groups of observations rather than by the whole dataset.
Grouping data with group_by()
> by_day by_day
# A tibble: 336,776 x 19
# Groups:
year, month, day [365]
year month
day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest
1
2013
1
1
517
515
2
830
819
11 UA
1545 N14228
EWR
IAH
2
2013
1
1
533
529
4
850
830
20 UA
1714 N24211
LGA
IAH
3
2013
1
1
542
540
2
923
850
33 AA
1141 N619AA
JFK
MIA
4
2013
1
1
544
545
-1
1004
1022
-18 B6
725 N804JB
JFK
BQN
5
2013
1
1
554
600
-6
812
837
-25 DL
461 N668DN
LGA
ATL
6
2013
1
1
554
558
-4
740
728
12 UA
1696 N39463
EWR
ORD
7
2013
1
1
555
600
-5
913
854
19 B6
507 N516JB
EWR
FLL
8
2013
1
1
557
600
-3
709
723
-14 EV
5708 N829AS
LGA
IAD
9
2013
1
1
557
600
-3
838
846
-8 B6
79 N593JB
JFK
MCO
10
2013
1
1
558
600
-2
753
745
8 AA
301 N3ALAA
LGA
ORD
# … with 336,766 more rows, and 5 more variables: air_time , distance , hour , minute , time_hour
Summarizing data by groups of observations
> by_day summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 365 x 4
# Groups:
year, month [12]
year month
day delay
1
2013
1
1 11.5
2
2013
1
2 13.9
3
2013
1
3 11.0
4
2013
1
4
8.95
5
2013
1
5
5.73
6
2013
1
6
7.15
7
2013
1
7
5.42
8
2013
1
8
2.55
9
2013
1
9
2.28
10
2013
1
10
2.84
# … with 355 more rows
Summarizing with multiple operations (removing “noise”)
> by_dest delay delay 20, dest != "HNL")
> delay
# A tibble: 96 x 4
dest
count
dist delay
1 ABQ
254 1826
4.38
2 ACK
265
199
4.85
3 ALB
439
143
4 ATL
17215
5 AUS
14.4
757. 11.3
2439 1514.
6.02
6 AVL
275
584.
8.00
7 BDL
443
116
7.05
8 BGR
375
378
8.03
9 BHM
297
866. 16.9
10 BNA
6333
758. 11.8
# … with 86 more rows
Flight delays only increase with distance up to ~750 miles
> ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(mapping = aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
Perhaps as flights get longer there is a greater ability to make up for
delays while in the air.
Three steps to prepare data in last example
> by_dest delay delay 20, dest != "HNL")
Step 1: Group flights by destination.
Step 2: Summarize to compute distance, average delay, and number
of flights.
Step 3: Filter to remove noisy points and Honolulu, which is close to
twice as far away as the next closest airport to New York City.
A problem with the above code is that we need to create, name, and
store a new data frame at each intermediate step, even if we
ultimately do not care about these intermediate datasets.
Using pipes %>% to efficiently tackle identical problem
> delays %
group_by(dest) %>%
summarize(count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)) %>%
filter(count > 20, dest != "HNL")
Here the focus is on the transformations, not what is being
transformed, which makes the code easier to read.
The above code can be read as: group, then summarize, then filter.
Not filtering NA values leads to missing summaries
What would happen if we did not use na.rm = TRUE?
> flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay))
# A tibble: 365 x 4
# Groups:
year, month [12]
year month
day
mean
1
2013
1
1
NA
2
2013
1
2
NA
3
2013
1
3
NA
4
2013
1
4
NA
5
2013
1
5
NA
6
2013
1
6
NA
7
2013
1
7
NA
8
2013
1
8
NA
9
2013
1
9
NA
10
2013
1
10
NA
# ... with 355 more rows
It is a good idea to be careful and use na.rm argument
> flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay, na.rm = TRUE))
# A tibble: 365 x 4
# Groups:
year, month [12]
year month
day
mean
1
2013
1
1 11.5
2
2013
1
2 13.9
3
2013
1
3 11.0
4
2013
1
4
8.95
5
2013
1
5
5.73
6
2013
1
6
7.15
7
2013
1
7
5.42
8
2013
1
8
2.55
9
2013
1
9
2.28
10
2013
1
10
2.84
# ... with 355 more rows
What if NA values relate to events of interest?
In the flights dataset, missing values represent cancelled flights.
That is, for cancelled flights, features like dep_delay and
arr_delay are missing because the flight never departed or arrived.
Therefore, we could create a new dataset based only on flights that
are not cancelled (by filtering flights with NA delay), and then learn
about the mean departure delay without needing to worry about
filtering NA values.
The following slide conveys this idea.
Summarizing the set of non-cancelled flights
> not_cancelled %
filter (!is.na(dep_delay), !is.na(arr_delay))
> not_cancelled %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay))
# A tibble: 365 x 4
# Groups:
year, month [12]
year month
day
mean
1
2013
1
1 11.4
2
2013
1
2 13.7
3
2013
1
3 10.9
4
2013
1
4
8.97
5
2013
1
5
5.73
6
2013
1
6
7.15
7
2013
1
7
5.42
8
2013
1
8
2.56
9
2013
1
9
2.30
10
2013
1
10
2.84
# ... with 355 more rows
Including count summaries with data aggregation
Whenever data is being aggregated or summarized in some way, it is
a good idea to include either a count with the function n(), or a count
of non-missing values with the function sum(!is.na()).
This helps you ensure that you are not drawing conclusions from
small samples of the original data.
Moreover, the power of statistical tests (discussed later in the course)
depends on sample size to an extent.
Let’s look at an example of why this is important.
Some flights have mean delays of >5 hours!
> delays %
group_by(tailnum) %>%
summarize(delay = mean(arr_delay))
> ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10)
This sounds
like a very
interesting
result!
The geom_freqpoly() function
In this example, we introduced a new geometric function for plotting
called geom_freqpoly(), which is similar to a histogram, except
that it plots a line graph, where the line goes through the top center of
each bin of the analogous histogram.
> ggplot(data = delays, mapping = aes(x = delay)) +
geom_histogram(binwidth = 10) +
geom_freqpoly(binwidth = 10, size = 2)
We can see this effect
more clearly here.
Story is a little more nuanced
These figures show that there were some flights with mean delay
times of over five hours, but that it was relatively rare (few flights).
However, both figures only provided us with summaries of the data.
We can potentially learn more if we understood the variation in mean
delay time as a function of the number of flights with that delay.
Greater variation in mean delay with fewer flights
> delays %
group_by(tailnum) %>%
summarize(delay = mean(arr_delay, na.rm = TRUE),
n = n())
> ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
Characteristic trend of less variation with increasing 𝑛
The above graph displays a plot that is commonly observed,
whenever the mean is plotted as a function of sample size 𝑛.
Here, variation in the summary (mean) decreases as sample size 𝑛
increases.
That is, variation in mean is decreasing with increased sample size.
Removing small samples reduces mean delay and variation
> delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
Useful summary functions
In the last lecture, we introduced two measures of location and their
functions:
Mean (average value), mean()
Median (middle value), median()
We also introduced two measures of spread and their functions:
Standard deviation, sd()
Variance, var()
All of these functions, as well as many others used to summarize
values, have the argument na.rm to remove NA values.
Summarizing flights by standard deviation of distance
> not_cancelled %>%
group_by(dest) %>%
summarize(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
# A tibble: 104 x 2
dest
distance_sd
1 EGE
10.5
2 SAN
10.4
3 SFO
10.2
4 HNL
10.0
5 SEA
9.98
6 LAS
9.91
7 PDX
9.87
8 PHX
9.86
9 LAX
9.66
10 IND
9.46
# ... with 94 more row
Measures of rank
In the last lecture, we also discussed min()and max(), which return
the minimum and maximum value in a list, respectively.
The quantile() function is a generalization of median(), which
returns the 50% quantile.
Examples:
> x min(x)
[1] 1
> max(x)
[1] 100
# Value of x greater than 25% of values and less than other 75%
> quantile(x, 0.25)
[1] 25.75
Finding out when first and last flight departed each day
> not_cancelled %>%
group_by(year, month, day) %>%
summarize(first = min(dep_time), last = max(dep_time))
# A tibble: 365 x 5
# Groups:
year, month [12]
year month
day first
last
1
2013
1
1
517
2356
2
2013
1
2
42
2354
3
2013
1
3
32
2349
4
2013
1
4
25
2358
5
2013
1
5
14
2357
6
2013
1
6
16
2355
7
2013
1
7
49
2359
8
2013
1
8
454
2351
9
2013
1
9
2
2252
10
2013
1
10
3
2320
# ... with 355 more rows
Measures of position
Useful measures of position are first(), nth(), and last().
The first() and last() functions return the first and last element
in a list, respectively (identical to min() and max() if list is sorted).
The nth() function returns the 𝑛th element of a list.
Examples:
> x first(x)
[1] 10
> last(x)
[1] 4
> nth(x, 3)
[1] -5
First and last departure of each date
> not_cancelled %>%
group_by(year, month, day) %>%
summarize(first_dep = first(dep_time),
last_dep = last(dep_time))
# A tibble: 365 x 5
# Groups:
year, month [12]
year month
day first_dep last_dep
1
2013
1
1
517
2356
2
2013
1
2
42
2354
3
2013
1
3
32
2349
4
2013
1
4
25
2358
5
2013
1
5
14
2357
6
2013
1
6
16
2355
7
2013
1
7
49
2359
8
2013
1
8
454
2351
9
2013
1
9
2
2252
10
2013
1
10
3
2320
# ... with 355 more rows
Counts
We introduced n(), which takes no arguments and returns the size of
the current group, and sum(!is.na()), which takes in the name of
a group and returns the number of non-missing values.
We can also calculate number of distinct (unique) values with
n_distinct().
Examples:
> x sum(!is.na(x))
[1] 8
# NA values count as a distinct value
> n_distinct(x)
[1] 4
Which destination has the most carriers?
> not_cancelled %>%
group_by(dest) %>%
summarize(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
# A tibble: 104 x 2
dest
carriers
1 ATL
7
2 BOS
7
3 CLT
7
4 ORD
7
5 TPA
7
6 AUS
6
7 DCA
6
8 DTW
6
9 IAD
6
10 MSP
6
# ... with 94 more rows
Built-in helper function within dplyr if only want count
> not_cancelled %>%
count(dest)
# A tibble: 104 x 2
dest
n
1 ABQ
254
2 ACK
3 ALB
264
418
4 ANC
5 ATL
8
16837
6 AUS
7 AVL
2411
261
8 BDL
9 BGR
412
358
10 BHM
269
# ... with 94 more rows
Total number of flights for each plane
> not_cancelled %>%
count(tailnum)
# A tibble: 4,037 x 2
tailnum
n
1 D942DN
4
2 N0EGMQ
3 N10156
352
145
4 N102UW
5 N103US
48
46
6 N104UW
7 N10575
46
269
8 N105UW
9 N107US
45
41
10 N108UW
60
# ... with 4,027 more rows
Weight argument to tally total miles a plane flew
> not_cancelled %>%
count(tailnum, wt = distance)
# A tibble: 4,037 x 2
tailnum
n
1 D942DN
3418
2 N0EGMQ
3 N10156
239143
109664
4 N102UW
5 N103US
25722
24619
6 N104UW
7 N10575
24616
139903
8 N105UW
9 N107US
23618
21677
10 N108UW
32070
# ... with 4,027 more rows
Counts and proportions of logical values
In sum(!is.na()), sum() treats any row as a 1 if !is.na()
evaluates to TRUE, and 0 otherwise.
Therefore, we can use sum() to count the number of rows satisfying
a logical statement, and use mean() to obtain the proportion of rows.
Example:
> x sum(!is.na(x))
[1] 8
# The proportion of values that are not NA
> mean(!is.na(x))
[1] 0.8
How many flights left before 5am (prior day delays)?
# Recall dep_time formatted as HHMM or HMM (H = hour, M = min)
> not_cancelled %>%
group_by(year, month, day) %>%
summarize(n_early = sum(dep_time < 500))
# A tibble: 365 x 4
# Groups:
year, month [12]
year month
day n_early
1
2013
1
1
0
2
2013
1
2
3
3
2013
1
3
4
4
2013
1
4
3
5
2013
1
5
3
6
2013
1
6
2
7
2013
1
7
2
8
2013
1
8
1
9
2013
1
9
3
10
2013
1
10
3
# ... with 355 more rows
What proportion of flights were delayed more >1 hour?
> not_cancelled %>%
group_by(year, month, day) %>%
summarize(hour_prop = mean(arr_delay > 60))
# A tibble: 365 x 4
# Groups:
year, month [12]
year month
day hour_prop
1
2013
1
1
0.0722
2
2013
1
2
0.0851
3
2013
1
3
0.0567
4
2013
1
4
0.0396
5
2013
1
5
0.0349
6
2013
1
6
0.0470
7
2013
1
7
0.0333
8
2013
1
8
0.0213
9
2013
1
9
0.0202
10
2013
1
10
0.0183
# ... with 355 more rows
Finding all groups larger than threshold
# Find destinations with more than one daily flight on average
> popular_dests %
group_by(dest) %>%
filter(n() > 365)
> popular_dests
# A tibble: 332,577 x 19
# Groups:
dest [77]
year month
day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
< dbl>
1
2013
1
1
517
515
2
830
819
11 UA
1545 N14228
EWR
2
2013
1
1
533
529
4
850
830
20 UA
1714 N24211
LGA
3
2013
1
1
542
540
2
923
850
33 AA
1141 N619AA
JFK
4
2013
1
1
544
545
-1
1004
1022
-18 B6
725 N804JB
JFK
5
2013
1
1
554
600
-6
812
837
-25 DL
461 N668DN
LGA
6
2013
1
1
554
558
-4
740
728
12 UA
1696 N39463
EWR
7
2013
1
1
555
600
-5
913
854
19 B6
507 N516JB
EWR
8
2013
1
1
557
600
-3
709
723
-14 EV
5708 N829AS
LGA
9
2013
1
1
557
600
-3
838
846
-8 B6
79 N593JB
JFK
10
2013
1
1
558
600
-2
753
745
8 AA
301 N3ALAA
LGA
# ... with 332,567 more rows, and 6 more variables: dest , air_time , distance , hour ,
#
minute , time_hour
Standardizing to compute per group metrics
# Proportion of delay time for each destination accounted for
# by a particular flight (sum to 1 for each of 77 dests).
> popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
# A tibble: 131,106 x 6
# Groups:
dest [77]
year month
day dest
arr_delay prop_delay
1
2013
1
1 IAH
11
0.000111
2
2013
1
1 IAH
20
0.000201
3
2013
1
1 MIA
33
0.000235
4
2013
1
1 ORD
12
0.0000424
5
2013
1
1 FLL
19
0.0000938
6
2013
1
1 ORD
8
0.0000283
7
2013
1
1 LAX
7
0.0000344
8
2013
1
1 DFW
31
0.000282
9
2013
1
1 ATL
12
0.0000400
10
2013
1
1 DTW
16
0.000116
# ... with 131,096 more rows
Lecture summary
We discussed how to use a variety of functions in the dplyr package
for data manipulations and transformations:
filter()
Pick observations by their values
arrange()
Reorder rows
select()
Pick features by their names
mutate()
Create new features with functions of existing
features
summarize()
Collapse many features down to a single
summary
group_by()
Changes scope of above functions from operating
on all data to operating on it group-by-group
*We also discussed many other related functions and their applications.
CAP 5768: Introduction to Data Science
Exploratory Data Analysis
What is exploratory data analysis?
Exploratory data analysis is the act of visualizing, transforming, and
modeling data to explore it in a systematic way.
The goal of exploratory data analysis is to gain some preliminary
understanding of your data before moving on to statistical learning.
We often use exploratory data analysis to investigate the quality of
data and clean them as necessary for downstream analyses.
Exploratory data analysis has no rules
During initial phases of exploratory data analysis, you should feel free
to investigate every idea that occurs to you.
Some will pan out, whereas others will be dead ends.
The process is therefore an iterative cycle in which you:
1. Generate questions about your data.
2. Search for answers by visualizing, transforming, and modeling
your data.
3. Use what you have learned to refine your questions and/or
generate novel questions.
Exploratory data analysis is thus a creative process
The questions that you ask help you focus on a specific aspect of
your data and choose transformations, visualizations or models.
For this reason, it is important to ask high quality questions.
Though there are no rules, two types of questions will always be
useful for making discoveries about your data:
1) What type of variation occurs within features?
2) What type of variation occurs among features?
Exploratory data analysis is thus a creative process
The questions that you ask help you focus on a specific aspect of
your data and choose transformations, visualizations or models.
For this reason, it is important to ask high quality questions.
Though there are no rules, two types of questions will always be
useful for making discoveries about your data:
1) What type of variation occurs within features?
2) What type of variation occurs among features?
Examples of variation?
Visualizing distributions
We have already seen several examples of ways to visualize
distributions (e.g., bar plots, histograms, boxplots).
How you visualize the distribution of a feature will depend on whether
that feature is categorical or continuous.
Categorical features take on a finite set of values (“categories”).
Continuous features take on an infinite set of values (“quantities”).
Visualizing categorical distributions with bar plots
One way to visualize categorical features is with bar plots, which
summarize variation within a single feature.
> ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Plotted values can be computed without visualization
We can compute the plotted values without visualizing by using some
of the techniques we learned in the last lecture:
> diamonds %>%
count(cut)
# A tibble: 5 x 2
cut
n
1 Fair
1610
2 Good
4906
3 Very Good 12082
4 Premium
5 Ideal
13791
21551
Visualizing continuous distributions with histograms
One way to visualize continuous features is with histograms, which
summarize variation within a single feature.
> ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
Units of 𝑥
(carat)
Visualizing continuous distributions with histograms
One way to visualize continuous features is with histograms, which
summarize variation within a single feature.
> ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
Units of 𝑥
(carat)
Is this bin width
appropriate for
identifying fine-scale
variation in carat?
Plotted values can be computed without visualization
We can compute the plotted values without visualizing by using some
of the techniques we learned in the last lecture, and with a new
function cut_width() that behaves the same as the binwidth
argument of geom_histogram():
> diamonds %>%
count(cut_width(carat, 0.5))
# A tibble: 11 x 2
`cut_width(carat, 0.5)`
1 [-0.25,0.25]
n
785
2 (0.25,0.75]
29498
3 (0.75,1.25]
15977
4 (1.25,1.75]
5313
5 (1.75,2.25]
2002
6 (2.25,2.75]
322
7 (2.75,3.25]
32
8 (3.25,3.75]
5
9 (3.75,4.25]
4
10 (4.25,4.75]
1
11 (4.75,5.25]
1
Exploring sub-regions of data and altering scale
In the histogram below, we have a bin width that may be too large to
identify any fine-scale variation in carat.
Moreover, because only a few diamonds are ≥ 3 carats, it may be
helpful to filter the data and explore variation in smaller diamonds.
Extracting small diamonds & examining fine-scale patterns
> smaller %
filter(carat < 3)
> ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
Multiple modes of smaller diamonds
By zooming into the parameter space where most observations lie
(diamonds weighing < 3 carats) and choosing a finer scale binning,
we can see that there are multiple modes (peaks) in the distribution,
potentially corresponding to distinct subgroups of diamonds:
Plotting distributions of multiple continuous features
Before exploring this variation further, we will briefly discuss how to
plot distributions of multiple continuous features.
Suppose we wish to create a separate histogram for each quality of
diamond (cut), and depict each histogram with a different color.
Our first thought might be to use the geom_histogram() function
with the code:
> ggplot(data = smaller, mapping = aes(x = carat,
color = cut)) +
geom_histogram(binwidth = 0.1)
However, the result of this code may not be what we envisioned.
Using different colors as histograms not what hoped for
> ggplot(data = smaller, mapping = aes(x = carat,
color = cut)) +
geom_histogram(binwidth = 0.1)
What about filling in the histograms instead?
> ggplot(data = smaller, mapping = aes(x = carat,
fill = cut)) +
geom_histogram(binwidth = 0.1)
Here, we have created a set of stacked histograms, making the
distributions difficult to compare across diamond quality (cut).
Using geom_freqpoly() to plot multiple histograms
> ggplot(data = smaller, mapping = aes(x = carat,
color = cut)) +
geom_freqpoly(binwidth = 0.1)
Exploiting typical values in exploratory data analysis
In bar plots and histograms, tall bars display common (typical) values
of a feature, and short bars display uncommon (atypical) values.
Regions without bars correspond to values that are not observed in
your data at all.
This information can be turned into useful questions for exploratory
data analysis:
- Which values are the most common, and why?
- Which value are rare, and why?
- Do these observations match our expectations?
- Are there any unusual patterns? What might explain them?
Consider distribution of diamond weight at a finer scale
> ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)
Questions to ask about these data
- Why are there more diamonds at whole carats and common
fractions of carats?
- Why are there more diamonds slightly to the right of each peak
than there are slightly to the left of each peak?
- Why are there no diamonds larger than three carats?
Exploring clusters of similar values
Clusters of similar values suggest that subgroups exist in the data.
To understand the subgroups, we often ask:
- How are observations within each cluster similar?
- How are observations in separate clusters different?
- How can you explain or describe the clusters?
- Why might the appearance of clusters be misleading?
Let’s look at clusters within an example dataset.
Example: The Old Faithful Geyser dataset
This dataset contains 272 observations on waiting times between
eruptions (waiting) and durations of eruptions (eruptions) for the
Old Faithful Geyser in Yellowstone National Park.
What are the features?
The date are tidied into a data frame called faithful, which is
located in the MASS package.
MASS is one of many packages that are automatically installed with
base R.
Do we need to type library(MASS)?
Viewing the top (head) of the faithful data frame
The faithful data frame has two features: eruptions and
waiting.
We can use the head() function to view the first six (or other
specified number) rows of the data frame:
> head(faithful)
eruptions waiting
1
2
3.600
1.800
79
54
3
4
3.333
2.283
74
62
5
6
4.533
2.883
85
55
What function could use to look at the last six rows?
Visualizing distribution of eruption times (eruptions)
> ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
Eruption times appear to be clustered into two groups:
1) Short eruptions (mode of ~2 minutes)
2) Long eruptions (mode of ~4.5 minutes)
Comparing short vs. long eruptions
Recall that to understand subgroups, we often ask:
- How are observations within each cluster similar?
- How are observations in separate clusters different?
- How can you explain or describe the clusters?
- Why might the appearance of clusters be misleading?
To answer these questions, we might examine other features.
In this case, there is just one other feature: waiting times between
eruptions, waiting.
Why is domain knowledge important here?
Outliers
Outliers are observations that are unusual, or that do not seem to fit
a particular pattern.
Recall that we briefly discussed outliers when we introduced boxplots
in our lecture on R and data visualization.
Sometimes outliers are data entry errors, whereas other times they
suggest important new findings.
In large datasets, outliers can be difficult to identify from a histogram.
Let’s look at an example to see why this issue exists.
Outliers are “invisible” in histogram of diamond width (y)
> ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
Only evidence
here is an
unusually large
𝑥 axis range
There are so many observations in the common bins that the rare
bins are so short that you cannot see them at all (for plotting
purposes, they effectively have a 0 count on the 𝑦 axis scale).
Zooming into the 𝑦 axis with coord_cartesian()
> ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
The coord_cartesian() function allows the user to change the
plotted coordinate system, and to alter the range (limits) of the axes,
with the above code setting the 𝑦 axis range to go from 0 to 50.
The 𝑥 axis scale can be similarly adjusted with xlim argument.
Zooming into the 𝑦 axis allows us to see outliers
> ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
This plot allows us to see three outliers of y: 0, ~30, ~60.
We can use some tools in dplyr to extract these outliers and explore
their properties.
Using dplyr to extract all outliers of diamond width (y)
> outliers %
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
> outliers
# A tibble: 9 x 4
price
x
y
z
1
2
5139
6381
0
0
0
0
0
0
3 12800
4 15686
0
0
0
0
0
0
5 18034
6 2130
0
0
0
0
0
0
7
8
2130
2075
0
5.15
0
31.8
0
5.12
9 12210
8.09
58.9
8.06
Something is wrong with these outliers
> outliers
# A tibble: 9 x 4
price
x
y
z
1
5139
0
0
0
2
6381
0
0
0
3 12800
0
0
0
4 15686
0
0
0
5 18034
0
0
0
6
2130
0
0
0
7
2130
0
0
0
8
2075
5.15
31.8
5.12
9 12210
8.09
58.9
8.06
The feature y measures the width of a diamond in mm.
But we know that diamonds cannot have a width of 0mm!
Similarly, diamonds cannot have a length (x) or depth (z) of 0mm.
Something is wrong with these outliers
> outliers
# A tibble: 9 x 4
price
x
y
z
1
5139
0
0
0
2
6381
0
0
0
3 12800
0
0
0
4 15686
0
0
0
5 18034
0
0
0
6
2130
0
0
0
7
2130
0
0
0
8
2075
5.15
31.8
5.12
9 12210
8.09
58.9
8.06
Also, widths (y) of ~32mm and ~59mm are quite large—over an inch.
Diamonds this wide would not cost less than $100,000 (Google).
Another example of the importance of domain knowledge!
What to do with outliers?
It is good practice to perform analyses with and without outliers.
If the outliers have a minimal effect on results, then it is usually okay
to remove them.
However, if the outliers have a substantial effect on results, then they
should not be removed without further investigation and justification.
In this case, you will first need to figure out what caused these
outliers (e.g., a data entry error).
If you proceed to remove these outliers, then you will also need to
disclose this in your report or paper.
Two options for removing outliers
1. Remove all rows (observations) containing outliers
> diamonds2 %
filter(between(y, 3, 20)))
Note that the code
between(x, left, right)
is a shortcut for
x >= left & x diamonds2 %
mutate(y = ifelse(y < 3 | y > 20, NA, y))
Note that we are using a new function:
ifelse(test, yes, no)
The value of the first argument test is a logical expression, which
evaluates to yes when TRUE, and no when FALSE.
This option is preferred, as it ensures that all other measurements on
the observation can be used for downstream analyses.
But sometimes missing values are not errors
Recall that in the flights data frame, missing values in the
dep_time feature indicate that the flight was canceled.
To compare the scheduled departure times for canceled and noncanceled flights, a new variable can be created with is.na():
> flights %>%
mutate(canceled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60)
Note that we have introduced two new operators here: %/% and %%,
which respectively yield the quotient and the remainder from a
division operation (examples on next slide).
Examples of obtaining quotients and remainders
> 45 %/% 45
[1] 1
> 45 %% 45
[1] 0
> 45 %/% 20
[1] 2
> 45 %% 20
[1] 5
> 45 %/% 50
[1] 0
> 45 %% 50
[1] 45
Departure times for canceled vs. non-canceled flights
> flights %>%
mutate(canceled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(color = canceled),
binwidth = 1/4)
Our graph is not that informative
The difference in scale for the 𝑦 axis between the two categories
makes it impossible to compare their distributions.
This is because there are many more non-canceled flights than
canceled flights (this makes sense).
We should instead plot densities, which places distributions on the
same scale because the area under each density curve sums to one.
Departure times for canceled vs. non-canceled flights
> flights %>%
mutate(canceled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60) %>%
ggplot(mapping = aes(x=sched_dep_time, y=..density..)) +
geom_freqpoly(mapping = aes(color = canceled),
binwidth = 1/4)
New plot enables comparison between categories
A quick glance shows that the shapes of distributions for cancelled
and non-cancelled flights are similar.
However, careful inspection suggests that canceled flights may be
more likely to occur later in the day.
Ultimately, this difference can be evaluated with a formal statistical
test (model), but our exploration helped us arrive at this hypothesis!
Covariation
So far we have discussed variation within a single feature, which can
be visualized with a bar plot when the feature is categorical and with
histograms (different types) when it is continuous.
Covariation describes variation among features, and is the tendency
for values of two or more features to change together systematically.
A key mechanism of identifying covariation is by visualizing the
relationship between two or more features.
Visualizations of covariation
As with a single feature, how you visualize covariation among
features depends on the types of features you are considering
(categorical vs. continuous).
We will consider covariation between two features in this lecture.
There are three combinations for two features:
1) Categorical vs. continuous
2) Categorical vs. categorical
3) Continuous vs. continuous
Covariation between categorical & continuous features
It is common to explore the distribution of a continuous feature
conditional on a categorical feature.
We have already used two approaches for visualizing this: frequency
polygons geom_freqpoly() and boxplots geom_boxplot().
In a frequency polygon, each distribution of a continuous feature is
conditional on a particular categorical feature, which is plotted with
the count (or density) on the 𝑦 axis.
Boxplots summarize the distribution of a continuous feature
conditional on a particular categorical feature.
Frequency polygons for diamond cut vs. price
> ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500)
Unfortunately, it is difficult to really compare distributions of diamond
price across cut categories because their counts differ a lot!
Recall distribution of counts for cut
Earlier, we used a boxplot to display the frequency of each diamond
quality (cut) in our dataset:
> ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Using densities for frequency polygons of cut vs. price
> ggplot(data = diamonds, mapping = aes(x = price,
y = ..density..)) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500)
This is another scenario where it is useful to plot frequency polygons
as densities so that the area under each distribution sums to one.
A surprising result!
This plot shows that the lowest quality (fair) diamonds have the
highest average price.
But it is possible that we are just being misled by looking at
distributions as frequency polygons, and maybe examining
summaries of distributions would solidify if this were the case.
We can examine such summaries using boxplots.
Boxplots to summarize diamond cut vs. price
> ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = cut, y = price))
Indeed, the distribution of prices of lowest quality (fair) diamonds is
shifted upward, and the median appears higher than other medians.
Adding notches to assess statistical differences
Notches can be added to boxplots using the notch argument, and
represent a confidence interval around the median:
> ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = reorder(cut, price,
FUN = median), y = price),
notch = TRUE)
Non-overlapping notches suggest strong evidence that medians differ.
Why might fair diamonds have highest average price?
We will explore this question a little more when we discuss
covariation between continuous features.
As a sneak peek, we will need to identify a feature that has a strong
relationship with diamond price.
Covariation between categorical features
To visualize covariation between categorical variables, we need to
count the number of observations for each combination of categories.
One mechanism for doing this and visualizing the distribution is
through the geom_count() function.
The geom_count() function acts just like any other ggplot2
geometric layer, and requires an aesthetic be specified for a mapping
argument indicating which features are plotted on the 𝑥 and 𝑦 axes.
The plot will be a discrete two-dimensional grid, with circles at each
cell in the grid, and the circle size being proportional to the number of
values falling within the pair of categories.
Covariation will appear as a strong correlation between the features.
Using geom_count() for cut vs. color
> ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
Plotted values can be computed without visualization
We can compute the plotted values without visualizing by using some
of the techniques we learned from Lecture 3:
> diamonds %>%
count(color, cut)
# A tibble: 35 x 3
color cut
n
1 D
Fair
163
2 D
Good
662
3 D
Very Good
1513
4 D
Premium
1603
5 D
Ideal
2834
6 E
Fair
224
7 E
Good
933
8 E
Very Good
2400
9 E
Premium
2337
Ideal
3903
10 E
# ... with 25 more rows
Using geom_tile() to create heatmap for cut vs. color
> diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
Covariation between continuous features
We have already explored a way to examine covariation between two
continuous features, which was a scatterplot using geom_point():
> ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
Why might fair diamonds have highest average price?
The plot below suggests that diamond price increases as diamond
weight (carat) increases:
Thus, maybe the reason for the higher average price of low-quality
(fair) diamonds is that low-quality diamonds tend to be heavier.
We can explore this hypothesis by creating a boxplot to summarize
the distribution of carat as a function of cut.
Boxplot with distribution of carat as a function of cut
> ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = reorder(cut, carat,
FUN = median), y = carat),
notch = TRUE)
Supporting our hypothesis, fair diamonds have their weight distribution
shifted toward larger values.
Also, notches do not overlap between fair and other qualities (cut).
Visualizing distributions of overlapping points
The plot to the left contains over 50,000 points, making it difficult to
visualize where observations tend to be concentrated.
For such a plot, it can be beneficial to set the transparency of points
such that overlaps can be observed (right), with the code:
> ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price),
alpha = 1/100)
Using transparency can be challenging for large datasets
This is because all regions will look the same (black) when there are
many points.
Whereas we may initially consider increasing transparency, the issue
is that this makes points invisible in regions with little data:
A better solution is to plot overlaps as a heatmap using either the
geom_bin2d() or the geom_hex() functions.
Illustrating scatterplot heatmaps
Let’s reconsider the filtered dataset smaller we generated earlier,
which is the set of diamonds weighing less than three carats:
> smaller %
filter(carat < 3)
Suppose we create a scatterplot with transparency:
> ggplot(data = smaller) +
geom_point(mapping = aes(x = carat, y = price),
alpha = 1/100)
Scatterplot heatmap with geom_bin2d()
> ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))
Scatterplot heatmap with geom_hex()
> install.packages("hexbin")
> library(hexbin)
> ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
Discretizing one continuous feature through binning
It is possible to partition continuous values into a discrete number of
“bins” (as is done with a histogram), enabling us to convert a
continuous feature into a categorical feature.
This permits us to employ techniques for visualizing combinations of
categorical and continuous features (e.g., boxplots) .
To perform binning, we must choose a feature to partition, and assign
each bin of the feature as a separate group (category).
The cut_width() function partitions values evenly across bins
according to a width argument.
The cut_number() function partitions values into a specific number
of bins, each with approximately the same number of observations.
Binning with cut_width()
> ggplot(data = smaller, mapping = aes(x=carat, y=price)) +
geom_boxplot(mapping = aes(group = cut_width(carat,0.1)))
Binning with cut_number()
> ggplot(data = smaller, mapping = aes(x=carat, y=price)) +
geom_boxplot(mapping = aes(group = cut_number(carat,20)))
Main goal of data visualization is to find patterns
Patterns in data provide clues about relationships among features
(covariation), which is often what you are seeking to identify.
If you spot a pattern, then you should ask:
- Could the pattern be due to coincidence (random chance)?
- How can you describe the relationship implied by the pattern?
- How strong is the relationships implied by the pattern?
- What other features might affect the relationship?
- Does the relationship differ across subgroups of the data?
Recall the clustering of eruption times into two groups
> ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
Why does there appear to be two clusters of eruption times?
Is it possible that eruption time covaries with some other feature?
Scatterplot of waiting as a function of eruptions
> ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
Here we see a pattern, in that the length of an eruption covaries with
length of time between eruptions.
Patterns reveal covariation
Patterns are highly useful tools because they reveal covariation.
Variation is a phenomenon that creates uncertainty.
However, covariation is a phenomenon that reduces uncertainty.
If two features covary, then you can use the values of one feature to
make better predictions about the values of the second feature.
If covariation is due to a causal relationship, then you can also use the
value of one feature to control the value of the second feature.
Models are tools for extracting patterns from data
Consider again diamonds, where it was difficult to understand the
relationship between diamond quality (cut) and price:
Reasoning: cut & carat, and carat & price are tightly related.
We can use a model to remove the very strong relationship between
price and carat to further explore the subtleties that remain.
Loading the modelr package
Before moving on, we will briefly use one feature of the modelr
package, which comes installed with the tidyverse package.
The modelr package has functions that allow seamless integration of
modeling into data manipulation and visualization pipelines.
However, modelr is not automatically loaded with tidyverse.
Therefore, we must first use the command to load modelr:
> library(modelr)
Relationship between price and carat
> ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
We want to model the relationship between price and carat.
However, this relationship does not appear to be linear.
Logarithmic transformation to make data (more) linear
> ggplot(data = diamonds) +
geom_point(mapping = aes(x = log(carat), y = log(price)))
After transforming both features with a natural logarithm, the
relationship looks approximately linear, and so it is now appropriate to
model the relationship with a linear function.
Modeling diamond price as a function of weight
We can use simple linear regression (a straight line) to predict the
logarithm of diamond price from the logarithm of diamond weight.
This can be accomplished by fitting a linear model using the lm()
function, which takes as a first argument a formula (~) relating the
feature on the y axis (left) to that on the x axis (right), as well as an
argument for the data frame:
> mod mod
Call:
lm(formula = log(price) ~ log(carat), data = diamonds)
Coefficients:
(Intercept)
log(carat)
8.449
1.676
Plotting the output of simple linear regression
> ggplot(data = diamonds, mapping = aes(x = log(carat),
y = log(price))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red")
Adding residuals (prediction errors) to data frame
The modelr package has a function add_residuals(), which adds
a column to the end of a data frame corresponding to the model
residuals (prediction errors).
The first argument is the data frame, and the second is the model:
> add_residuals(diamonds, mod)
# A tibble: 53,940 x 11
carat cut
color clarity depth table price
x
y
z
resid
1 0.23
Ideal
E
SI2
61.5
55
326
3.95
3.98
2.43 -0.199
2 0.21
Premium
E
SI1
59.8
61
326
3.89
3.84
2.31 -0.0464
3 0.23
Good
E
VS1
56.9
65
327
4.05
4.07
2.31 -0.196
4 0.290 Premium
I
VS2
62.4
58
334
4.2
4.23
2.63 -0.563
5 0.31
Good
J
SI2
63.3
58
335
4.34
4.35
2.75 -0.672
6 0.24
Very Good J
VVS2
62.8
57
336
3.94
3.96
2.48 -0.240
7 0.24
Very Good I
VVS1
62.3
57
336
3.95
3.98
2.47 -0.240
8 0.26
Very Good H
SI1
61.9
55
337
4.07
4.11
2.53 -0.371
9 0.22
Fair
E
VS2
65.1
61
337
3.87
3.78
2.49 -0.0912
Very Good H
VS1
59.4
61
338
4
4.05
2.39 -0.163
10 0.23
# ... with 53,930 more rows
Converting residuals to the correct scale
However, the residuals are on a natural logarithmic scale, rather than
on the scale of our original features.
We can therefore use mutate to convert the residual column onto the
scale of the original features.
Using pipes in the same manner as before, we type:
> diamonds2 %
add_residuals(mod) %>%
mutate(resid = exp(resid))
The residuals column (resid) has now been replaced with one on the
correct scale, and we have stored this augmented dataset into a new
data frame called diamonds2.
Examining the new data frame
> diamonds2
# A tibble: 53,940 x 11
carat cut
color clarity depth table price
x
y
z resid
1 0.23
Ideal
E
SI2
61.5
55
326
3.95
3.98
2.43 0.820
2 0.21
Premium
E
SI1
59.8
61
326
3.89
3.84
2.31 0.955
3 0.23
Good
E
VS1
56.9
65
327
4.05
4.07
2.31 0.822
4 0.290 Premium
I
VS2
62.4
58
334
4.2
4.23
2.63 0.569
5 0.31
Good
J
SI2
63.3
58
335
4.34
4.35
2.75 0.511
6 0.24
Very Good J
VVS2
62.8
57
336
3.94
3.96
2.48 0.787
7 0.24
Very Good I
VVS1
62.3
57
336
3.95
3.98
2.47 0.787
8 0.26
Very Good H
SI1
61.9
55
337
4.07
4.11
2.53 0.690
9 0.22
Fair
E
VS2
65.1
61
337
3.87
3.78
2.49 0.913
Very Good H
VS1
59.4
61
338
4
4.05
2.39 0.850
10 0.23
# ... with 53,930 more rows
Residuals as a function of diamond weight (carat)
> ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))
Residuals correspond to variation that is unaccounted for by our
model, suggesting that there is quite a bit of unexplained variation for
the price for lighter diamonds.
Residuals as a function of diamond quality (cut)
> ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid),
notch = TRUE)
Because the residuals correspond to unexplained variation between
carat and price, we can now see the expected pattern—relative to
their size, higher quality diamonds are indeed more expensive!
Lecture summary
Exploratory data analysis is the act of visualizing, transforming, and
modeling of data to explore it in a systematic way.
Though it has no rules and is a creative process, it requires care to
generate an initial set of quality questions about your data.
Data visualization depends on the type(s) of features in your data
(categorical/continuous).
The main goal of data visualization is to extract patterns, which can
reveal relationships (covariation) between features.
We briefly introduced linear regression as a way of modeling covariation
between features, and will next discuss its use in making predictions.

Purchase answer to see full
attachment