Three Simple Data Analysis Problems With R Report

User Generated

wn1994P

Engineering

Description

Attached are 3 files, one with the instructions, and the other two are supporting materials.

A complete step by step solution is required to each part of each problem. All the answers Must be placed in the space provided on the instructions.

Solutions MUST include.

- Code (no pictures or screenshots) -> CODE MUST BE WRITTEN USING R

- Graphs or figures (code output)

- Answer to all the questions




Unformatted Attachment Preview

Homework 2: Due Friday, October 22 at 11:59pm For this assignment you will be analyzing the flights data frame included in the nycflights13 package, which contains 19 features for 336,776 flights that departed from New York City in 2013. The goal of this assignment is to become more familiar with data transformations and exploratory data analysis. Provide all responses in the designated spaces in this Word document, then save it as a pdf and upload it to Canvas. 1. [25%] Generate notched boxplots to compare distributions of airport distances between canceled and non-canceled flights. Is the median airport distance for canceled flights shorter, longer, or roughly the same as for non-canceled flights? Justify your answer, and then provide a possible explanation for this finding. Provide code below: Provide figure below: Provide answers to questions below: 2. [40%] Generate a bar plot with month on the 𝑥 axis and the proportion of the month’s flights that are canceled on the 𝑦 axis. Which four months of the year had the highest proportions of flight cancelations? Provide a possible explanation for this finding. Note: You will need to use the function geom_col() to generate a bar plot for which you provide the 𝑥 and 𝑦 axis features. Because geom_col()expects that the feature on the 𝑥 axis is categorical, you must also use the code as.factor(month) to convert the month feature to a categorical variable taking 12 values (1, 2, …, 12). Provide code below: Provide figure below: Provide answers to questions below: 1 3. [35%] Generate a scatterplot displaying the relationship between the mean and the standard deviation of airport distance for flights on each of the 365 days of the year. Use the method argument of the geom_smooth() function to overlay your scatterplot with a fitted straight line with confidence intervals. Is there a relationship between the mean and standard deviation of airport distance for flights on each day of the year? Describe the relationship (or lack of one), and then provide a possible explanation for this finding. Provide code below: Provide figure below: Provide answers to questions below: 2 CAP 5768: Introduction to Data Science Data Transformations Review of last lecture We discussed how to import, tidy, and visualize data from R packages. Loading a package automatically imports included data, which are already tidied into data frames (or tibbles). We also learned how to use the ggplot() function to visualize data. However, we skipped data transformation, which was unnecessary for the data (mpg) that we were analyzing. Data transformation in R We have already seen that the tidyverse package contains the ggplot2 package for data visualization. The tidyverse package also contains the dplyr package for data transformation. The toolset in dplyr allows us to easily filter, arrange, select, and summarize data for easier downstream use. Recall that we can load the tidyverse package and all of its included packages with the command: > library(tidyverse) The flights data frame (tibble) We will perform data transformation on the flights data frame (tibble) that is included in the nycflights13 package. First, we must install the package: > install.packages("nycflights13") Then, we must load the package: > library(nycflights13) The flights data frame contains 19 features (columns) for all 336,776 flights (rows) that departed from New York City in 2013. A snapshot of the flights data frame > flights # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 544 545 -1 1004 1022 5 2013 1 1 554 600 -6 812 837 6 2013 1 1 554 558 -4 740 728 7 2013 1 1 555 600 -5 913 854 8 2013 1 1 557 600 -3 709 723 9 2013 1 1 557 600 -3 838 846 10 2013 1 1 558 600 -2 753 745 # … with 336,766 more rows, and 11 more variables: arr_delay , carrier , # flight , tailnum , origin , dest , air_time , # distance , hour , minute , time_hour Learning about the features of the flights data frame > help(flights) year, month, day Date of departure dep_time, arr_time carrier Actual departure and arrival times (format HHMM or HMM), local tz Scheduled departure and arrival times (format HHMM or HMM), local tz Departure and arrival delays, in minutes. Negative times represent early departures/arrivals. Two letter carrier abbreviation. flight tailnum Flight number Plane tail number origin, dest air_time Origin and destination Amount of time spent in air (minutes) distance hour, minute Distance between airports (miles) Time of schedule departure time_hour Scheduled flight date & hour (POSIXct) sched_dep_time, sched_arr_time dep_delay, arr_delay Basics of dplyr The dplyr package contains five key functions that are useful for data manipulation and transformation: filter() Pick observations by their values arrange() Reorder rows select() Pick features by their names mutate() Create new features with functions of existing features summarize() Collapse many features down to a single summary These functions can also be used together with the group_by() function, which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. Similar structure for all six functions These six dplyr functions all work in similar ways: - Their first argument is a data frame. - The subsequent arguments describe what to do with the data frame, using the feature names. - The result of the function is a new data frame. These three properties make it easy to chain together multiple steps, permitting complex manipulations of the original data. We will now apply each of them to the flights data frame to see how they work. Filtering rows with filter() The filter() function allows you to subset observations based on values of their features. The first argument is the name of the data frame. The subsequent arguments are expressions that filter the data frame. You may recall that we already used filter() to subset data when plotting a smoothing trend line in the previous lecture. Example from last lecture > ggplot(data = mpg, mapping = aes(x = displ, y = hwy )) + geom_point(mapping = aes(color = class)) + geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE) Extracting only flights on January 1st > filter(flights, month == 1, day == 1) # A tibble: 842 x 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 544 545 -1 1004 1022 5 2013 1 1 554 600 -6 812 837 6 2013 1 1 554 558 -4 740 728 7 2013 1 1 555 600 -5 913 854 8 2013 1 1 557 600 -3 709 723 9 2013 1 1 557 600 -3 838 846 10 2013 1 1 558 600 -2 753 745 # … with 832 more rows, and 11 more variables: arr_delay , carrier , # flight , tailnum , origin , dest , air_time , # distance , hour , minute , time_hour Creating a new data frame with flights on January 1st In the previous example, we obtained all flights on January 1st. However, we may wish to store this new dataset so that we can further manipulate it downstream. To save the data frame, we can assign it to a new variable, e.g., Jan1: > Jan1 filter(flights, month == 1, day == 1) Here we are stating that we want to take observations (rows) from the flights data frame if it is TRUE that the feature month is equal to 1 (January) AND that the feature day is equal to 1 (first of the month). We must use designated comparison operators If we used = instead of ==, then we would not get our intended results: > filter(flights, month = 1, day = 1) Error: Problem with `filter()` input `..1`. x Input `..1` is named. ℹ This usually means that you've used `=` instead of `==`. ℹ Did you mean `month == 1`? This is because, as we saw in the last lecture, the = operator is used by R to assign values to variables. Therefore, rather than asking whether month is equal to 1 and day is equal to 1, the above code is trying to assign a value of 1 to each of these features. The set of comparison operators in R Comparison operators in R take the form x OPERATOR y to compare the value of x to the value of y, and then return a value of TRUE or FALSE. x == y x is equal to y x > y x is greater than y x >= y x is greater than or equal to y x < y x is less than y x sqrt(2) ^ 2 == 2 [1] FALSE > 1 / 49 * 49 == 1 [1] FALSE Circumventing issues with finite precision arithmetic We can circumvent some of these issues by using the built-in R function near(), which examines whether an expression is “close” to a particular value. Close is measured by a modifiable argument tol (tolerance of comparison) of the near() function, which defines how numerically different the two quantities can be to call them different. We can use near() to obtain desired results from our examples: > near(sqrt(2) ^ 2, 2) [1] TRUE > near(1 / 49 * 49, 1 ) [1] TRUE Logical operators In addition to comparison operators, we have also already observed an example of a logical operator. In particular, consider our code from earlier: > filter(flights, month == 1, day == 1) The comma between the second and third arguments of this code represents an “and” operation. This means rows are only selected if month == 1 and day == 1 . The set of logical operators Suppose that x and y are two sets of values. We can apply the logical operators & (and), | (or), xor() (exclusive or), and ! (not) to obtain different combinations of these values: Wickham, Grolemund (2017) R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly. Example use of logical operators in filter() If we wanted to consider flights that occurred in either November (11) or December (12), then we could use the code: > filter(flights, month == 11 | month == 12) Here, if month equals either 11 or 12, then we choose that row. Because it can be cumbersome to list the feature month for each comparison, we can use the following shorthand code instead: > filter(flights, month %in% c(11, 12)) Here we are asking whether the value of month is within (%in% operator) the set of values in a list created with c() that contains 11 and 12. Filtering data by intervals instead of exact values We just discussed filtering data by exact values, e.g., > filter(flights, month == 11 | month == 12) We can also filter data by intervals. For example, we could use the following code to extract all flights that were not delayed on arrival or departure by more than two hours: > filter(flights, !(arr_delay > 120 | dep_delay > 120)) The code below produces the same result: > filter(flights, arr_delay 5 [1] NA > 10 == NA [1] NA > NA + 10 [1] NA > NA / 2 [1] NA > NA == NA [1] NA Understanding why NA == NA evaluates to NA Suppose mary_age represents Mary’s age, which we don’t know: > mary_age john_age mary_age == john_age [1] NA We don’t know, because we don’t know the ages of Mary and John. Determining if a value is missing We can use the built-in R function is.na()to check whether a value is missing (NA). For example: > is.na(mary_age) [1] TRUE > is.na(john_age) [1] TRUE Implications of NA values on filter() The filter() function only returns rows for which the condition across the provided arguments evaluates to TRUE. It thus excludes rows for which the condition evaluates to FALSE or to NA, and so care needs to be taken if we wish to preserve NA values. If we want to preserve NA values with filter(), then we must ask for them explicitly. Example using filter() without preserving NA values We will first create a new data frame (tibble) with three rows and one column: > df df # A tibble: 3 x 1 x 1 1 2 NA 3 3 Example using filter() without preserving NA values Filtering values where x is greater than 1, we get: > filter(df, x > 1) # A tibble: 1 x 1 x 1 3 Here, the filtered data frame has only one row, because there is only one row for which x = 3 evaluated to TRUE. Example using filter() while preserving NA values Filtering values where x is NA or is greater than 1, we get: > filter(df, is.na(x) | x > 1) # A tibble: 2 x 1 x 1 NA 2 3 Here, the filtered data frame has two rows, because both the row with the NA and the row with x = 3 evaluated to TRUE. Arranging rows with arrange() The arrange() function allows you to change the order of rows in a data frame. The first argument is the name of the data frame. The subsequent arguments are column names (or a complicated expression) that are used to arrange the rows of the data frame. If more than one column name is provided, then additional columns will be used to break ties in the values of the preceding columns. Arranging flights by ascending year, month, and day > arrange(flights, year, month, day) # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 544 545 -1 1004 1022 5 2013 1 1 554 600 -6 812 837 6 2013 1 1 554 558 -4 740 728 7 2013 1 1 555 600 -5 913 854 8 2013 1 1 557 600 -3 709 723 9 2013 1 1 557 600 -3 838 846 10 2013 1 1 558 600 -2 753 745 # … with 336,766 more rows, and 11 more variables: arr_delay , carrier , # flight , tailnum , origin , dest , air_time , # distance , hour , minute , time_hour Arranging flights by ascending departure delay > arrange(flights, dep_delay) # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time 1 2013 12 7 2040 2123 -43 40 2352 2 2013 2 3 2022 2055 -33 2240 2338 3 2013 11 10 1408 1440 -32 1549 1559 4 2013 1 11 1900 1930 -30 2233 2243 5 2013 1 29 1703 1730 -27 1947 1957 6 2013 8 9 729 755 -26 1002 955 7 2013 10 23 1907 1932 -25 2143 2143 8 2013 3 30 2030 2055 -25 2213 2250 9 2013 3 2 1431 1455 -24 1601 1631 10 2013 5 5 934 958 -24 1225 1309 # … with 336,766 more rows, and 11 more variables: arr_delay , carrier , # flight , tailnum , origin , dest , air_time , # distance , hour , minute , time_hour Arranging flights by descending departure delay > arrange(flights, desc(dep_delay)) # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time 1 2013 1 9 641 900 1301 1242 1530 2 2013 6 15 1432 1935 1137 1607 2120 3 2013 1 10 1121 1635 1126 1239 1810 4 2013 9 20 1139 1845 1014 1457 2210 5 2013 7 22 845 1600 1005 1044 1815 6 2013 4 10 1100 1900 960 1342 2211 7 2013 3 17 2321 810 911 135 1020 8 2013 6 27 959 1900 899 1236 2226 9 2013 7 22 2257 759 898 121 1026 10 2013 12 5 756 1700 896 1058 2020 # … with 336,766 more rows, and 11 more variables: arr_delay , carrier , # flight , tailnum , origin , dest , air_time , # distance , hour , minute , time_hour Missing values are always placed in the bottom rows > df arrange(df, x) > arrange(df, desc(x)) # A tibble: 3 x 1 # A tibble: 3 x 1 x x 1 2 1 5 2 5 2 2 3 NA 3 NA Extracting relevant columns with select() Modern datasets are often massive, with millions of rows (observations) and thousands of columns (features). However, for a given problem, most features are irrelevant. The select() function allows you to choose a set of columns that you would like to keep. The first argument is the name of the data frame. The subsequent arguments are the names of columns that you want to extract. Extracting year, month, and day from flights > select(flights, year, month, day) # A tibble: 336,776 x 3 year month day 1 2013 1 1 2 2013 1 1 3 2013 1 1 4 2013 1 1 5 2013 1 1 6 2013 1 1 7 2013 1 1 8 2013 1 1 9 2013 1 1 10 2013 1 1 # … with 336,766 more rows All columns except between year and day (inclusive) > select(flights, -(year:day)) # A tibble: 336,776 x 16 dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight 1 517 515 2 830 819 11 UA 1545 2 533 529 4 850 830 20 UA 1714 3 542 540 2 923 850 33 AA 1141 4 544 545 -1 1004 1022 -18 B6 725 5 554 600 -6 812 837 -25 DL 461 6 554 558 -4 740 728 12 UA 1696 7 555 600 -5 913 854 19 B6 507 8 557 600 -3 709 723 -14 EV 5708 9 557 600 -3 838 846 -8 B6 79 10 558 600 -2 753 745 8 AA 301 # … with 336,766 more rows, and 8 more variables: tailnum , origin , # dest , air_time , distance , hour , minute , # time_hour All columns except year, day, and dep_time > select(flights, -c(year, day, dep_time)) # A tibble: 336,776 x 16 month sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight 1 1 515 2 830 819 11 UA 1545 2 1 529 4 850 830 20 UA 1714 3 1 540 2 923 850 33 AA 1141 4 1 545 -1 1004 1022 -18 B6 725 5 1 600 -6 812 837 -25 DL 461 6 1 558 -4 740 728 12 UA 1696 7 1 600 -5 913 854 19 B6 507 8 1 600 -3 709 723 -14 EV 5708 9 1 600 -3 838 846 -8 B6 79 10 1 600 -2 753 745 8 AA 301 # … with 336,766 more rows, and 8 more variables: tailnum , origin , # dest , air_time , distance , hour , minute , # time_hour Placing specific columns up front with everything() > select(flights, arr_time, arr_delay, everything()) # A tibble: 336,776 x 19 arr_time arr_delay year month day dep_time sched_dep_time dep_delay 1 830 11 2013 1 1 517 515 2 2 850 20 2013 1 1 533 529 4 3 923 33 2013 1 1 542 540 2 4 1004 -18 2013 1 1 544 545 -1 5 812 -25 2013 1 1 554 600 -6 6 740 12 2013 1 1 554 558 -4 7 913 19 2013 1 1 555 600 -5 8 709 -14 2013 1 1 557 600 -3 9 838 -8 2013 1 1 557 600 -3 10 753 8 2013 1 1 558 600 -2 # … with 336,766 more rows, and 11 more variables: sched_arr_time , # carrier , flight , tailnum , origin , dest , # air_time , distance , hour , minute , time_hour Renaming features with rename() > rename(flights, depTime = dep_time) # A tibble: 336,776 x 19 year month day depTime sched_dep_time dep_delay arr_time sched_arr_time 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 544 545 -1 1004 1022 5 2013 1 1 554 600 -6 812 837 6 2013 1 1 554 558 -4 740 728 7 2013 1 1 555 600 -5 913 854 8 2013 1 1 557 600 -3 709 723 9 2013 1 1 557 600 -3 838 846 10 2013 1 1 558 600 -2 753 745 # … with 336,766 more rows, and 11 more variables: arr_delay , carrier , # flight , tailnum , origin , dest , air_time , # distance , hour , minute , time_hour Adding new features with mutate() Sometimes the features in a dataset are not the features of interest. For example, a dataset may contain features for distance and time, whereas the relevant feature is speed, or distance divided by time. The mutate() function allows you to generate new features that are functions of existing features (data transformation). The first argument is the name of the data frame. The subsequent arguments are expressions to generate new features, which are added as the last columns of the new data frame. Let’s first create a smaller dataset for easy visualization > flights_sml flights_sml # A tibble: 336,776 x 7 year month day dep_delay arr_delay distance air_time 1 2013 1 1 2 11 1400 227 2 2013 1 1 4 20 1416 227 3 2013 1 1 2 33 1089 160 4 2013 1 1 -1 -18 1576 183 5 2013 1 1 -6 -25 762 116 6 2013 1 1 -4 12 719 150 7 2013 1 1 -5 19 1065 158 8 2013 1 1 -3 -14 229 53 9 2013 1 1 -3 -8 944 140 10 2013 1 1 -2 8 733 138 # … with 336,766 more rows Adding two new features to our new data frame > mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60) # A tibble: 336,776 x 9 year month day dep_delay arr_delay distance air_time gain speed 1 2013 1 1 2 11 1400 227 -9 370. 2 2013 1 1 4 20 1416 227 -16 374. 3 2013 1 1 2 33 1089 160 -31 408. 4 2013 1 1 -1 -18 1576 183 17 517. 5 2013 1 1 -6 -25 762 116 19 394. 6 2013 1 1 -4 12 719 150 -16 288. 7 2013 1 1 -5 19 1065 158 -24 404. 8 2013 1 1 -3 -14 229 53 11 259. 9 2013 1 1 -3 -8 944 140 5 405. 10 2013 1 1 -2 8 733 138 -10 319. # … with 336,766 more rows Can also apply mutate to newly-generated features > mutate(flights_sml, gain = dep_delay - arr_delay, hours = air_time / 60, gains_per_hour = gain / hours) # A tibble: 336,776 x 10 year month day dep_delay arr_delay distance air_time gain hours gains_per_hour 1 2013 1 1 2 11 1400 227 -9 3.78 -2.38 2 2013 1 1 4 20 1416 227 -16 3.78 -4.23 3 2013 1 1 2 33 1089 160 -31 2.67 4 2013 1 1 -1 -18 1576 183 17 3.05 5.57 5 2013 1 1 -6 -25 762 116 19 1.93 9.83 6 2013 1 1 -4 12 719 150 -16 2.5 -6.4 7 2013 1 1 -5 19 1065 158 -24 2.63 -9.11 8 2013 1 1 -3 -14 229 53 9 2013 1 1 -3 -8 944 140 10 2013 1 1 -2 8 733 138 # … with 336,766 more rows 11 0.883 5 2.33 -10 2.3 -11.6 12.5 2.14 -4.35 Can use transmute() to only keep new variables > transmute(flights_sml, gain = dep_delay - arr_delay, hours = air_time / 60, gains_per_hour = gain / hours) # A tibble: 336,776 x 3 gain hours gains_per_hour 1 -9 3.78 -2.38 2 -16 3.78 -4.23 3 -31 2.67 4 17 3.05 5.57 5 19 1.93 9.83 -11.6 6 -16 2.5 -6.4 7 -24 2.63 -9.11 8 11 0.883 9 5 2.33 10 -10 2.3 12.5 2.14 -4.35 # … with 336,766 more rows New features can be generated if vectorizable There are numerous expressions that can be used to create new features with mutate(). An important property of these expressions is that they must be vectorizable. That is, the expression must take a vector (list) of input values, and then return a vector (list) with the same number of values as output. We cannot exhaustively describe all such possible functions, but we will introduce a few on the next slide. Examples of vectorizable operations > x (x * 2) + 1 [1] 3 21 201 2001 20001 > log10(x) [1] 0 1 2 3 4 > cumsum(x) [1] 1 11 111 1111 11111 > cumsum(log10(x)) [1] 0 1 3 6 10 > log10(x) / sum(log10(x)) [1] 0.0 0.1 0.2 0.3 0.4 > log10(x) > 2 [1] FALSE FALSE FALSE TRUE TRUE Summarizing data with summarize() The summarize() function collapses a data frame into a single row by summarizing it based on particular features. For example, we can summarize flights by mean delay time: > summarize(flights, delay = mean(dep_delay, na.rm = TRUE)) # A tibble: 1 x 1 delay 1 12.6 This is not terribly useful on its own (we can just use mean()here). But we can use summarize() with group_by() to summarize the dataset by groups of observations rather than by the whole dataset. Grouping data with group_by() > by_day by_day # A tibble: 336,776 x 19 # Groups: year, month, day [365] year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD # … with 336,766 more rows, and 5 more variables: air_time , distance , hour , minute , time_hour Summarizing data by groups of observations > by_day summarize(by_day, delay = mean(dep_delay, na.rm = TRUE)) # A tibble: 365 x 4 # Groups: year, month [12] year month day delay 1 2013 1 1 11.5 2 2013 1 2 13.9 3 2013 1 3 11.0 4 2013 1 4 8.95 5 2013 1 5 5.73 6 2013 1 6 7.15 7 2013 1 7 5.42 8 2013 1 8 2.55 9 2013 1 9 2.28 10 2013 1 10 2.84 # … with 355 more rows Summarizing with multiple operations (removing “noise”) > by_dest delay delay 20, dest != "HNL") > delay # A tibble: 96 x 4 dest count dist delay 1 ABQ 254 1826 4.38 2 ACK 265 199 4.85 3 ALB 439 143 4 ATL 17215 5 AUS 14.4 757. 11.3 2439 1514. 6.02 6 AVL 275 584. 8.00 7 BDL 443 116 7.05 8 BGR 375 378 8.03 9 BHM 297 866. 16.9 10 BNA 6333 758. 11.8 # … with 86 more rows Flight delays only increase with distance up to ~750 miles > ggplot(data = delay, mapping = aes(x = dist, y = delay)) + geom_point(mapping = aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE) Perhaps as flights get longer there is a greater ability to make up for delays while in the air. Three steps to prepare data in last example > by_dest delay delay 20, dest != "HNL") Step 1: Group flights by destination. Step 2: Summarize to compute distance, average delay, and number of flights. Step 3: Filter to remove noisy points and Honolulu, which is close to twice as far away as the next closest airport to New York City. A problem with the above code is that we need to create, name, and store a new data frame at each intermediate step, even if we ultimately do not care about these intermediate datasets. Using pipes %>% to efficiently tackle identical problem > delays % group_by(dest) %>% summarize(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) %>% filter(count > 20, dest != "HNL") Here the focus is on the transformations, not what is being transformed, which makes the code easier to read. The above code can be read as: group, then summarize, then filter. Not filtering NA values leads to missing summaries What would happen if we did not use na.rm = TRUE? > flights %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay)) # A tibble: 365 x 4 # Groups: year, month [12] year month day mean 1 2013 1 1 NA 2 2013 1 2 NA 3 2013 1 3 NA 4 2013 1 4 NA 5 2013 1 5 NA 6 2013 1 6 NA 7 2013 1 7 NA 8 2013 1 8 NA 9 2013 1 9 NA 10 2013 1 10 NA # ... with 355 more rows It is a good idea to be careful and use na.rm argument > flights %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay, na.rm = TRUE)) # A tibble: 365 x 4 # Groups: year, month [12] year month day mean 1 2013 1 1 11.5 2 2013 1 2 13.9 3 2013 1 3 11.0 4 2013 1 4 8.95 5 2013 1 5 5.73 6 2013 1 6 7.15 7 2013 1 7 5.42 8 2013 1 8 2.55 9 2013 1 9 2.28 10 2013 1 10 2.84 # ... with 355 more rows What if NA values relate to events of interest? In the flights dataset, missing values represent cancelled flights. That is, for cancelled flights, features like dep_delay and arr_delay are missing because the flight never departed or arrived. Therefore, we could create a new dataset based only on flights that are not cancelled (by filtering flights with NA delay), and then learn about the mean departure delay without needing to worry about filtering NA values. The following slide conveys this idea. Summarizing the set of non-cancelled flights > not_cancelled % filter (!is.na(dep_delay), !is.na(arr_delay)) > not_cancelled %>% group_by(year, month, day) %>% summarize(mean = mean(dep_delay)) # A tibble: 365 x 4 # Groups: year, month [12] year month day mean 1 2013 1 1 11.4 2 2013 1 2 13.7 3 2013 1 3 10.9 4 2013 1 4 8.97 5 2013 1 5 5.73 6 2013 1 6 7.15 7 2013 1 7 5.42 8 2013 1 8 2.56 9 2013 1 9 2.30 10 2013 1 10 2.84 # ... with 355 more rows Including count summaries with data aggregation Whenever data is being aggregated or summarized in some way, it is a good idea to include either a count with the function n(), or a count of non-missing values with the function sum(!is.na()). This helps you ensure that you are not drawing conclusions from small samples of the original data. Moreover, the power of statistical tests (discussed later in the course) depends on sample size to an extent. Let’s look at an example of why this is important. Some flights have mean delays of >5 hours! > delays % group_by(tailnum) %>% summarize(delay = mean(arr_delay)) > ggplot(data = delays, mapping = aes(x = delay)) + geom_freqpoly(binwidth = 10) This sounds like a very interesting result! The geom_freqpoly() function In this example, we introduced a new geometric function for plotting called geom_freqpoly(), which is similar to a histogram, except that it plots a line graph, where the line goes through the top center of each bin of the analogous histogram. > ggplot(data = delays, mapping = aes(x = delay)) + geom_histogram(binwidth = 10) + geom_freqpoly(binwidth = 10, size = 2) We can see this effect more clearly here. Story is a little more nuanced These figures show that there were some flights with mean delay times of over five hours, but that it was relatively rare (few flights). However, both figures only provided us with summaries of the data. We can potentially learn more if we understood the variation in mean delay time as a function of the number of flights with that delay. Greater variation in mean delay with fewer flights > delays % group_by(tailnum) %>% summarize(delay = mean(arr_delay, na.rm = TRUE), n = n()) > ggplot(data = delays, mapping = aes(x = n, y = delay)) + geom_point(alpha = 1/10) Characteristic trend of less variation with increasing 𝑛 The above graph displays a plot that is commonly observed, whenever the mean is plotted as a function of sample size 𝑛. Here, variation in the summary (mean) decreases as sample size 𝑛 increases. That is, variation in mean is decreasing with increased sample size. Removing small samples reduces mean delay and variation > delays %>% filter(n > 25) %>% ggplot(mapping = aes(x = n, y = delay)) + geom_point(alpha = 1/10) Useful summary functions In the last lecture, we introduced two measures of location and their functions: Mean (average value), mean() Median (middle value), median() We also introduced two measures of spread and their functions: Standard deviation, sd() Variance, var() All of these functions, as well as many others used to summarize values, have the argument na.rm to remove NA values. Summarizing flights by standard deviation of distance > not_cancelled %>% group_by(dest) %>% summarize(distance_sd = sd(distance)) %>% arrange(desc(distance_sd)) # A tibble: 104 x 2 dest distance_sd 1 EGE 10.5 2 SAN 10.4 3 SFO 10.2 4 HNL 10.0 5 SEA 9.98 6 LAS 9.91 7 PDX 9.87 8 PHX 9.86 9 LAX 9.66 10 IND 9.46 # ... with 94 more row Measures of rank In the last lecture, we also discussed min()and max(), which return the minimum and maximum value in a list, respectively. The quantile() function is a generalization of median(), which returns the 50% quantile. Examples: > x min(x) [1] 1 > max(x) [1] 100 # Value of x greater than 25% of values and less than other 75% > quantile(x, 0.25) [1] 25.75 Finding out when first and last flight departed each day > not_cancelled %>% group_by(year, month, day) %>% summarize(first = min(dep_time), last = max(dep_time)) # A tibble: 365 x 5 # Groups: year, month [12] year month day first last 1 2013 1 1 517 2356 2 2013 1 2 42 2354 3 2013 1 3 32 2349 4 2013 1 4 25 2358 5 2013 1 5 14 2357 6 2013 1 6 16 2355 7 2013 1 7 49 2359 8 2013 1 8 454 2351 9 2013 1 9 2 2252 10 2013 1 10 3 2320 # ... with 355 more rows Measures of position Useful measures of position are first(), nth(), and last(). The first() and last() functions return the first and last element in a list, respectively (identical to min() and max() if list is sorted). The nth() function returns the 𝑛th element of a list. Examples: > x first(x) [1] 10 > last(x) [1] 4 > nth(x, 3) [1] -5 First and last departure of each date > not_cancelled %>% group_by(year, month, day) %>% summarize(first_dep = first(dep_time), last_dep = last(dep_time)) # A tibble: 365 x 5 # Groups: year, month [12] year month day first_dep last_dep 1 2013 1 1 517 2356 2 2013 1 2 42 2354 3 2013 1 3 32 2349 4 2013 1 4 25 2358 5 2013 1 5 14 2357 6 2013 1 6 16 2355 7 2013 1 7 49 2359 8 2013 1 8 454 2351 9 2013 1 9 2 2252 10 2013 1 10 3 2320 # ... with 355 more rows Counts We introduced n(), which takes no arguments and returns the size of the current group, and sum(!is.na()), which takes in the name of a group and returns the number of non-missing values. We can also calculate number of distinct (unique) values with n_distinct(). Examples: > x sum(!is.na(x)) [1] 8 # NA values count as a distinct value > n_distinct(x) [1] 4 Which destination has the most carriers? > not_cancelled %>% group_by(dest) %>% summarize(carriers = n_distinct(carrier)) %>% arrange(desc(carriers)) # A tibble: 104 x 2 dest carriers 1 ATL 7 2 BOS 7 3 CLT 7 4 ORD 7 5 TPA 7 6 AUS 6 7 DCA 6 8 DTW 6 9 IAD 6 10 MSP 6 # ... with 94 more rows Built-in helper function within dplyr if only want count > not_cancelled %>% count(dest) # A tibble: 104 x 2 dest n 1 ABQ 254 2 ACK 3 ALB 264 418 4 ANC 5 ATL 8 16837 6 AUS 7 AVL 2411 261 8 BDL 9 BGR 412 358 10 BHM 269 # ... with 94 more rows Total number of flights for each plane > not_cancelled %>% count(tailnum) # A tibble: 4,037 x 2 tailnum n 1 D942DN 4 2 N0EGMQ 3 N10156 352 145 4 N102UW 5 N103US 48 46 6 N104UW 7 N10575 46 269 8 N105UW 9 N107US 45 41 10 N108UW 60 # ... with 4,027 more rows Weight argument to tally total miles a plane flew > not_cancelled %>% count(tailnum, wt = distance) # A tibble: 4,037 x 2 tailnum n 1 D942DN 3418 2 N0EGMQ 3 N10156 239143 109664 4 N102UW 5 N103US 25722 24619 6 N104UW 7 N10575 24616 139903 8 N105UW 9 N107US 23618 21677 10 N108UW 32070 # ... with 4,027 more rows Counts and proportions of logical values In sum(!is.na()), sum() treats any row as a 1 if !is.na() evaluates to TRUE, and 0 otherwise. Therefore, we can use sum() to count the number of rows satisfying a logical statement, and use mean() to obtain the proportion of rows. Example: > x sum(!is.na(x)) [1] 8 # The proportion of values that are not NA > mean(!is.na(x)) [1] 0.8 How many flights left before 5am (prior day delays)? # Recall dep_time formatted as HHMM or HMM (H = hour, M = min) > not_cancelled %>% group_by(year, month, day) %>% summarize(n_early = sum(dep_time < 500)) # A tibble: 365 x 4 # Groups: year, month [12] year month day n_early 1 2013 1 1 0 2 2013 1 2 3 3 2013 1 3 4 4 2013 1 4 3 5 2013 1 5 3 6 2013 1 6 2 7 2013 1 7 2 8 2013 1 8 1 9 2013 1 9 3 10 2013 1 10 3 # ... with 355 more rows What proportion of flights were delayed more >1 hour? > not_cancelled %>% group_by(year, month, day) %>% summarize(hour_prop = mean(arr_delay > 60)) # A tibble: 365 x 4 # Groups: year, month [12] year month day hour_prop 1 2013 1 1 0.0722 2 2013 1 2 0.0851 3 2013 1 3 0.0567 4 2013 1 4 0.0396 5 2013 1 5 0.0349 6 2013 1 6 0.0470 7 2013 1 7 0.0333 8 2013 1 8 0.0213 9 2013 1 9 0.0202 10 2013 1 10 0.0183 # ... with 355 more rows Finding all groups larger than threshold # Find destinations with more than one daily flight on average > popular_dests % group_by(dest) %>% filter(n() > 365) > popular_dests # A tibble: 332,577 x 19 # Groups: dest [77] year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin < dbl> 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA # ... with 332,567 more rows, and 6 more variables: dest , air_time , distance , hour , # minute , time_hour Standardizing to compute per group metrics # Proportion of delay time for each destination accounted for # by a particular flight (sum to 1 for each of 77 dests). > popular_dests %>% filter(arr_delay > 0) %>% mutate(prop_delay = arr_delay / sum(arr_delay)) %>% select(year:day, dest, arr_delay, prop_delay) # A tibble: 131,106 x 6 # Groups: dest [77] year month day dest arr_delay prop_delay 1 2013 1 1 IAH 11 0.000111 2 2013 1 1 IAH 20 0.000201 3 2013 1 1 MIA 33 0.000235 4 2013 1 1 ORD 12 0.0000424 5 2013 1 1 FLL 19 0.0000938 6 2013 1 1 ORD 8 0.0000283 7 2013 1 1 LAX 7 0.0000344 8 2013 1 1 DFW 31 0.000282 9 2013 1 1 ATL 12 0.0000400 10 2013 1 1 DTW 16 0.000116 # ... with 131,096 more rows Lecture summary We discussed how to use a variety of functions in the dplyr package for data manipulations and transformations: filter() Pick observations by their values arrange() Reorder rows select() Pick features by their names mutate() Create new features with functions of existing features summarize() Collapse many features down to a single summary group_by() Changes scope of above functions from operating on all data to operating on it group-by-group *We also discussed many other related functions and their applications. CAP 5768: Introduction to Data Science Exploratory Data Analysis What is exploratory data analysis? Exploratory data analysis is the act of visualizing, transforming, and modeling data to explore it in a systematic way. The goal of exploratory data analysis is to gain some preliminary understanding of your data before moving on to statistical learning. We often use exploratory data analysis to investigate the quality of data and clean them as necessary for downstream analyses. Exploratory data analysis has no rules During initial phases of exploratory data analysis, you should feel free to investigate every idea that occurs to you. Some will pan out, whereas others will be dead ends. The process is therefore an iterative cycle in which you: 1. Generate questions about your data. 2. Search for answers by visualizing, transforming, and modeling your data. 3. Use what you have learned to refine your questions and/or generate novel questions. Exploratory data analysis is thus a creative process The questions that you ask help you focus on a specific aspect of your data and choose transformations, visualizations or models. For this reason, it is important to ask high quality questions. Though there are no rules, two types of questions will always be useful for making discoveries about your data: 1) What type of variation occurs within features? 2) What type of variation occurs among features? Exploratory data analysis is thus a creative process The questions that you ask help you focus on a specific aspect of your data and choose transformations, visualizations or models. For this reason, it is important to ask high quality questions. Though there are no rules, two types of questions will always be useful for making discoveries about your data: 1) What type of variation occurs within features? 2) What type of variation occurs among features? Examples of variation? Visualizing distributions We have already seen several examples of ways to visualize distributions (e.g., bar plots, histograms, boxplots). How you visualize the distribution of a feature will depend on whether that feature is categorical or continuous. Categorical features take on a finite set of values (“categories”). Continuous features take on an infinite set of values (“quantities”). Visualizing categorical distributions with bar plots One way to visualize categorical features is with bar plots, which summarize variation within a single feature. > ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) Plotted values can be computed without visualization We can compute the plotted values without visualizing by using some of the techniques we learned in the last lecture: > diamonds %>% count(cut) # A tibble: 5 x 2 cut n 1 Fair 1610 2 Good 4906 3 Very Good 12082 4 Premium 5 Ideal 13791 21551 Visualizing continuous distributions with histograms One way to visualize continuous features is with histograms, which summarize variation within a single feature. > ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5) Units of 𝑥 (carat) Visualizing continuous distributions with histograms One way to visualize continuous features is with histograms, which summarize variation within a single feature. > ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5) Units of 𝑥 (carat) Is this bin width appropriate for identifying fine-scale variation in carat? Plotted values can be computed without visualization We can compute the plotted values without visualizing by using some of the techniques we learned in the last lecture, and with a new function cut_width() that behaves the same as the binwidth argument of geom_histogram(): > diamonds %>% count(cut_width(carat, 0.5)) # A tibble: 11 x 2 `cut_width(carat, 0.5)` 1 [-0.25,0.25] n 785 2 (0.25,0.75] 29498 3 (0.75,1.25] 15977 4 (1.25,1.75] 5313 5 (1.75,2.25] 2002 6 (2.25,2.75] 322 7 (2.75,3.25] 32 8 (3.25,3.75] 5 9 (3.75,4.25] 4 10 (4.25,4.75] 1 11 (4.75,5.25] 1 Exploring sub-regions of data and altering scale In the histogram below, we have a bin width that may be too large to identify any fine-scale variation in carat. Moreover, because only a few diamonds are ≥ 3 carats, it may be helpful to filter the data and explore variation in smaller diamonds. Extracting small diamonds & examining fine-scale patterns > smaller % filter(carat < 3) > ggplot(data = smaller, mapping = aes(x = carat)) + geom_histogram(binwidth = 0.1) Multiple modes of smaller diamonds By zooming into the parameter space where most observations lie (diamonds weighing < 3 carats) and choosing a finer scale binning, we can see that there are multiple modes (peaks) in the distribution, potentially corresponding to distinct subgroups of diamonds: Plotting distributions of multiple continuous features Before exploring this variation further, we will briefly discuss how to plot distributions of multiple continuous features. Suppose we wish to create a separate histogram for each quality of diamond (cut), and depict each histogram with a different color. Our first thought might be to use the geom_histogram() function with the code: > ggplot(data = smaller, mapping = aes(x = carat, color = cut)) + geom_histogram(binwidth = 0.1) However, the result of this code may not be what we envisioned. Using different colors as histograms not what hoped for > ggplot(data = smaller, mapping = aes(x = carat, color = cut)) + geom_histogram(binwidth = 0.1) What about filling in the histograms instead? > ggplot(data = smaller, mapping = aes(x = carat, fill = cut)) + geom_histogram(binwidth = 0.1) Here, we have created a set of stacked histograms, making the distributions difficult to compare across diamond quality (cut). Using geom_freqpoly() to plot multiple histograms > ggplot(data = smaller, mapping = aes(x = carat, color = cut)) + geom_freqpoly(binwidth = 0.1) Exploiting typical values in exploratory data analysis In bar plots and histograms, tall bars display common (typical) values of a feature, and short bars display uncommon (atypical) values. Regions without bars correspond to values that are not observed in your data at all. This information can be turned into useful questions for exploratory data analysis: - Which values are the most common, and why? - Which value are rare, and why? - Do these observations match our expectations? - Are there any unusual patterns? What might explain them? Consider distribution of diamond weight at a finer scale > ggplot(data = smaller, mapping = aes(x = carat)) + geom_histogram(binwidth = 0.01) Questions to ask about these data - Why are there more diamonds at whole carats and common fractions of carats? - Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak? - Why are there no diamonds larger than three carats? Exploring clusters of similar values Clusters of similar values suggest that subgroups exist in the data. To understand the subgroups, we often ask: - How are observations within each cluster similar? - How are observations in separate clusters different? - How can you explain or describe the clusters? - Why might the appearance of clusters be misleading? Let’s look at clusters within an example dataset. Example: The Old Faithful Geyser dataset This dataset contains 272 observations on waiting times between eruptions (waiting) and durations of eruptions (eruptions) for the Old Faithful Geyser in Yellowstone National Park. What are the features? The date are tidied into a data frame called faithful, which is located in the MASS package. MASS is one of many packages that are automatically installed with base R. Do we need to type library(MASS)? Viewing the top (head) of the faithful data frame The faithful data frame has two features: eruptions and waiting. We can use the head() function to view the first six (or other specified number) rows of the data frame: > head(faithful) eruptions waiting 1 2 3.600 1.800 79 54 3 4 3.333 2.283 74 62 5 6 4.533 2.883 85 55 What function could use to look at the last six rows? Visualizing distribution of eruption times (eruptions) > ggplot(data = faithful, mapping = aes(x = eruptions)) + geom_histogram(binwidth = 0.25) Eruption times appear to be clustered into two groups: 1) Short eruptions (mode of ~2 minutes) 2) Long eruptions (mode of ~4.5 minutes) Comparing short vs. long eruptions Recall that to understand subgroups, we often ask: - How are observations within each cluster similar? - How are observations in separate clusters different? - How can you explain or describe the clusters? - Why might the appearance of clusters be misleading? To answer these questions, we might examine other features. In this case, there is just one other feature: waiting times between eruptions, waiting. Why is domain knowledge important here? Outliers Outliers are observations that are unusual, or that do not seem to fit a particular pattern. Recall that we briefly discussed outliers when we introduced boxplots in our lecture on R and data visualization. Sometimes outliers are data entry errors, whereas other times they suggest important new findings. In large datasets, outliers can be difficult to identify from a histogram. Let’s look at an example to see why this issue exists. Outliers are “invisible” in histogram of diamond width (y) > ggplot(data = diamonds) + geom_histogram(mapping = aes(x = y), binwidth = 0.5) Only evidence here is an unusually large 𝑥 axis range There are so many observations in the common bins that the rare bins are so short that you cannot see them at all (for plotting purposes, they effectively have a 0 count on the 𝑦 axis scale). Zooming into the 𝑦 axis with coord_cartesian() > ggplot(data = diamonds) + geom_histogram(mapping = aes(x = y), binwidth = 0.5) + coord_cartesian(ylim = c(0, 50)) The coord_cartesian() function allows the user to change the plotted coordinate system, and to alter the range (limits) of the axes, with the above code setting the 𝑦 axis range to go from 0 to 50. The 𝑥 axis scale can be similarly adjusted with xlim argument. Zooming into the 𝑦 axis allows us to see outliers > ggplot(data = diamonds) + geom_histogram(mapping = aes(x = y), binwidth = 0.5) + coord_cartesian(ylim = c(0, 50)) This plot allows us to see three outliers of y: 0, ~30, ~60. We can use some tools in dplyr to extract these outliers and explore their properties. Using dplyr to extract all outliers of diamond width (y) > outliers % filter(y < 3 | y > 20) %>% select(price, x, y, z) %>% arrange(y) > outliers # A tibble: 9 x 4 price x y z 1 2 5139 6381 0 0 0 0 0 0 3 12800 4 15686 0 0 0 0 0 0 5 18034 6 2130 0 0 0 0 0 0 7 8 2130 2075 0 5.15 0 31.8 0 5.12 9 12210 8.09 58.9 8.06 Something is wrong with these outliers > outliers # A tibble: 9 x 4 price x y z 1 5139 0 0 0 2 6381 0 0 0 3 12800 0 0 0 4 15686 0 0 0 5 18034 0 0 0 6 2130 0 0 0 7 2130 0 0 0 8 2075 5.15 31.8 5.12 9 12210 8.09 58.9 8.06 The feature y measures the width of a diamond in mm. But we know that diamonds cannot have a width of 0mm! Similarly, diamonds cannot have a length (x) or depth (z) of 0mm. Something is wrong with these outliers > outliers # A tibble: 9 x 4 price x y z 1 5139 0 0 0 2 6381 0 0 0 3 12800 0 0 0 4 15686 0 0 0 5 18034 0 0 0 6 2130 0 0 0 7 2130 0 0 0 8 2075 5.15 31.8 5.12 9 12210 8.09 58.9 8.06 Also, widths (y) of ~32mm and ~59mm are quite large—over an inch. Diamonds this wide would not cost less than $100,000 (Google). Another example of the importance of domain knowledge! What to do with outliers? It is good practice to perform analyses with and without outliers. If the outliers have a minimal effect on results, then it is usually okay to remove them. However, if the outliers have a substantial effect on results, then they should not be removed without further investigation and justification. In this case, you will first need to figure out what caused these outliers (e.g., a data entry error). If you proceed to remove these outliers, then you will also need to disclose this in your report or paper. Two options for removing outliers 1. Remove all rows (observations) containing outliers > diamonds2 % filter(between(y, 3, 20))) Note that the code between(x, left, right) is a shortcut for x >= left & x diamonds2 % mutate(y = ifelse(y < 3 | y > 20, NA, y)) Note that we are using a new function: ifelse(test, yes, no) The value of the first argument test is a logical expression, which evaluates to yes when TRUE, and no when FALSE. This option is preferred, as it ensures that all other measurements on the observation can be used for downstream analyses. But sometimes missing values are not errors Recall that in the flights data frame, missing values in the dep_time feature indicate that the flight was canceled. To compare the scheduled departure times for canceled and noncanceled flights, a new variable can be created with is.na(): > flights %>% mutate(canceled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = sched_dep_time %% 100, sched_dep_time = sched_hour + sched_min / 60) Note that we have introduced two new operators here: %/% and %%, which respectively yield the quotient and the remainder from a division operation (examples on next slide). Examples of obtaining quotients and remainders > 45 %/% 45 [1] 1 > 45 %% 45 [1] 0 > 45 %/% 20 [1] 2 > 45 %% 20 [1] 5 > 45 %/% 50 [1] 0 > 45 %% 50 [1] 45 Departure times for canceled vs. non-canceled flights > flights %>% mutate(canceled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = sched_dep_time %% 100, sched_dep_time = sched_hour + sched_min / 60) %>% ggplot(mapping = aes(sched_dep_time)) + geom_freqpoly(mapping = aes(color = canceled), binwidth = 1/4) Our graph is not that informative The difference in scale for the 𝑦 axis between the two categories makes it impossible to compare their distributions. This is because there are many more non-canceled flights than canceled flights (this makes sense). We should instead plot densities, which places distributions on the same scale because the area under each density curve sums to one. Departure times for canceled vs. non-canceled flights > flights %>% mutate(canceled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = sched_dep_time %% 100, sched_dep_time = sched_hour + sched_min / 60) %>% ggplot(mapping = aes(x=sched_dep_time, y=..density..)) + geom_freqpoly(mapping = aes(color = canceled), binwidth = 1/4) New plot enables comparison between categories A quick glance shows that the shapes of distributions for cancelled and non-cancelled flights are similar. However, careful inspection suggests that canceled flights may be more likely to occur later in the day. Ultimately, this difference can be evaluated with a formal statistical test (model), but our exploration helped us arrive at this hypothesis! Covariation So far we have discussed variation within a single feature, which can be visualized with a bar plot when the feature is categorical and with histograms (different types) when it is continuous. Covariation describes variation among features, and is the tendency for values of two or more features to change together systematically. A key mechanism of identifying covariation is by visualizing the relationship between two or more features. Visualizations of covariation As with a single feature, how you visualize covariation among features depends on the types of features you are considering (categorical vs. continuous). We will consider covariation between two features in this lecture. There are three combinations for two features: 1) Categorical vs. continuous 2) Categorical vs. categorical 3) Continuous vs. continuous Covariation between categorical & continuous features It is common to explore the distribution of a continuous feature conditional on a categorical feature. We have already used two approaches for visualizing this: frequency polygons geom_freqpoly() and boxplots geom_boxplot(). In a frequency polygon, each distribution of a continuous feature is conditional on a particular categorical feature, which is plotted with the count (or density) on the 𝑦 axis. Boxplots summarize the distribution of a continuous feature conditional on a particular categorical feature. Frequency polygons for diamond cut vs. price > ggplot(data = diamonds, mapping = aes(x = price)) + geom_freqpoly(mapping = aes(color = cut), binwidth = 500) Unfortunately, it is difficult to really compare distributions of diamond price across cut categories because their counts differ a lot! Recall distribution of counts for cut Earlier, we used a boxplot to display the frequency of each diamond quality (cut) in our dataset: > ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) Using densities for frequency polygons of cut vs. price > ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + geom_freqpoly(mapping = aes(color = cut), binwidth = 500) This is another scenario where it is useful to plot frequency polygons as densities so that the area under each distribution sums to one. A surprising result! This plot shows that the lowest quality (fair) diamonds have the highest average price. But it is possible that we are just being misled by looking at distributions as frequency polygons, and maybe examining summaries of distributions would solidify if this were the case. We can examine such summaries using boxplots. Boxplots to summarize diamond cut vs. price > ggplot(data = diamonds) + geom_boxplot(mapping = aes(x = cut, y = price)) Indeed, the distribution of prices of lowest quality (fair) diamonds is shifted upward, and the median appears higher than other medians. Adding notches to assess statistical differences Notches can be added to boxplots using the notch argument, and represent a confidence interval around the median: > ggplot(data = diamonds) + geom_boxplot(mapping = aes(x = reorder(cut, price, FUN = median), y = price), notch = TRUE) Non-overlapping notches suggest strong evidence that medians differ. Why might fair diamonds have highest average price? We will explore this question a little more when we discuss covariation between continuous features. As a sneak peek, we will need to identify a feature that has a strong relationship with diamond price. Covariation between categorical features To visualize covariation between categorical variables, we need to count the number of observations for each combination of categories. One mechanism for doing this and visualizing the distribution is through the geom_count() function. The geom_count() function acts just like any other ggplot2 geometric layer, and requires an aesthetic be specified for a mapping argument indicating which features are plotted on the 𝑥 and 𝑦 axes. The plot will be a discrete two-dimensional grid, with circles at each cell in the grid, and the circle size being proportional to the number of values falling within the pair of categories. Covariation will appear as a strong correlation between the features. Using geom_count() for cut vs. color > ggplot(data = diamonds) + geom_count(mapping = aes(x = cut, y = color)) Plotted values can be computed without visualization We can compute the plotted values without visualizing by using some of the techniques we learned from Lecture 3: > diamonds %>% count(color, cut) # A tibble: 35 x 3 color cut n 1 D Fair 163 2 D Good 662 3 D Very Good 1513 4 D Premium 1603 5 D Ideal 2834 6 E Fair 224 7 E Good 933 8 E Very Good 2400 9 E Premium 2337 Ideal 3903 10 E # ... with 25 more rows Using geom_tile() to create heatmap for cut vs. color > diamonds %>% count(color, cut) %>% ggplot(mapping = aes(x = color, y = cut)) + geom_tile(mapping = aes(fill = n)) Covariation between continuous features We have already explored a way to examine covariation between two continuous features, which was a scatterplot using geom_point(): > ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price)) Why might fair diamonds have highest average price? The plot below suggests that diamond price increases as diamond weight (carat) increases: Thus, maybe the reason for the higher average price of low-quality (fair) diamonds is that low-quality diamonds tend to be heavier. We can explore this hypothesis by creating a boxplot to summarize the distribution of carat as a function of cut. Boxplot with distribution of carat as a function of cut > ggplot(data = diamonds) + geom_boxplot(mapping = aes(x = reorder(cut, carat, FUN = median), y = carat), notch = TRUE) Supporting our hypothesis, fair diamonds have their weight distribution shifted toward larger values. Also, notches do not overlap between fair and other qualities (cut). Visualizing distributions of overlapping points The plot to the left contains over 50,000 points, making it difficult to visualize where observations tend to be concentrated. For such a plot, it can be beneficial to set the transparency of points such that overlaps can be observed (right), with the code: > ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price), alpha = 1/100) Using transparency can be challenging for large datasets This is because all regions will look the same (black) when there are many points. Whereas we may initially consider increasing transparency, the issue is that this makes points invisible in regions with little data: A better solution is to plot overlaps as a heatmap using either the geom_bin2d() or the geom_hex() functions. Illustrating scatterplot heatmaps Let’s reconsider the filtered dataset smaller we generated earlier, which is the set of diamonds weighing less than three carats: > smaller % filter(carat < 3) Suppose we create a scatterplot with transparency: > ggplot(data = smaller) + geom_point(mapping = aes(x = carat, y = price), alpha = 1/100) Scatterplot heatmap with geom_bin2d() > ggplot(data = smaller) + geom_bin2d(mapping = aes(x = carat, y = price)) Scatterplot heatmap with geom_hex() > install.packages("hexbin") > library(hexbin) > ggplot(data = smaller) + geom_hex(mapping = aes(x = carat, y = price)) Discretizing one continuous feature through binning It is possible to partition continuous values into a discrete number of “bins” (as is done with a histogram), enabling us to convert a continuous feature into a categorical feature. This permits us to employ techniques for visualizing combinations of categorical and continuous features (e.g., boxplots) . To perform binning, we must choose a feature to partition, and assign each bin of the feature as a separate group (category). The cut_width() function partitions values evenly across bins according to a width argument. The cut_number() function partitions values into a specific number of bins, each with approximately the same number of observations. Binning with cut_width() > ggplot(data = smaller, mapping = aes(x=carat, y=price)) + geom_boxplot(mapping = aes(group = cut_width(carat,0.1))) Binning with cut_number() > ggplot(data = smaller, mapping = aes(x=carat, y=price)) + geom_boxplot(mapping = aes(group = cut_number(carat,20))) Main goal of data visualization is to find patterns Patterns in data provide clues about relationships among features (covariation), which is often what you are seeking to identify. If you spot a pattern, then you should ask: - Could the pattern be due to coincidence (random chance)? - How can you describe the relationship implied by the pattern? - How strong is the relationships implied by the pattern? - What other features might affect the relationship? - Does the relationship differ across subgroups of the data? Recall the clustering of eruption times into two groups > ggplot(data = faithful, mapping = aes(x = eruptions)) + geom_histogram(binwidth = 0.25) Why does there appear to be two clusters of eruption times? Is it possible that eruption time covaries with some other feature? Scatterplot of waiting as a function of eruptions > ggplot(data = faithful) + geom_point(mapping = aes(x = eruptions, y = waiting)) Here we see a pattern, in that the length of an eruption covaries with length of time between eruptions. Patterns reveal covariation Patterns are highly useful tools because they reveal covariation. Variation is a phenomenon that creates uncertainty. However, covariation is a phenomenon that reduces uncertainty. If two features covary, then you can use the values of one feature to make better predictions about the values of the second feature. If covariation is due to a causal relationship, then you can also use the value of one feature to control the value of the second feature. Models are tools for extracting patterns from data Consider again diamonds, where it was difficult to understand the relationship between diamond quality (cut) and price: Reasoning: cut & carat, and carat & price are tightly related. We can use a model to remove the very strong relationship between price and carat to further explore the subtleties that remain. Loading the modelr package Before moving on, we will briefly use one feature of the modelr package, which comes installed with the tidyverse package. The modelr package has functions that allow seamless integration of modeling into data manipulation and visualization pipelines. However, modelr is not automatically loaded with tidyverse. Therefore, we must first use the command to load modelr: > library(modelr) Relationship between price and carat > ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price)) We want to model the relationship between price and carat. However, this relationship does not appear to be linear. Logarithmic transformation to make data (more) linear > ggplot(data = diamonds) + geom_point(mapping = aes(x = log(carat), y = log(price))) After transforming both features with a natural logarithm, the relationship looks approximately linear, and so it is now appropriate to model the relationship with a linear function. Modeling diamond price as a function of weight We can use simple linear regression (a straight line) to predict the logarithm of diamond price from the logarithm of diamond weight. This can be accomplished by fitting a linear model using the lm() function, which takes as a first argument a formula (~) relating the feature on the y axis (left) to that on the x axis (right), as well as an argument for the data frame: > mod mod Call: lm(formula = log(price) ~ log(carat), data = diamonds) Coefficients: (Intercept) log(carat) 8.449 1.676 Plotting the output of simple linear regression > ggplot(data = diamonds, mapping = aes(x = log(carat), y = log(price))) + geom_point() + geom_smooth(method = "lm", se = FALSE, color = "red") Adding residuals (prediction errors) to data frame The modelr package has a function add_residuals(), which adds a column to the end of a data frame corresponding to the model residuals (prediction errors). The first argument is the data frame, and the second is the model: > add_residuals(diamonds, mod) # A tibble: 53,940 x 11 carat cut color clarity depth table price x y z resid 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 -0.199 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 -0.0464 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 -0.196 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 -0.563 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 -0.672 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 -0.240 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 -0.240 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 -0.371 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 -0.0912 Very Good H VS1 59.4 61 338 4 4.05 2.39 -0.163 10 0.23 # ... with 53,930 more rows Converting residuals to the correct scale However, the residuals are on a natural logarithmic scale, rather than on the scale of our original features. We can therefore use mutate to convert the residual column onto the scale of the original features. Using pipes in the same manner as before, we type: > diamonds2 % add_residuals(mod) %>% mutate(resid = exp(resid)) The residuals column (resid) has now been replaced with one on the correct scale, and we have stored this augmented dataset into a new data frame called diamonds2. Examining the new data frame > diamonds2 # A tibble: 53,940 x 11 carat cut color clarity depth table price x y z resid 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 0.820 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 0.955 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 0.822 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 0.569 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 0.511 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 0.787 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 0.787 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 0.690 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 0.913 Very Good H VS1 59.4 61 338 4 4.05 2.39 0.850 10 0.23 # ... with 53,930 more rows Residuals as a function of diamond weight (carat) > ggplot(data = diamonds2) + geom_point(mapping = aes(x = carat, y = resid)) Residuals correspond to variation that is unaccounted for by our model, suggesting that there is quite a bit of unexplained variation for the price for lighter diamonds. Residuals as a function of diamond quality (cut) > ggplot(data = diamonds2) + geom_boxplot(mapping = aes(x = cut, y = resid), notch = TRUE) Because the residuals correspond to unexplained variation between carat and price, we can now see the expected pattern—relative to their size, higher quality diamonds are indeed more expensive! Lecture summary Exploratory data analysis is the act of visualizing, transforming, and modeling of data to explore it in a systematic way. Though it has no rules and is a creative process, it requires care to generate an initial set of quality questions about your data. Data visualization depends on the type(s) of features in your data (categorical/continuous). The main goal of data visualization is to extract patterns, which can reveal relationships (covariation) between features. We briefly introduced linear regression as a way of modeling covariation between features, and will next discuss its use in making predictions.
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

View attached ex...


Anonymous
Great content here. Definitely a returning customer.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Related Tags