Tuesday, February 13, 2018

Databases: They're All Relative, Really






Databases: They're All Relative, Really

Databases: They're All Relative, Really

This week brought a few of really useful insights into using the tidyverse. For one, my mentor introduced me to R-Markdown, which makes everything so pretty. I'd been jealous of all of the blogs I'd been reading that all seemed to be formatted in the same way, but I hadn't figured out how everyone was doing it. Having used LaTeX a lot in my undergrad, I'm loving the familiarity of the “knit”.

I also started to delve into the Wrangle section of R for Data Science, which has been very successfully scratching my constant itch for organization. I will probably regret having said this years from now when I never want to tidy a dataset ever again, but I'm kind of looking forward to working with my own messy data so that I can parse, spread, gather, separate, and unite to my heart's content. Fixing up that WHO dataframe about TB cases was just oh-so-satisfying.

Until I get to that point where I'm ready to jump into my own project with my own data, I think the Relational Data section will be the most useful part of what I learned this week. It just so happens that my Intro to Data Analytics course is beginning work in SQL this week as well, so this has all been quite relevant.

The book did a great job leading us into using relational databases with one of the exercises from the Tidy Data section. We start out with two messy data frames from the tidyverse:

table4a #This table shows the number of cases of TB in each country per year
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b #This table shows the population of each country per year
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

These tables were used to demonstrate the need to use gather like so:

tidy4a <- table4a%>%
  gather(`1999`, `2000`, key="Year", value="Cases")
tidy4a
## # A tibble: 6 x 3
##   country     Year   Cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
tidy4b <- table4b%>%
  gather(`1999`, `2000`, key="Year", value="Population")
tidy4b
## # A tibble: 6 x 3
##   country     Year  Population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Brazil      1999   172006362
## 3 China       1999  1272915272
## 4 Afghanistan 2000    20595360
## 5 Brazil      2000   174504898
## 6 China       2000  1280428583

But then the reader is left with these two dataframes which so clearly want to be joined together, but without the tools to do so…for a minute. Then we learn about joins and all is well again:

table4 <- tidy4a%>%
  left_join(tidy4b, by=c("country", "Year"))
table4
## # A tibble: 6 x 4
##   country     Year   Cases Population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Brazil      1999   37737  172006362
## 3 China       1999  212258 1272915272
## 4 Afghanistan 2000    2666   20595360
## 5 Brazil      2000   80488  174504898
## 6 China       2000  213766 1280428583

Flights Don't Exist in Isolation

The rest of my work from this week focused on the nycflights13 dataset, which has just been begging to be used with joins since I first saw it weeks ago. Our first task is to use the left_join function with the flights and airports dataframes to create a map of the average delays by destination airport.

delay <- flights%>%
  group_by(dest)%>%
  filter(!is.na(arr_delay))%>%
  summarise(avg_delay=mean(arr_delay))

delay %>%
  left_join(airports, c("dest" = "faa"))%>% 
  filter(lon>-140)%>% #to zoom in on the continental US
  ggplot(aes(x=lon, y=lat)) +
  borders("state") +
  geom_point(aes(size=avg_delay)) +
  coord_quickmap()

plot of chunk unnamed-chunk-4 Unfortunately this map isn't all that informative, since most of the average delays are between 0 and 20 minutes, and the airports are pretty smushed along the east coast. But it's still cool that we can do this.

We're then asked to determine whether there is a link between average delay and the age of the plane. So once again we use left_join to see the relationship between arr_delay in flights and year in planes:

plane_delay <- flights%>%
  group_by(tailnum)%>%
  filter(!is.na(arr_delay))%>%
  summarise(avg_delay=mean(arr_delay))%>%
  left_join(planes, "tailnum")%>%
  select(tailnum, avg_delay, year)%>%
  filter(!is.na(year))

plane_delay%>%
  ggplot(aes(avg_delay, year))+
  geom_point()

plot of chunk unnamed-chunk-5

This plot shows pretty definitively that there is not a link between the age of the plane and its average delay. So, since we've determined that the airports and planes tables don't give us any really good information about what causes delays, of course we next investigate the weather table. First we create a new table with some of the weather data connected to each arr_delay observation:

flight_weather <- flights%>%
  left_join(weather, c("origin", "year", "month", "day", "hour"))%>%
  filter(!is.na(arr_delay), !is.na(temp), !is.na(humid), !is.na(wind_speed), !is.na(precip), !is.na(wind_gust))%>%     #there's quite a lot of missing data from the weather table as well
  select("arr_delay", "temp", "humid", "wind_speed", "precip", "wind_gust")
flight_weather
## # A tibble: 326,089 x 6
##    arr_delay  temp humid wind_speed precip wind_gust
##        <dbl> <dbl> <dbl>      <dbl>  <dbl>     <dbl>
##  1    -25.0   39.9  57.3       13.8      0      15.9
##  2     19.0   39.0  59.4       10.4      0      11.9
##  3    -14.0   39.9  57.3       13.8      0      15.9
##  4    - 8.00  39.0  59.4       12.7      0      14.6
##  5      8.00  39.9  57.3       13.8      0      15.9
##  6    - 2.00  39.0  59.4       12.7      0      14.6
##  7    - 3.00  39.0  59.4       12.7      0      14.6
##  8      7.00  39.0  59.4       12.7      0      14.6
##  9    -14.0   39.0  59.4       10.4      0      11.9
## 10     31.0   39.9  57.3       13.8      0      15.9
## # ... with 326,079 more rows

Then we make a whole bunch of plots to look for some sort of relationship between the delays and the weather. I thought that precipitation would be most likely to affect the delays:

flight_weather%>%
  group_by(precip)%>%
  summarise(avg_delay=mean(arr_delay))%>% 
  ggplot(aes(precip, avg_delay))+
  geom_point()

plot of chunk unnamed-chunk-7

Since that didn't look like a really obvious correlation, I decided to just go through each measurement to look for something meaningful:

flight_weather%>%
  group_by(temp)%>%
  summarise(avg_delay=mean(arr_delay))%>%
  ggplot(aes(temp, avg_delay))+
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess'

plot of chunk unnamed-chunk-8

flight_weather%>%
  group_by(humid)%>%
  summarise(avg_delay=mean(arr_delay))%>%
  arrange(humid)%>%
  ggplot(aes(humid, avg_delay))+
  geom_point(alpha=1/3)+
  geom_smooth()
## `geom_smooth()` using method = 'gam'

plot of chunk unnamed-chunk-8

flight_weather%>%
  filter(wind_speed<100)%>% #The data has one faulty datapoint at over 1000mph wind speed
  group_by(wind_speed)%>%
  summarise(avg_delay=mean(arr_delay))%>%
  ggplot(aes(wind_speed, avg_delay))+
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess'

plot of chunk unnamed-chunk-8

flight_weather%>%
  filter(wind_gust<100)%>%
  group_by(wind_gust)%>%
  summarise(avg_delay=mean(arr_delay))%>%
  arrange(wind_gust)%>%
  ggplot(aes(wind_gust, avg_delay))+
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess'

plot of chunk unnamed-chunk-8

While the temperature and humidity show some relationship to the delay (who would have guessed that average delays are longer at higher temperatures?), both of the wind metrics show a really clear positive correlation with average delay. After seeing the plots, it seemed really obvious to me that wind would affect flights, but in the absence of common sense, it's neat to be able to use these tools to figure it out.

The Puzzle of June 13, 2013

The final task that I completed this week was to investigate the flight situation from June 13, 2013, about which the book is very cryptic. So the first thing that I did was to create a new table with delay and weather information from that day and look at the delay pattern:

june13_weather <- flights%>%
  filter(!is.na(arr_delay))%>%
  left_join(weather, by=c("origin", "year", "month", "day", "hour"))%>%
  filter(year==2013, month==06, day==13)%>%
  select("origin", "hour", "arr_delay", "temp", "wind_speed", "precip")

june13_weather%>%
  group_by(origin, hour)%>%
  summarise(delay=mean(arr_delay))%>%
  ggplot(aes(hour, delay, color=origin))+
  geom_point()

plot of chunk unnamed-chunk-9

From this plot, we see that all three NYC airports have an increase in delays right around the same time. Since we now know (or perhaps always have known…) that wind is the biggest predictor of delay, I wanted to see the wind pattern from that day. I also threw in the precipitation metric as well because I had a hunch that there was a major weather event going on, which probably would involve some precipitation as well.

june13_weather%>%
  ggplot(aes(hour, wind_speed, size=precip, color=origin))+
  geom_point()

plot of chunk unnamed-chunk-10

Aha! An event! We see that there was a major increase in winds accompanied by a sudden accumulation of precipitation. A little googling confirms that the East Coast experienced a “derecho” that day. Which, according to google, is “a line of intense, widespread, and fast-moving windstorms and sometimes thunderstorms that moves across a great distance and is characterized by damaging winds”. Who knew? Apparently I'm learning about weather phenomena as well as the tidyverse in this independent study.

No comments:

Post a Comment

English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we m...