Databases: They're All Relative, Really
This week brought a few of really useful insights into using the tidyverse. For one, my mentor introduced me to R-Markdown, which makes everything so pretty. I'd been jealous of all of the blogs I'd been reading that all seemed to be formatted in the same way, but I hadn't figured out how everyone was doing it. Having used LaTeX a lot in my undergrad, I'm loving the familiarity of the “knit”.
I also started to delve into the Wrangle section of R for Data Science, which has been very successfully scratching my constant itch for organization. I will probably regret having said this years from now when I never want to tidy a dataset ever again, but I'm kind of looking forward to working with my own messy data so that I can parse
, spread
, gather
, separate
, and unite
to my heart's content. Fixing up that WHO dataframe about TB cases was just oh-so-satisfying.
Until I get to that point where I'm ready to jump into my own project with my own data, I think the Relational Data section will be the most useful part of what I learned this week. It just so happens that my Intro to Data Analytics course is beginning work in SQL this week as well, so this has all been quite relevant.
The book did a great job leading us into using relational databases with one of the exercises from the Tidy Data section. We start out with two messy data frames from the tidyverse
:
table4a #This table shows the number of cases of TB in each country per year
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b #This table shows the population of each country per year
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
These tables were used to demonstrate the need to use gather
like so:
tidy4a <- table4a%>%
gather(`1999`, `2000`, key="Year", value="Cases")
tidy4a
## # A tibble: 6 x 3
## country Year Cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
tidy4b <- table4b%>%
gather(`1999`, `2000`, key="Year", value="Population")
tidy4b
## # A tibble: 6 x 3
## country Year Population
## <chr> <chr> <int>
## 1 Afghanistan 1999 19987071
## 2 Brazil 1999 172006362
## 3 China 1999 1272915272
## 4 Afghanistan 2000 20595360
## 5 Brazil 2000 174504898
## 6 China 2000 1280428583
But then the reader is left with these two dataframes which so clearly want to be joined together, but without the tools to do so…for a minute. Then we learn about joins and all is well again:
table4 <- tidy4a%>%
left_join(tidy4b, by=c("country", "Year"))
table4
## # A tibble: 6 x 4
## country Year Cases Population
## <chr> <chr> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Brazil 1999 37737 172006362
## 3 China 1999 212258 1272915272
## 4 Afghanistan 2000 2666 20595360
## 5 Brazil 2000 80488 174504898
## 6 China 2000 213766 1280428583
Flights Don't Exist in Isolation
The rest of my work from this week focused on the nycflights13
dataset, which has just been begging to be used with joins since I first saw it weeks ago. Our first task is to use the left_join
function with the flights
and airports
dataframes to create a map of the average delays by destination airport.
delay <- flights%>%
group_by(dest)%>%
filter(!is.na(arr_delay))%>%
summarise(avg_delay=mean(arr_delay))
delay %>%
left_join(airports, c("dest" = "faa"))%>%
filter(lon>-140)%>% #to zoom in on the continental US
ggplot(aes(x=lon, y=lat)) +
borders("state") +
geom_point(aes(size=avg_delay)) +
coord_quickmap()
Unfortunately this map isn't all that informative, since most of the average delays are between 0 and 20 minutes, and the airports are pretty smushed along the east coast. But it's still cool that we can do this.
We're then asked to determine whether there is a link between average delay and the age of the plane. So once again we use left_join
to see the relationship between arr_delay
in flights
and year
in planes
:
plane_delay <- flights%>%
group_by(tailnum)%>%
filter(!is.na(arr_delay))%>%
summarise(avg_delay=mean(arr_delay))%>%
left_join(planes, "tailnum")%>%
select(tailnum, avg_delay, year)%>%
filter(!is.na(year))
plane_delay%>%
ggplot(aes(avg_delay, year))+
geom_point()
This plot shows pretty definitively that there is not a link between the age of the plane and its average delay. So, since we've determined that the airports
and planes
tables don't give us any really good information about what causes delays, of course we next investigate the weather
table. First we create a new table with some of the weather
data connected to each arr_delay
observation:
flight_weather <- flights%>%
left_join(weather, c("origin", "year", "month", "day", "hour"))%>%
filter(!is.na(arr_delay), !is.na(temp), !is.na(humid), !is.na(wind_speed), !is.na(precip), !is.na(wind_gust))%>% #there's quite a lot of missing data from the weather table as well
select("arr_delay", "temp", "humid", "wind_speed", "precip", "wind_gust")
flight_weather
## # A tibble: 326,089 x 6
## arr_delay temp humid wind_speed precip wind_gust
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -25.0 39.9 57.3 13.8 0 15.9
## 2 19.0 39.0 59.4 10.4 0 11.9
## 3 -14.0 39.9 57.3 13.8 0 15.9
## 4 - 8.00 39.0 59.4 12.7 0 14.6
## 5 8.00 39.9 57.3 13.8 0 15.9
## 6 - 2.00 39.0 59.4 12.7 0 14.6
## 7 - 3.00 39.0 59.4 12.7 0 14.6
## 8 7.00 39.0 59.4 12.7 0 14.6
## 9 -14.0 39.0 59.4 10.4 0 11.9
## 10 31.0 39.9 57.3 13.8 0 15.9
## # ... with 326,079 more rows
Then we make a whole bunch of plots to look for some sort of relationship between the delays and the weather. I thought that precipitation would be most likely to affect the delays:
flight_weather%>%
group_by(precip)%>%
summarise(avg_delay=mean(arr_delay))%>%
ggplot(aes(precip, avg_delay))+
geom_point()
Since that didn't look like a really obvious correlation, I decided to just go through each measurement to look for something meaningful:
flight_weather%>%
group_by(temp)%>%
summarise(avg_delay=mean(arr_delay))%>%
ggplot(aes(temp, avg_delay))+
geom_point()+
geom_smooth()
## `geom_smooth()` using method = 'loess'
flight_weather%>%
group_by(humid)%>%
summarise(avg_delay=mean(arr_delay))%>%
arrange(humid)%>%
ggplot(aes(humid, avg_delay))+
geom_point(alpha=1/3)+
geom_smooth()
## `geom_smooth()` using method = 'gam'
flight_weather%>%
filter(wind_speed<100)%>% #The data has one faulty datapoint at over 1000mph wind speed
group_by(wind_speed)%>%
summarise(avg_delay=mean(arr_delay))%>%
ggplot(aes(wind_speed, avg_delay))+
geom_point()+
geom_smooth()
## `geom_smooth()` using method = 'loess'
flight_weather%>%
filter(wind_gust<100)%>%
group_by(wind_gust)%>%
summarise(avg_delay=mean(arr_delay))%>%
arrange(wind_gust)%>%
ggplot(aes(wind_gust, avg_delay))+
geom_point()+
geom_smooth()
## `geom_smooth()` using method = 'loess'
While the temperature and humidity show some relationship to the delay (who would have guessed that average delays are longer at higher temperatures?), both of the wind metrics show a really clear positive correlation with average delay. After seeing the plots, it seemed really obvious to me that wind would affect flights, but in the absence of common sense, it's neat to be able to use these tools to figure it out.
The Puzzle of June 13, 2013
The final task that I completed this week was to investigate the flight situation from June 13, 2013, about which the book is very cryptic. So the first thing that I did was to create a new table with delay and weather information from that day and look at the delay pattern:
june13_weather <- flights%>%
filter(!is.na(arr_delay))%>%
left_join(weather, by=c("origin", "year", "month", "day", "hour"))%>%
filter(year==2013, month==06, day==13)%>%
select("origin", "hour", "arr_delay", "temp", "wind_speed", "precip")
june13_weather%>%
group_by(origin, hour)%>%
summarise(delay=mean(arr_delay))%>%
ggplot(aes(hour, delay, color=origin))+
geom_point()
From this plot, we see that all three NYC airports have an increase in delays right around the same time. Since we now know (or perhaps always have known…) that wind is the biggest predictor of delay, I wanted to see the wind pattern from that day. I also threw in the precipitation metric as well because I had a hunch that there was a major weather event going on, which probably would involve some precipitation as well.
june13_weather%>%
ggplot(aes(hour, wind_speed, size=precip, color=origin))+
geom_point()
Aha! An event! We see that there was a major increase in winds accompanied by a sudden accumulation of precipitation. A little googling confirms that the East Coast experienced a “derecho” that day. Which, according to google, is “a line of intense, widespread, and fast-moving windstorms and sometimes thunderstorms that moves across a great distance and is characterized by damaging winds”. Who knew? Apparently I'm learning about weather phenomena as well as the tidyverse in this independent study.
No comments:
Post a Comment