Tuesday, February 6, 2018

Basic Transformations and Explorations in the Tidyverse

Teenage Mutant Ninja Data

This week, I've worked through the rest of the Explore section of R for Data Science by Garrett Grolemund and Hadley Wickham, starting with learning various ways to transform data. With all the talk of pipes and mutating, I had the TMNT theme song stuck in my head on several occasions, and I want to make a point of doing some Ninja Turtle analysis at some point during my graduate career in order to sufficiently pay homage. However, for now I'll be sticking with the data that is so nicely provided for me in the tidyverse package.

Going through this section, my R-language skills expanded significantly with the introduction of the script editor (so clean!), assignments, and piping. Here are some examples of the types of transformations I was able to create, using the nycflights13::flights data.

This dataframe shows the number of destinations that each carrier (airline) flies to and ranks them based on this number.


carrier_rank <- flights%>% 
  group_by(carrier)%>%
  summarize(num_dest = n_distinct(dest))%>%
  mutate(car_rank=min_rank(desc(num_dest)))%>%
  arrange(car_rank)

This dataframe ranks each carrier based on the percentage of their flights that arrive on-time (within 15 minutes of the scheduled arrival)


carrier_ontime <- flights%>%
  group_by(carrier)%>%
  summarize(ontime=mean(arr_delay<=15, na.rm=TRUE))%>%
  mutate(ontime_rank=min_rank(desc(ontime)), 
         percent_ontime=(100*ontime))%>%
  arrange(ontime_rank)%>%
  select(carrier, ontime_rank, percent_ontime)

This dataframe shows the total number of delayed flights out of each airport over the year, as well as the percentage of flights out of each airport that were delayed.


airport_delay <- flights%>%
  filter(!is.na(arr_delay))%>%
  mutate(delayed=(arr_delay>15))%>%
  group_by(origin)%>%
  summarize(num_delay=sum(delayed),
            prop_delay=mean(num_delay/n()))%>%
  mutate(percent_delay=prop_delay*100)%>%
  select(origin, num_delay, percent_delay)

And this dataframe shows the average, minimum, and maximum speeds of each carrier.


carrier_speed <- flights%>%
  mutate(speed=(distance/(air_time/60)))%>%
  filter(!is.na(air_time))%>%
  group_by(carrier)%>%
  summarise(avg_speed=mean(speed), max_speed=max(speed),
            min_speed=min(speed))%>%
  arrange(desc(avg_speed))

I've been using !is.na() and na.rm=TRUE to remove missing data, assuming these to be cancelled flights. This last dataframe, however, has some potentially questionable values (such as the US Airways flight with a speed of 76.8 mph, or the Delta flight with a speed of 703.4 mph), so I decided to take a closer look at these datapoints to see if these could be explained by other variables.



Speed_outliers <- flights%>%
  mutate(speed=(distance/(air_time/60)))%>%
  filter(speed<100 | speed>600)%>%
  select(speed, contains("dep"), contains("arr"),
         carrier, tailnum, dest)%>%
  arrange(desc(speed))

As could be expected, all of the fastest flights gained time while in the air, either decreasing or negating the flight's departure delay. And with the exception of the very fastest flight, all of these flights are on the carrier ExpressJet (EV) which, according to its Wikipedia site, has smaller jets that can go faster than a bigger airliner (hence the name of the carrier).  For that very fastest flight, everything seems to be in order until you also look at the reported air_time which was 65 minutes. If we compare that to the 134 minutes elapsed between the dep_time and the arr_time, it's not surprising that the speed is way off--probably almost twice as fast as the plane actually traveled.

Looking at the slowest flights, it's also not unexpected that all of these flights lost time between their departure and arrival, increasing delays by up to 100 minutes. This could be explained by faulty data for air_time, if the flight was reported to be taken off while it was still waiting on the tarmac. It could also be explained by flights that actually got delayed in the air, having to redirect or circle around in the air before landing. Either of these issues would increase the air_time without affecting the distance, and thus would decrease the calculated speed.

It is interesting to note that most of the slowest flights happen to be going to Philadelphia, but a quick look at the data for all flights to PHL confirms that there isn't anything strange about these particular entries that might affect the calculated speed, such as a shorter distance.


Dora the R Explorer

Apparently I've got TV on the brain this week--maybe someday these headings will be titles of blogposts that I've written about textual analyses I've done on these shows. Again, for now it just comes to mind because the next section of the textbook is about Data Exploration. I love that this is the accepted term for this stage of analysis, by the way--it makes me feel like I'm on adventure and doing important work in the wild frontier on the monarch's dollar...or something. Anyway, this section focuses on combining the transformation and visualization tools to really learn things about your dataset. Here are some plots that I created with the book's guidance.

This shows an unimpressive relationship between the summer and increased average delays.



flights<-flights%>%
  mutate(date=as.Date(with(flights,
         paste(year, month, day, sep="-")), "%Y-%m-%d"))

flights%>%
  group_by(date)%>%
  filter(!is.na(dep_delay))%>%
  summarise(avg_delay=mean(dep_delay))%>%
  ggplot(mapping=aes(x=date, y=avg_delay))+
  geom_point()+
  geom_smooth(se=FALSE)+
  labs(title="Average Departure Delay by Date", x= "Date",
       y="Average Delay (minutes)")

This plot shows a much more impressive relationship between time of day and delays, with the average delays peaking around 7pm.


flights%>%
  mutate(sched_dep_hour=sched_dep_time%/%100)%>%
  filter(!is.na(dep_delay))%>%
  group_by(sched_dep_hour)%>%
  summarise(avg_delay=mean(dep_delay))%>%
  ggplot(mapping=aes(x=sched_dep_hour, y=avg_delay))+
  geom_point()+
  geom_smooth(se=FALSE)+
  labs(title="Average Departure Delay by Time of Day", 
       x= "Scheduled Departure Hour",
       y="Average Delay (minutes)")

The following plots are all meant to demonstrate the relationship between diamond price, cut quality, and size (in carats). I thought the book did a really good job of leading you to discover the relationship, while still allowing a lot of room to play with different ways to organize and visualize the relevant data.



diamonds%>%
  ggplot(mapping=aes(x=price, y=..density.., color=cut))+
  geom_freqpoly()+
  labs(title="Price vs Cut Quality", x="Price", 
       y="Density", color="Cut Quality")



diamonds%>%
  ggplot(mapping=aes(x=cut, y=price))+
  geom_boxplot()+
  labs(title="Cut Quality vs Price", x="Cut Quality", y="Price")


diamonds%>%
  ggplot(mapping=aes(x=cut, y=carat))+
  geom_boxplot()+
  labs(title="Cut Quality vs. Diamond Size", 
       x="Cut Quality", y="Carat")


diamonds%>%
  ggplot(mapping=aes(x=carat))+
  geom_freqpoly(mapping=aes(color=cut_number(price, 10)))+
  labs(title="Diamond Size by Price", x="Carats", 
       y="Count", color="Price")


diamonds%>%
  ggplot(mapping=aes(x=price, y=carat))+
  geom_boxplot(mapping=aes(group=cut_width(price, 1000)))+
  labs(title="Diamond Price by Size", x="Price", y="Carats")



diamonds%>%
  ggplot(mapping=aes(x=price, y=carat))+
  geom_point(alpha=1/3)+
  geom_smooth(mapping=aes(color=cut),se=FALSE)+
  labs(title="Price vs Diamond Size by Cut Quality", 
       x="Price", y="Carats", color="Cut Quality")

The conclusion that we were supposed to reach (which is hopefully clear from the plots above) is that the apparently incongruous number of highly priced low-quality ("fair") diamonds, can be explained by size ("carat") of the diamond. Since the largest diamonds tend to be of a lower quality, the highest quality diamonds tend to be smaller, and the size of the diamond dictates price more directly than the cut quality, we are left with a situation in which some of the most expensive diamonds are of the lowest quality.

I'm really enjoying working through this book, but feel like I might be getting a little sidetracked on my journey towards TidyText masterdom. Maybe this week I'll start looking through the first chapters of Text Mining with R to see if I feel ready to move on yet. We'll see!

No comments:

Post a Comment

English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we m...