Friday, February 23, 2018

Working with Strings






Working with Strings, and I'm not even tangled!

Working with Strings, and I'm not even tangled!

This week I continued to make my way through the Wrangle section of R For Data Science, focusing mainly on the chapter on Strings. It shouldn't be surprising to me, since the focus of this independent study and potentially my future career is working with textual data, but working with character strings has been SO FUN!

Like the rest of this course so far, working through this chapter has been a nice confirmation that I've chosen a good path and actually do enjoy doing the work that I've imagined myself doing for years now. I think anyone who is going/has gone through a career change can probably attest to how much relief comes with that feeling of confirmation. Both my brother- and sister-in-law are also in the midst of major career changes, so the subject has been a frequent topic of discussion in our family. Basically, I think the consensus is that changing careers is really hard, stressful, and scary, so any feelings of relief are quite welcome.

Regular Expressions: You Mean, AMAZING EXPRESSIONS?!?

Working within the really wonderful framework of the stringr package, all of the otherwise complicated work of working with character strings became pretty straightforward. This really only left the puzzle of regular expressions to learn and figure out. Luckily, I feel like my entire childhood (and, let's face it, adulthood) of working through grocery store puzzle books from cover to cover has really just been preparing me for the day when I got to play with regex and call it work.

When I reached the end of the first section on regex, and one of the exercises was to go to https://regexcrossword.com/ and complete the beginner puzzles, I felt like I had cheated the system. Was I really allowed to do puzzles and still pretend like I was studying? But since it was written right there in the textbook, I decided I was allowed to spend as much time doing it as I wanted…so there went the rest of the day.

The book starts us out doing pretty simple operations, like finding all words in a list of strings that start or end with a specific letter:

str_subset(words, "(^y)|(x$)")
##  [1] "box"       "sex"       "six"       "tax"       "year"     
##  [6] "yes"       "yesterday" "yet"       "you"       "young"

Working up to something slightly more complicated, like finding all words in a list with three or more vowels in a row (turns out there aren't any with more than three…):

str_subset(words, "[aeiou]{3,}")
## [1] "beauty"   "obvious"  "previous" "quiet"    "serious"  "various"

Or finding all the words in a list with a repeating letter pair:

str_subset(words, "(..).*\\1")
##  [1] "appropriate" "church"      "condition"   "decide"      "environment"
##  [6] "london"      "paragraph"   "particular"  "photograph"  "prepare"    
## [11] "pressure"    "remember"    "represent"   "require"     "sense"      
## [16] "therefore"   "understand"  "whether"

Then things really do start to pick up, like creating this tibble that counts how many vowels and consonants are in each word, and then uses those values to calculate each word's vowel to consonant ratio.

tibble(word=words, 
                  vowels=str_count(word, "[aeiou]"),
                  consonants=str_count(word, "[^aeiou]"),
                  vtoc_ratio=vowels/(consonants+vowels))%>%
  head(10)
## # A tibble: 10 x 4
##    word     vowels consonants vtoc_ratio
##    <chr>     <int>      <int>      <dbl>
##  1 a             1          0      1.00 
##  2 able          2          2      0.500
##  3 about         3          2      0.600
##  4 absolute      4          4      0.500
##  5 accept        2          4      0.333
##  6 account       3          4      0.429
##  7 achieve       4          3      0.571
##  8 across        2          4      0.333
##  9 act           1          2      0.333
## 10 active        3          3      0.500

Or pulling all sentences with colors in them:

colors <- c("red", "orange", "yellow", "green", "blue", "purple")
color_match <-str_c(" ", colors, " ", collapse = "|")
str_subset(sentences, color_match)%>%
  head(10)
##  [1] "Glue the sheet to the dark blue background."   
##  [2] "Two blue fish swam in the tank."               
##  [3] "A wisp of cloud hung in the blue air."         
##  [4] "Leaves turn brown and yellow in the fall."     
##  [5] "The spot on the blotter was made by green ink."
##  [6] "The sofa cushion is red and of light weight."  
##  [7] "A blue crane is a tall wading bird."           
##  [8] "It is hard to erase blue or red ink."          
##  [9] "The lamp shone with a steady green flame."     
## [10] "The box is held by a bright red snapper."

At one point the book asks us to find all words that end in “ing”:

ing <- str_subset(sentences, regex(" [a-z]*ing ", ignore_case=TRUE))
str_extract_all(ing, regex(" [a-z]*ing ", ignore_case=TRUE), simplify=TRUE)%>%
  head(10)
##       [,1]        
##  [1,] " winding " 
##  [2,] " king "    
##  [3,] " making "  
##  [4,] " raging "  
##  [5,] " playing " 
##  [6,] " sleeping "
##  [7,] " glaring " 
##  [8,] " dying "   
##  [9,] " lodging " 
## [10,] " filing "

Well, when I saw this list, the first thing I thought was how useless it was since it includes both verbs and words like “king”, “thing”, and “sing” which happen to end in “ing” but don't follow the same rules. So I decided to try something a little more specific to isolate verbs ending in “ing”. Of course, any linguist will bring up words like “herring” which would still be selected with this rule, but I think I did pretty well:

gerund_string <- " [a-z]*[aeiouy][a-z]*ing "
gerunds <- str_subset(sentences, regex(gerund_string, ignore_case=TRUE))
str_extract_all(gerunds, gerund_string, simplify=TRUE)%>%
  head(10)
##       [,1]        
##  [1,] " winding " 
##  [2,] " making "  
##  [3,] " raging "  
##  [4,] " playing " 
##  [5,] " sleeping "
##  [6,] " glaring " 
##  [7,] " dying "   
##  [8,] " lodging " 
##  [9,] " filing "  
## [10,] " making "

Similarly, the expression needed to find plurals is more complicated than just finding words that end in “s”:

plural_string <- "[a-z]*(e|[^aeious'])s "
plurals <- str_subset(sentences, 
                      regex(plural_string, ignore_case=TRUE))
str_extract_all(plurals, plural_string, simplify = TRUE)%>%
  head(10)
##       [,1]         [,2]      [,3]
##  [1,] "days "      ""        ""  
##  [2,] "lemons "    "makes "  ""  
##  [3,] "hogs "      ""        ""  
##  [4,] "hours "     ""        ""  
##  [5,] "stockings " ""        ""  
##  [6,] "helps "     ""        ""  
##  [7,] "fires "     ""        ""  
##  [8,] "pants "     ""        ""  
##  [9,] "books "     ""        ""  
## [10,] "keeps "     "chicks " ""

Unfortunately I couldn't think of a pattern that could weed out verbs ending in “s” like “helps”, but even if I could it wouldn't be able to account for words like “needs” which could be either a verb or a plural noun. I guess I'll just have to wait to do that kind of mining once I've learned about POS tagging in the tidytext book. Perhaps I'm showing my nerd a little here, but I'm actually really excited for that day.

My last task in this chapter was to find out which individual words are used the most often within the sentences list. It was a bit of a fight to figure out how to count the words individually, even after splitting them up, but I am quite happy with the solution I came up with:

sent_word <- str_extract(sentences, boundary("word"))
sent_words <- tibble(words=sent_word)
sent_words%>%
 count(words)%>%
  arrange(desc(n))
## # A tibble: 262 x 2
##    words     n
##    <chr> <int>
##  1 The     262
##  2 A        72
##  3 He       24
##  4 We       13
##  5 It       12
##  6 They     10
##  7 She       9
##  8 There     7
##  9 Take      5
## 10 This      5
## # ... with 252 more rows

Although I do have a bit more that I want to learn out of R for Data Science before I move on to Text Mining in R, the work that I did this week has me more excited than ever to do so. Look, everyone! I'm really using my undergraduate degree after all! I know things and they are useful! Those are statements that I wasn't sure I would ever utter, and it feels pretty darn good to be able to say them without any sarcasm.

Tuesday, February 13, 2018

Databases: They're All Relative, Really






Databases: They're All Relative, Really

Databases: They're All Relative, Really

This week brought a few of really useful insights into using the tidyverse. For one, my mentor introduced me to R-Markdown, which makes everything so pretty. I'd been jealous of all of the blogs I'd been reading that all seemed to be formatted in the same way, but I hadn't figured out how everyone was doing it. Having used LaTeX a lot in my undergrad, I'm loving the familiarity of the “knit”.

I also started to delve into the Wrangle section of R for Data Science, which has been very successfully scratching my constant itch for organization. I will probably regret having said this years from now when I never want to tidy a dataset ever again, but I'm kind of looking forward to working with my own messy data so that I can parse, spread, gather, separate, and unite to my heart's content. Fixing up that WHO dataframe about TB cases was just oh-so-satisfying.

Until I get to that point where I'm ready to jump into my own project with my own data, I think the Relational Data section will be the most useful part of what I learned this week. It just so happens that my Intro to Data Analytics course is beginning work in SQL this week as well, so this has all been quite relevant.

The book did a great job leading us into using relational databases with one of the exercises from the Tidy Data section. We start out with two messy data frames from the tidyverse:

table4a #This table shows the number of cases of TB in each country per year
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b #This table shows the population of each country per year
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

These tables were used to demonstrate the need to use gather like so:

tidy4a <- table4a%>%
  gather(`1999`, `2000`, key="Year", value="Cases")
tidy4a
## # A tibble: 6 x 3
##   country     Year   Cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766
tidy4b <- table4b%>%
  gather(`1999`, `2000`, key="Year", value="Population")
tidy4b
## # A tibble: 6 x 3
##   country     Year  Population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Brazil      1999   172006362
## 3 China       1999  1272915272
## 4 Afghanistan 2000    20595360
## 5 Brazil      2000   174504898
## 6 China       2000  1280428583

But then the reader is left with these two dataframes which so clearly want to be joined together, but without the tools to do so…for a minute. Then we learn about joins and all is well again:

table4 <- tidy4a%>%
  left_join(tidy4b, by=c("country", "Year"))
table4
## # A tibble: 6 x 4
##   country     Year   Cases Population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Brazil      1999   37737  172006362
## 3 China       1999  212258 1272915272
## 4 Afghanistan 2000    2666   20595360
## 5 Brazil      2000   80488  174504898
## 6 China       2000  213766 1280428583

Flights Don't Exist in Isolation

The rest of my work from this week focused on the nycflights13 dataset, which has just been begging to be used with joins since I first saw it weeks ago. Our first task is to use the left_join function with the flights and airports dataframes to create a map of the average delays by destination airport.

delay <- flights%>%
  group_by(dest)%>%
  filter(!is.na(arr_delay))%>%
  summarise(avg_delay=mean(arr_delay))

delay %>%
  left_join(airports, c("dest" = "faa"))%>% 
  filter(lon>-140)%>% #to zoom in on the continental US
  ggplot(aes(x=lon, y=lat)) +
  borders("state") +
  geom_point(aes(size=avg_delay)) +
  coord_quickmap()

plot of chunk unnamed-chunk-4 Unfortunately this map isn't all that informative, since most of the average delays are between 0 and 20 minutes, and the airports are pretty smushed along the east coast. But it's still cool that we can do this.

We're then asked to determine whether there is a link between average delay and the age of the plane. So once again we use left_join to see the relationship between arr_delay in flights and year in planes:

plane_delay <- flights%>%
  group_by(tailnum)%>%
  filter(!is.na(arr_delay))%>%
  summarise(avg_delay=mean(arr_delay))%>%
  left_join(planes, "tailnum")%>%
  select(tailnum, avg_delay, year)%>%
  filter(!is.na(year))

plane_delay%>%
  ggplot(aes(avg_delay, year))+
  geom_point()

plot of chunk unnamed-chunk-5

This plot shows pretty definitively that there is not a link between the age of the plane and its average delay. So, since we've determined that the airports and planes tables don't give us any really good information about what causes delays, of course we next investigate the weather table. First we create a new table with some of the weather data connected to each arr_delay observation:

flight_weather <- flights%>%
  left_join(weather, c("origin", "year", "month", "day", "hour"))%>%
  filter(!is.na(arr_delay), !is.na(temp), !is.na(humid), !is.na(wind_speed), !is.na(precip), !is.na(wind_gust))%>%     #there's quite a lot of missing data from the weather table as well
  select("arr_delay", "temp", "humid", "wind_speed", "precip", "wind_gust")
flight_weather
## # A tibble: 326,089 x 6
##    arr_delay  temp humid wind_speed precip wind_gust
##        <dbl> <dbl> <dbl>      <dbl>  <dbl>     <dbl>
##  1    -25.0   39.9  57.3       13.8      0      15.9
##  2     19.0   39.0  59.4       10.4      0      11.9
##  3    -14.0   39.9  57.3       13.8      0      15.9
##  4    - 8.00  39.0  59.4       12.7      0      14.6
##  5      8.00  39.9  57.3       13.8      0      15.9
##  6    - 2.00  39.0  59.4       12.7      0      14.6
##  7    - 3.00  39.0  59.4       12.7      0      14.6
##  8      7.00  39.0  59.4       12.7      0      14.6
##  9    -14.0   39.0  59.4       10.4      0      11.9
## 10     31.0   39.9  57.3       13.8      0      15.9
## # ... with 326,079 more rows

Then we make a whole bunch of plots to look for some sort of relationship between the delays and the weather. I thought that precipitation would be most likely to affect the delays:

flight_weather%>%
  group_by(precip)%>%
  summarise(avg_delay=mean(arr_delay))%>% 
  ggplot(aes(precip, avg_delay))+
  geom_point()

plot of chunk unnamed-chunk-7

Since that didn't look like a really obvious correlation, I decided to just go through each measurement to look for something meaningful:

flight_weather%>%
  group_by(temp)%>%
  summarise(avg_delay=mean(arr_delay))%>%
  ggplot(aes(temp, avg_delay))+
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess'

plot of chunk unnamed-chunk-8

flight_weather%>%
  group_by(humid)%>%
  summarise(avg_delay=mean(arr_delay))%>%
  arrange(humid)%>%
  ggplot(aes(humid, avg_delay))+
  geom_point(alpha=1/3)+
  geom_smooth()
## `geom_smooth()` using method = 'gam'

plot of chunk unnamed-chunk-8

flight_weather%>%
  filter(wind_speed<100)%>% #The data has one faulty datapoint at over 1000mph wind speed
  group_by(wind_speed)%>%
  summarise(avg_delay=mean(arr_delay))%>%
  ggplot(aes(wind_speed, avg_delay))+
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess'

plot of chunk unnamed-chunk-8

flight_weather%>%
  filter(wind_gust<100)%>%
  group_by(wind_gust)%>%
  summarise(avg_delay=mean(arr_delay))%>%
  arrange(wind_gust)%>%
  ggplot(aes(wind_gust, avg_delay))+
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess'

plot of chunk unnamed-chunk-8

While the temperature and humidity show some relationship to the delay (who would have guessed that average delays are longer at higher temperatures?), both of the wind metrics show a really clear positive correlation with average delay. After seeing the plots, it seemed really obvious to me that wind would affect flights, but in the absence of common sense, it's neat to be able to use these tools to figure it out.

The Puzzle of June 13, 2013

The final task that I completed this week was to investigate the flight situation from June 13, 2013, about which the book is very cryptic. So the first thing that I did was to create a new table with delay and weather information from that day and look at the delay pattern:

june13_weather <- flights%>%
  filter(!is.na(arr_delay))%>%
  left_join(weather, by=c("origin", "year", "month", "day", "hour"))%>%
  filter(year==2013, month==06, day==13)%>%
  select("origin", "hour", "arr_delay", "temp", "wind_speed", "precip")

june13_weather%>%
  group_by(origin, hour)%>%
  summarise(delay=mean(arr_delay))%>%
  ggplot(aes(hour, delay, color=origin))+
  geom_point()

plot of chunk unnamed-chunk-9

From this plot, we see that all three NYC airports have an increase in delays right around the same time. Since we now know (or perhaps always have known…) that wind is the biggest predictor of delay, I wanted to see the wind pattern from that day. I also threw in the precipitation metric as well because I had a hunch that there was a major weather event going on, which probably would involve some precipitation as well.

june13_weather%>%
  ggplot(aes(hour, wind_speed, size=precip, color=origin))+
  geom_point()

plot of chunk unnamed-chunk-10

Aha! An event! We see that there was a major increase in winds accompanied by a sudden accumulation of precipitation. A little googling confirms that the East Coast experienced a “derecho” that day. Which, according to google, is “a line of intense, widespread, and fast-moving windstorms and sometimes thunderstorms that moves across a great distance and is characterized by damaging winds”. Who knew? Apparently I'm learning about weather phenomena as well as the tidyverse in this independent study.

English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we m...