Friday, March 9, 2018

Grimms' Fairy Tales: A TidyText Analysis






TidyText: I Have Arrived!











TidyText: I Have Arrived!


It's so exciting to be creating my very first ever text analysis this week. With the foundation of the work I've been doing in R for Data Science, working through Julia Silge and David Robinson's Text Mining in R has been really straightforward so far. I think it really speaks to the brilliance behind the package that performing normally tedious natural language process tasks like tokenizing is made so deceptively simple.


It's also been really neat to see the explanations of a lot of the techniques used in the tidytext blogs I've been reading. It's so encouraging to be reaching the point where I feel like I'm really understanding the code and the analyses that I'm reading. And once again, I feel like the authors really deserve recognition for the incredible accessibility of using this package. After just one week of working with the package, I'm proud of the analysis that I was able to put together. So, thank you to Julia Silge and David Robinson for making this all possible!


Grimms' Fairy Tales


After seeing how the authors used the gutenbergr package to pull novels from the Gutenberg Project archives, I went browsing for some titles that I would be interested in exploring. One of the titles at the top of the 100 Most Dowloaded lists was Grimms' Fairy Tales, which immediately caught my eye. Although I'm actually not familiar with a lot of the stories (more than I expected), there are quite a few standards that I thought almost anyone would recognize. I also liked the idea of being able to do a comparative analysis between some of the stories, rather than looking at individual chapters or entire books as was done in the book.


So I found the book's ID number and pulled it into R:


gutenberg_metadata%>%
  filter(title == "Grimms' Fairy Tales")
grimm <- gutenberg_download(2591)

Then I needed to clean it up a little, removing the table of contents and labelling each line of text with the story title. Unlike the examples from the book, the story titles were not preceded by a nice marker like “Chapter” or a number, but they were consistently written in all-caps. So I came up with a regex that would find all lines in all-caps. This was complicated a little by the inclusion of various punctuation in some of the titles, but since it was a pretty limited set, I just went ahead and hard-coded those specific inclusions in my regex.


I could have followed the examples in the books and used cumsum to label the stories by number, but I really wanted the full name of the story as the label instead–I didn't want to have to keep referring to the table of contents and counting to know which story I was looking at. So I used a combination of str_detect from the stringr package and na.locf from the zoo package to get everything labelled how I wanted it:


grimm_titles <- grimm[-c(1:93), , drop = FALSE]%>%
  mutate(story_title = ifelse(str_detect(text, regex("^[12\\,\\.\\[\\]A-Z \\'\\-]+$")), text, NA),
         story = na.locf(story_title))

Then I selected four of the stories that I was most familiar with, and used unnest_tokens to make them tidy for the rest of my analysis.


grimm_stories <- grimm_titles%>%
  filter(story %in% c("THE FROG-PRINCE", "RAPUNZEL", "HANSEL AND GRETEL",
                      "RUMPELSTILTSKIN"))%>%
  mutate(linenumber = row_number())%>%
  select(text, linenumber, story)

tidy_stories <- grimm_stories%>%
  unnest_tokens(word, text)

Word Frequency and Inverse Document Frequency


It seems like it's pretty much unanimous that the first step of a text analysis is to look at word frequency, which makes sense, since this is a fairly simple but informative analysis to run. So I removed stop words from the dataset and then looked at the top words from each story.


I do feel like I need to incude a disclaimer that I don't think I fully agree with the stop_words list included in the package yet. Admittedly I haven't looked through the options for splitting the list up, but I feel like it includes some words that could be important in the context of some documents. From my quick glimpse, I saw a lot of adverbs (like currently, entirely, hardly), and even some adjectives and verbs (like alone, consider, indicate) that might should be left alone.


I understand why these words were included in the list, since they don't necessarily contribute to the content of the document. But I think they might should be left in the document when doing something like a sentiment analysis, since they could contribute to the overall tone of a document. Anyway, you'll see that I don't remove stop words when I do my sentiment analysis because of this. Probably pointless, since the sentiment analysis is performed on a specific set of words as well (which I would assume doesn't include these “stop words”), but I still left them in just on principle…


Anyway, back to the analysis:


tidy_stories%>%
  anti_join(stop_words)%>%
  group_by(story)%>%
  count(word)%>%
  arrange(story, desc(n))%>%
  top_n(8)%>%
  ungroup()%>%
  mutate(word = reorder(word, n))%>%
  ggplot(aes(word, n, fill=story))+
  geom_col(show.legend = FALSE)+
  xlab(NULL)+
  coord_flip()+
  facet_wrap(~story, ncol=2, scales="free" )

plot of chunk wordfreq


These plots show us that the most used terms do a pretty good job of highlighting the most important parts of each story, with the top terms consisting mostly of character names and major plot points. Normally we see proper names here, but with the exclusion of Hansel, Gretel, and Rapunzel, most of the characters in these stories are unnamed. It's also interesting to note that Rumpelstiltskin is specifically not on this list, which makes sense in the context of the story since guessing his name is an important part of the story.


Next I wanted to look at the inverse document frequency to see if these top terms changed when compared to the term frequency among all four stories. First I needed to recalculate the word frequency within each story before running the tf_idf function to determine which words were most unique to each story.


story_words <-  tidy_stories%>%
  group_by(story)%>%
  count(story, word, sort=TRUE)%>%
  ungroup()

story_words <- story_words%>%
  bind_tf_idf(word, story, n)

story_words%>%
  arrange(desc(tf_idf))%>%
  group_by(story)%>%
  top_n(8)%>%
  ungroup()%>%
  mutate(word=reorder(word, tf_idf))%>%
  ggplot(aes(word, tf_idf, fill = story))+
  geom_col(show.legend = FALSE)+
  labs(x=NULL, y= "tf_idf")+
  facet_wrap(~story, ncol=2, scales = "free")+
  coord_flip()

plot of chunk storytfidf


It's actually interesting to see how little the top terms changed when controlling for overall document language. This shows that the language used in each story is generally fairly specific to that story. I didn't remove the stop words from this analysis since I figured the most common ones (the, and, is, etc.) would be cancelled out by the document frequency calculation. While this was true for the most part, I think it's really interesting to see that Hansel and Gretel had “we”, “they”, and “us” pop up on the list of top terms. While this may not tell us a lot about the content of the story, it does show that this is the only of the four stories in which the main characters interact as a team rather than as individuals.


Sentiment Analysis


Next I wanted to run some sentiment analyses to see how the sentiment in each story compared over the course of the story and how they compared to each other. Since the tidytext package provides us with several sentiment lexicons, I decided to play with them each a little, starting with the “Bing” set. This lexicon has each word rated in a binary fashion as either “negative” or “positive”, so I needed to manipulate the results a little in order to display the change in sentiment across each document. Luckily, the authors of the book lay this process out pretty clearly:


story_sentiment_bing <- tidy_stories%>%
  inner_join(get_sentiments("bing"))%>%
  count(story, index=linenumber%/%5, sentiment)%>%
  spread(sentiment, n, fill = 0)%>%
  mutate(sentiment = positive - negative)

story_sentiment_bing%>%
  ggplot(aes(index, sentiment, fill=story))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~story, ncol = 2, scales = "free_x")

plot of chunk bingsentiment


Since these stories are pretty short, I needed to make the binwidths (index) pretty small in order to see any progression across the story. The tradeoff of this method was that some of the bins didn't have any words in the sentiment lexicon at all, but I think the pattern of the plot is still evident.


Now let's see what it looks like when we use the “afinn” lexicon instead. Since this lexicon ranks each word's sentiment with a numeric score (negative sentiment being ranked with a negative number, and positive sentiment being ranked with a positive number), creating a plot with this lexicon is a little more straightforward.


story_sentiment_afinn <- tidy_stories%>%
  inner_join(get_sentiments("afinn"))%>%
  group_by(story, index=linenumber%/%5)%>%
  summarise(sentiment = sum(score))

story_sentiment_afinn%>%
  ggplot(aes(index, sentiment, fill=story))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~story, ncol = 2, scales = "free_x")

plot of chunk afinnsentiment


When comparing these two analyses, we can see that the overall negative/positive shape of the story is pretty similar. According to the authors of Text Mining in R it's a known effect that the “afinn” analysis will show larger absolute values than the “bing” analysis will, so that's not surprising. Also consistent with the description given in the book, the “bing” analysis has larger chunks of similar sentiment than the “afinn” analysis does.


The shapes of the analysis also map well onto the plot of each story. Hansel and Gretel appears to be the most negative, with large negative chunks in the areas of the story where the children are first abandoned by their parents and then captured by the witch in the woods. Rapunzel starts out more positive but then becomes negative once the prince dispairs at first being unable to reach her and then at being blinded by thorns. Rumpelstiltskin and The Frog Prince are both overwhelmingly positive, with negative bursts where the miller's daughter (turned queen) and princess are distraught.


The third sentiment lexicon (“nrc”) provided in the tidytext package is different in that it ranks words on several different sentiments, rather than only on a binary variable of positive or negative. Thus, I wanted to see how each story ranked overall among these different sentiments:


nrc_sentiment <- tidy_stories%>%
  inner_join(get_sentiments("nrc"))%>%
  group_by(story)%>%
  count(sentiment)%>%
  ungroup()%>%
  mutate(sentiment = reorder(sentiment, n))

nrc_sentiment%>%
  ggplot(aes(sentiment, n,  fill = story))+
  geom_col(show.legend = FALSE)+
  facet_wrap(~story, ncol = 2, scales = "free")+
  coord_flip()

plot of chunk nrcsentiment


I was surprised that three of the four stories were much more positive than they were negative. Going into this project, I was expecting overwhelmingly negative sentiment, since Grimms' has a reputation of being fairly dark (especially considering that it's written for children). However, I wonder if the stories that have become more popular in modern society are some of the more positive ones–a question for another analysis perhaps!


Because of the way that the ranking is done, there is necessarily going to be some connection between the positive/negative scores and the other categories. It should also be noted that the raw n value is not that meaninful when used in comparing between the different stories, since they do not have an equal total wordcount. However, I do find it interesting that the two more negative stories (Hansel and Gretel and Rapunzel) have a larger proportion of “anticipation” than the other two stories. I don't know exactly how the sentiment lexicon was put together, but it makes sense that the more negative storylines would have more of a sense of suspense or anticipation as the reader waits for a resolution.


Next Steps


It was really exciting to not only begin working in tidytext this week, but to get far enough that I could produce a meaningful (if basic) text analysis. As I am working, the linguist in me is constantly questioning and thinking of ways to improve the analyses, but I'm trying to hold judgment until I work through more of the book and maybe even look further into the natural language processing documentation in CRAN. I'm particularly excited to work with n-grams and look into stemming and part-of-speech tagging, since I find the use of individuals words somewhat problematic. I'm also excited to look into topic modelling, since it's an entirely new concept for me.


Overall, very pleased with the work that I'm doing and the progress that I'm making. It feels really good :-)





Friday, March 2, 2018

Fun with FUNctions!






Fun with FUNctions!

Fun with FUNctions!

And also dates happened.

This week I finished up the sections in R for Data Science that I'd wanted to cover before moving on to Text Mining in R which is a really exciting place to be. We've also started working in R in my Data Analytics course, so once again things are matching up quite well between my different studies. Although I'm definitely ahead of everyone else in terms of using R thanks to this independent study, a lot of what we're doing in class is in Base R, which I really haven't gotten into very much yet. Sometimes my reaction to Base R functions is “Cool, that's so straightforward!”, but usually it's more like “Well, this is good to know, but I will never use it if I can use X instead…” So, thanks tidyverse contributors for making things easy and pretty!

Although I worked through a lot of chapters this week, I definitely spent the most time working on various functions and date formats. I did start out with some basic work with factors, practicing cleaning up data in situations with complicated or inconsistent levels:

#Lump the rincome levels to 5 labeled levels and one "other"
gss_cat%>%
  mutate(rincome=fct_lump(rincome, n=5))%>%
  count(rincome)
## # A tibble: 6 x 2
##   rincome            n
##   <fct>          <int>
## 1 $25000 or more  7363
## 2 $20000 - 24999  1283
## 3 $15000 - 19999  1048
## 4 $10000 - 14999  1168
## 5 Not applicable  7043
## 6 Other           3578
#Collapse the partyid levels into clearer groups and plot the proportion of each party over time
gss_cat%>%
  mutate(partyid=fct_collapse(partyid,
            "other"=c("No answer","Don't know","Other party"),
            "rep"=c("Strong republican","Not str republican"),
            "dem"=c("Strong democrat","Not str democrat"),
            "ind"=c("Ind,near rep","Independent","Ind,near dem")))%>%
  group_by(year)%>%
  count(partyid)%>%
  mutate(prop=(n/sum(n)))%>%
  ggplot(aes(x=year, y=prop, color=partyid))+
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess'

plot of chunk levelcollapse

Then I started with some basic work on dates:

# List the first day of each month in the current year
months <- c(1:12)
first_days_thisyear <- ymd(str_c(year(today()), months, 1, sep="-"))
first_days_thisyear
##  [1] "2018-01-01" "2018-02-01" "2018-03-01" "2018-04-01" "2018-05-01"
##  [6] "2018-06-01" "2018-07-01" "2018-08-01" "2018-09-01" "2018-10-01"
## [11] "2018-11-01" "2018-12-01"
#Calculate my current age in years
my_age <- (ymd(19890702)%--%today())/(years(1))
my_age
## [1] 28.66575

And then I spent a really long time trying to figure out how to work with the datetime format within the nycflights13 dataset. This was also my first taste of creating a function, with the goal of converting the strangely formatted date and time fields in this dataset into an accepted datetime format.

make_datetime_100 <- function(year, month, day, time){
  make_datetime(year, month, day, time %/% 100, time %% 100)
}

flights_dt <- flights%>%
  filter(!is.na(dep_time), !is.na(arr_time))%>%
  mutate(date=make_date(year, month, day),
         dep_time=make_datetime_100(year, month, day, dep_time),
         arr_time=make_datetime_100(year, month, day, arr_time),
         sched_dep_time=make_datetime_100(year, month, day, sched_dep_time),
         sched_arr_time=make_datetime_100(year, month, day, sched_arr_time))%>%
  select(origin, date, dest, ends_with("delay"), ends_with("time"))

This all went fine (and was pretty directly guided by the textbook), but then when I tried to work with this new df, several complications arose. The task that made everything difficult was trying to compare the reported delay with a manually calculated difference between the scheduled time and the actual time of each flight. I arbitrarily chose to work with the departure times instead of arrival times, but theoretically the same issues would have occurred either way.

So first I had to figure out how to subtract two datetime vectors from each other, and express the difference in minutes. After much struggle, I got difftime to work, but in order to do so I had to convert each datetime into POSIXct first.

flights_dt <- flights_dt%>%
   mutate(calc_delay=difftime(as.POSIXct(flights_dt$dep_time, format="%FT%R"), 
                             as.POSIXct(flights_dt$sched_dep_time, format="%FT%R"), 
                             units = "mins"))%>%
  select(dep_delay, calc_delay, sched_dep_time, dep_time, arr_time, sched_arr_time)
head(flights_dt)
## # A tibble: 6 x 6
##   dep_delay calc_delay sched_dep_time      dep_time           
##       <dbl> <time>     <dttm>              <dttm>             
## 1      2.00 2          2013-01-01 05:15:00 2013-01-01 05:17:00
## 2      4.00 4          2013-01-01 05:29:00 2013-01-01 05:33:00
## 3      2.00 2          2013-01-01 05:40:00 2013-01-01 05:42:00
## 4     -1.00 -1         2013-01-01 05:45:00 2013-01-01 05:44:00
## 5     -6.00 -6         2013-01-01 06:00:00 2013-01-01 05:54:00
## 6     -4.00 -4         2013-01-01 05:58:00 2013-01-01 05:54:00
## # ... with 2 more variables: arr_time <dttm>, sched_arr_time <dttm>

Looking at these first few observations, it seems like the recorded delay and the calculated delay are already equal to each other. However, when we run the correlation, we can see that this is not the case. This is confirmed and explained when we filter for observations in which the reported and calculated delays differ:

cor(flights_dt$dep_delay, as.numeric(flights_dt$calc_delay))
## [1] 0.2261434
flights_dt%>%
  filter(dep_delay!=as.numeric(calc_delay))%>%
  select(dep_delay, calc_delay, sched_dep_time, dep_time)
## # A tibble: 1,205 x 4
##    dep_delay calc_delay sched_dep_time      dep_time           
##        <dbl> <time>     <dttm>              <dttm>             
##  1     853   -587       2013-01-01 18:35:00 2013-01-01 08:48:00
##  2      43.0 -1397      2013-01-02 23:59:00 2013-01-02 00:42:00
##  3     156   -1284      2013-01-02 22:50:00 2013-01-02 01:26:00
##  4      33.0 -1407      2013-01-03 23:59:00 2013-01-03 00:32:00
##  5     185   -1255      2013-01-03 21:45:00 2013-01-03 00:50:00
##  6     156   -1284      2013-01-03 23:59:00 2013-01-03 02:35:00
##  7      26.0 -1414      2013-01-04 23:59:00 2013-01-04 00:25:00
##  8     141   -1299      2013-01-04 22:45:00 2013-01-04 01:06:00
##  9      15.0 -1425      2013-01-05 23:59:00 2013-01-05 00:14:00
## 10     127   -1313      2013-01-05 22:30:00 2013-01-05 00:37:00
## # ... with 1,195 more rows

Here we can see that all of the mismatched values are on flights that were delayed overnight. Since the date of each flight is based on the scheduled departure date, that date was transferred onto the actual departure time, even if the actual departure occurred after midnight that night.

The textbook addresses this issue on the arrival time, since this error also causes all overnight flights to arrive before they actually depart. They fix this issue with the following code:

flights_dt <- flights_dt %>% 
  mutate(
    overnight = arr_time < dep_time,
    arr_time = arr_time + days(overnight * 1),
    sched_arr_time = sched_arr_time + days(overnight * 1)
  )

However, this method doesn't apply directly as written for the issue I'm seeing with the delays. Since the root of the problem is that the reported departure time is almost a full day before the scheduled departure time, I changed the dates on the reported departure time like so:

flights_dt <- flights_dt %>% 
  mutate(
    overnight = (sched_dep_time - dep_time >hours(1)),
    dep_time = dep_time + days(overnight * 1),
    calc_delay=difftime(as.POSIXct(dep_time, format="%FT%R"), 
                        as.POSIXct(sched_dep_time, format="%FT%R"), 
                        units = "mins"))

I figured it was pretty unlikely that any plane left more than 1 hour before the scheduled time, but theoretically this could still be an issue. Then, of course, I had to rerun the calc_delay variable to use the updated datetimes. And happily, the correlation shows us what we want to see!

cor(flights_dt$dep_delay, as.numeric(flights_dt$calc_delay))
## [1] 1

After all of this struggle, I can see why the authors spent a whole chapter on dates, and why my professor thought it was a good idea for me to work through it. Who knew that dates were so complicated?

Now for the FUNction part

I have to admit that this chapter was the hardest one for me yet. I think part of this was because I haven't used a lot of the built-in functions that are referenced here (even simple ones like range or length), so I had to consider everything in pieces. But also, the format of the functions are often not very straightforward. It still bothers me that there isn't any connector between the arguments of the function in parantheses and the actual function in curly brackets like this: function(x,y){here's the function}. It doesn't help that this seems to upset rstudio as well, so there has been a lot of running and re-running to try to get the whole function to run. All in all, this chapter was rather frustrating, but in some ways that made it more satisfying when things went right.

These first functions were pretty instinctual:

#Creates a function that returns a greeting depending on the current time of day
say_hi <- function(x){
  if(4<hour(x)& hour(x)<10){
    "Good Morning"
  } else if (10<hour(x)&& hour(x)<17){
      "Good Afternoon"
  } else {
      "Good Evening"
    }}
say_hi(now())
## [1] "Good Evening"
#Creates a function that returns "fizz", "buzz", or "fizzbuzz" depending on the divisibility of the input
fizzbuzz <- function(x){
  if((x%%3)==0 & (x%%5)==0){
  "fizzbuzz"
  }else if((x%%3)==0){
  "fizz"
  } else if ((x%%5)==0){
    "buzz"
  }}
fizzbuzz(3)
## [1] "fizz"
fizzbuzz(5)
## [1] "buzz"
fizzbuzz(15)
## [1] "fizzbuzz"

One issue that I ran into repeatedly was the inability to use a vector with the if function. I know that ifelse is vectorised, but nesting ifelse statements just feels so much clunkier than using if. As far as I could tell, there's basically no workaround for this.

The next puzzle that stuck me was creating a function that would return the number of matching NA from two vectors. I really wanted to use count since it's so slick, but apparently this only works on 2-dimensional objects, so inputting two vectors was not an option. I came up with the following workaround:

na1 <- c(1,2,4,NA,5,NA,5,NA)
na2 <- c(3,4,NA,NA,NA,5,3,NA)
both_na <- function(x,y){
  tibble(x,y)%>%
    count(is.na(x), is.na(y))}
both_na(na1, na2)
## # A tibble: 4 x 3
##   `is.na(x)` `is.na(y)`     n
##   <lgl>      <lgl>      <int>
## 1 F          F              3
## 2 F          T              2
## 3 T          F              1
## 4 T          T              2

This worked OK, but it gave us information about all of the intersections, not just matching NAs. I struggled with a better solution for a long time, mainly due to one mistake that I unknowingly repeated over and over. Eventually, I learned that using && instead of & in a logical statement between vectors limits the comparison to the first values of the vector. Once I figured that out, I was able to use ifelse to create my own vectorized count:

NA_count <- function(x,y){
  sum(ifelse((is.na(x)&is.na(y)), 1, 0))
}

NA_count(na1, na2)
## [1] 2

Next I worked on creating functions that would select specific subsets of vectors:

#Returns the last value of any vector
x <- c(1:10, NA, 11:13)
y <- c(-4:25, NA, NA, 15)
last_value <- function(x){
  x[[length(x)]]
}
last_value(x)
## [1] 13
#Returns only even values
even_number <- function(x){
  x[!is.na(x) & x %% 2 == 0] 
}
even_number(y)
##  [1] -4 -2  0  2  4  6  8 10 12 14 16 18 20 22 24
#Returns all values in even-numbered positions
even_position <- function(x){
  x[c(seq(2, length(x), 2))]
}
even_position(y)
##  [1] -3 -1  1  3  5  7  9 11 13 15 17 19 21 23 25 NA
even_position(x)
## [1]  2  4  6  8 10 11 13

Even though this week was frustrating at times, I feel like I'm at a good place to be transitioning into the tidytext package. And boy am I excited to get started with that.

My dad sent me a short and sweet tidytext analysis that he did on one of my brother's novels a few years ago when he was playing with the package. It was SO cool to be looking through his code and mostly understanding what he was doing (if not the specific functions he was calling). In my family, I think my siblings and I always just assumed that the science-y computer-y things that my dad did were too complicated for us to really understand. It's a weird feeling to realize that he and I are going to have this in common as I move further into this career path, but it's also really nice. An added bonus was the fact that this analysis was on my brother's words as well. Tidytext family fun! Can't wait for more :-)

English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we m...