TidyText: I Have Arrived!
It's so exciting to be creating my very first ever text analysis this week. With the foundation of the work I've been doing in R for Data Science, working through Julia Silge and David Robinson's Text Mining in R has been really straightforward so far. I think it really speaks to the brilliance behind the package that performing normally tedious natural language process tasks like tokenizing is made so deceptively simple.
It's also been really neat to see the explanations of a lot of the techniques used in the tidytext blogs I've been reading. It's so encouraging to be reaching the point where I feel like I'm really understanding the code and the analyses that I'm reading. And once again, I feel like the authors really deserve recognition for the incredible accessibility of using this package. After just one week of working with the package, I'm proud of the analysis that I was able to put together. So, thank you to Julia Silge and David Robinson for making this all possible!
Grimms' Fairy Tales
After seeing how the authors used the gutenbergr
package to pull novels from the Gutenberg Project archives, I went browsing for some titles that I would be interested in exploring. One of the titles at the top of the 100 Most Dowloaded lists was Grimms' Fairy Tales, which immediately caught my eye. Although I'm actually not familiar with a lot of the stories (more than I expected), there are quite a few standards that I thought almost anyone would recognize. I also liked the idea of being able to do a comparative analysis between some of the stories, rather than looking at individual chapters or entire books as was done in the book.
So I found the book's ID number and pulled it into R:
gutenberg_metadata%>%
filter(title == "Grimms' Fairy Tales")
grimm <- gutenberg_download(2591)
Then I needed to clean it up a little, removing the table of contents and labelling each line of text with the story title. Unlike the examples from the book, the story titles were not preceded by a nice marker like “Chapter” or a number, but they were consistently written in all-caps. So I came up with a regex that would find all lines in all-caps. This was complicated a little by the inclusion of various punctuation in some of the titles, but since it was a pretty limited set, I just went ahead and hard-coded those specific inclusions in my regex.
I could have followed the examples in the books and used cumsum
to label the stories by number, but I really wanted the full name of the story as the label instead–I didn't want to have to keep referring to the table of contents and counting to know which story I was looking at. So I used a combination of str_detect
from the stringr
package and na.locf
from the zoo
package to get everything labelled how I wanted it:
grimm_titles <- grimm[-c(1:93), , drop = FALSE]%>%
mutate(story_title = ifelse(str_detect(text, regex("^[12\\,\\.\\[\\]A-Z \\'\\-]+$")), text, NA),
story = na.locf(story_title))
Then I selected four of the stories that I was most familiar with, and used unnest_tokens
to make them tidy for the rest of my analysis.
grimm_stories <- grimm_titles%>%
filter(story %in% c("THE FROG-PRINCE", "RAPUNZEL", "HANSEL AND GRETEL",
"RUMPELSTILTSKIN"))%>%
mutate(linenumber = row_number())%>%
select(text, linenumber, story)
tidy_stories <- grimm_stories%>%
unnest_tokens(word, text)
Word Frequency and Inverse Document Frequency
It seems like it's pretty much unanimous that the first step of a text analysis is to look at word frequency, which makes sense, since this is a fairly simple but informative analysis to run. So I removed stop words from the dataset and then looked at the top words from each story.
I do feel like I need to incude a disclaimer that I don't think I fully agree with the stop_words
list included in the package yet. Admittedly I haven't looked through the options for splitting the list up, but I feel like it includes some words that could be important in the context of some documents. From my quick glimpse, I saw a lot of adverbs (like currently, entirely, hardly), and even some adjectives and verbs (like alone, consider, indicate) that might should be left alone.
I understand why these words were included in the list, since they don't necessarily contribute to the content of the document. But I think they might should be left in the document when doing something like a sentiment analysis, since they could contribute to the overall tone of a document. Anyway, you'll see that I don't remove stop words when I do my sentiment analysis because of this. Probably pointless, since the sentiment analysis is performed on a specific set of words as well (which I would assume doesn't include these “stop words”), but I still left them in just on principle…
Anyway, back to the analysis:
tidy_stories%>%
anti_join(stop_words)%>%
group_by(story)%>%
count(word)%>%
arrange(story, desc(n))%>%
top_n(8)%>%
ungroup()%>%
mutate(word = reorder(word, n))%>%
ggplot(aes(word, n, fill=story))+
geom_col(show.legend = FALSE)+
xlab(NULL)+
coord_flip()+
facet_wrap(~story, ncol=2, scales="free" )
These plots show us that the most used terms do a pretty good job of highlighting the most important parts of each story, with the top terms consisting mostly of character names and major plot points. Normally we see proper names here, but with the exclusion of Hansel, Gretel, and Rapunzel, most of the characters in these stories are unnamed. It's also interesting to note that Rumpelstiltskin is specifically not on this list, which makes sense in the context of the story since guessing his name is an important part of the story.
Next I wanted to look at the inverse document frequency to see if these top terms changed when compared to the term frequency among all four stories. First I needed to recalculate the word frequency within each story before running the tf_idf
function to determine which words were most unique to each story.
story_words <- tidy_stories%>%
group_by(story)%>%
count(story, word, sort=TRUE)%>%
ungroup()
story_words <- story_words%>%
bind_tf_idf(word, story, n)
story_words%>%
arrange(desc(tf_idf))%>%
group_by(story)%>%
top_n(8)%>%
ungroup()%>%
mutate(word=reorder(word, tf_idf))%>%
ggplot(aes(word, tf_idf, fill = story))+
geom_col(show.legend = FALSE)+
labs(x=NULL, y= "tf_idf")+
facet_wrap(~story, ncol=2, scales = "free")+
coord_flip()
It's actually interesting to see how little the top terms changed when controlling for overall document language. This shows that the language used in each story is generally fairly specific to that story. I didn't remove the stop words from this analysis since I figured the most common ones (the, and, is, etc.) would be cancelled out by the document frequency calculation. While this was true for the most part, I think it's really interesting to see that Hansel and Gretel had “we”, “they”, and “us” pop up on the list of top terms. While this may not tell us a lot about the content of the story, it does show that this is the only of the four stories in which the main characters interact as a team rather than as individuals.
Sentiment Analysis
Next I wanted to run some sentiment analyses to see how the sentiment in each story compared over the course of the story and how they compared to each other. Since the tidytext
package provides us with several sentiment lexicons, I decided to play with them each a little, starting with the “Bing” set. This lexicon has each word rated in a binary fashion as either “negative” or “positive”, so I needed to manipulate the results a little in order to display the change in sentiment across each document. Luckily, the authors of the book lay this process out pretty clearly:
story_sentiment_bing <- tidy_stories%>%
inner_join(get_sentiments("bing"))%>%
count(story, index=linenumber%/%5, sentiment)%>%
spread(sentiment, n, fill = 0)%>%
mutate(sentiment = positive - negative)
story_sentiment_bing%>%
ggplot(aes(index, sentiment, fill=story))+
geom_col(show.legend = FALSE)+
facet_wrap(~story, ncol = 2, scales = "free_x")
Since these stories are pretty short, I needed to make the binwidths (index
) pretty small in order to see any progression across the story. The tradeoff of this method was that some of the bins didn't have any words in the sentiment lexicon at all, but I think the pattern of the plot is still evident.
Now let's see what it looks like when we use the “afinn” lexicon instead. Since this lexicon ranks each word's sentiment with a numeric score (negative sentiment being ranked with a negative number, and positive sentiment being ranked with a positive number), creating a plot with this lexicon is a little more straightforward.
story_sentiment_afinn <- tidy_stories%>%
inner_join(get_sentiments("afinn"))%>%
group_by(story, index=linenumber%/%5)%>%
summarise(sentiment = sum(score))
story_sentiment_afinn%>%
ggplot(aes(index, sentiment, fill=story))+
geom_col(show.legend = FALSE)+
facet_wrap(~story, ncol = 2, scales = "free_x")
When comparing these two analyses, we can see that the overall negative/positive shape of the story is pretty similar. According to the authors of Text Mining in R it's a known effect that the “afinn” analysis will show larger absolute values than the “bing” analysis will, so that's not surprising. Also consistent with the description given in the book, the “bing” analysis has larger chunks of similar sentiment than the “afinn” analysis does.
The shapes of the analysis also map well onto the plot of each story. Hansel and Gretel appears to be the most negative, with large negative chunks in the areas of the story where the children are first abandoned by their parents and then captured by the witch in the woods. Rapunzel starts out more positive but then becomes negative once the prince dispairs at first being unable to reach her and then at being blinded by thorns. Rumpelstiltskin and The Frog Prince are both overwhelmingly positive, with negative bursts where the miller's daughter (turned queen) and princess are distraught.
The third sentiment lexicon (“nrc”) provided in the tidytext
package is different in that it ranks words on several different sentiments, rather than only on a binary variable of positive or negative. Thus, I wanted to see how each story ranked overall among these different sentiments:
nrc_sentiment <- tidy_stories%>%
inner_join(get_sentiments("nrc"))%>%
group_by(story)%>%
count(sentiment)%>%
ungroup()%>%
mutate(sentiment = reorder(sentiment, n))
nrc_sentiment%>%
ggplot(aes(sentiment, n, fill = story))+
geom_col(show.legend = FALSE)+
facet_wrap(~story, ncol = 2, scales = "free")+
coord_flip()
I was surprised that three of the four stories were much more positive than they were negative. Going into this project, I was expecting overwhelmingly negative sentiment, since Grimms' has a reputation of being fairly dark (especially considering that it's written for children). However, I wonder if the stories that have become more popular in modern society are some of the more positive ones–a question for another analysis perhaps!
Because of the way that the ranking is done, there is necessarily going to be some connection between the positive/negative scores and the other categories. It should also be noted that the raw n
value is not that meaninful when used in comparing between the different stories, since they do not have an equal total wordcount. However, I do find it interesting that the two more negative stories (Hansel and Gretel and Rapunzel) have a larger proportion of “anticipation” than the other two stories. I don't know exactly how the sentiment lexicon was put together, but it makes sense that the more negative storylines would have more of a sense of suspense or anticipation as the reader waits for a resolution.
Next Steps
It was really exciting to not only begin working in tidytext this week, but to get far enough that I could produce a meaningful (if basic) text analysis. As I am working, the linguist in me is constantly questioning and thinking of ways to improve the analyses, but I'm trying to hold judgment until I work through more of the book and maybe even look further into the natural language processing documentation in CRAN. I'm particularly excited to work with n-grams and look into stemming and part-of-speech tagging, since I find the use of individuals words somewhat problematic. I'm also excited to look into topic modelling, since it's an entirely new concept for me.
Overall, very pleased with the work that I'm doing and the progress that I'm making. It feels really good :-)