Sunday, April 15, 2018

Web Scraping and Text Classification Take Two






Web Scraping and Text Classification Take Two:

Web Scraping and Text Classification Take Two:

One more time with more data!

Last week I made my first venture into the world of webscraping and topic modelling, an endeavor I detail in this blog. I did a lot of manipulation to figure out how to make the LDA model turn out better, using the data cleaning techniques I learned in the world of social science (basically trying to weed out any noise to make the cleanest-looking plots). It didn't even cross my mind that using more “dirty” data would be a better strategy for getting better results. Of course I know that a bigger n will always be better for any statistical test, but I've always worked with finite data sets in the past, for which manual cleaning was the best option I had.

Luckily, my advisor was there to point out that I could scrape as many Wikipedia articles as I wanted, and that more data will garner better results. Thanks, John! So for this week the puzzle was to create a webscraping function that could compile a vector of 50 Wikipedia articles in a usable format. Then, I decided to try out a supervised classification method (SVM) in addition to the unsupervised LDA that I tried out last week.

The Great Wiki-Scrape of 2018

My first task was to figure out how to scrape 50 articles without having to do them all manually (like I did last week).

I used Wikipedia's categories to pick my articles from four disparate categories. I went with “Ballet”, “Astronomy”, “Landforms”, and “Medical Professionals”, and found about a dozen articles from each category. I then created a .csv file with each article's url and associated category, which would be the input for my scraping function. So I imported this file and converted the Url column into a character vector.

wikiurl <- read.csv("~/Desktop/wikiurl.csv", header = TRUE)
wikiurl <- wikiurl%>%
  mutate(Url = as.character(Url))

Now that I had my input ready, I got to puzzling out my scraping function. The biggest challenge was figuring out how to apply the rvest commands to a vector, when they were designed to run on a single item. I found a great, clean solution to this problem using the map functions from the purrr package, but I still had a problem with the last step when using the html_text function to extract the text out of the html language.

The example that I was goin off of only read a single html node, so there was only a single item to be extracted. However, I was extracting all “div” nodes, so the web scraping functions created one item for each “div” chunk. It turned out that this was actually okay for the rvest functions, but that it caused problems for map_chr. The normal output of all of the map functions is a list, but since I wanted the outputs to end up as a new vector in the dataframe, I used the map_chr function to force the output into a character vector. But this function could not handle the multiple items per article created by the scraping, so I found a way to collapse all of the strings first, using the paste command.

I finally had success with the following combination:

wikiscrape <- function(x){
  x%>%
    mutate(text=(
                 map(Url, read_html)%>%
                 map(html_nodes,"div")%>%
                 map_chr(~paste(html_text(.),collapse = " "))%>%
                 str_replace_all(pattern="\t", replacement = " ")%>%
                 str_replace_all(pattern="\n", replacement = " ")%>%
                   str_replace_all(pattern=" +", replacement = " ")%>%
                 removeNumbers()))
  }

I did still do some tidying to make everything a little cleaner, just taking out some html items (“\n”, “\t”) and then taking out unnecessary spaces and all numbers. Then I could use the beautifully simple command below to scrape all 50 articles at once, resulting in a single character string that encompassed all text in the “div” nodes for each article.

wikitext <- wikiscrape(wikiurl)

LDA Model

Once I had this dataframe put together, it was fairly simple to run the LDA using the same steps that I did last week, indexing with each article and category, instead of using arbitrary groups within each article.

wikitext <- wikitext%>%
  mutate(article = row_number())

LDA_wikitext <- wikitext%>%
  unite(index, Category, article, sep = "_")%>%
  select(index, text)

wikiword <- LDA_wikitext%>%
  unnest_tokens(word, text)%>%
  anti_join(stop_words)%>%
  count(index, word, sort=T)
## Joining, by = "word"
wiki_dtm <- wikiword%>%
  cast_dtm(index, word, n)

wiki_lda <- LDA(wiki_dtm, k=4, control=list(seed=3333))

Now to see whether I achieved the expected improvement in results. First we'll look at the top words assigned to each group:

group_topics <- tidy(wiki_lda, matrix="beta")

top_terms <- group_topics%>%
  group_by(topic)%>%
  top_n(5, beta)%>%
  ungroup()%>%
  arrange(topic, -beta)

top_terms
## # A tibble: 20 x 3
##    topic term            beta
##    <int> <chr>          <dbl>
##  1     1 astronomy    0.0180 
##  2     1 ray          0.0133 
##  3     1 edit         0.00729
##  4     1 telescope    0.00718
##  5     1 radio        0.00690
##  6     2 sea          0.0233 
##  7     2 ocean        0.0104 
##  8     2 gulf         0.00571
##  9     2 isbn         0.00540
## 10     2 water        0.00532
## 11     3 ballet       0.0601 
## 12     3 dance        0.0178 
## 13     3 de           0.0133 
## 14     3 edit         0.00850
## 15     3 dancers      0.00753
## 16     4 nursing      0.0124 
## 17     4 health       0.0115 
## 18     4 therapy      0.0114 
## 19     4 care         0.0108 
## 20     4 occupational 0.00812

Looking pretty good! Each topic has words that mostly fit with a single category, with a few oddballs like “isbn” and “edit” thrown in there. So I continued with the analysis, looking at how the spread of the article-topic assignments looked, compared to the original categories.

group_gamma <- tidy(wiki_lda, matrix="gamma")
group_gamma <- group_gamma%>%
  separate(document, c("category", "article"), sep="_", convert=TRUE)

group_gamma%>%
  mutate(category=reorder(category, gamma*topic))%>%
  ggplot(aes(factor(topic), gamma))+
  geom_boxplot()+
  facet_wrap(~category)

plot of chunk LDA box

This plot confirms what we saw above. Each of the categories is discretely associated with a single topic. If we make a confusion matrix of the individual word assignments, we can further support this result:

group_classification <- group_gamma%>%
  group_by(category, article)%>%
  top_n(1, gamma)%>%
  ungroup()

wiki_topics <- group_classification%>%
  count(category, topic)%>%
  group_by(category)%>%
  top_n(1, n)%>%
  ungroup()%>%
  transmute(consensus = category, topic)

word_classification <- augment(wiki_lda, data=wiki_dtm)%>%
  separate(document, c("category", "article"), sep="_", convert=T)%>%
  inner_join(wiki_topics, by=c(".topic"="topic"))

word_classification%>%
  count(category, consensus, wt=count)%>%
  group_by(category)%>%
  mutate(percent=n/sum(n))%>%
  ggplot(aes(consensus, category, fill=percent))+
  geom_tile()+
  scale_fill_gradient2(high="red")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank())+
  labs(x = "Category words were assigned to",
       y="Category words came from",
       fill="%of assignments")

plot of chunk LDA confusion

Once again, we can see that the assigned categories very closely match the expected categories. There was only a small percentage of words from the medical professional and landforms categories that were incorrectly assigned, and otherwise everything was spot on! A definite improvement on last week's analysis, all thanks to a bigger n.

Supervised Learning with an SVM

An SVM (support vector machine) is a supervised learning model that is used in machine learning for classifying items in a dataset. The rTextTools package is a wrapper on the e1071 package which allows the SVM algorithm to work on textual data. Since this model only allows binary classification, we'll have to run a model on each category separately and then combine the results.

The first step in doing this is to use the spread function to turn each category into its own logical vector, with a value of 1 if the article came from the category and a value of 0 if not.

svm_wikiword <- wikitext%>%
  mutate(yesno=1)%>%
  distinct%>%
  spread(Category, yesno, fill=0)

Then we'll need to split the dataset into a training set and a testing set. We'll use the training set to create the model and then use the testing set to see how accurate the model is at predicting the categories. To make sure that each set had a random sample from all four categories, I used the sample function to randomize the data before splitting into the two sets.

set.seed(2222)
svm_wikiword <- svm_wikiword[sample(nrow(svm_wikiword)),]
bound <- floor((nrow(svm_wikiword)/2))
svm_train <- svm_wikiword[1:bound,]
svm_test <- svm_wikiword[(bound+1):nrow(svm_wikiword),]%>%
  mutate(row=c(1:25))

The rTextTools package has a create_matrix function that will make a dtm out of a dataframe without having to unnest the tokens, remove stop words, or do a per-document count. I'm not sure exactly what the differences are between this function and tidytext's cast_dtm on the back end, but it sure is easy for the user! So I did this for the testing set, and then created a “container” for the dtm that included the category assignments for each article (apparently you can use the trainSize and testSize arguments to split up the dataset at this point, but I couldn't find any literature about whether there was any randomization to those functions, so I decided to do it manually instead). Once you have this container made, you can run the SVM function.

Since I had to do the same thing for all four categories separately, I'll just show my code for the astronomy category to avoid too much repetition.

svm_train_dtm<-
  create_matrix(svm_train["text"])

svmastro_container <- create_container(svm_train_dtm, svm_train$Astronomy,
                                        trainSize = 1:25,
                                        virgin = FALSE)
astromodel <- train_model(svmastro_container, "SVM", kernel = "linear", cost=1)

The next step is to prepare and run the testing set through the model we just made. First I made a dtm of the testing set using the word list from the training set (since any words not in the training set would not be in the model). Then I created a container for the dtm with empty labels and used the classify_model function to make our predictions.

astropredMatrix <- create_matrix(svm_test, originalMatrix=svm_train_dtm)

predastro_container <- create_container(astropredMatrix, labels = rep(0, 25), 
                                         testSize = 1:25, virgin=FALSE)

astro_pred <- classify_model(predastro_container, astromodel)

I then cleaned up the results and joined them with the testing dataset to be able to compare the predictions with the actual assignments.

astro_pred <- astro_pred%>%
  mutate(row=c(1:25), astro_predict=SVM_LABEL, astro_prob=SVM_PROB)%>%
  select(-SVM_LABEL, -SVM_PROB)

astro_compare <- svm_test%>%
  select(row,article, Astronomy)%>%
  inner_join(astro_pred)%>%
  select(-row)
## Joining, by = "row"
astro_compare
##    article Astronomy astro_predict astro_prob
## 1       40         1             1  0.9573721
## 2        3         0             0  0.9385191
## 3        6         0             0  0.9401042
## 4       25         0             0  0.9285038
## 5       21         0             0  0.9978756
## 6       41         1             0  0.5540011
## 7       35         0             0  0.9395527
## 8       12         0             0  0.9858133
## 9       47         1             0  0.7476170
## 10       5         0             0  0.9554466
## 11      32         0             0  0.8879723
## 12      44         1             1  0.9763707
## 13       4         0             0  0.9441213
## 14      33         0             0  0.9249003
## 15      43         1             1  1.0000000
## 16      50         1             1  1.0000000
## 17      10         0             0  0.9554945
## 18      18         0             0  0.9267376
## 19      37         0             0  0.9753106
## 20      45         1             0  0.5341259
## 21      29         0             0  0.8649816
## 22       9         0             0  0.9290784
## 23      23         0             0  0.7585622
## 24      17         0             0  0.9128045
## 25      16         0             0  0.9160170

At first glance, it's not looking too bad! Next I combined it with all the other results to see how our model did overall.

svm_compare <- med_compare%>%
  inner_join(land_compare, by = "article")%>%
  inner_join(ballet_compare, by= "article")%>%
  inner_join(astro_compare, by = "article")

svm_compare_gather <- svm_compare%>%
  mutate(category= ifelse(Astronomy==1, "Astronomy",
                          ifelse(Ballet==1, "Ballet",
                                 ifelse(Landforms==1, "Landforms",
                                        ifelse(MedicalProfessional==1, "Med", 
                                        "Not Assigned")))),
         prediction= ifelse(med_predict==1, "Med",
                            ifelse(ballet_predict==1, "Ballet",
                                   ifelse(land_predict==1, "Landforms",
                                          ifelse(astro_predict==1, "Astronomy", 
                                          "Not Assigned")))))%>%
  select(article, category, prediction)

  svm_compare_gather
##    article  category   prediction
## 1       40 Astronomy    Astronomy
## 2        3    Ballet       Ballet
## 3        6    Ballet       Ballet
## 4       25       Med          Med
## 5       21 Landforms    Landforms
## 6       41 Astronomy Not Assigned
## 7       35       Med          Med
## 8       12    Ballet       Ballet
## 9       47 Astronomy Not Assigned
## 10       5    Ballet       Ballet
## 11      32       Med          Med
## 12      44 Astronomy    Astronomy
## 13       4    Ballet       Ballet
## 14      33       Med          Med
## 15      43 Astronomy    Astronomy
## 16      50 Astronomy    Astronomy
## 17      10    Ballet       Ballet
## 18      18 Landforms    Landforms
## 19      37       Med          Med
## 20      45 Astronomy Not Assigned
## 21      29       Med          Med
## 22       9    Ballet Not Assigned
## 23      23 Landforms    Landforms
## 24      17 Landforms    Landforms
## 25      16 Landforms    Landforms
  svm_compare_gather%>%
  count(category, prediction)%>%
  group_by(category)%>%
  mutate(percent=n/sum(n))%>%
  ggplot(aes(prediction, category, fill=percent))+
  geom_tile()+
  scale_fill_gradient2(high="red")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank())+
  labs(x = "Category articles were assigned to",
       y="Category articles came from",
       fill="Percent of assignments")

plot of chunk all compare

Looking pretty good, indeed! There were actually no incorrect assignments, but several articles that were not assigned to any topic. We can see that the medical professional and landforms articles were predicted perfectly, but some of the ballet and astronomy articles were not assigned to their categories.

The SVM results also include the probability associated with each assignment in the model, for both the 1's and the 0's. We can use this statistic to report the probability for each article for each topic (using the inverse probablity for 0's). Then we can visualize the assignments on a boxplot like we did for the LDA.

svm_box <- svm_compare%>%
  mutate(astro_p=ifelse(astro_predict==1, astro_prob, (1-astro_prob)),
         med_p=ifelse(med_predict==1, med_prob, (1-med_prob)),
         land_p=ifelse(land_predict==1, land_prob, (1-land_prob)),
         ballet_p=ifelse(ballet_predict==1, ballet_prob, (1-ballet_prob)),
         category= ifelse(Astronomy==1, "Astronomy",
                          ifelse(Ballet==1, "Ballet",
                                 ifelse(Landforms==1, "Landforms",
                                        ifelse(MedicalProfessional==1, "Med", "Not Assigned")))))%>%
  gather(cat_p, prob, astro_p, med_p, land_p, ballet_p)%>%
  select(category,article, cat_p, prob)

svm_box%>%
  ggplot(aes(factor(cat_p), prob))+
  geom_boxplot()+
  facet_wrap(~category)

plot of chunk svm box

This plot makes me so happy. Even though it's not absolutely perfect, each group of predictions is discrete and associated with the correct category, which is exactly what I wanted to see. Success!

Using the lesson that I (re-)learned last week, I would guess that the model would get even better with a bigger dataset. Splitting the dataset in two (for training and testing) was particularly painful, since that meant that I really only had half the data to work with, but it seemed to turn out ok anyway. Feeling very accomplished this week, indeed.

Sunday, April 8, 2018

Web Scraping and LDA Topic Modelling






Scraping Wikipedia and Topic Modelling

Scraping Wikipedia and Topic Modelling

With a “final project” for my independent study in mind, I've been doing some research about how best to go about web-scraping and categorizing text. I'm hoping to be able to stay within R for this project for sure, and maybe even remain mainly within the tidyverse depending on what the best solutions end up being.

Luckily, most of the tutorials I've found regarding web scraping in R use Hadley Wickham's rvest, so that one is pretty straightforward. And although my final project is going to require supervised modeling, doing the unsupervised LDA modelling as described in the Text Mining in R book may still be a preliminary step for the ultimate classification. Also, it seems like it's generally a good tool to have under my belt going forward, especially if I'm going to continue working with text. So here we go! Web scraping- and topic modelling-ho!

Selecting and Scraping the Data

In the TidyText book, an example LDA is run on the chapters of four separate books to see if the algorithm can correctly identify which book each chapter comes from. This example is very clean, with only two chapters being incorrectly assigned. I thought that I would try a similar exercise, but using four unrelated Wikipedia articles so that I would get some practice with web-scraping as well.

I primarily used this tutorial from Bradley Boehmke as a guide for performing the web-scraping. I decided to scrape all of the text from each of four broad but unrelated articles (“Dog”, “Number”, “Plant”, and “Entertainment”) by first using the code below to read all of the html data into R:

dog_wiki <- read_html("https://en.wikipedia.org/wiki/Dog")
number_wiki <- read_html("https://en.wikipedia.org/wiki/Number")
plant_wiki <- read_html("https://en.wikipedia.org/wiki/Plant")
entertainment_wiki <- read_html("https://en.wikipedia.org/wiki/Entertainment")

I then pulled all of the text from each “div” node, which the tutorial explains should be most, if not all, of the text on the whole page. I removed the html language, split the data into lines by “\n”, removed all of the tabs (“\t”) and empty lines, then removed all numbers (this comes from the tm package). Below is the code for the “Dog” article.

dog_text <- dog_wiki%>%
  html_nodes("div")%>%
  html_text()%>%
  strsplit(split = "\n") %>%
  unlist() %>%
  str_replace_all(pattern="\t", replacement = "")%>%
  .[. != ""]%>%
  removeNumbers()

head(dog_text)
## [1] "Dog"                                                                                                                                  
## [2] "From Wikipedia, the free encyclopedia"                                                                                                
## [3] "Jump to:navigation, search"                                                                                                           
## [4] "This article is about the domestic dog. For related species known as \"dogs\", see Canidae. For other uses, see Dog (disambiguation)."
## [5] "\"Doggie\" redirects here. For the Danish artist, see Doggie (artist)."                                                               
## [6] "Domestic dogTemporal range: Late Pleistocene – Present (,– years BP)"

As you can see, some of the lines are very short or lacking any text that would be specific to any particular article, so I decided to group every five lines into one. Especially since these lines will be what the LDA was working to classify, I wanted to give it the best chance of being successful by trying to make sure every line had meaningful content.

The problem of concatenating every five rows ended up being more difficult than I expected, but I landed on what I think is a pretty slick way to do it. First I had to turn my list of values into a dataframe, assign a number to every group of five rows, and then use the summarise and paste commands to combine the rows by group. I also decided to filter out any rows with fewer than 15 characters, and used this step to label the data by article as well.

dog_data <- as.data.frame(dog_text)
dog_grouped <- dog_data%>%
  mutate(group=1:nrow(dog_data)%/%5)%>%
  group_by(group)%>%
  summarise(text=paste(dog_text, collapse = " "))%>%
  filter(nchar(text)>15)%>%
  mutate(wiki = "dog")%>%
  select(wiki, group, text)

head(dog_grouped)
## # A tibble: 6 x 3
##   wiki  group text                                                        
##   <chr> <dbl> <chr>                                                       
## 1 dog    0    "Dog From Wikipedia, the free encyclopedia Jump to:navigati…
## 2 dog    1.00 "\"Doggie\" redirects here. For the Danish artist, see Dogg…
## 3 dog    2.00 Scientific classification  Kingdom: Animalia Phylum: Chorda…
## 4 dog    3.00 Class: Mammalia Order: Carnivora Family:                    
## 5 dog    4.00 Canidae Genus: Canis Species: C. lupus                      
## 6 dog    5.00 Subspecies: C. l. familiaris[] Trinomial name Canis lupus f…

Running an LDA model

So now my text is all in one place and ready to be “tidied” for analysis. The first step is to combine all four articles into one dataframe, and then to create a tidy dataframe, with one token (word) per row. I maintained the article name and group number as an index so that we'd be able to see where each word came from after the model ran. I also removed stop words and performed a count per word per group index, then created a dtm (document-term matrix) which is what is needed to run an LDA.

#combines into one df, unites index
wiki_text <- 
  rbind(dog_grouped, number_grouped, plant_grouped, entertainment_grouped)%>%
  unite(index, wiki, group, sep="_")

#splits by word and creates word count for each group index
by_group_word <- wiki_text%>%
  unnest_tokens(word, text)%>%
  anti_join(stop_words)%>%
  count(index, word, sort=T)
## Joining, by = "word"
#creates document-term matrix for lda
group_dtm <- by_group_word%>%
  cast_dtm(index, word, n)

At this point, we're ready to run the LDA and examine the results. The tidy function is really handy here, in that it pulls specific data out of the LDA results so that it's a bit more digestible. First we look at the per-topic-per-word probabilities, using the beta argument. Here I pull the top five terms for each topic by probability.

#create 4-topic lda model
wiki_lda <- LDA(group_dtm, k=4, control=list(seed=1234))

#per-topic-per-word probabilities
group_topics <- tidy(wiki_lda, matrix="beta")

#view top 5 terms for each topic
top_terms <- group_topics%>%
  group_by(topic)%>%
  top_n(5, beta)%>%
  ungroup()%>%
  arrange(topic, -beta)

top_terms
## # A tibble: 20 x 3
##    topic term             beta
##    <int> <chr>           <dbl>
##  1     1 entertainment 0.0289 
##  2     1 plants        0.00736
##  3     1 audience      0.00728
##  4     1 century       0.00667
##  5     1 forms         0.00636
##  6     2 dogs          0.0378 
##  7     2 dog           0.0343 
##  8     2 plants        0.0157 
##  9     2 plant         0.00681
## 10     2 humans        0.00568
## 11     3 real          0.00932
## 12     3 mongoose      0.00827
## 13     3 complex       0.00757
## 14     3 displaystyle  0.00722
## 15     3 seal          0.00657
## 16     4 isbn          0.0292 
## 17     4 doi           0.0148 
## 18     4 press         0.0135 
## 19     4 retrieved     0.0132 
## 20     4 university    0.0121

And, wow. There appears to be no really good pattern to the words/topics at all. The words for topic 4 are particularly troubling, since they are really unrelated to any of the four articles, probably mostly coming from the references on each page. Just in case, I decided to look at the distribution by article as well to see if there was any pattern evident. For this, I used the gamma argument to tidy which shows the proportion of words in each group assigned to each topic.

#proportion of words per group assigned to topic
group_gamma <- tidy(wiki_lda, matrix="gamma")

#separate index to plot topic assignment
group_gamma <- group_gamma%>%
  separate(document, c("wiki", "group"), sep="_", convert=TRUE)

group_gamma%>%
  mutate(wiki=reorder(wiki, gamma*topic))%>%
  ggplot(aes(factor(topic), gamma))+
  geom_boxplot()+
  facet_wrap(~wiki)

plot of chunk LDA gamma1

Yuck. I realized my mistake in including all the text from all four articles. I had thought that the words that were not content-specific (from the references, sidebars, etc.) would cancel each other out since they would be more or less equally present in all four articles. However, since the articles were broken into smaller groups, it makes sense that certain groups of each article would be more similar across articles than to other groups in the same article. For example, the references and sidebars would likely match onto their own topic separate from the article content.

LDA with Paragraph Text Only

I decided to redo the analysis using only paragraph text. Everything in the analysis remained the same, except that I only scraped “\p” nodes from the articles.

## # A tibble: 20 x 3
##    topic term             beta
##    <int> <chr>           <dbl>
##  1     1 dogs          0.0476 
##  2     1 dog           0.0376 
##  3     1 humans        0.00909
##  4     1 human         0.00838
##  5     1 pet           0.00786
##  6     2 dogs          0.0185 
##  7     2 negative      0.00766
##  8     2 theory        0.00686
##  9     2 complex       0.00639
## 10     2 century       0.00634
## 11     3 plants        0.0406 
## 12     3 plant         0.0123 
## 13     3 real          0.0116 
## 14     3 algae         0.00970
## 15     3 called        0.00748
## 16     4 entertainment 0.0365 
## 17     4 forms         0.00920
## 18     4 audience      0.00867
## 19     4 music         0.00725
## 20     4 dance         0.00664

Phew! Much better! Still not perfect (note that dog/dogs is at the top of both topic 1 and 2), but we can start to see some patterns between the topics that could match up to the different articles. When we look at the patterns across the different articles, we can see some very clear correlations. plot of chunk LDA gamma2

Here we see that the articles on “Dog”, “Plant”, and “Entertainment” are all pretty clearly identified to a single topic. The article on “Number”, however, remains spread between a couple of topics. We can continue to use the gamma data to determine which topic is most commonly assigned to each group and each article, and then identify which specific groups are incorrectly assigned.

#topic most associated with each group index
pgroup_classification <- pgroup_gamma%>%
  group_by(wiki, group)%>%
  top_n(1, gamma)%>%
  ungroup()

#compare to topic most common among wiki
wiki_topics <- pgroup_classification%>%
  count(wiki, topic)%>%
  group_by(wiki)%>%
  top_n(1, n)%>%
  ungroup()%>%
  transmute(consensus = wiki, topic)

#find mismatched groups
mismatch_p <- pgroup_classification%>%
  inner_join(wiki_topics, by="topic")%>%
  filter(wiki != consensus)

mismatch_p
## # A tibble: 47 x 5
##    wiki          group topic gamma consensus
##    <chr>         <int> <int> <dbl> <chr>    
##  1 entertainment     8     1 0.807 dog      
##  2 entertainment     6     1 0.976 dog      
##  3 entertainment    12     1 0.614 dog      
##  4 number           10     1 0.999 dog      
##  5 dog               4     2 1.000 number   
##  6 dog               7     2 0.788 number   
##  7 plant             6     2 0.875 number   
##  8 dog               9     2 1.000 number   
##  9 plant            19     2 0.684 number   
## 10 plant            16     2 1.000 number   
## # ... with 37 more rows

It's apparent that the “Number” article is causing trouble. Most of the mismatched assignments either come from the “Number” article or match onto the “Number” topic. After going back to the article itself, it's actually not very surprising that this is happening. There isn't a lot of language in the article that is specific to math or numbers, that isn't also likely to appear in the other articles being analyzed. There are also long portions of the article about the history and cultural significance of numbers, which further blurs the “Number” article with the content of the others.

LDA without “Number” Article

Again, I decided to run another LDA, leaving the “Number” article out altogether. I still included only text from paragraph nodes, changing only the text being analyzed and the number of topics to identify (now only three).

## # A tibble: 15 x 3
##    topic term             beta
##    <int> <chr>           <dbl>
##  1     1 dogs          0.0430 
##  2     1 dog           0.0255 
##  3     1 canis         0.00962
##  4     1 wolves        0.00933
##  5     1 domestic      0.00841
##  6     2 plants        0.0473 
##  7     2 plant         0.0152 
##  8     2 algae         0.0104 
##  9     2 green         0.00818
## 10     2 fungi         0.00564
## 11     3 entertainment 0.0344 
## 12     3 dogs          0.0122 
## 13     3 dog           0.0117 
## 14     3 forms         0.00815
## 15     3 audience      0.00786

It seems like our model is getting better and better, though those pesky dog/dogs are still causing some trouble. When we look at the patterns per article, though, we can see that the model is doing a pretty good job overall. plot of chunk nonum gamma

## # A tibble: 18 x 5
##    wiki          group topic gamma consensus    
##    <chr>         <int> <int> <dbl> <chr>        
##  1 plant            23     1 0.681 dog          
##  2 entertainment     6     1 0.508 dog          
##  3 entertainment    14     1 0.554 dog          
##  4 entertainment     3     1 0.754 dog          
##  5 entertainment    13     1 0.795 dog          
##  6 plant             8     1 0.994 dog          
##  7 entertainment    15     2 0.612 plant        
##  8 entertainment    17     2 0.611 plant        
##  9 dog               3     2 0.528 plant        
## 10 dog              12     3 1.000 entertainment
## 11 dog              10     3 0.527 entertainment
## 12 dog              17     3 1.000 entertainment
## 13 dog              11     3 0.724 entertainment
## 14 dog              14     3 1.000 entertainment
## 15 dog              16     3 1.000 entertainment
## 16 dog               5     3 1.000 entertainment
## 17 dog               6     3 1.000 entertainment
## 18 dog              15     3 1.000 entertainment

The groupings are much cleaner in this model without the “Number” article, and the list of mismatched groups is smaller. We see that most of the errors are between the “Dog” and “Entertainment” articles, though looking at the specific groups doesn't give any really good insight into why these specific groups might be assigned incorrectly. As a dog lover, I suggest that it's likely because dogs are just so entertaining…

We can also look at how the specific words were assigned and identify which words in each group led to an incorrect assignment. And finally, we can create a confusion matrix to visualize the percent of words in each article that were assigned correctly/incorrectly.

#find mismatched words
assignments <- augment(nonum_lda, data=nonum_dtm)%>%
  separate(document, c("wiki", "group"), sep="_", convert=T)%>%
  inner_join(nonum_topics, by=c(".topic"="topic"))

missed_assignments <- assignments%>%
  filter(wiki!=consensus)

#confusion matrix of word/topic assignment
assignments%>%
  count(wiki, consensus, wt=count)%>%
  group_by(wiki)%>%
  mutate(percent=n/sum(n))%>%
  ggplot(aes(consensus, wiki, fill=percent))+
  geom_tile()+
  scale_fill_gradient2(high="red")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank())+
  labs(x = "Wiki words were assigned to",
       y="Wiki words came from",
       fill="%of assignments")

plot of chunk confusion matrix

The confusion matrix makes visually clear what we had already determined from the data. The “Plant” article was very successfully identified, with only a few words incorrectly assigned to/from the article. The “Dog” and “Entertainment” articles were still successfully identified, but with more mistakes between the two. The “Dog” article was the least successfully identified, and the largest group of incorrect assignments came from the “Dog” article being assigned to the “Entertainment” topic.

Conclusion

The biggest conclusion that I came to during this exercise was that topic-modelling is not as clean as it looked in the textbook. This should probably be obvious, but I came into this project thinking that I would have results as beautifully clean as the ones in the book's “Great Library Heist”. This is actually probably really important for me to realize going forward, and I feel like I need some guidance about what constitutes a good model beyond looking nice.

Even if my model isn't perfect, I do think that I learned a lot doing this exercise. I feel pretty confident in my ability to do basic web-scraping, and I'm figuring out what does and doesn't work for running LDAs, as well as what I can do to try to improve the model. Overall, a very successful week, I think!

English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we m...