Anna Marbut: Web Scraping and LDA Topic Modelling

Scraping Wikipedia and Topic Modelling

With a “final project” for my independent study in mind, I've been doing some research about how best to go about web-scraping and categorizing text. I'm hoping to be able to stay within R for this project for sure, and maybe even remain mainly within the tidyverse depending on what the best solutions end up being.

Luckily, most of the tutorials I've found regarding web scraping in R use Hadley Wickham's rvest, so that one is pretty straightforward. And although my final project is going to require supervised modeling, doing the unsupervised LDA modelling as described in the Text Mining in R book may still be a preliminary step for the ultimate classification. Also, it seems like it's generally a good tool to have under my belt going forward, especially if I'm going to continue working with text. So here we go! Web scraping- and topic modelling-ho!

Selecting and Scraping the Data

In the TidyText book, an example LDA is run on the chapters of four separate books to see if the algorithm can correctly identify which book each chapter comes from. This example is very clean, with only two chapters being incorrectly assigned. I thought that I would try a similar exercise, but using four unrelated Wikipedia articles so that I would get some practice with web-scraping as well.

I primarily used this tutorial from Bradley Boehmke as a guide for performing the web-scraping. I decided to scrape all of the text from each of four broad but unrelated articles (“Dog”, “Number”, “Plant”, and “Entertainment”) by first using the code below to read all of the html data into R:

dog_wiki <- read_html("https://en.wikipedia.org/wiki/Dog")
number_wiki <- read_html("https://en.wikipedia.org/wiki/Number")
plant_wiki <- read_html("https://en.wikipedia.org/wiki/Plant")
entertainment_wiki <- read_html("https://en.wikipedia.org/wiki/Entertainment")

I then pulled all of the text from each “div” node, which the tutorial explains should be most, if not all, of the text on the whole page. I removed the html language, split the data into lines by “\n”, removed all of the tabs (“\t”) and empty lines, then removed all numbers (this comes from the tm package). Below is the code for the “Dog” article.

dog_text <- dog_wiki%>%
  html_nodes("div")%>%
  html_text()%>%
  strsplit(split = "\n") %>%
  unlist() %>%
  str_replace_all(pattern="\t", replacement = "")%>%
  .[. != ""]%>%
  removeNumbers()

head(dog_text)

## [1] "Dog"                                                                                                                                  
## [2] "From Wikipedia, the free encyclopedia"                                                                                                
## [3] "Jump to:navigation, search"                                                                                                           
## [4] "This article is about the domestic dog. For related species known as \"dogs\", see Canidae. For other uses, see Dog (disambiguation)."
## [5] "\"Doggie\" redirects here. For the Danish artist, see Doggie (artist)."                                                               
## [6] "Domestic dogTemporal range: Late Pleistocene – Present (,– years BP)"

As you can see, some of the lines are very short or lacking any text that would be specific to any particular article, so I decided to group every five lines into one. Especially since these lines will be what the LDA was working to classify, I wanted to give it the best chance of being successful by trying to make sure every line had meaningful content.

The problem of concatenating every five rows ended up being more difficult than I expected, but I landed on what I think is a pretty slick way to do it. First I had to turn my list of values into a dataframe, assign a number to every group of five rows, and then use the summarise and paste commands to combine the rows by group. I also decided to filter out any rows with fewer than 15 characters, and used this step to label the data by article as well.

dog_data <- as.data.frame(dog_text)
dog_grouped <- dog_data%>%
  mutate(group=1:nrow(dog_data)%/%5)%>%
  group_by(group)%>%
  summarise(text=paste(dog_text, collapse = " "))%>%
  filter(nchar(text)>15)%>%
  mutate(wiki = "dog")%>%
  select(wiki, group, text)

head(dog_grouped)

## # A tibble: 6 x 3
##   wiki  group text                                                        
##   <chr> <dbl> <chr>                                                       
## 1 dog    0    "Dog From Wikipedia, the free encyclopedia Jump to:navigati…
## 2 dog    1.00 "\"Doggie\" redirects here. For the Danish artist, see Dogg…
## 3 dog    2.00 Scientific classification  Kingdom: Animalia Phylum: Chorda…
## 4 dog    3.00 Class: Mammalia Order: Carnivora Family:                    
## 5 dog    4.00 Canidae Genus: Canis Species: C. lupus                      
## 6 dog    5.00 Subspecies: C. l. familiaris[] Trinomial name Canis lupus f…

Running an LDA model

So now my text is all in one place and ready to be “tidied” for analysis. The first step is to combine all four articles into one dataframe, and then to create a tidy dataframe, with one token (word) per row. I maintained the article name and group number as an index so that we'd be able to see where each word came from after the model ran. I also removed stop words and performed a count per word per group index, then created a dtm (document-term matrix) which is what is needed to run an LDA.

#combines into one df, unites index
wiki_text <- 
  rbind(dog_grouped, number_grouped, plant_grouped, entertainment_grouped)%>%
  unite(index, wiki, group, sep="_")

#splits by word and creates word count for each group index
by_group_word <- wiki_text%>%
  unnest_tokens(word, text)%>%
  anti_join(stop_words)%>%
  count(index, word, sort=T)

## Joining, by = "word"

#creates document-term matrix for lda
group_dtm <- by_group_word%>%
  cast_dtm(index, word, n)

At this point, we're ready to run the LDA and examine the results. The tidy function is really handy here, in that it pulls specific data out of the LDA results so that it's a bit more digestible. First we look at the per-topic-per-word probabilities, using the beta argument. Here I pull the top five terms for each topic by probability.

#create 4-topic lda model
wiki_lda <- LDA(group_dtm, k=4, control=list(seed=1234))

#per-topic-per-word probabilities
group_topics <- tidy(wiki_lda, matrix="beta")

#view top 5 terms for each topic
top_terms <- group_topics%>%
  group_by(topic)%>%
  top_n(5, beta)%>%
  ungroup()%>%
  arrange(topic, -beta)

top_terms

## # A tibble: 20 x 3
##    topic term             beta
##    <int> <chr>           <dbl>
##  1     1 entertainment 0.0289 
##  2     1 plants        0.00736
##  3     1 audience      0.00728
##  4     1 century       0.00667
##  5     1 forms         0.00636
##  6     2 dogs          0.0378 
##  7     2 dog           0.0343 
##  8     2 plants        0.0157 
##  9     2 plant         0.00681
## 10     2 humans        0.00568
## 11     3 real          0.00932
## 12     3 mongoose      0.00827
## 13     3 complex       0.00757
## 14     3 displaystyle  0.00722
## 15     3 seal          0.00657
## 16     4 isbn          0.0292 
## 17     4 doi           0.0148 
## 18     4 press         0.0135 
## 19     4 retrieved     0.0132 
## 20     4 university    0.0121

And, wow. There appears to be no really good pattern to the words/topics at all. The words for topic 4 are particularly troubling, since they are really unrelated to any of the four articles, probably mostly coming from the references on each page. Just in case, I decided to look at the distribution by article as well to see if there was any pattern evident. For this, I used the gamma argument to tidy which shows the proportion of words in each group assigned to each topic.

#proportion of words per group assigned to topic
group_gamma <- tidy(wiki_lda, matrix="gamma")

#separate index to plot topic assignment
group_gamma <- group_gamma%>%
  separate(document, c("wiki", "group"), sep="_", convert=TRUE)

group_gamma%>%
  mutate(wiki=reorder(wiki, gamma*topic))%>%
  ggplot(aes(factor(topic), gamma))+
  geom_boxplot()+
  facet_wrap(~wiki)

plot of chunk LDA gamma1

Yuck. I realized my mistake in including all the text from all four articles. I had thought that the words that were not content-specific (from the references, sidebars, etc.) would cancel each other out since they would be more or less equally present in all four articles. However, since the articles were broken into smaller groups, it makes sense that certain groups of each article would be more similar across articles than to other groups in the same article. For example, the references and sidebars would likely match onto their own topic separate from the article content.

LDA with Paragraph Text Only

I decided to redo the analysis using only paragraph text. Everything in the analysis remained the same, except that I only scraped “\p” nodes from the articles.

## # A tibble: 20 x 3
##    topic term             beta
##    <int> <chr>           <dbl>
##  1     1 dogs          0.0476 
##  2     1 dog           0.0376 
##  3     1 humans        0.00909
##  4     1 human         0.00838
##  5     1 pet           0.00786
##  6     2 dogs          0.0185 
##  7     2 negative      0.00766
##  8     2 theory        0.00686
##  9     2 complex       0.00639
## 10     2 century       0.00634
## 11     3 plants        0.0406 
## 12     3 plant         0.0123 
## 13     3 real          0.0116 
## 14     3 algae         0.00970
## 15     3 called        0.00748
## 16     4 entertainment 0.0365 
## 17     4 forms         0.00920
## 18     4 audience      0.00867
## 19     4 music         0.00725
## 20     4 dance         0.00664

Phew! Much better! Still not perfect (note that dog/dogs is at the top of both topic 1 and 2), but we can start to see some patterns between the topics that could match up to the different articles. When we look at the patterns across the different articles, we can see some very clear correlations. plot of chunk LDA gamma2

Here we see that the articles on “Dog”, “Plant”, and “Entertainment” are all pretty clearly identified to a single topic. The article on “Number”, however, remains spread between a couple of topics. We can continue to use the gamma data to determine which topic is most commonly assigned to each group and each article, and then identify which specific groups are incorrectly assigned.

#topic most associated with each group index
pgroup_classification <- pgroup_gamma%>%
  group_by(wiki, group)%>%
  top_n(1, gamma)%>%
  ungroup()

#compare to topic most common among wiki
wiki_topics <- pgroup_classification%>%
  count(wiki, topic)%>%
  group_by(wiki)%>%
  top_n(1, n)%>%
  ungroup()%>%
  transmute(consensus = wiki, topic)

#find mismatched groups
mismatch_p <- pgroup_classification%>%
  inner_join(wiki_topics, by="topic")%>%
  filter(wiki != consensus)

mismatch_p

## # A tibble: 47 x 5
##    wiki          group topic gamma consensus
##    <chr>         <int> <int> <dbl> <chr>    
##  1 entertainment     8     1 0.807 dog      
##  2 entertainment     6     1 0.976 dog      
##  3 entertainment    12     1 0.614 dog      
##  4 number           10     1 0.999 dog      
##  5 dog               4     2 1.000 number   
##  6 dog               7     2 0.788 number   
##  7 plant             6     2 0.875 number   
##  8 dog               9     2 1.000 number   
##  9 plant            19     2 0.684 number   
## 10 plant            16     2 1.000 number   
## # ... with 37 more rows

It's apparent that the “Number” article is causing trouble. Most of the mismatched assignments either come from the “Number” article or match onto the “Number” topic. After going back to the article itself, it's actually not very surprising that this is happening. There isn't a lot of language in the article that is specific to math or numbers, that isn't also likely to appear in the other articles being analyzed. There are also long portions of the article about the history and cultural significance of numbers, which further blurs the “Number” article with the content of the others.

LDA without “Number” Article

Again, I decided to run another LDA, leaving the “Number” article out altogether. I still included only text from paragraph nodes, changing only the text being analyzed and the number of topics to identify (now only three).

## # A tibble: 15 x 3
##    topic term             beta
##    <int> <chr>           <dbl>
##  1     1 dogs          0.0430 
##  2     1 dog           0.0255 
##  3     1 canis         0.00962
##  4     1 wolves        0.00933
##  5     1 domestic      0.00841
##  6     2 plants        0.0473 
##  7     2 plant         0.0152 
##  8     2 algae         0.0104 
##  9     2 green         0.00818
## 10     2 fungi         0.00564
## 11     3 entertainment 0.0344 
## 12     3 dogs          0.0122 
## 13     3 dog           0.0117 
## 14     3 forms         0.00815
## 15     3 audience      0.00786

It seems like our model is getting better and better, though those pesky dog/dogs are still causing some trouble. When we look at the patterns per article, though, we can see that the model is doing a pretty good job overall. plot of chunk nonum gamma

## # A tibble: 18 x 5
##    wiki          group topic gamma consensus    
##    <chr>         <int> <int> <dbl> <chr>        
##  1 plant            23     1 0.681 dog          
##  2 entertainment     6     1 0.508 dog          
##  3 entertainment    14     1 0.554 dog          
##  4 entertainment     3     1 0.754 dog          
##  5 entertainment    13     1 0.795 dog          
##  6 plant             8     1 0.994 dog          
##  7 entertainment    15     2 0.612 plant        
##  8 entertainment    17     2 0.611 plant        
##  9 dog               3     2 0.528 plant        
## 10 dog              12     3 1.000 entertainment
## 11 dog              10     3 0.527 entertainment
## 12 dog              17     3 1.000 entertainment
## 13 dog              11     3 0.724 entertainment
## 14 dog              14     3 1.000 entertainment
## 15 dog              16     3 1.000 entertainment
## 16 dog               5     3 1.000 entertainment
## 17 dog               6     3 1.000 entertainment
## 18 dog              15     3 1.000 entertainment

The groupings are much cleaner in this model without the “Number” article, and the list of mismatched groups is smaller. We see that most of the errors are between the “Dog” and “Entertainment” articles, though looking at the specific groups doesn't give any really good insight into why these specific groups might be assigned incorrectly. As a dog lover, I suggest that it's likely because dogs are just so entertaining…

We can also look at how the specific words were assigned and identify which words in each group led to an incorrect assignment. And finally, we can create a confusion matrix to visualize the percent of words in each article that were assigned correctly/incorrectly.

#find mismatched words
assignments <- augment(nonum_lda, data=nonum_dtm)%>%
  separate(document, c("wiki", "group"), sep="_", convert=T)%>%
  inner_join(nonum_topics, by=c(".topic"="topic"))

missed_assignments <- assignments%>%
  filter(wiki!=consensus)

#confusion matrix of word/topic assignment
assignments%>%
  count(wiki, consensus, wt=count)%>%
  group_by(wiki)%>%
  mutate(percent=n/sum(n))%>%
  ggplot(aes(consensus, wiki, fill=percent))+
  geom_tile()+
  scale_fill_gradient2(high="red")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank())+
  labs(x = "Wiki words were assigned to",
       y="Wiki words came from",
       fill="%of assignments")

plot of chunk confusion matrix

The confusion matrix makes visually clear what we had already determined from the data. The “Plant” article was very successfully identified, with only a few words incorrectly assigned to/from the article. The “Dog” and “Entertainment” articles were still successfully identified, but with more mistakes between the two. The “Dog” article was the least successfully identified, and the largest group of incorrect assignments came from the “Dog” article being assigned to the “Entertainment” topic.

Conclusion

The biggest conclusion that I came to during this exercise was that topic-modelling is not as clean as it looked in the textbook. This should probably be obvious, but I came into this project thinking that I would have results as beautifully clean as the ones in the book's “Great Library Heist”. This is actually probably really important for me to realize going forward, and I feel like I need some guidance about what constitutes a good model beyond looking nice.

Even if my model isn't perfect, I do think that I learned a lot doing this exercise. I feel pretty confident in my ability to do basic web-scraping, and I'm figuring out what does and doesn't work for running LDAs, as well as what I can do to try to improve the model. Overall, a very successful week, I think!

2 comments:

Kabir khanAugust 17, 2021 at 4:22 AM
Thanks for the informative post. We are here to provide web data scraping services. if you want LinkedIn data extract must visit this link LinkedIn Leads generation you can extract easily emails, phone numbers, websites, profile links and more
Allen MarryMarch 6, 2022 at 9:02 PM
IT's very informative blog and useful article thank you for sharing with us , keep posting learn more about Product engineering services | Product engineering solutions.

Anna Marbut

Sunday, April 8, 2018

Web Scraping and LDA Topic Modelling