Scraping Wikipedia and Topic Modelling
With a “final project” for my independent study in mind, I've been doing some research about how best to go about web-scraping and categorizing text. I'm hoping to be able to stay within R for this project for sure, and maybe even remain mainly within the tidyverse depending on what the best solutions end up being.
Luckily, most of the tutorials I've found regarding web scraping in R use Hadley Wickham's rvest
, so that one is pretty straightforward. And although my final project is going to require supervised modeling, doing the unsupervised LDA modelling as described in the Text Mining in R book may still be a preliminary step for the ultimate classification. Also, it seems like it's generally a good tool to have under my belt going forward, especially if I'm going to continue working with text. So here we go! Web scraping- and topic modelling-ho!
Selecting and Scraping the Data
In the TidyText book, an example LDA is run on the chapters of four separate books to see if the algorithm can correctly identify which book each chapter comes from. This example is very clean, with only two chapters being incorrectly assigned. I thought that I would try a similar exercise, but using four unrelated Wikipedia articles so that I would get some practice with web-scraping as well.
I primarily used this tutorial from Bradley Boehmke as a guide for performing the web-scraping. I decided to scrape all of the text from each of four broad but unrelated articles (“Dog”, “Number”, “Plant”, and “Entertainment”) by first using the code below to read all of the html data into R:
dog_wiki <- read_html("https://en.wikipedia.org/wiki/Dog")
number_wiki <- read_html("https://en.wikipedia.org/wiki/Number")
plant_wiki <- read_html("https://en.wikipedia.org/wiki/Plant")
entertainment_wiki <- read_html("https://en.wikipedia.org/wiki/Entertainment")
I then pulled all of the text from each “div” node, which the tutorial explains should be most, if not all, of the text on the whole page. I removed the html language, split the data into lines by “\n”, removed all of the tabs (“\t”) and empty lines, then removed all numbers (this comes from the tm
package). Below is the code for the “Dog” article.
dog_text <- dog_wiki%>%
html_nodes("div")%>%
html_text()%>%
strsplit(split = "\n") %>%
unlist() %>%
str_replace_all(pattern="\t", replacement = "")%>%
.[. != ""]%>%
removeNumbers()
head(dog_text)
## [1] "Dog"
## [2] "From Wikipedia, the free encyclopedia"
## [3] "Jump to:navigation, search"
## [4] "This article is about the domestic dog. For related species known as \"dogs\", see Canidae. For other uses, see Dog (disambiguation)."
## [5] "\"Doggie\" redirects here. For the Danish artist, see Doggie (artist)."
## [6] "Domestic dogTemporal range: Late Pleistocene – Present (,– years BP)"
As you can see, some of the lines are very short or lacking any text that would be specific to any particular article, so I decided to group every five lines into one. Especially since these lines will be what the LDA was working to classify, I wanted to give it the best chance of being successful by trying to make sure every line had meaningful content.
The problem of concatenating every five rows ended up being more difficult than I expected, but I landed on what I think is a pretty slick way to do it. First I had to turn my list of values into a dataframe, assign a number to every group of five rows, and then use the summarise
and paste
commands to combine the rows by group. I also decided to filter out any rows with fewer than 15 characters, and used this step to label the data by article as well.
dog_data <- as.data.frame(dog_text)
dog_grouped <- dog_data%>%
mutate(group=1:nrow(dog_data)%/%5)%>%
group_by(group)%>%
summarise(text=paste(dog_text, collapse = " "))%>%
filter(nchar(text)>15)%>%
mutate(wiki = "dog")%>%
select(wiki, group, text)
head(dog_grouped)
## # A tibble: 6 x 3
## wiki group text
## <chr> <dbl> <chr>
## 1 dog 0 "Dog From Wikipedia, the free encyclopedia Jump to:navigati…
## 2 dog 1.00 "\"Doggie\" redirects here. For the Danish artist, see Dogg…
## 3 dog 2.00 Scientific classification Kingdom: Animalia Phylum: Chorda…
## 4 dog 3.00 Class: Mammalia Order: Carnivora Family:
## 5 dog 4.00 Canidae Genus: Canis Species: C. lupus
## 6 dog 5.00 Subspecies: C. l. familiaris[] Trinomial name Canis lupus f…
Running an LDA model
So now my text is all in one place and ready to be “tidied” for analysis. The first step is to combine all four articles into one dataframe, and then to create a tidy dataframe, with one token (word) per row. I maintained the article name and group number as an index so that we'd be able to see where each word came from after the model ran. I also removed stop words and performed a count per word per group index, then created a dtm (document-term matrix) which is what is needed to run an LDA.
#combines into one df, unites index
wiki_text <-
rbind(dog_grouped, number_grouped, plant_grouped, entertainment_grouped)%>%
unite(index, wiki, group, sep="_")
#splits by word and creates word count for each group index
by_group_word <- wiki_text%>%
unnest_tokens(word, text)%>%
anti_join(stop_words)%>%
count(index, word, sort=T)
## Joining, by = "word"
#creates document-term matrix for lda
group_dtm <- by_group_word%>%
cast_dtm(index, word, n)
At this point, we're ready to run the LDA and examine the results. The tidy
function is really handy here, in that it pulls specific data out of the LDA results so that it's a bit more digestible. First we look at the per-topic-per-word probabilities, using the beta
argument. Here I pull the top five terms for each topic by probability.
#create 4-topic lda model
wiki_lda <- LDA(group_dtm, k=4, control=list(seed=1234))
#per-topic-per-word probabilities
group_topics <- tidy(wiki_lda, matrix="beta")
#view top 5 terms for each topic
top_terms <- group_topics%>%
group_by(topic)%>%
top_n(5, beta)%>%
ungroup()%>%
arrange(topic, -beta)
top_terms
## # A tibble: 20 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 entertainment 0.0289
## 2 1 plants 0.00736
## 3 1 audience 0.00728
## 4 1 century 0.00667
## 5 1 forms 0.00636
## 6 2 dogs 0.0378
## 7 2 dog 0.0343
## 8 2 plants 0.0157
## 9 2 plant 0.00681
## 10 2 humans 0.00568
## 11 3 real 0.00932
## 12 3 mongoose 0.00827
## 13 3 complex 0.00757
## 14 3 displaystyle 0.00722
## 15 3 seal 0.00657
## 16 4 isbn 0.0292
## 17 4 doi 0.0148
## 18 4 press 0.0135
## 19 4 retrieved 0.0132
## 20 4 university 0.0121
And, wow. There appears to be no really good pattern to the words/topics at all. The words for topic 4 are particularly troubling, since they are really unrelated to any of the four articles, probably mostly coming from the references on each page. Just in case, I decided to look at the distribution by article as well to see if there was any pattern evident. For this, I used the gamma
argument to tidy
which shows the proportion of words in each group assigned to each topic.
#proportion of words per group assigned to topic
group_gamma <- tidy(wiki_lda, matrix="gamma")
#separate index to plot topic assignment
group_gamma <- group_gamma%>%
separate(document, c("wiki", "group"), sep="_", convert=TRUE)
group_gamma%>%
mutate(wiki=reorder(wiki, gamma*topic))%>%
ggplot(aes(factor(topic), gamma))+
geom_boxplot()+
facet_wrap(~wiki)
Yuck. I realized my mistake in including all the text from all four articles. I had thought that the words that were not content-specific (from the references, sidebars, etc.) would cancel each other out since they would be more or less equally present in all four articles. However, since the articles were broken into smaller groups, it makes sense that certain groups of each article would be more similar across articles than to other groups in the same article. For example, the references and sidebars would likely match onto their own topic separate from the article content.
LDA with Paragraph Text Only
I decided to redo the analysis using only paragraph text. Everything in the analysis remained the same, except that I only scraped “\p” nodes from the articles.
## # A tibble: 20 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 dogs 0.0476
## 2 1 dog 0.0376
## 3 1 humans 0.00909
## 4 1 human 0.00838
## 5 1 pet 0.00786
## 6 2 dogs 0.0185
## 7 2 negative 0.00766
## 8 2 theory 0.00686
## 9 2 complex 0.00639
## 10 2 century 0.00634
## 11 3 plants 0.0406
## 12 3 plant 0.0123
## 13 3 real 0.0116
## 14 3 algae 0.00970
## 15 3 called 0.00748
## 16 4 entertainment 0.0365
## 17 4 forms 0.00920
## 18 4 audience 0.00867
## 19 4 music 0.00725
## 20 4 dance 0.00664
Phew! Much better! Still not perfect (note that dog/dogs is at the top of both topic 1 and 2), but we can start to see some patterns between the topics that could match up to the different articles. When we look at the patterns across the different articles, we can see some very clear correlations.
Here we see that the articles on “Dog”, “Plant”, and “Entertainment” are all pretty clearly identified to a single topic. The article on “Number”, however, remains spread between a couple of topics. We can continue to use the gamma
data to determine which topic is most commonly assigned to each group and each article, and then identify which specific groups are incorrectly assigned.
#topic most associated with each group index
pgroup_classification <- pgroup_gamma%>%
group_by(wiki, group)%>%
top_n(1, gamma)%>%
ungroup()
#compare to topic most common among wiki
wiki_topics <- pgroup_classification%>%
count(wiki, topic)%>%
group_by(wiki)%>%
top_n(1, n)%>%
ungroup()%>%
transmute(consensus = wiki, topic)
#find mismatched groups
mismatch_p <- pgroup_classification%>%
inner_join(wiki_topics, by="topic")%>%
filter(wiki != consensus)
mismatch_p
## # A tibble: 47 x 5
## wiki group topic gamma consensus
## <chr> <int> <int> <dbl> <chr>
## 1 entertainment 8 1 0.807 dog
## 2 entertainment 6 1 0.976 dog
## 3 entertainment 12 1 0.614 dog
## 4 number 10 1 0.999 dog
## 5 dog 4 2 1.000 number
## 6 dog 7 2 0.788 number
## 7 plant 6 2 0.875 number
## 8 dog 9 2 1.000 number
## 9 plant 19 2 0.684 number
## 10 plant 16 2 1.000 number
## # ... with 37 more rows
It's apparent that the “Number” article is causing trouble. Most of the mismatched assignments either come from the “Number” article or match onto the “Number” topic. After going back to the article itself, it's actually not very surprising that this is happening. There isn't a lot of language in the article that is specific to math or numbers, that isn't also likely to appear in the other articles being analyzed. There are also long portions of the article about the history and cultural significance of numbers, which further blurs the “Number” article with the content of the others.
LDA without “Number” Article
Again, I decided to run another LDA, leaving the “Number” article out altogether. I still included only text from paragraph nodes, changing only the text being analyzed and the number of topics to identify (now only three).
## # A tibble: 15 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 dogs 0.0430
## 2 1 dog 0.0255
## 3 1 canis 0.00962
## 4 1 wolves 0.00933
## 5 1 domestic 0.00841
## 6 2 plants 0.0473
## 7 2 plant 0.0152
## 8 2 algae 0.0104
## 9 2 green 0.00818
## 10 2 fungi 0.00564
## 11 3 entertainment 0.0344
## 12 3 dogs 0.0122
## 13 3 dog 0.0117
## 14 3 forms 0.00815
## 15 3 audience 0.00786
It seems like our model is getting better and better, though those pesky dog/dogs are still causing some trouble. When we look at the patterns per article, though, we can see that the model is doing a pretty good job overall.
## # A tibble: 18 x 5
## wiki group topic gamma consensus
## <chr> <int> <int> <dbl> <chr>
## 1 plant 23 1 0.681 dog
## 2 entertainment 6 1 0.508 dog
## 3 entertainment 14 1 0.554 dog
## 4 entertainment 3 1 0.754 dog
## 5 entertainment 13 1 0.795 dog
## 6 plant 8 1 0.994 dog
## 7 entertainment 15 2 0.612 plant
## 8 entertainment 17 2 0.611 plant
## 9 dog 3 2 0.528 plant
## 10 dog 12 3 1.000 entertainment
## 11 dog 10 3 0.527 entertainment
## 12 dog 17 3 1.000 entertainment
## 13 dog 11 3 0.724 entertainment
## 14 dog 14 3 1.000 entertainment
## 15 dog 16 3 1.000 entertainment
## 16 dog 5 3 1.000 entertainment
## 17 dog 6 3 1.000 entertainment
## 18 dog 15 3 1.000 entertainment
The groupings are much cleaner in this model without the “Number” article, and the list of mismatched groups is smaller. We see that most of the errors are between the “Dog” and “Entertainment” articles, though looking at the specific groups doesn't give any really good insight into why these specific groups might be assigned incorrectly. As a dog lover, I suggest that it's likely because dogs are just so entertaining…
We can also look at how the specific words were assigned and identify which words in each group led to an incorrect assignment. And finally, we can create a confusion matrix to visualize the percent of words in each article that were assigned correctly/incorrectly.
#find mismatched words
assignments <- augment(nonum_lda, data=nonum_dtm)%>%
separate(document, c("wiki", "group"), sep="_", convert=T)%>%
inner_join(nonum_topics, by=c(".topic"="topic"))
missed_assignments <- assignments%>%
filter(wiki!=consensus)
#confusion matrix of word/topic assignment
assignments%>%
count(wiki, consensus, wt=count)%>%
group_by(wiki)%>%
mutate(percent=n/sum(n))%>%
ggplot(aes(consensus, wiki, fill=percent))+
geom_tile()+
scale_fill_gradient2(high="red")+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.grid = element_blank())+
labs(x = "Wiki words were assigned to",
y="Wiki words came from",
fill="%of assignments")
The confusion matrix makes visually clear what we had already determined from the data. The “Plant” article was very successfully identified, with only a few words incorrectly assigned to/from the article. The “Dog” and “Entertainment” articles were still successfully identified, but with more mistakes between the two. The “Dog” article was the least successfully identified, and the largest group of incorrect assignments came from the “Dog” article being assigned to the “Entertainment” topic.
Conclusion
The biggest conclusion that I came to during this exercise was that topic-modelling is not as clean as it looked in the textbook. This should probably be obvious, but I came into this project thinking that I would have results as beautifully clean as the ones in the book's “Great Library Heist”. This is actually probably really important for me to realize going forward, and I feel like I need some guidance about what constitutes a good model beyond looking nice.
Even if my model isn't perfect, I do think that I learned a lot doing this exercise. I feel pretty confident in my ability to do basic web-scraping, and I'm figuring out what does and doesn't work for running LDAs, as well as what I can do to try to improve the model. Overall, a very successful week, I think!
Thanks for the informative post. We are here to provide web data scraping services. if you want LinkedIn data extract must visit this link LinkedIn Leads generation you can extract easily emails, phone numbers, websites, profile links and more
ReplyDeleteIT's very informative blog and useful article thank you for sharing with us , keep posting learn more about Product engineering services | Product engineering solutions.
ReplyDelete