Web Scraping and Text Classification Take Two:
One more time with more data!
Last week I made my first venture into the world of webscraping and topic modelling, an endeavor I detail in this blog. I did a lot of manipulation to figure out how to make the LDA model turn out better, using the data cleaning techniques I learned in the world of social science (basically trying to weed out any noise to make the cleanest-looking plots). It didn't even cross my mind that using more “dirty” data would be a better strategy for getting better results. Of course I know that a bigger n will always be better for any statistical test, but I've always worked with finite data sets in the past, for which manual cleaning was the best option I had.
Luckily, my advisor was there to point out that I could scrape as many Wikipedia articles as I wanted, and that more data will garner better results. Thanks, John! So for this week the puzzle was to create a webscraping function that could compile a vector of 50 Wikipedia articles in a usable format. Then, I decided to try out a supervised classification method (SVM) in addition to the unsupervised LDA that I tried out last week.
The Great Wiki-Scrape of 2018
My first task was to figure out how to scrape 50 articles without having to do them all manually (like I did last week).
I used Wikipedia's categories to pick my articles from four disparate categories. I went with “Ballet”, “Astronomy”, “Landforms”, and “Medical Professionals”, and found about a dozen articles from each category. I then created a .csv file with each article's url and associated category, which would be the input for my scraping function. So I imported this file and converted the Url
column into a character vector.
wikiurl <- read.csv("~/Desktop/wikiurl.csv", header = TRUE)
wikiurl <- wikiurl%>%
mutate(Url = as.character(Url))
Now that I had my input ready, I got to puzzling out my scraping function. The biggest challenge was figuring out how to apply the rvest
commands to a vector, when they were designed to run on a single item. I found a great, clean solution to this problem using the map
functions from the purrr
package, but I still had a problem with the last step when using the html_text
function to extract the text out of the html language.
The example that I was goin off of only read a single html node, so there was only a single item to be extracted. However, I was extracting all “div” nodes, so the web scraping functions created one item for each “div” chunk. It turned out that this was actually okay for the rvest
functions, but that it caused problems for map_chr
. The normal output of all of the map
functions is a list, but since I wanted the outputs to end up as a new vector in the dataframe, I used the map_chr
function to force the output into a character vector. But this function could not handle the multiple items per article created by the scraping, so I found a way to collapse all of the strings first, using the paste
command.
I finally had success with the following combination:
wikiscrape <- function(x){
x%>%
mutate(text=(
map(Url, read_html)%>%
map(html_nodes,"div")%>%
map_chr(~paste(html_text(.),collapse = " "))%>%
str_replace_all(pattern="\t", replacement = " ")%>%
str_replace_all(pattern="\n", replacement = " ")%>%
str_replace_all(pattern=" +", replacement = " ")%>%
removeNumbers()))
}
I did still do some tidying to make everything a little cleaner, just taking out some html items (“\n”, “\t”) and then taking out unnecessary spaces and all numbers. Then I could use the beautifully simple command below to scrape all 50 articles at once, resulting in a single character string that encompassed all text in the “div” nodes for each article.
wikitext <- wikiscrape(wikiurl)
LDA Model
Once I had this dataframe put together, it was fairly simple to run the LDA using the same steps that I did last week, indexing with each article and category, instead of using arbitrary groups within each article.
wikitext <- wikitext%>%
mutate(article = row_number())
LDA_wikitext <- wikitext%>%
unite(index, Category, article, sep = "_")%>%
select(index, text)
wikiword <- LDA_wikitext%>%
unnest_tokens(word, text)%>%
anti_join(stop_words)%>%
count(index, word, sort=T)
## Joining, by = "word"
wiki_dtm <- wikiword%>%
cast_dtm(index, word, n)
wiki_lda <- LDA(wiki_dtm, k=4, control=list(seed=3333))
Now to see whether I achieved the expected improvement in results. First we'll look at the top words assigned to each group:
group_topics <- tidy(wiki_lda, matrix="beta")
top_terms <- group_topics%>%
group_by(topic)%>%
top_n(5, beta)%>%
ungroup()%>%
arrange(topic, -beta)
top_terms
## # A tibble: 20 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 astronomy 0.0180
## 2 1 ray 0.0133
## 3 1 edit 0.00729
## 4 1 telescope 0.00718
## 5 1 radio 0.00690
## 6 2 sea 0.0233
## 7 2 ocean 0.0104
## 8 2 gulf 0.00571
## 9 2 isbn 0.00540
## 10 2 water 0.00532
## 11 3 ballet 0.0601
## 12 3 dance 0.0178
## 13 3 de 0.0133
## 14 3 edit 0.00850
## 15 3 dancers 0.00753
## 16 4 nursing 0.0124
## 17 4 health 0.0115
## 18 4 therapy 0.0114
## 19 4 care 0.0108
## 20 4 occupational 0.00812
Looking pretty good! Each topic has words that mostly fit with a single category, with a few oddballs like “isbn” and “edit” thrown in there. So I continued with the analysis, looking at how the spread of the article-topic assignments looked, compared to the original categories.
group_gamma <- tidy(wiki_lda, matrix="gamma")
group_gamma <- group_gamma%>%
separate(document, c("category", "article"), sep="_", convert=TRUE)
group_gamma%>%
mutate(category=reorder(category, gamma*topic))%>%
ggplot(aes(factor(topic), gamma))+
geom_boxplot()+
facet_wrap(~category)
This plot confirms what we saw above. Each of the categories is discretely associated with a single topic. If we make a confusion matrix of the individual word assignments, we can further support this result:
group_classification <- group_gamma%>%
group_by(category, article)%>%
top_n(1, gamma)%>%
ungroup()
wiki_topics <- group_classification%>%
count(category, topic)%>%
group_by(category)%>%
top_n(1, n)%>%
ungroup()%>%
transmute(consensus = category, topic)
word_classification <- augment(wiki_lda, data=wiki_dtm)%>%
separate(document, c("category", "article"), sep="_", convert=T)%>%
inner_join(wiki_topics, by=c(".topic"="topic"))
word_classification%>%
count(category, consensus, wt=count)%>%
group_by(category)%>%
mutate(percent=n/sum(n))%>%
ggplot(aes(consensus, category, fill=percent))+
geom_tile()+
scale_fill_gradient2(high="red")+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.grid = element_blank())+
labs(x = "Category words were assigned to",
y="Category words came from",
fill="%of assignments")
Once again, we can see that the assigned categories very closely match the expected categories. There was only a small percentage of words from the medical professional and landforms categories that were incorrectly assigned, and otherwise everything was spot on! A definite improvement on last week's analysis, all thanks to a bigger n.
Supervised Learning with an SVM
An SVM (support vector machine) is a supervised learning model that is used in machine learning for classifying items in a dataset. The rTextTools
package is a wrapper on the e1071
package which allows the SVM algorithm to work on textual data. Since this model only allows binary classification, we'll have to run a model on each category separately and then combine the results.
The first step in doing this is to use the spread
function to turn each category into its own logical vector, with a value of 1 if the article came from the category and a value of 0 if not.
svm_wikiword <- wikitext%>%
mutate(yesno=1)%>%
distinct%>%
spread(Category, yesno, fill=0)
Then we'll need to split the dataset into a training set and a testing set. We'll use the training set to create the model and then use the testing set to see how accurate the model is at predicting the categories. To make sure that each set had a random sample from all four categories, I used the sample
function to randomize the data before splitting into the two sets.
set.seed(2222)
svm_wikiword <- svm_wikiword[sample(nrow(svm_wikiword)),]
bound <- floor((nrow(svm_wikiword)/2))
svm_train <- svm_wikiword[1:bound,]
svm_test <- svm_wikiword[(bound+1):nrow(svm_wikiword),]%>%
mutate(row=c(1:25))
The rTextTools
package has a create_matrix
function that will make a dtm out of a dataframe without having to unnest the tokens, remove stop words, or do a per-document count. I'm not sure exactly what the differences are between this function and tidytext
's cast_dtm
on the back end, but it sure is easy for the user! So I did this for the testing set, and then created a “container” for the dtm that included the category assignments for each article (apparently you can use the trainSize
and testSize
arguments to split up the dataset at this point, but I couldn't find any literature about whether there was any randomization to those functions, so I decided to do it manually instead). Once you have this container made, you can run the SVM function.
Since I had to do the same thing for all four categories separately, I'll just show my code for the astronomy category to avoid too much repetition.
svm_train_dtm<-
create_matrix(svm_train["text"])
svmastro_container <- create_container(svm_train_dtm, svm_train$Astronomy,
trainSize = 1:25,
virgin = FALSE)
astromodel <- train_model(svmastro_container, "SVM", kernel = "linear", cost=1)
The next step is to prepare and run the testing set through the model we just made. First I made a dtm of the testing set using the word list from the training set (since any words not in the training set would not be in the model). Then I created a container for the dtm with empty labels and used the classify_model
function to make our predictions.
astropredMatrix <- create_matrix(svm_test, originalMatrix=svm_train_dtm)
predastro_container <- create_container(astropredMatrix, labels = rep(0, 25),
testSize = 1:25, virgin=FALSE)
astro_pred <- classify_model(predastro_container, astromodel)
I then cleaned up the results and joined them with the testing dataset to be able to compare the predictions with the actual assignments.
astro_pred <- astro_pred%>%
mutate(row=c(1:25), astro_predict=SVM_LABEL, astro_prob=SVM_PROB)%>%
select(-SVM_LABEL, -SVM_PROB)
astro_compare <- svm_test%>%
select(row,article, Astronomy)%>%
inner_join(astro_pred)%>%
select(-row)
## Joining, by = "row"
astro_compare
## article Astronomy astro_predict astro_prob
## 1 40 1 1 0.9573721
## 2 3 0 0 0.9385191
## 3 6 0 0 0.9401042
## 4 25 0 0 0.9285038
## 5 21 0 0 0.9978756
## 6 41 1 0 0.5540011
## 7 35 0 0 0.9395527
## 8 12 0 0 0.9858133
## 9 47 1 0 0.7476170
## 10 5 0 0 0.9554466
## 11 32 0 0 0.8879723
## 12 44 1 1 0.9763707
## 13 4 0 0 0.9441213
## 14 33 0 0 0.9249003
## 15 43 1 1 1.0000000
## 16 50 1 1 1.0000000
## 17 10 0 0 0.9554945
## 18 18 0 0 0.9267376
## 19 37 0 0 0.9753106
## 20 45 1 0 0.5341259
## 21 29 0 0 0.8649816
## 22 9 0 0 0.9290784
## 23 23 0 0 0.7585622
## 24 17 0 0 0.9128045
## 25 16 0 0 0.9160170
At first glance, it's not looking too bad! Next I combined it with all the other results to see how our model did overall.
svm_compare <- med_compare%>%
inner_join(land_compare, by = "article")%>%
inner_join(ballet_compare, by= "article")%>%
inner_join(astro_compare, by = "article")
svm_compare_gather <- svm_compare%>%
mutate(category= ifelse(Astronomy==1, "Astronomy",
ifelse(Ballet==1, "Ballet",
ifelse(Landforms==1, "Landforms",
ifelse(MedicalProfessional==1, "Med",
"Not Assigned")))),
prediction= ifelse(med_predict==1, "Med",
ifelse(ballet_predict==1, "Ballet",
ifelse(land_predict==1, "Landforms",
ifelse(astro_predict==1, "Astronomy",
"Not Assigned")))))%>%
select(article, category, prediction)
svm_compare_gather
## article category prediction
## 1 40 Astronomy Astronomy
## 2 3 Ballet Ballet
## 3 6 Ballet Ballet
## 4 25 Med Med
## 5 21 Landforms Landforms
## 6 41 Astronomy Not Assigned
## 7 35 Med Med
## 8 12 Ballet Ballet
## 9 47 Astronomy Not Assigned
## 10 5 Ballet Ballet
## 11 32 Med Med
## 12 44 Astronomy Astronomy
## 13 4 Ballet Ballet
## 14 33 Med Med
## 15 43 Astronomy Astronomy
## 16 50 Astronomy Astronomy
## 17 10 Ballet Ballet
## 18 18 Landforms Landforms
## 19 37 Med Med
## 20 45 Astronomy Not Assigned
## 21 29 Med Med
## 22 9 Ballet Not Assigned
## 23 23 Landforms Landforms
## 24 17 Landforms Landforms
## 25 16 Landforms Landforms
svm_compare_gather%>%
count(category, prediction)%>%
group_by(category)%>%
mutate(percent=n/sum(n))%>%
ggplot(aes(prediction, category, fill=percent))+
geom_tile()+
scale_fill_gradient2(high="red")+
theme_minimal()+
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.grid = element_blank())+
labs(x = "Category articles were assigned to",
y="Category articles came from",
fill="Percent of assignments")
Looking pretty good, indeed! There were actually no incorrect assignments, but several articles that were not assigned to any topic. We can see that the medical professional and landforms articles were predicted perfectly, but some of the ballet and astronomy articles were not assigned to their categories.
The SVM results also include the probability associated with each assignment in the model, for both the 1's and the 0's. We can use this statistic to report the probability for each article for each topic (using the inverse probablity for 0's). Then we can visualize the assignments on a boxplot like we did for the LDA.
svm_box <- svm_compare%>%
mutate(astro_p=ifelse(astro_predict==1, astro_prob, (1-astro_prob)),
med_p=ifelse(med_predict==1, med_prob, (1-med_prob)),
land_p=ifelse(land_predict==1, land_prob, (1-land_prob)),
ballet_p=ifelse(ballet_predict==1, ballet_prob, (1-ballet_prob)),
category= ifelse(Astronomy==1, "Astronomy",
ifelse(Ballet==1, "Ballet",
ifelse(Landforms==1, "Landforms",
ifelse(MedicalProfessional==1, "Med", "Not Assigned")))))%>%
gather(cat_p, prob, astro_p, med_p, land_p, ballet_p)%>%
select(category,article, cat_p, prob)
svm_box%>%
ggplot(aes(factor(cat_p), prob))+
geom_boxplot()+
facet_wrap(~category)
This plot makes me so happy. Even though it's not absolutely perfect, each group of predictions is discrete and associated with the correct category, which is exactly what I wanted to see. Success!
Using the lesson that I (re-)learned last week, I would guess that the model would get even better with a bigger dataset. Splitting the dataset in two (for training and testing) was particularly painful, since that meant that I really only had half the data to work with, but it seemed to turn out ok anyway. Feeling very accomplished this week, indeed.