Saturday, June 9, 2018

Semantic Search in R: Latent Semantic Analysis






Semantic Search in R: LSI

Semantic Search in R: LSI

I've been doing a lot of research about search engines for a summer project that I'm working on, trying to improve a search function that is currently just a basic keyword matching query. While I've found a few examples of doing tf- (term frequency) or tf*idf- (frequency weighted for document uniqueness) weighted searches in R, I've been surprised not to find any straightforward examples of doing topic-based (semantic) search in R. So I decided to try my hand at making one myself, combining the LSA topic model methods from this video with the vector space model search method from this tutorial.

Overall I was pretty pleased with the relevance of the results that I found in this exercise, though applying it to a sample of my real (certainly less clean) data was not quite as successful. I will be interested to see how it does once I train the model on the whole dataset.

Associated Press Data

For this exercise, I'll be using the Associated Press dataset, which includes the text from 2246 articles. This data is already in a document-term matrix (DTM) which both simplifies and complicates our process. Normally, we would be creating a DTM from a table of documents (our corpus) which had (at minimum) some sort of identifier/key and the original text for each document. The DTM is set up so that there is one row for each document, and the columns consist of a “dictionary” of all the terms from the entire corpus. Then each cell corresponds to a frequency count for each word within each document.

Creating the DTM can be time-consuming, and both of the models that I'll be working with require the data to be in this (or similar) form, so in this way it is a time-saver. However, in order to interpret the relevance of our search results I would normally want to look at the article title or a snippet of the original text itself, neither of which we can get from this dataset. So instead I created a string with the top 10 most frequent words for each document.

# Load data set and "tidy" it into a dataframe with one term per row
data("AssociatedPress")
ap_data <- AssociatedPress
ap_tidy <- tidy(ap_data)

# Pull the top ten terms for each document and combine them into one string per document
ap_topterms <- ap_tidy%>%
  group_by(document)%>%
  top_n(10, count)%>%
  summarise(top_terms=toString(term))

head(ap_topterms)
## # A tibble: 6 x 2
##   document top_terms                                                      
##      <int> <chr>                                                          
## 1        1 boy, boys, classroom, gun, guns, jammed, marino, police, schoo~
## 2        2 bechtel, company, foreign, israel, made, memo, offer, official~
## 3        3 back, bloomberg, gunman, i, jewelry, liberace, man, mrs, museu~
## 4        4 ago, american, died, first, h, history, invasion, military, mi~
## 5        5 away, begins, bird, born, brings, charge, coming, continues, d~
## 6        6 administration, american, economic, general, i, jackson, lette~

We can see from these lists that the dataset has not been stemmed or lemmatized (which would combine boy and boys/gun and guns into one term). There are mixed opinions about whether or not this pre-processing step is helpful in returning better results, the main complaint being that the process is language-specific so it cannot be applied to data that is multi-lingual. In practice, you should decide for yourself whether or not your data is a good fit for stemming and then try running the model both ways to determine whether it helps or hinders. For this exercise, I'll leave the data as is.

Latent Semantic Analysis

Latent Semantic Analysis (LSA) is a topic-modelling technique that relies on using tf or tfidf values and matrix math to reduce the dimensions of a dataset by grouping similar items together. The name more or less explains the goal of using this technique, which is to uncover hidden (latent) content-based (semantic) topics in a collection of text. These hidden topics can then be used to identify relationships between documents more accurately than by usingterm frequency-based relations alone, and these topic-based relationships are what we will be using for our search engine.

I'll be using the RSpectra package to perform the LSA, because the results of this study showed that its algorithm produced somewhat better results. However, there are several other packages out there that can do the same thing, and I encourage you to look into the other options and compare results based on your specific data.

The RSpectra functions require that our data be in a slightly different form, with each row representing one term and each column representing one document. It also requires the data to be in a plain matrix rather than as a DTM (or TDM) object. Converting to a regular matrix takes up more memory (our matrix goes from 5.2 Mb to 180 Mb), which is not a problem for a small-ish dataset like the one used in this exercise, but is likely be an issue for larger data. There are apparently some methods of dealing with this (including creating a sparse matrix or using a different package altogether), but I have not yet needed to cross that bridge. We'll use the following code to transform our DTM into the shape that we need:

ap_tdm <- as.TermDocumentMatrix(ap_data)
ap_matrix <- as.matrix(ap_tdm)

Most of the power behind LSA comes from a mathematical process called singular value decomposition (SVD). The wikipedia page on SVD has a pretty good description of the nitty gritty math that is involved, but basically the SVD breaks your matrix into three new matrices which break out and clean up some of the noisiness in the original matrix's correlations. Then the SVD uses the values in the new matrices (corresponding to the strength of the correlations) to reduce the matrix dimensions to a user-specified number of “topics”.

There seems to be some consensus on 100-300 as a good range for English documents. The model will ultimately continue to become more accurate as the number of topics increases, but there are diminishing returns on that accuracy, and you lose the benefit of the reduced dimensions as the number of topics approaches the number of terms. I'll use 200 topics in this exercise.

ap_svd <- svds(ap_matrix, k=200)

The RSpectra::svds function creates a list including two of the calculated matrices (labelled u and d) as well as the reduced dimension matrix (labelled v). This matrix v houses the new values that we will use in our search engine, while the other two can be used to introduce new data into the same LSA model space.

ap_lsa <- ap_svd$v

# transpose (or flip axes) the u matrix
u_transpose <- t(ap_svd$u)

# take the inverse of the d matrix (also known as the sigma values)
sigma_inverse <- 1/ap_svd$d

Our next step is to create a query and insert it into the same model space as the rest of the data. To do this, we must first create another TDM using the “dictionary” (list of terms) from our corpus. Then we use the u_transpose and sigma_inverse values to calculate the topic values for the query.

query <- "european politics"

#trace("create_matrix", edit=TRUE) *see note below
query_matrix <- create_matrix(query, originalMatrix = ap_data)
query_matrix <- as.matrix(query_matrix)

* The RTextTools::create_matrix function is fantastic and takes a lot of the headache out of creating a DTM, especially when using another DTM's dictionary. However, some of the code wasn't updated correctly when they last updated the package, so you'll receive an error if you use this function without fixing the code. If you run the line that is commented out, then change “Acronym” to “acronym” in line 42, this fixes the error. The error can also be avoided by installing the dev version of the package, using devtools::install_github("timjurka/RTextTools/RTextTols")

We now have a query-vector of the same length as our matrix (200 topics), which allows us to find the cosine similarity values between our query and every document in the corpus. You can find a detailed explanation of cosine similarity in this article, but the idea is that each document (including our query) can be represented as a vector made up of the values we've computed for each “topic”. The distance between these vectors is representative of the topic-based similarity between each document, so vectors that are closer together are more likely to be related than vectors that are farther apart. In other words, as the cosine similarity between two documents approaches 1, the documents become more closely related.

# Insert the query into the topic model space
query_lsa <- sigma_inverse * u_transpose %*% t(query_matrix)

# Calculate the cosine similarity between the query and each document
query_scores <- t(query_lsa) %*% t(ap_lsa)

# Create dataframe with the article number and the cosine similarity
query_match <- data.frame(article= 1:2246, score= t(query_scores))
colnames(query_match) <- c("article", "score")

Now we have a dataframe with a “match score” between our query and every document in our corpus. We can use these scores to rank our results by the best matches, and then use the top ten document terms that we created above to assess the relevance of the results.

query_match <- query_match%>%
  arrange(-score)

top_match <- query_match%>%
  top_n(10, score)

top_match <- top_match%>%
  left_join(ap_topterms, by = c("article" = "document"))

top_match[,3]
##  [1] "aid, castro, colony, credit, cuba, cubans, cubas, european, last, latin, million, pressure, refuge, sought, soviet, spain, spains, spanish, union, year"                                                                                                                                                                           
##  [2] "ec, european, foreign, member, membership, neutrality, opposition, percent, policy, political, sweden, swedish"                                                                                                                                                                                                                    
##  [3] "britain, currency, european, major, mrs, soviet, summit, thatcher, thursday, union"                                                                                                                                                                                                                                                
##  [4] "brussels, community, ec, major, position, round, states, subsidies, talks, trade, united, uruguay, yeutter"                                                                                                                                                                                                                        
##  [5] "alliance, briefings, contacts, countries, diplomats, hungary, nato, pact, soviet, warsaw"                                                                                                                                                                                                                                          
##  [6] "aid, assistance, bush, leaders, soviet, summit, sununu, west, western, world"                                                                                                                                                                                                                                                      
##  [7] "ec, european, foreign, luxembourg, minister, monetary, presidency, talks, trade, union"                                                                                                                                                                                                                                            
##  [8] "broadcasts, europe, free, jamming, liberty, radio, soviet, stopped, union, voa"                                                                                                                                                                                                                                                    
##  [9] "community, conservative, europe, european, first, foreign, france, friday, friendship, future, german, germans, germany, governments, hurd, interview, leaving, members, minister, month, move, mrs, night, party, policy, political, prime, ridley, secretary, take, thatcher, thatchers, told, trade, united, views, west, worst"
## [10] "agricultural, american, ec, farm, foreign, japan, market, million, percent, products, report, trade, year"

And there we are! Out of the top ten, only one result (number ten) doesn't have “Europe”, “European”, or the name of a European country in its top terms. And even this result still seems to be related to foreign policy/politics. This is certainly not the cleanest search engine in existence, but as a proof of concept I think we've gotten pretty good results.

Further Exploration

Although I couldn't find any examples of people creating this kind of topic-based search engine in R, there seem to be a lot more people who are doing it in Python, and as I understand the gensim package for Python has some really powerful topic modeling capabilities. Using Python, it should also be possible to integrate a topic-model approach with an existing open-source search engine (like Solr, ElasticSearch, or Whoosh), which is what I plan to do as the ultimate product of this summer project.

I did experiment with using LDA (latent dirichlet allocation) in the same manner as above to create a different semantic search model, but was disappointed that the results were actually worse than they were for the LSA model. This surprised me since LDA is generally accepted as a more powerful topic-model, but I have a theory that the issue lies in using such a short query. However, the use case that I am doing this research for would most likely be used with very short 2-3 word queries, in which case LDA may not be a good option. I will play with it some more and will plan to write another blog about the methods if I can get it to be more successful.

I'm also just starting to look into using Google's word2vec algorithm which is very hot in the information retrieval world right now. There do appear to be some packages available in R to do both word2vec and doc2vec, so I hope to experiment with using these as well.

English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we m...