Thursday, December 13, 2018

Google PageRank





Google PageRank




Google PageRank

Re-creating the original search ranking algorithm


TL;DR

I built a webcrawler to create a mini-internet centered around my university's homepage (www.umt.edu). Then I re-created Larry Page and Sergey Brin's original ranking algorithm to rank all of the pages in the network. Although the webcrawler got a little sidetracked and had a misrepresentative number of pages from Twitter, the ranking algorithm succeeded in ordering the pages in the network based on the number and rank of other pages linking to them.


Disclaimer: Most of this blog is pulled from a fairly formal final paper that I wrote for school, so the style is a little different than usual.


Everyone knows how Google works, right?

When performing a search on the internet, there are two main components to how the search program displays the results. The first is a text-based component, which may use any one of many similarity algorithms to find results that are similar in content to the user's search query (more about the text part in this blog and this other blog that I wrote this summer). The second component is a ranking algorithm that determines the order in which results are displayed. For example, when a user types the query “machine learning” in Google, 813 million potential matches are returned. However, the first results displayed are those that Google determined to be highest ranking within all of those potential matches. In part, this ranking is determined by the Google PageRank algorithm.


Google PageRank was created by Google founders Sergey Brin and Larry Page while they were still at Stanford as PhD candidates (Mueller et. al, 2017). Somewhat surprisingly, the algorithm is not based upon user behavior, but instead upon the number of links to and from any given page. The theory is that websites are more likely to include links to high quality pages, and so the pages with a large amount of “inlinks” (links from other pages) should be ranked higher in the list of results. Moreover, an “inlink” coming from another highly ranked page should be worth more than one from a low-ranking page (Elden, 2007). Discussed below, methods from linear algebra were applied to the problem to re-create a link-based ranking algorithm.


How does it work, then?

The Google PageRank algorithm is based almost entirely on concepts from linear algebra. In this model, each page in a network is represented as a vector of navigation probabilities based on the pages linking to/from that page. For example, consider a very simple network of six pages, A through F (Figure 1).

Figure 1



Each arrow represents a link from one page to another. Based on this network, if a user starts on page A and navigates the network only through clicking links on the pages, she has a probability of 1/3 of navigating to page B, D, or E. Calculating these probabilities for every page in the network results in the matrix seen in Figure 2.


Figure 2


This matrix can be used to imagine any number of network browsers, following links from page to page. With the sample network here, however, all of the users will eventually get “stuck” on page F, since this page does not have any outlinks. But instead of getting stuck, a user would more likely navigate from that page directly to any other page in the network, giving each page in that vector an equal probability of 1/n (in this example, column F would be filled with 1/6 in every row).


Along the same lines, the probability matrix should also account for some chance that a user directly navigates to any other page at random while they browse the network of pages. This is often termed the teleportation or damping factor (d), and is assigned a value such that a user has d probability of following the links on the pages, and a 1-d probability of randomly navigating within the network. Equation 1 shows how this is applied to the network probability matrix L, where t is an n x n matrix of \(\frac{(1-d)}{n}\). A common value for this teleportation factor is 0.85, and Figure 3 shows matrix L after applying this factor.

\(M = d*L+t\)            (Equation 1)


Figure 3



To produce a vector of ranks for every page in the network which is based both on this matrix of probabilities and on the rank of the pages themselves, consider Equation 1, in which r is the rank vector, and M is the probability matrix.


\(r=Mr\)            (Equation 2)


Since r needs to be the same on both sides of the equation, this can be seen as an eigenvector problem. An eigenvector is a vector which does not change direction when a transformation matrix is applied to it, and an eigenvector with an eigenvalue of one is a vector that does not change direction or value when a transformation matrix is applied to it. Thus, the goal of the PageRank algorithm is to find an eigenvector with an eigenvalue of one for the “transformation matrix” that we create based on the linking probabilities determined above.


For a small network, such as the six-page network in this example, it would not be too difficult to calculate this eigenvector by hand. However, for a much larger network (the whole internet, for example), hand calculation is not practical or even possible. Returning back to Equation 2, the PageRank vector r can instead be found by arbitrarily assigning a starting value to r, and then iteratively multiplying r by the probability matrix M until it no longer changes in value (Cooper, 2017). This method is often called Power Iteration, and thus, the equation can be rewritten as seen in Equation 3.


\(r^{i+1}=Mr^i\)            (Equation 3)


A common starting rank vector assumes that all pages are equal in rank, so for this example network the starting vector r would be [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]. Iteratively multiplying this vector by the probability matrix M, it takes 8 iterations before \(r^i\) and \(r^{i+1}\) converge. This results in the rank vector [0.227, 0.162, 0.089, 0.162, 0.153, 0.208]. Comparing this to the sample network in Figure 1 and the original probability matrix in Figure 2, it makes sense that pages A and F would be the highest ranking in the network. Page A has two in-links, one of which starts with a link probability of one, and page F is the only page with three inlinks. On the other hand, page C only has one in-link, so it makes sense that it is the lowest ranking page in the network.


Let's try it on a real(ish) network!


The first step to creating a functioning model of the PageRank algorithm is to create a model network for the algorithm to run on. This was done by creating a webcrawler using the python packages requests and lxml.html. This webcrawler is fed a starting url, for which it pulls the page's html code and collects all of the page's outlinks. The webcrawler then iterates over each of these outlink urls, pulling each of these pages' outlinks as well. For the purpose of this exercise, a limit was set for the webcrawler to stop after collecting outlinks from 502 unique urls. The complete code for the webcrawler can be found on my Github.


The result of running the webcrawler is a python dictionary with 502 url keys, each with a list of outlink urls as their dictionary value. Starting with www.umt.edu and pulling the outlinks for 502 pages resulted in a list of almost 5000 unique urls. Since the Pagerank algorithm requires a square matrix (ie. every crawled page needs to be on both the x and y axes of the matrix), each list of outlinks needed to be filtered for urls that had been crawled. Then these lists were broken into link-outlink pairs and added to a 2-column pandas.DataFrame.


#create df of source-outlink pairs, filtering for links that were crawled
link_df = pd.DataFrame(columns = ['source', 'outlink'])
for link in outlinks:
    for outlink in outlinks[link]:
        if outlink in outlinks:
            pair = pd.DataFrame([[link, outlink]], columns = ['source', 'outlink'])
            link_df = pd.concat([link_df, pair], ignore_index = True)

Next, the pandas.crosstab function was used to turn the outlink column into a dummy variable, with a value of one if the source page had an outlink to that url, and a zero if not. Because the dataframe only included positive link-outlink pairs, any pages that did not have any outlinks needed to be added separately as a row filled with zeros.


#create frequency df of dummy variables for each combination
sparse_links = pd.crosstab(index = link_df['source'], columns = link_df['outlink'])

#include pages crawled with no outlinks, or with no outlinks to other pages crawled
for link in outlinks:
    if link not in sparse_links.index:
        new_row = pd.DataFrame([[0]*502], index = [link], columns = sparse_links.columns)
        sparse_links = pd.concat([sparse_links, new_row])

To calculate the navigation probability (before adding the teleportation factor) for each row, positive outlinks (currently with a value of one) were divided by the total number of outlinks on that page, and pages without any outlinks were filled with equal probabilities for all pages (1/502)


# replace row values with link probability (ie. 1/number of outlinks from page)
for link in sparse_links.index:
    try:
        prob = 1/len([i for i in outlinks[link] if i in sparse_links.index])
        sparse_links[(sparse_links.index == link)] *= prob
    #fill empty rows with equal probabilities
    except ZeroDivisionError:
        sparse_links[(sparse_links.index == link)] = [[1/502]*502]
        pass

The final step in creating the network's transformation matrix of navigation probabilities is to include the teleportation (or damping) factor. For this, the dataframe was converted to a numpy.array, first saving the row and column indices in a dictionary for later use. Equation 1 was then used to apply a damping factor of 0.85 to the entire matrix.


#reorder columns to be in the same order as rows
row_order = sparse_links.index.tolist()
sparse_links = sparse_links[row_order]

#convert dataframe to np array
#save row and column names in dictionary

sources ={}
for idx, source in enumerate(sparse_links.index):
    sources[idx] = source

# convert to array with sources as columns (ie. all columns sum to 1)
linkArray = sparse_links.values.T

#add teleportation operation
d = 0.85
withTeleport = d * linkArray + (1-d)/502 * np.ones([502, 502]) 

Now that the transformation matrix is formatted correctly, it can be applied to a starting rank vector (with equal rank for all pages) using the power iteration method. This method will continue to loop, iteratively multiplying the rank vector by the transformation matrix until the values converge. Because the ranks start out so small (0.001992), the measure for convergence is a change of less than 0.00001 (or 0.5%).


#use power-iteration method to find pagerank eigenvector with eigenvalue of 1
# Sets up starting pagerank vector with equal rank for all pages
r = np.ones(502) / 502 
lastR = r

# calculate dot-product of transformation matrix (computed by link 
# probabilities and teleportation operation) and pagerank vector r
r = withTeleport @ r
i = 0 #count the number of iterations until convergence

#break loop once pagerank vector changes less than 0.0001
while np.linalg.norm(lastR - r) > 0.00001 :
    lastR = r
    r = withTeleport @ r
    i += 1
print(str(i) + " iterations to convergence.")


How'd we do?


With this relatively small network of 502 websites, the PageRank algorithm shown above needed to iterate 25 times before reaching convergence at a threshhold of 0.00001. Since the resulting rank vector is an unlabelled and unsorted vector of 502 numbers, this vector needed to be matched back up with the page urls and then sorted from high to low.


# match pagerank vector back up with source url labels
rankedUrls = []
for i in sources:
    pair = (sources[i], r[i])
    rankedUrls.append(pair)

# sort urls by pagerank to find highest ranking pages
sortedRank = sorted(rankedUrls, key = lambda tup: tup[1], reverse = True)

To explore the resulting ranks further, the number of inlinks to each page was calculated. Then the page url, rank, and number of inlinks were exported to a .txt file and loaded into R for visualization.


#count inlinks
inlinks = {}
for page in outlinks:
    counter = 0
    for other_page in outlinks:
        if page in outlinks[other_page]:
            counter += 1
    inlinks[page] = counter

#save url, rank, number of inlinks, and number of outlinks to txt file for import into R for visualization
with open("sortedPageRank.txt", "w", encoding = "utf-8") as file:
    headers = ["url", "rank", "num_in"]
    file.write("\t".join(headers)+"\n")
    for pair in sortedRank:
        url, rank = pair
        num_in = inlinks[url]
        row = [url, str(rank), str(num_in)]
        file.write("\t".join(row)+"\n")

When looking at the density plot of PageRanks, it's apparent that the distribution is strongly right-skewed, with most of the pages having very low ranks and only a few with higher scores spread out along a long right tail. Although not identical, the similarity between the density plot of PageRanks and that of inlinks is unmistakable–an expected result of the relationship between inlinks and rank in the PageRank algorithm.


plot of chunk unnamed-chunk-1plot of chunk unnamed-chunk-1


The tail in the distribution above has a few clumps of pages with similar higher scores. This might suggest that there is some insularity among these higher-ranking pages, meaning that these pages all link to each other, further boosting their ranks. Filtering for the ten highest-ranking pages confirms that this is indeed the case. All of the top ten pages are various Twitter help pages, and going back to the outlink dictionary in Python shows that the inlinks for these pages are all coming from other highly-ranked Twitter help pages. Zooming out a little further in the data, almost all of the 50 highest ranking pages are Twitter help pages.


Rank Website PageRank Inlinks
1 https://twitter.com/privacy 0.0222552 208
2 https://twittercommunity.com/ 0.0175635 214
3 https://twitter.com/tos 0.0154999 208
4 https://twitter.com/signup 0.0134464 30
5 https://blog.twitter.com/developer 0.0122242 103
6 https://support.twitter.com/articles/20170520 0.0121100 103
7 https://business.twitter.com/ 0.0116167 115
8 https://help.twitter.com/en/rules-and-policies/twitter-cookies 0.0114359 212
9 https://marketing.twitter.com/na/en/insights.html 0.0109402 185
10 https://marketing.twitter.com/na/en/success-stories.html 0.0109402 185

Removing any Twitter pages changes the density curve a bit, but the general character remains the same. The highest ranking pages are now missing, and the group of pages that were clumped together at the base of the highest peak on the left are now spread out. Once again, the similarity between the density plots for PageRank and Inlink number is obvious. Also, the data continues to show the same clumping in the tail seen above.


plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3


With the Twitter pages removed, the ten highest ranking pages are now more resemblant of what one expects from a network centered around the starting page http://www.umt.edu. Various spellings of that starting page are all ranked highly, and the other highly ranked pages (from olark.com) are all related to the university's chatbot, which is present on many of the university's main landing pages. Filtering further to remove the chatbot pages, the top ranking pages are all major pages on the university site or social media accounts associated with the university.


Rank Website PageRank Inlinks
43 http://my.umt.edu/ 0.0065134 57
44 http://www.umt.edu/ 0.0059865 13
48 https://developers.google.com/analytics/devguides/collect... 0.0056231 15
58 https://www.facebook.com/umontana/ 0.0043125 16
62 https://www.youtube.com/user/UniversityOfMontana 0.0039308 18
63 https://apps.umt.edu/search/atoz 0.0039060 60
65 https://www.instagram.com/umontana 0.0037964 15
67 http://map.umt.edu/ 0.0036861 13
68 https://map.umt.edu/ 0.0035020 60
69 https://www.umt.edu/accessibility/ 0.0035020 60

Looking at the lowest ranking pages instead, they all have only one or two inlinks and many are unique redirect urls. Following the redirects, they mostly lead to various pages for the registrar. It would be interesting to know how these redirect pages would rank if they were grouped instead with the pages that they redirected to.


Rank Website PageRank Inlinks
502 https://twitter.com/login?redirect_after_login=htt... 0.0005331 1
501 https://twitter.com/login?redirect_after_login=htt... 0.0005345 1
500 https://twitter.com/login?redirect_after_login=htt... 0.0005428 2
488 http://t.co/119f4k96qp 0.0005475 1
489 https://t.co/2OKLCC0dlF 0.0005475 1
490 https://pbs.twimg.com/profile_images/5025606486900... 0.0005475 1
491 https://pbs.twimg.com/profile_images/5025606486900... 0.0005475 1
492 https://t.co/t5nWoRDTgx 0.0005475 1
493 https://t.co/csfs4U6oZk 0.0005475 1
494 https://t.co/hf5zHbkoHL 0.0005475 1

What's next?


Although there were issues with the way that this project was designed which affected the resulting PageRanks, the results were indicative of a successful algorithm overall. The highest ranking pages all had many inlinks, mostly from other high-ranking pages, and the lowest ranking pages had only one or two inlinks. The clumping of pages on the density plots demonstrates that the rank of a page is determined not only from the number of inlinks that it has, but from the rank of the pages which link to it. This clumping also illustrates an interesting characteristic of networks in general, which is that a network often consists of pockets of highly-connected entities, which are less strongly connected to other pockets in the network.


Changes to the project design for future work mainly center around the webcrawler. Since the number of pages crawled was limited, the network created for this project was incomplete and biased toward a few pages that were crawled early on in the process, namely the various Twitter help pages. This could be fixed by either expanding the number of pages collected to form the network, or by creating a more methodical way to choose which pages get crawled first (as opposed to the more or less random choice that is currently built into the code). Alternatively, the webcrawler could be designed to only crawl links to certain domains (ie. only looking at umt.edu sites), though this would also be a pretty incomplete picture of the link network.


Another change would be to normalize the urls in the network before running the PageRank algorithm. As built by the webcrawler, the network has duplicate pages with minor spelling changes (ie. http://www.umt.edu vs http://www.umt.edu/ vs https://www.umt.edu). Similarly, many of the lowest ranking pages were redirects to pages that were also in the network. If these pages were all grouped together, this might make a significant difference in the resulting PageRanks.


These changes aside, the PageRank algorithm recreated for this project worked as well as could be expected on the faulty network it was provided. The power iteration method was used to perform iterative matrix multiplication on a transformation matrix weighted by a page's number of inlinks and outlinks. This resulted in a vector of ranks for every page in the network which was based on the number and rank of other pages in the network that linked to it. When used in combination with a text-matching algorithm, this vector of ranks would drive the order of search results presented to a search user.


References


Cooper, Samuel J. “PageRank”. Introduction to Linear Algebra and to Mathematics for Machine Learning, Imperial College London, Coursera. Received 16 Dec. 2017. Course Handout.


Eldén Lars. Matrix Methods in Data Mining and Pattern Recognition. Society for Industrial and Applied Mathematics, 2007.


Mueller, John, and Luca Massaron. Algorithms for Dummies. Wiley Publishing, Inc., 2017.





Sunday, December 9, 2018

Submitter Segmentation





Submitter Segmentation

Submitter Segmentation

Principle Component Analysis and K-Means Clustering

TL;DR

In this project I explored techniques for clustering observations based on high-dimensional, very sparse, binary behavioral data. The final results were created using a combination of sampling, dimension-reduction through Principle Component Analysis, and clustering using the K-means algorithm. These methods suggested 5 or 12 clusters of observations from a 5K X 5K matrix of observations and variables. Text analysis on descriptions associated with the variables did not produce immediately obvious categories for the clusters.

Exploring our Submitters

Last Spring I finagled my way into an internship at an awesome local tech startup, which turned into part-time work while I finish up my graduate program this year. Our company provides a cloud-based submission management platform, and our clients consist of any organizations that need to accept some sort of submission from the public. We started mainly in the publishing industry, with literary journals making up the bulk of our clients, but over the last few years we've quickly expanded into other arts and media industries, the grants/foundations industry, and numerous others.

While we've spent a lot of time trying to better understand who our clients are, very little work has been done on the submitter side. For this reason, I decided I'd like to do an exploration of our submitters for my Master's Capstone (read:thesis), starting this semester with a simple (or so I thought) cluster analysis. Since we do have a variety of types of organizations using our platform, I thought a straightforward first effort would be to segment the submitters based on which opportunities they'd submitted to.

Dealing with Big Sparse Data

Like with any data analysis, the first step was getting the data I needed. I needed to pull all submitter-opportunity combinations, and since I knew that I wanted to do a text analysis to assess the clusters I came up with, I wanted to pull the opportunity descriptions as well (ie. “We're now accepting poetry submissions for our April issue of Literary Journal X”). So I went to our data base in Redshift and wrote my query. When the query was taking longer than expected, I decided to do some exploration to see what the problem might be. It turns out that we have ~130K opportunities and ~4.2M submitters in our database, with ~12M submissions between them. Pulling this all together in one query would not only timeout the connection to the database, but it would be too big to hold in memory (because of the size and duplication of the descriptions).

My first solution was to split the query up by submission year to deal with both problems at once. While this allowed me to pull the data, I was still left with over 20GB of data. Then I tried filtering for submitters who had made at least three submissions, and opportunities that had received at least three submissions. This left me with only ~3.6M submitters and ~64K opportunities, but still over 15GB of data. So finally I stripped off the descriptions into their own dictionary which left me with a very reasonable ~350MB of data.

Finally I was able to pull all of the years' submission data together into one python object and then use pandas.get_dummies to turn the opportunities into dummy variables with a 1 if the submitter had submitted to it, or a 0 if not. Because this results in a 3.6M x 64K matrix, I needed to specify that it be stored as a sparse dataframe to still be able to hold it in memory.

#combine files from all years and convert to sparse pandas df 
files = [0]*9
for idx, file in enumerate(os.listdir("cleaned")):
    with open("cleaned/"+file, "rb") as f:
        files[idx] = pickle.load(f)

combined = pd.concat(files, ignore_index = True)

#create dummy variables
sparse_try = pd.get_dummies(combined, sparse = True)

At this point I'd done a lot of research about working with high-dimensional sparse data, and I'd run across a lot of people who pointed out that trying to cluster very sparse data would be fraught with errors because of the way that clustering algorithms were set up. So I'd decided to use Principle Component Analysis for dimension reduction to make the data denser.

pca = PCA(n_components = 1000)
pca.fit(sparse_try)

And I immediately got a MemoryError. Of course this makes sense, since running a PCA creates a correlation matrix for the data, meaning that it was trying to make a 3.6M x 3.6M non-sparse matrix. I looked around for other options and tried several, including Incremental_PCA which performs singular value decomposition (SVD) on chunks of the data and then combines them, garnering results almost identical to those of a PCA. But no matter how small I made the chunks, I was still getting a MemoryError.

Although I'd really like to figure out a way to perform the analysis on all of the submitters, I finally had to settle for taking a sample of the data so that I could have some sort of result to share by the end of the semester. If anyone has any good ideas for how to do this, please let me know! I'd still like to be able to do use all the data for my Capstone next semester.

Take Two: Smaller data makes everyone's lives easier

Since the data I had was so sparse, I was concerned that taking a random sample might not leave me with enough data to get any meaningful results. So I decided to run the analysis on the 5000 highest-submitting submitters and the 5000 highest-receiving opportunities. Obviously this method of sampling is rife with potential bias (especially since we know that the majority of our opportunities are for literary journals), but it was the best option that I had for the purpose of this project.

An added bonus to using a smaller dataset was that I felt comfortable moving back into R without losing too much performance. So I pulled my submitter-opportunity pairs into R and used tidyr's spread function to create my dummy variables. This gave me a 4982 x 4393 matrix (suggesting that there some submitters submitted to the same opportunity multiple times, and some of the opportunities didn't have any submissions from the top 5000 submitters). With 2.1M unique submissions, the matrix was ~90% sparse, so dimension reduction was still necessary.

#read in tsv of user/form submission pairs for 5000 highest submitting users
#and 5000 highest receiving forms
top5000 <- read_tsv("top5000.txt")

#create a n_user x n_form matrix with ones for positive user/form submission pairs
#and zeros elsewhere
sparsedf <- top5000%>%
  mutate(dummy = 1)%>%
  distinct%>%
  spread(productid, dummy, fill = 0)

I was then able to run the PCA and visualize the cumulative variance of each component to decide how many components to include in the clustering analysis. The goal is to significantly decrease the number of variables for clustering (and make the data much denser) without losing too much detail in the variance of the data.

#perform pca on matrix to compress into less sparse form for clustering
pca <- sparsedf%>%
  select(-userid)%>%
  prcomp

#visualize cumulative variance to determine number of components to include
cum_var <- data.frame(sd = pca$sdev)
cum_var <- cum_var%>%
  mutate(eigs = sd^2,
         cum_var = cumsum(eigs/sum(eigs)),
         id = 1:n())

plot of chunk unnamed-chunk-3

#find number of components closest to 80% cumulative variance
which.min(abs(cum_var$cum_var-0.80))
## [1] 959

I included the first 959 principle components (a little more than 1/5 of the original variables) to account for 80% of the variance in the original data. The PCA analysis provides a vector of weights for each component, which shows how much each variable (column) affects that component. To create a new matrix for analysis using the principle components, matrix multiplication is used to calculate the weighted sum for each row (submitter) for each principle component.

#for each component, compute the dot product of each row/user and the pca rotation
num_pcs = 959
for (i in 1:num_pcs) {
  pc_name <- paste0("PC", i)
  sparsedf[, pc_name] <- as.matrix(sparsedf[,2:4393]) %*% pca$rotation[,i]
}

#select pca columns for clustering
reduced_df <- sparsedf[,-c(1:4393)]

Clustering Post-PCA

Now that I finally had a dense-ish matrix to work with, it was time to start work on the “simple” clustering I'd set out to accomplish. I decided to use the K-means algorithm, which requires the user to provide a pre-determined number of clusters. There are several techniques to help decide what the optimal number of clusters might be. The simplest method is called the “elbow method”, in which the sum of squares at each number of clusters is calculated and graphed, and the user looks for a change of slope from steep to shallow (an elbow) to determine the optimal number of clusters. This method is inexact, but still potentially helpful.

As a sidenote, in doing research for this project I discovered this super helpful tutorial about doing K-means clustering in R, which introduced me to the amazing factoextra package for clustering and visualizing cluster analyses

plot of chunk unnamed-chunk-6
I've yet to see an elbow plot that really clearly indicates an specific number of clusters, but I'd say this one points to somewhere between 4 and 8 clusters.

Another visualization that can help determine the optimal number of clusters is called the “silhouette method”. Although I don't understand the inner-workings of this method quite as well, the theory is that any cluster has a “silhouette”, or an area that in which its members lie. The “silhouette method” used here is a measure of how well the members fit into their clusters' silhouettes.

plot of chunk unnamed-chunk-7
In reading this plot, a peak indicates a good “silhouette fit”. Although the clearest peak is at 2 clusters, my knowledge of the data suggests that there should be more clusters than that, so I'll also look at 5, 12, and 16 clusters.

In the code below, the argument iter.max is the number of times the algorithm will re-calculate the cluster centroids before giving up on reaching convergence and nstart is the number of times it will run the algorithm. Since the starting position of the cluster centroids is random, using nstart will allow R to pick the starting position that results in the tightest clusters.

#run k-means algorithm with 2, 5, 12, and 16 clusters as potential best fits based on plots
cluster2 <- kmeans(reduced_df, centers = 2, iter.max = 35, nstart = 25)
cluster5 <- kmeans(reduced_df, centers = 5, iter.max = 35, nstart = 25)
cluster12 <- kmeans(reduced_df, centers = 12, iter.max = 35, nstart = 25)
cluster16 <- kmeans(reduced_df, centers = 16, iter.max = 35, nstart = 25)

factoextra does provide a really cool way of visualizing multidimensional clustering using fviz_cluster(). This function performs a PCA on the clusters, picks the two components with the highest explained variance, and then plots the clusters on those two components. This works great up to a certain number of variables, but this analysis has so many variables (959 to be specific), that none of the principle components has very much explained variance individually, so the plots never show really distinct clusters at all.

Instead, I decided to compare the within-cluster sum of squares (a measure of how tight each cluster is) and the between-cluster sum of squares (a measure of how separated each cluster is from the others). The goal is to find a minimum within-cluster sum of squares, and a maximum between-cluster sum of squares.

plot of chunk unnamed-chunk-9

The trick to picking an optimal number of clusters is to minimize the within-cluster sum of squares and maximize the between-cluster sum of squares, but not lose the ability to apply logical categories to the clusters by creating too many (or for smaller datasets, having so many clusters that you end up with almost as many as you started with). While the plot above does suggest that a higher number of clusters may be a better fit, I decided to continue the analysis with 5 and 12 clusters, to see if I can match them to categories.

Finding meaning behind the clustering results

Normally a person would be able to go back to the matrix that the clustering algorithm was used on to look for patterns that might help to categorize the resulting clusters. However, my clustering data had already been processed through the principle components analysis, and so the variables used to cluster were already abstracted from the underlying data.

Instead, I went back to the original pre-PCA dataset and made a list of all of the opportunities that the members of each group had submitted to. Then I went back to our Redshift database to pull the descriptions for those top 5000 opportunities and joined them with my cluster-opportunity lists. Code for the 5-cluster model is shown below.

#create df with cluster assignment for each submitter
cluster5_forms <- data.frame("userid" = sparsedf$userid, "cluster" = cluster5$cluster)

#join with original dataset to create cluster-opportunity combinations
cluster5_forms <- cluster5_forms%>%
  left_join(top5000, by = c("userid" = "userid"))%>%
  group_by(cluster)%>%
  count(productid)

#join with opportunity descriptions
cluster5_text <- cluster5_forms%>%
  left_join(form_desc, by = c("productid" = "productid"))

A fairly simple idea I had for categorizing the data was to see what the most frequent words were in the opportunity descriptions for that cluster. So I used my favorite text analysis package tidytext to clean up the text and calculate term frequency. The opportunity descriptions are written in a kind of pseudo-html, so I used the format = "html" argument to help clean up a lot of the formatting marks. Again, the code for the 5-cluster model is shown below.

#split by word, remove punctuation, stop words, and other formatting marks
cluster5_tidytext <- cluster5_text%>%
  unnest_tokens(word, description, format = "html", strip_punct = TRUE)%>%
  anti_join(stop_words)

#count term frequency by cluster
cluster5_tf <- cluster5_tidytext%>%
  group_by(cluster)%>%
  count(word, sort = TRUE)%>%
  ungroup()

plot of chunk unnamed-chunk-14

Unfortunately the most frequent terms are basically the same across all 5 clusters, and the same is true for the 12-cluster model. This is not particularly surprising, since all of opportunities have the same broad goal: to get people to send in submissions. Also, the sampling method that I used was pretty likely biased towards literary journals, so words like fiction, poetry, and publication show up across the board.

A handy tool for situations like this is the tf-idf (term frequency - inverse document frequency) metric. This compares the term frequency in a “document” (in this case a cluster) to the term frequency in the whole “corpus” (all of the clusters combined) to find terms that are used disproportionately in one document. Theoretically, these words would be very specific to the content of that document.

# calculate tfidf for clusters
cluster5_tfidf <- cluster5_tf%>%
  bind_tf_idf(word, cluster, nn)

plot of chunk unnamed-chunk-17

At least here we have different words in each cluster. The patterns aren't immediately obvious, and a lot of these tf-idf terms are names of specific opportunities or specific clients. But looking closer we can see that cluster 1 is obviously mostly in French, cluster 2 might be in the grants/foundations world (with MCAT, fiscal, deposited, and Ideascity all being related to grants), and cluster 4 could be visual arts-related (with SCAD, Halcyon, Jerwood, and NPAF all being arts-focused organizations).

plot of chunk unnamed-chunk-18

As there get to be more clusters, it gets harder to find patterns, but looking at the 12-cluster tf-idf results, we see a couple really obvious categories. Cluster 4, for example, is clearly focused around some sort of social-justice effort (with jurisdiction, negligence, arbitration, coalition, enforcement, etc.), cluster 8 pops up again as our French cluster, and cluster 10 has many of the same visual-arts tf-idf terms as cluster 4 above. We might think that cluster 1 is somehow related to women (with Harlequin, motherhood, Latina, and Shebooks), cluster 3 might be focused around the UK or Europe (with Europe, Albion, and Kensington), and cluster 12 might be related to Hawaii (with Maui and Haleakala).

What's next?

This project obviously has a lot of loose ends left to figure out. The clusters that I ended up with were not easily categorized based on the opportunity descriptions, I was never really happy with the number of clusters that I was choosing, and, perhaps most importantly, I never figured out how to deal with all of the submission data that I wanted to include.

My company is in a seemingly perpetual struggle figuring out how we want to label our clients, but one solution may be to try to cluster submitters based on some of those labels (where the client label is linked with each individual opportunity which is then linked to a user's submissions). Or it might prove more meaningful to use these labels in place of the opportunity descriptions to help categorize each cluster.

I could also do a deeper exploration of different numbers of clusters to see if there might be some magical number that would give really clear categories. Or I've also considered ignoring the internet's advice and directly clustering the full-dimensional data (pre-PCA). Although I believe that there are theoretical issues with clustering sparse data, there's enough subjectivity in tweaking clustering algorithms that I'd be willing to give it a shot.

I'll also probably play around with better, less biased, sampling methods and see whether that makes a difference in the resulting clusters. And again, if anyone has any brilliant ideas about working with all of the data at once (short of buying a mega-computer), please please let me know!

Correction

After taking a few months away from this project, I revisited my code to see if any good ideas for working on the full dataset jumped out at me. Just a quick look showed me that the pd.get_dummies() wasn't building the transition matrix properly, with the resulting matrix having a lot of duplicate entries. I found several other mentions of similar issues with this function, and discovered a workaround of creating a pivot-table with csr_matrix(). Using this method, I was able to perform a truncated_SVD() analysis to reduce the number of opportunities to only 561. Then k-means cluster analysis suggested 4, 7, or 12 clusters as potentially optimal for the data. Unfortunately, as was seen in the sample data discussed above, the clusters did not have obvious associations with the type of categories that we would like to identify (ie. filmmakers, writers, etc.). Additionally, this method does not allow for users to fall into more than one category, which would be necessary in the long run for this usecase. As such, this method was abandoned for other more flexible techniques.

English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we m...