Whoosh Full Text Search
This summer I've been doing an internship at a local tech company, working with their data team on various projects. My main project, however, has been to build a full-text search engine that can take input from their Redshift database and will be an improvement on the keyword-based search that they have been using.
After doing a bit of research about how search engines work and what open-source options are available, I identified Whoosh as the platform that I wanted to work with. Although Solr and ElasticSearch are both much more mature platforms, and probably ultimately have more complex capabilities, it was really the simplicity of Whoosh that drew me to it. Whoosh is actually just a Python library that houses various indexing and searching functions, which meant two things for me:
- I only needed to learn how to work in Python, which I was meaning to take on this summer anyway
- Every piece of it could be customized and optimized to my purposes
Basics of Whoosh's Index and Searcher
As I mentioned, Whoosh was originally designed to allow for search within the body text of a library of help documents. These documents were, presumably, stored as separate files on some sort of a server. So what Whoosh was built to do was to create an index that stored all the text and meta-data associated with each document. For most systems like this, the index would be likely to store:
- File path
- Document title
- Document text
- Optional: other meta-data like the created date, file size, keywords, etc.
The biggest hurdle that I ran into when figuring out how to use Whoosh was that my data was stored in a relational database, and the text that I wanted to index was stored in 20+ fields over 5 different tables. It seems like this wouldn't be a huge deal, but Whoosh's indexing functions are really designed to point to a specific location on a server and gather data about/within that location--there's no way built-in to look into a database directly, or even to index data from within a python object like a dictionary, series, or dataframe. So I made my own.
Importing from Redshift into a Pandas Dataframe
Pulling my data into Python from Redshift was not difficult. I tried a couple different packages for this, but I settled on psycopg2 because it seemed the most straightforward and the least buggy. After a lot of googling and trial and error to figure out how to get my data into the Whoosh index, I decided to use a pandas dataframe as my import data structure--ultimately the dataframe has the most robust methods for referencing specific items. So I used psycopg2 to create a connection to Redshift, and then used pandas to pull my query in as a dataframe:
conn = psycopg2.connect(dbname= '', host='', port= '', user='' ,
password='')
sql = "select desired fields from desired tables"
dataframe = pd.read_sql_query(sql, con=conn)
Building the Index
From here on I'm going to be using a sample dataframe to demonstrate my methods. For this, I decided to use the full Grimm's Fairy Tales text, broken up into individual stories. I already had this more or less put together from my past tidytext blog, so all I did was take that labelled text and combine all of the lines for each story into one text block. This left me with a dataframe with only two columns (story title and story text), but the methods used below can be used with any number of columns of any data type.
So the first thing to do is import your data and create the index schema. The schema outlines the structure of the search index, providing meta-data about each of the fields that will be indexed. In this example, both fields are of type TEXT, which means that the data will be broken up into individual word tokens, and information about both word frequency and position will be stored. There are a lot of other field types available in Whoosh, but the other two most common ones are ID (which is stored as a single string) and KEYWORD (which is broken up into word tokens, but frequency and position data is not stored). The stored argument determines which fields will show up in the search results, and the field_boost argument will weight the match score for certain fields (ie. if there is a match in the title or a keyword field, that might be worth more than a match in the body text).
grimm = pd.read_csv("~/grimm.csv")
schema = Schema(title = TEXT (stored = True,  field_boost = 2.0),
                text = TEXT)                                     
The next step is to actually create the index. First we check to see whether or not the index (as you name it) already exists on your machine. If it doesn't then we create it; if it does, then we overwrite it and structure it with our schema:
if not os.path.exists("Grimm_Index"):
        os.mkdir("Grimm_Index")                                  
ix = index.create_in("Grimm_Index", schema)                      
Next we populate the index from our dataframe. In order to do this, I used a for loop to loop through each row/story in the dataframe, and used pd.loc to point the index to each individual field. Again, this example is pretty basic with just two fields, but it's worked just as well well for me using 20+ fields. Opening the index writer inside a with statement causes it to automatically close/commit when it finishes, which can avoid a lot of headaches down the road since Whoosh locks out all attempts to write if there's another writer open (believe me, this is something to be avoided).
with ix.writer() as writer:                                      
  
   for i in grimm.index:                                         
       writer.update_document(title = str(grimm.loc[i,"story"]), 
                              text = str(grimm.loc[i, "text"]))  
Searching the Index
Next we build the searcher that will compare our search query to the text that we've indexed and find the best matches. By default, the Whoosh searcher only looks through one field at a time and uses 'AND' logic to combine tokens within the search query (ie. searching for "evil witch" is like searching for "'evil' AND 'witch'" instead of "'evil' OR 'witch' OR 'evil witch'"). So I used the OrGroup and MultifieldParser functions to alter the default for my purposes. The factory argument weights the match score for 'AND' logic matches (ie. a story that has 'evil witch' will be scored higher than one that only has 'evil' or 'witch' alone).
og = qparser.OrGroup.factory(0.9)                                  
mp = qparser.MultifieldParser([title, text], ix.schema, group = og)
And finally we use the parser that we just created to compare our search query to our index and create an ordered list of matching results. You can dictate how many results are displayed (up to and including all possible matches), but I've chosen to show only 10 since this dataset is pretty small (only 66 stories). Once again, opening the searcher inside of a with statement simplifies the opening/closing of the searcher, though the stakes are lower with a searcher than a writer, since more than one can be open at one time.
q = mp.parse(u"evil witch")
with ix.searcher() as s:
     results = s.search(q, limit = 10)
     print("Search Results: ")
     print(results[0:9])
And there we are! The results look like this:
Search Results: 
[<Hit {'title': 'THE BLUE LIGHT'}>, <Hit {'title': 'SWEETHEART ROLAND'}>, <Hit {'title': 'THE SALAD'}>, <Hit {'title': 'HANSEL AND GRETEL'}>, <Hit {'title': 'THE JUNIPER-TREE'}>, <Hit {'title': 'BRIAR ROSE'}>, <Hit {'title': 'RUMPELSTILTSKIN'}>, <Hit {'title': 'THE THREE LANGUAGES'}>, <Hit {'title': 'THE STORY OF THE YOUTH WHO WENT FORTH TO LEARN WHAT FEAR WAS'}>, <Hit {'title': 'FUNDEVOGEL'}>]
Updating the Index
Although it's not really relevant for this example, it was important for me to be able to update the index regularly so that search results reflected "documents" that were added, deleted/expired, or changed. The bones of what I came up with comes from the Whoosh documentation, but once again the form of my data complicated things and required some serious tweaking. I won't include the code here because I don't have a good public example to use with it, but feel free to DM me if you'd like to learn more about how I did it, and I'll do my best to share what I learned.
Complete Example Code
import pandas as pd
from whoosh.fields import Schema, TEXT
from whoosh import index
import os, os.path
from whoosh import index
from whoosh import qparser
#%%
# import data into pandas df and create index schema
grimm = pd.read_csv("~/grimm.csv")
schema = Schema(title = TEXT (stored = True,  field_boost = 2.0),
                text = TEXT)
#%%
# create and populate index
def populate_index(dirname, dataframe, schema):
    # Checks for existing index path and creates one if not present
    if not os.path.exists(dirname):
        os.mkdir(dirname)
    print("Creating the Index")
    ix = index.create_in(dirname, schema)
    with ix.writer() as writer:
        # Imports stories from pandas df
        print("Populating the Index")
        for i in dataframe.index:
            add_stories(i, dataframe, writer)
def add_stories(i, dataframe, writer):   
    writer.update_document(title = str(dataframe.loc[i, "story"]),
                           text = str(dataframe.loc[i, "text"]))
populate_index("Grimm_Index", grimm, schema)
#%%
# creates index searcher
def index_search(dirname, search_fields, search_query):
    ix = index.open_dir(dirname)
    schema = ix.schema
    # Create query parser that looks through designated fields in index
    og = qparser.OrGroup.factory(0.9)
    mp = qparser.MultifieldParser(search_fields, schema, group = og)
    # This is the user query
    q = mp.parse(search_query)
    # Actual searcher, prints top 10 hits
    with ix.searcher() as s:
        results = s.search(q, limit = 10)
        print("Search Results: ")
        print(results[0:10])
index_search("Grimm_Index", ['title', 'text'], u"evil witch")
 
 
Claps. This is a really impressive body of work, particularly considering you're trying to learn Python as you do all this.
ReplyDeleteAmazing work. I been trying to save the query from search index but for some reason beyond my skills i have not yet succeed
ReplyDeletewhere is the grimm.csv
ReplyDeleteThis comment has been removed by the author.
DeleteI put that dataset together using R's Gutenberg library in this blog:
Deletehttp://annamarbut.blogspot.com/2018/03/grimms-fairy-tales-tidytext-analysis.html
You can also find the raw text here:
https://www.gutenberg.org/files/52521/52521-0.txt
But you'd have to do some parsing to get it into the format needed for this project.
ReplyDeleteVery informative and creative content. This concept is a good way to enhance the knowledge.thanks for sharing
RedShift Training
AWS RedShift Training
Amazon RedShift Online Training
Thankyou Anna Marbut for the post. Can you please guide me if one has to query "evil witch" and wants the output with exact "evil witch" sentences only. Like it shouldn't find evil and witch separately and gives the output with only evils and only witches but as a whole.
ReplyDeleteThanks a bunch.
I believe that you can use quotes just as you have here to require the exact phrase "evil witch", and the query also allows other logical operators.
DeleteI'll admit it's been quite awhile since I've worked on this project, but the help documents are fantastic. Here's the page about the default query language used in Whoosh:
https://whoosh.readthedocs.io/en/latest/querylang.html
This comment has been removed by the author.
ReplyDelete