Monday, August 10, 2020

English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we mainly explored using Flex and Bison to create compiler languages that can parse the English language. In my last blog, I discuss the (very) basics of Flex and Bison and compiler languages, as well as the first project I completed for this class--building a phonetic transcriber for English using Flex and Bison.

For the second major class project, the assignment was to create a compiler that would ingest English prose of some sort and produce some other sort of transformed English prose. As soon as we started working with Flex and Bison, I was excited to start thinking about the "rules" of English syntax again almost 10 years after my undergrad Syntax days. I loved my undergrad Syntax courses, and I now recognize that it uses the same puzzling/problem-solving skills that have led me to my current career path. Since I needed to build something that would parse the English language eventually anyway, I decided that this was a great opportunity for me to nerd-out and revisit a lot of the Generative Syntax that I had learned and loved all those years ago.

Treez 4 Dayz

If you have the time/interest, I'd highly recommend Haegeman & Gueron's English Grammar as an in-depth introduction to the English language from a generative perspective. I knew I didn't have time to repeat my undergrad Syntax work just for this one project, so I looked to the internet for newer/quicker resources. And I quickly found this fantastic YouTube series from Prof. Caroline Heycock at the University of Edinburgh--I love the future! 

Although pretty much all of what was in these videos was review for me, it was really helpful to get back in the swing of thinking about English sentence structure, both deep and surface, and the various theories for how human brains get from one to the other. I was especially excited to get to the section on X-Bar theory, since its largely right-branching, iterative tree structure is what I had been imagining would fit well with the Bison rule structure.

The Wikipedia article and Prof. Heycock's video about X-bar theory will be a much better resource, but I'll do my best to give a very quick overview of the main concepts. First, we come from an assumption that language can be represented by a tree-structure, in which some items are more closely connected than others. A basic example of diagramming the sentence The dog ate the bone is shown below (from Wikipedia). We can see that the dog forms a NP (noun phrase) and ate the bone forms a VP (verb phrase) in which the bone is its own NP, and these parts all come together to form an S (sentence).



X-bar theory builds on this tree-structure view of language, asserting that each phrase (e.g. NP or VP) consists of an optional specifier (e.g. D for Determiner) and an X-bar (e.g. N-bar or V-bar), and each X-bar can consist of another X-bar with an adjunct phrase, or an X-head (e.g. N or V) and any number of complement phrases. The difference between an adjunct and a complement phrase can be determined by substitution tests that will show how closely a phrase is connected to its phrase-head. There are a number of motivations for this kind of structure, including this differentiation between adjunct/complement phrases, and the kind of movement that is theorized in Transformational Grammar Theory

This iterative X-bar structure can be seen in the tree diagram of the sentence He studies linguistics at the university below (from Wikipedia). Here we see the intervening X-bar node after each XP node, and the adjunct (PP) and complement (NP) phrases that are both a part of the overall VP studies linguistics at the university.
Once I felt like I had reviewed what I needed to get started on this project, the next step was to figure out how to write a program that would cause Flex and Bison to mimic these rules, without having a human brain to do all of the parsing for us.

POS-Tagging: Another Imperfection

I've learned (in my infinite knowledge gained from writing 5+ Flex and Bison Compilers...) that defining the tokens to be used in the Bison grammar is probably an important place to start when building a Flex and Bison program. Which means that the first task is to write the Flex script and, for this project, to figure out how best to tackle part-of-speech (POS) tagging. 

I know that POS generally fall into one of two types: either open- or closed-class. Closed-class POS include classes of words like articles/determiners (the, a, this, etc.) or modal verbs (is, can, do, etc.) that have fixed class-membership--it would be very uncommon for a new determiner to be adopted in a language. Open-class POS don't have fixed membership: for classes of words like nouns or content verbs, the list of class-members is continuously growing--remember when using "google" as a verb would have been total nonsense?

Some closed-class POS I could confidently fill out by myself, but for others (especially large/questionably-closed classes like prepositions) I needed to find existing lists. For open-class POS I decided to do some research on how POS-tagging programs generally work. My first and last stop was the NLTK textbook chapter on tagging. Although I wasn't surprised, I was disappointed to learn that many of the more basic POS-tagging programs are simply dictionaries that match words to a known POS, that adding POS-specific regex recognition helps but isn't very accurate or complete, and that the state-of-the-art POS-methodology requires machine learning algorithms, which is outside the scope of this project. 

So I decided to go with a combination of finite word-lists and regex for recognizing Flex tokens. I started out using the biggest tagged corpus available through Python's NLTK package (Brown Corpus), but discovered that Flex actually does have a limit for how many discrete tokens it can handle (I can't find it in Flex's documentation, but the limit was expressed in the error message I received). I also discovered that the Brown Corpus had a lot of words that I didn't necessarily need to include in my lexicon (e.g. outdated words/spellings such as "hysself" or "thou"). 

So I once again took to the internet to find a smaller, more discerning list of tagged words. I ended up using lists from scrapmaker.com and then filtering out words that would be caught by my POS regex to make the lists small enough to keep Flex from yelling at me. With [...] representing loooong lists of words or regex rules, here is my final lexicon:

[ \t.,?!"]+                       {return SP;}
[']                                ;
[\n]                              {return EOL;}

"not"                                {yylval.sVal = strdup(yytext); return NOT;}
"that"                               {yylval.sVal = strdup(yytext); return THAT;}
"to"                                 {yylval.sVal = strdup(yytext); return TO;}
"this"                               {yylval.sVal = strdup(yytext); return THIS;}
"what"|"who"|"where"|"when"|"how"    {yylval.sVal = strdup(yytext); return COMP;}
"and"|"or"|"but"                     {yylval.sVal = strdup(yytext); return CONJ;}


"about"|"if"|"of"|[...]|"following"|"until"       {yylval.sVal = strdup(yytext); return PREP;}

"am"|"are"|"is"|"was"|[...]"have"|"has"|"had"     {yylval.sVal = strdup(yytext); return MODAL;}

"us"|"him"|"she"|[...]|"them"|"whoever"           {yylval.sVal = strdup(yytext); return PRON;}

"the"|"a"|"these"|"those"|"any"                   {yylval.sVal = strdup(yytext); return DET;}

[A-Za-z]+ed                                       {yylval.sVal = strdup(yytext); return ED;}
[A-Za-z]+ing                                      {yylval.sVal = strdup(yytext); return GER;}

"vendor"|"officials"|[...]|"mates"|"tacks"        {yylval.sVal = strdup(yytext); return NOUN;}

"thought"|"struck"|[...]|"grill"|"stem"|"steer"   {yylval.sVal = strdup(yytext); return VERB;}

"too"|"my"|"your"|[...]|"wan"|"triple"            {yylval.sVal = strdup(yytext); return ADJ;}
"often"|"soon"|"after"|[...]|"sometimes"          {yylval.sVal = strdup(yytext); return ADV;}


[a-z]+ous|[a-z]+ful|[...]|[a-z]+ic|[a-z]+al       {yylval.sVal = strdup(yytext); return ADJ;}
[a-z]+ness|[a-z]+ment|[...]|[a-z]+ant             {yylval.sVal = strdup(yytext); return NOUN;}
[a-z]+ly                                          {yylval.sVal = strdup(yytext); return ADV;}
[a-z]+ize|[a-z]+ate|[...]|[a-z]+ens               {yylval.sVal = strdup(yytext); return VERBS;}

[a-z]+                                            {yylval.sVal = strdup(yytext); return NOUN;}

I won't go through every piece of this, since it was a long and iterative process to build out these lists, but some interesting parts to note:
  • The order of rules in Flex scripts does matter, since Flex is greedy and will take the first match that it finds for a given token. This meant that, for the most part, the discrete word lists needed to come before the regex, in case the regex rule would catch a word that belonged in a different POS category.
  • I split out some words and word classes that have special functions or can fall into multiple word classes
    • e.g."to" can be a preposition ("I went to the park") or can be used in infinitive verb constructions ("I want to go")
    • e.g. words that end in "ing", also known as gerunds, can be used as a noun ("I like swimming"), an adjective ("She wore a stunning dress"), or a verb ("I was dancing")
  • Humans use context to parse whether a word like "bank" is referring to a noun ("financial institution" or "edge of a body of water") or a verb ("to rely (on)" or "to take a corner"). Since Flex can only pick one POS per word, and does not have any context to consider, I needed to subjectively decide which was the most common POS for these words that can fall into multiple categories. This limitation itself meant that there will be inevitable syntax errors later on, if a word is used in a POS other than the one that I assigned it.

Let's play 20 (bajillion) Questions!

Having learning my lesson from my last Flex and Bison project, I didn't try to create an accurate model of language based on real English data. Instead, I started out with my own knowledge of English and syntax to create the Bison rules that would be used to parse the English I fed it. While the syntax trees shown above are a more illustrative way of looking at sentence structure, brackets and labels are often used as a more space- and word-processor-friendly option to serve the same function. The two sentences above, diagrammed this way, would look like this:
The dog ate the bone
[S [NP [D the][N dog]][VP [V ate][NP [D the][N bone]]]

He studied at the university
[IP [NP [N' [N he]]][I' [I -Pst][VP [V' [V' [V studies][NP [N' [N linguistics]]]][PP [P' [P at][NP [Det the][N' [N university]]]]]]]]]

The second example here especially demonstrates why this method of diagramming is not preferred, but since this initial diagramming step wasn't going to be my project's final product, I didn't want to learn how to write the C-code necessary to draw an actual tree diagram. So I stuck with the bracketing method as the output for my Bison script, but decided to use a less complicated tree structure than the full X-bar theory would recommend in order to save some headaches and a lot of lines of code.

With the format of the output decided, all that was left to do (ha!) was to write all of the grammar rules. Like with the Lexical Parser that I put together, the process of creating this Grammatical Parser was long, full of subjective judgments, and happened over many iterations. To give you an idea, the beginning of my Bison rules section looks like this:

begin: sentence EOL {
printf("%s\n\n", $1); 
} begin
| /* NULL */
;

sentence: conjquestion{
            char open[] = "[Q ";
            char close[] = "]";
            $$ = (char*)malloc(strlen(open)+strlen($1)+2);
            strcpy($$, open);
            strcat($$, $1);
            strcat($$, close);
      }
        | conjstatement {
            char open[] = "[S ";
            char close[] = "]";
            $$ = (char*)malloc(strlen(open)+strlen($1)+2);
            strcpy($$, open);
            strcat($$, $1);
            strcat($$, close);
      }
        

conjquestion: conjquestion CONJ SP question {
                char space[] = " ";
                $$ = (char*)malloc(strlen($1)+strlen($2)+strlen($4)+3);
                strcpy($$, $1);
                strcat($$, space);
                strcat($$, $2);
                strcat($$, space);
                strcat($$, $4);
            }
            | question
        
question: qp question {
            char space[] = " ";
            $$ = (char*)malloc(strlen($1)+strlen($2)+2);
            strcpy($$, $1);
            strcat($$, space);
            strcat($$, $2);
        }
        | qp conjstatement {
            char sopen[] = "[S ";
            char close[] = "]";
            $$ = (char*)malloc(strlen(sopen)+strlen($1)+strlen($2)+2);
            strcpy($$, $1);
            strcat($$, sopen);
            strcat($$, $2);
            strcat($$, close);
        }
        | qp dp question {
            char dpopen[] = "[DP ";
            char close[] = "]";
            $$ = (char*)malloc(strlen(dpopen)+strlen($1)+strlen($2)+strlen($3)+2);
            strcpy($$, $1);
            strcat($$, dpopen);
            strcat($$, $2);
            strcat($$, close);
            strcat($$, $3);
        }
        | qp conjmp {
            char mpopen[] = "[MP ";
            char close[] = "]";
            $$ = (char*)malloc(strlen(mpopen)+strlen($1)+strlen($2)+2);
            strcpy($$, $1);
            strcat($$, mpopen);
            strcat($$, $2);
            strcat($$, close);
        }
        | MODAL SP conjstatement {
            char sopen[] = "[S ";
            char close[] = "]";
            $$ = (char*)malloc(strlen(sopen)+strlen($1)+strlen($3)+2);
            strcpy($$, $1);
            strcat($$, sopen);
            strcat($$, $3);
            strcat($$, close);
        }

qp: QUEST SP {
        $$ = $1;
    }
    |COMP SP {
        $$ = $1;
    }
        
conjstatement: conjstatement CONJ SP statement {
            char space[] = " ";
            $$ = (char*)malloc(strlen($1)+strlen($2)+strlen($4)+3);
            strcpy($$, $1);
            strcat($$, space);
            strcat($$, $2);
            strcat($$, space);
            strcat($$, $4);
         }
         | statement

statement: conjdp conjmp {
            char dpopen[] = "[DP ";
            char mpopen[] = "[MP ";
            char space[] = " ";
            char close[] = "]";
            $$ = (char*)malloc(strlen(dpopen)+strlen(mpopen)+strlen($1)+strlen($2)+3);
            strcpy($$, dpopen);
            strcat($$, $1);
            strcat($$, close);
            strcat($$, mpopen);
            strcat($$, $2);
            strcat($$, close);
         }
         
conjdp: conjdp CONJ SP dp {
            char space[] = " ";
            $$ = (char*)malloc(strlen($1)+strlen($2)+strlen($4)+3);
            strcat($$, $1);
            strcat($$, space);
            strcat($$, $2);
            strcat($$, space);
            strcat($$, $4);
      }
      | dp

dp: det np {
        char npopen[] = "[NP ";
        char close[] = "]";
        $$ = (char*)malloc(strlen(npopen)+strlen($1)+strlen($2)+2);
        strcpy($$, $1);
        strcat($$, npopen);
        strcat($$, $2);
        strcat($$, close);
  } 
  | np {
        char npopen[] = "[NP ";
        char close[] = "]";
        $$ = (char*)malloc(strlen(npopen)+strlen($1)+2);
        strcpy($$, npopen);
        strcat($$, $1);
        strcat($$, close);
  }
np: adjp nomp {
        char adjopen[] = "[ADJ ";
        char space[] = " ";
        char close[] = "]";
        $$ = (char*)malloc(strlen(adjopen)+strlen($1)+strlen($2)+3);
        strcpy($$, adjopen);
        strcat($$, $1);
        strcat($$, close);
        strcat($$, space);
        strcat($$, $2);
  }
  | nomp

nomp: noun
    | pronoun
    | noun cp {
        char cpopen[] = "[CP ";
        char space[] = " ";
        char close[] = "]";
        $$ = (char*)malloc(strlen(cpopen)+strlen($1)+strlen($2)+2);
        strcpy($$, $1);
        strcat($$, cpopen);
        strcat($$, $2);
        strcat($$, close);
    }
    | noun pp {
        char ppopen[] = "[PP ";
        char space[] = " ";
        char close[] = "]";
        $$ = (char*)malloc(strlen(ppopen)+strlen($1)+strlen($2)+2);
        strcpy($$, $1);
        strcat($$, ppopen);
        strcat($$, $2);
        strcat($$, close);
    }
 
These are the first 11 of the 25 rules that I built for this pretty simple grammar. Starting at the sentence-level, at each step I basically asked myself questions until I thought all possible grammatical options were covered. As an example of the thought-process that went into the above rules:
  • Is the sentence a question or a statement?
  • If it's a question, is it a compound question connected by a conjunction (e.g. "Who went to the store and why did they go?")?
  • For each question part (the whole sentence or one part of the compound question), how is it structured?
    • Does it start with a modal verb ("Did you go to the store?") or a question word?
    • Is the question word followed by a modal verb and a statement ("What did the boy buy?"), a statement ("Which boy went to the store?"),  a noun and a modal verb and a statement ("Which boy did you send to the store?"), or a verb phrase ("Who went to the store?")?
  • If it's a statement, is it a compound statement connected by a conjunction ("I danced and she ate the cake.")?
  • Each statement part (the whole statement or one part of the compound statement) consists of a subject (noun part) and a predicate (verb part)
  • Does the subject have a determiner ("The girl" vs. "Girls")?
  • Is the subject noun preceded by an adjective ("The smart girl")?
  • Is the subject noun a pronoun? Or is it followed by a complementizer phrase ("The girl that won the medal")? Or is it followed by a prepositional phrase ("The girl by the water fountain")?
Once I got through the end of my rules (and all of the back-and-forth necessary to make the rules work within the Flex/Bison framework), the Bison output looked like this:

The boy went to the store.
[S [DP The[NP boy]][MP [VP went[PP to[DP the[NP store]]]]]]

Where did the boy go?
[Q Where did[S [DP the[NP boy]][MP [VP go]]]]

She and I went to the store.
[S [DP [NP She] and [NP I]][MP [VP went[PP to[DP the[NP store]]]]]]

Did you go to the store and the bakery?
[Q Did[S [DP [NP you]][MP [VP go[PP to[DP the[NP store] and the[NP bakery]]]]]]]

So you can read English--what's interesting about that?

I was SUPER excited to get to this point. I always enjoyed building syntax trees--who doesn't love a good puzzle?! So writing code to do the sentence-diagramming was like completing the puzzle-to-end-all-puzzles! Unfortunately, the actual class assignment was meant to do more than just break the sentence into its constituent parts--that part was almost optional and could have been more or less replaced by using an established POS-tagger. I needed to think of a creative way to use what I had built to perform some kind of transformation on the input text.

The example project provided by my professor was to create a compiler that would ingest text in a script format, and make insertions and replacements to cause it to become a Seinfeld script. As you can imagine, this could be done with relatively little consideration for the grammar of any given sentence. At the very least, individual words could be replaced with more Seinfeld-specific words based on POS alone, like a computerized Mad-Lib. Or entire sentences could be replaced if they have the right POS-pattern. Or we could potentially cause the Bison/Flex combination to recognize character names or recurring themes and then replace these throughout the script more dynamically.

These or similar options would all be fun and interesting programming challenges, but, given my Linguistic background any my research interest in NLP, I really wanted to find a transformation that would be based on Linguistic concepts. I thought about trying to recreate a sentence's deep structure from its surface structure, but realized that the context-free structure of Bison's grammar rules would make it very difficult to connect the pieces of the sentence that had theoretically "moved" from their deep structure positions. Also, I realized that it might be a lot to ask of my Computer Science professor to learn about generative syntax theory just to be able to understand what my program was doing...

Finally I decided that I could use the constituent pieces that my initial English parser created to turn any statement into a question. I called the project "Boring Jeopardy", since the questions created by the transformation would only be ones that were already answered by the statement.

I'll take "Simple Sentences" for 10, Alex

Starting with the English Parser program that I had put together, I could immediately delete all of the portions about parsing questions. Then, using my knowledge about the pretty well-established ways that English sentences are formed, I started thinking about what other changes would need to be made. 

At a high-level, questions can be seen as statements in which one element is removed or replaced with a question word. For example, Where did the boy go? can be seen as a transformation of The boy went to the store in which to the store is replaced by where and moved to the front of the sentence. Another part of this transformation/movement that can be seen in this example is that the inflection (person, number, case, and tense information) marker for the verb moves with the question word. Here we can see that went is broken up into did + go and did travels to the front of the sentence with where.

Based on this, I had two tasks to complete:
  1. Identify an element in the statement that could be replaced/removed to make a question, and match it with the correct question word and question format.
  2. Find the relevant verb and identify/split out its inflection to be moved into the correct question format.
Once again beginning with creating the lexicon, I went through my existing Flex script and broke the POS lists down further into lists that would match to a specific question word. For example, nouns that refer to a person (e.g professor, administrator, girl, etc.) should get matched to who, prepositions that refer to location (e.g by, near, on, etc.) should get matched to where and so on. 

I also went through my list of verbs and picked out any that had an irregular past-tense conjugation (e.g. thought from think, drove from drive etc.) and separated the 3rd-person-present verbs that are formed with -s from those that are formed with -es (e.g. wins vs. rushes). This would allow me to create Bison rules that split out the inflection without misspelling the remaining verb (e.g. did think, does rush, etc.).

The resulting Flex rules can be seen below, once again with [...] replacing lists of words or regex rules:

[ \t.,?!"]+                       {return SP;}
[']                                ;
[\n]                              {return EOL;}

"not"                                {yylval.sVal = strdup(yytext); return NOT;}
"that"                               {yylval.sVal = strdup(yytext); return THAT;}
"to"                                 {yylval.sVal = strdup(yytext); return TO;}
"this"                               {yylval.sVal = strdup(yytext); return THIS;}
"what"|"who"|"where"|"when"|"how"    {yylval.sVal = strdup(yytext); return COMP;}
"and"|"or"|"but"                     {yylval.sVal = strdup(yytext); return CONJ;}


"about"|"whether"|"if"|"of"                 {yylval.sVal = strdup(yytext); return PREPWHAT;}
"from"|"past"|[...]|"among"|"amid"          {yylval.sVal = strdup(yytext); return PREPWHERE;}
"with"|"like"|"as"[...]|"without"           {yylval.sVal = strdup(yytext); return PREPHOW;}
"for"|"because"                             {yylval.sVal = strdup(yytext); return PREPWHY;}
"while"|"after"|[...]|"until"               {yylval.sVal = strdup(yytext); return PREPWHEN;}

"am"|"are"|"is"[...]|"has"|"had"            {yylval.sVal = strdup(yytext); return MODAL;}

"us"|"he"|"they"|[...]|"them"|"whoever"     {yylval.sVal = strdup(yytext); return PRONWHO;}
"ours"|"yours"|"hers"|"mine"|"theirs"       {yylval.sVal = strdup(yytext); return PRONWHOSE;}
"it"|"itself"                               {yylval.sVal = strdup(yytext); return PRONWHAT;}
"here"|"there"                              {yylval.sVal = strdup(yytext); return PRONWHERE;}

"the"|"a"|"these"|"those"|"any"             {yylval.sVal = strdup(yytext); return DET;}

[A-Za-z]+ed                                 {yylval.sVal = strdup(yytext); return ED;}
[A-Za-z]+ing                                {yylval.sVal = strdup(yytext); return GER;}

"vendor"|"officials"|[...]|"advisers"       {yylval.sVal = strdup(yytext); return NOUNWHO;}
"watt"|"decoders"|[...]|"mates"|"tacks"     {yylval.sVal = strdup(yytext); return NOUNWHAT;}


"thought"|"struck"|[...]|"slid"|"spoke"     {yylval.sVal = strdup(yytext); return VERBPAST;}
"contends"|"dances"|[...]|"flares"          {yylval.sVal = strdup(yytext); return VERBS;}
"passes"|"clutches"|[...]|"attaches"        {yylval.sVal = strdup(yytext); return VERBES;}
"ensure"|"choose"|[...]|"stem"|"steer"      {yylval.sVal = strdup(yytext); return VERB;}

"too"|"my"|"your"|[...]|"triple"            {yylval.sVal = strdup(yytext); return ADJ;}
"often"|"soon"|[...]|"sometimes"            {yylval.sVal = strdup(yytext); return ADVWHEN;}


[a-z]+ous|[a-z]+ful|[...]|[a-z]+al          {yylval.sVal = strdup(yytext); return ADJ;}
[a-z]+ness|[a-z]+ment|[...]|[a-z]+ities     {yylval.sVal = strdup(yytext); return NOUNWHAT;}
[a-z]+ist|[a-z]+ants|[...]|[a-z]+ists       {yylval.sVal = strdup(yytext); return NOUNWHO;}
[a-z]+ly                                    {yylval.sVal = strdup(yytext); return ADVHOW;}
[a-z]+ize|[a-z]+ate|[a-z]+fy|[a-z]+en       {yylval.sVal = strdup(yytext); return VERB;}
[a-z]+izes|[a-z]+ates|[a-z]+ens             {yylval.sVal = strdup(yytext); return VERBS;}

[a-z]+                                      {yylval.sVal = strdup(yytext); return NOUNWHAT;}

Next came the many iterations of Bison rules. I won't take you through every step, both because it would be tedious and because I honestly don't remember how many times I completely changed course, but I will try to cover the major decisions/adjustments that I had to make.

First, I had to narrow my scope of allowable sentence structures even further than I already had to avoid fatal shift/reduce errors and verb/inflection or question formatting mismatches. For example, I had to disallow compound sentences (e.g. I went to the store and he danced.) and compound predicates (e.g. I went to the store and danced.) because I couldn't find a clean way to identify the "main" verb that would be used in the resulting question.

In order to allow pieces to flow into the final question format from further down in the grammar tree, I created an array for each sentence that would be populated by the rules below it on the Bison rule tree.
  1. The sentence array consists of [nounPhrase, modalVerb (inflection), mainVerb, questionWord]
  2. The nounPhrase is the subject of the sentence, and could generally be left intact
  3. In order to differentiate a complementizer phrase from the statement predicate, I needed to create a separate class for phrases that contain verbs but are attached to the subject (e.g. The girl who went to the store was late for school.)
  4. The other three pieces of the final array come from the predicate, and are populated based on the structure of the input statement
    • if the predicate already has a separated modal verb/inflection, these pieces are passed into the modalVerb and mainVerb slots (e.g. The boy can dance at home.; modalVerb = can, mainVerb = dance)
    • if the main verb is in the list of irregular past tense verbs, the modalVerb becomes did and the mainVerb comes from a lookup table established at the beginning of the Bison script
    • if the main verb is not irregular and does not already have a separated modal verb/inflection marker, it is split based on explicit rules
      • verb ending with -ed : modalVerb = did, mainVerb = verb with -ed removed
      • verb ending with -s: modalVerb = does, mainVerb = verb with -s removed
      • verb ending with -es: modalVerb = does, mainVerb = verb with -es removed
      • all other verbs: modalVerb = do, mainVerb = verb unchanged
    • if the predicate consists only of a modal verb, the mainVerb is left blank and the modal is passed directly into the modalVerb (e.g. The girl is smart.)
    • the question word generally comes from the complement to the main verb phrase, whether it's a noun, adverb, prepositional phrase, etc. (e.g. The man left his wallet.; questionWord = What)
    • if the main verb does not have a complement, the modalVerb comes from the main verb's inflection, the mainVerb is replaced with do, and the questionWord is What (e.g. The girl experimented.; modalVerb = did, mainVerb = do, questionWord = What; Final question = What did the girl do?)
  5. Finally, once Bison has made its way down the grammar tree and back, the pieces of the sentence array are assembled to created the final question.
Since the predicate parsing is the most complicated part of this puzzle, I'll post a sample of the Bison rules below. The entire project code, including a README and a makefile, can be found on my github here:

mp: modal vp {
        $$ = (char**)calloc(3, (sizeof($2[0])+sizeof(char)*5));
        
        $$[0] = (char*)malloc(strlen($1)+1);
        $$[1] = (char*)malloc(strlen($2[0])+1);
        $$[2] = (char*)malloc(strlen($2[1])+1);
        
        strcpy($$[0], $1);
        strcpy($$[1], $2[1]);
        strcpy($$[2], $2[2]);
  }
  | vp {
        $$ = (char**)calloc(3, (sizeof($1[0])+sizeof(char)*5));
        
        $$[0] = (char*)malloc(strlen($1[0])+1);
        $$[1] = (char*)malloc(strlen($1[1])+1);
        $$[2] = (char*)malloc(strlen($1[2])+1);
        
        strcpy($$[0], $1[0]);
        strcpy($$[1], $1[1]);
        strcpy($$[2], $1[2]);
  }

modal: MODAL SP {
        $$ = $1;
    }
    | MODAL SP NOT SP {
        char space[] = " ";
        char neg[]= "n\'t";
        $$ = (char*)malloc(strlen($1)+strlen(neg)+1);
        strcpy($$, $1);
        strcat($$, neg);
    }

  
vp: advp verb {
        char adv[] = "How";
        $$ = (char**)calloc(3, (sizeof($2[0])+sizeof(char)*5));
        
        $$[0] = (char*)malloc(strlen($2[0])+1);
        $$[1] = (char*)malloc(strlen($2[1])+1);
        $$[2] = (char*)malloc(strlen(adv)+1);
        
        strcpy($$[0], $2[0]);
        strcpy($$[1], $2[1]);
        strcpy($$[2], adv);
   }
   | verb whatp {
        char what[] = "What";
        $$ = (char**)calloc(3, (sizeof($1[1])+sizeof(char)*5));
        
        $$[0] = (char*)malloc(strlen($1[0])+1);
        $$[1] = (char*)malloc(strlen($1[1])+1);
        $$[2] = (char*)malloc(strlen(what)+1);
        
        strcpy($$[0], $1[0]);
        strcpy($$[1], $1[1]);
        strcpy($$[2], what);
  }
  [...]
  | verb  {
        char what[] = "What";
        char d[] = "do";
        $$ = (char**)calloc(3, (sizeof($1[1])+sizeof(char)*5));
        
        $$[0] = (char*)malloc(strlen($1[0])+1);
        $$[1] = (char*)malloc(strlen(d)+1);
        $$[2] = (char*)malloc(strlen(what)+1);
        
        strcpy($$[0], $1[0]);
        strcpy($$[1], d);
        strcpy($$[2], what); 
  }

whatp: THAT SP statement {
        $$ = $1;
    }
     | PREPWHAT SP conjdp {
        $$ = $1;
    }
     | PREPWHAT SP statement {
        $$ = $1;
    }
     | nounwhat
     | nconjmp 
     | THAT SP {
        $$ = $1;
    }
     | THIS SP {
        $$ = $1;
    }
     
[...]

verb: verba
    | verbb


verba: VERB SP {
        char d[] = "do";
        $$ = (char**)calloc(2, (sizeof($1)+sizeof(d)));
        
        $$[0] = (char*)malloc(strlen(d)+1);
        $$[1] = (char*)malloc(strlen($1)+1);
        
        strcpy($$[0], d);
        strcpy($$[1], $1);
    }
    | GER SP {
        char d[] = "do";
        $$ = (char**)calloc(2, (sizeof($1)+sizeof(d)));
        
        $$[0] = (char*)malloc(strlen(d)+1);
        $$[1] = (char*)malloc(strlen($1)+1);
        
        strcpy($$[0], d);
        strcpy($$[1], $1);
    }
    | ED SP {
        char did[] = "did";
        $1[strlen($1)-2] = '\0';
        $$ = (char**)calloc(2, (sizeof($1)+sizeof(did)));
        
        $$[0] = (char*)malloc(strlen(did)+1);
        $$[1] = (char*)malloc(strlen($1)+1);
        
        strcpy($$[0], did);
        strcpy($$[1], $1);
    }
    | VERBS SP {
        char does[] = "does";
        $1[strlen($1)-1] = '\0';
        $$ = (char**)calloc(2, (sizeof($1)+sizeof(does)));
        
        $$[0] = (char*)malloc(strlen(does)+1);
        $$[1] = (char*)malloc(strlen($1)+1);
        
        strcpy($$[0], does);
        strcpy($$[1], $1);
    }
    | VERBES SP {
        char does[] = "does";
        $1[strlen($1)-2] = '\0';
        $$ = (char**)calloc(2, (sizeof($1)+sizeof(does)));
        
        $$[0] = (char*)malloc(strlen(does)+1);
        $$[1] = (char*)malloc(strlen($1)+1);
        
        strcpy($$[0], does);
        strcpy($$[1], $1);
    }
    | VERBPAST SP {
        char did[] = "did";
        char *v;
        v = (char*)malloc(strlen($1)+5);
        int i = 0;
        while(i < 113) {
            if (strcmp(pastV[i], $1) == 0){
                strcpy(v, rootV[i]);
                break;
            }
            i++;
        }
        $$ = (char**)calloc(2, (sizeof(v)+sizeof(did)));
        
        $$[0] = (char*)malloc(strlen(did)+1);
        $$[1] = (char*)malloc(strlen(v)+1);
        
        strcpy($$[0], did);
        strcpy($$[1], v);
    } 

verbb: MODAL SP {
        $$ = (char**)calloc(2, (sizeof($1)));
        
        $$[0] = (char*)malloc(strlen($1)+1);
        $$[1] = (char*)malloc(strlen($1)+1);
        
        strcpy($$[0], $1);
    }

In place of a conclusion...

I'll leave you with a list of example inputs and outputs from the final code. Although there are a lot of limitations that could be avoided with a more flexible programming framework, the structure of Flex and Bison really caused me to think about language in terms of concrete rules and to appreciate (not for the first time) the amazing power of the human brain to interpret language, even in its most convoluted and ambiguous forms.

The girl and the boy went to the store.
Where did the girl and the boy go?

The day was long.
How was the day?

The gunpowder burned.
What did the gunpowder do?

The donors who asked for receipts gave more money.
What did the donors who asked for receipts give?

The passengers sang when the bus arrived at the store.
When did the passengers sing?

It works!
What does it do?

 

English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we m...