Sunday, June 21, 2020

English Phonetic Transcription with Flex and Bison

    This spring was my first semester officially enrolled as a PhD student. I'm planning a full blog post about my decision to (attempt to) join the doctoral ranks, but very quickly: my research is focused on integrating Natural Language Processing and Information Retrieval techniques to build a text- and activity-based recommender system applicable to my employer's SAAS platform.

Given the heavy Computer Science (CS) component of this research and my almost total lack of CS experience, a large portion of my coursework for this program is going to be within the CS department. I've now completed the rudimentary CS requirements (Data Structures and Design/Analysis of Algorithms), and have begun doing some of the higher level/more creative coursework. This spring I was able to take a class called Projects in Natural Language Processing, which largely focused on building and understanding Flex and Bison compilers.

Muscles and Bovines

What are Flex and Bison anyway? I'll admit that I'm still a little fuzzy on the answer to that question, but my best big picture explanation is that they are two tools that allow a user to more or less create their own programming language. In other words, using Flex and Bison, you can assign roles to an input and then use rule-based programming to give a desired output.

Breaking it down further, Flex (a spin-off of Lex) is a lexical analyzer. Within a Flex script, the programmer uses regex to assign any possible input to a token which is then fed to Bison. For example, a very simple Flex scanner (which only defines a plus sign and passes through the values of strings and integers) might look like this:

"+"           {return PLUS;} 
[0-9]+        {yylval.iVal = strdup(yytext); return INTEGER;}
[A-Za-z]+     {yylval.sVal = strdup(yytext); return STRING;}

Bison (a spin-off of Yacc) is a parser generator that takes the tokens fed to it by Flex and uses programmer-defined rules to produce the desired output from those tokens. Generally, the code within the Bison rules is written in C or C++. So, given the Flex scanner above, if we wanted our language to add two numbers, concatenate two strings, or concatenate a string and a number when separated by a plus sign, the rules portion of our Bison script might look like this:

begin: input EOL {
                printf("%s\n\n", $1); 
    } begin
     | /* NULL */
     ;
 
input: intadd {
                                       int length = snprintf( NULL, 0, "%d", $3 );
          $$ = (char*)malloc(length +1);
          sprintf($$, "%d", $3);
                              } 
     | stringcat

intadd: intadd PLUS INTEGER {
                $$ = $1 + $2;
           }
           | INTEGER

stringcat: stringcat PLUS STRING {
                    $$ = (char*)malloc(strlen($1)+strlen($3)+1));
                    strcpy($$, $1);
                    strcat($$, $3);
              }
              | stringcat PLUS intadd {

                    int length = snprintf( NULL, 0, "%d", $3 );
                    char i[length];
                    sprintf(i, "%d", $3);

                    $$ = (char*)malloc(strlen($1)+length+1);
                    strpy($$, $1);
                    strcat($$, i);
              }
              | intadd PLUS stringcat {
            
                    int length = snprintf( NULL, 0, "%d", $1 );
                    char i[length];
                    sprintf(i, "%d", $1);

                    $$ = (char*)malloc(strlen($3)+length+1);
                    strpy($$, $1);
                    strcat($$, i);
              }
              | STRING


There's quite a bit more finicky business involved in creating the Flex and Bison scripts so that they play nicely with each other and do what they're supposed to do, but that wasn't the part of this project that I was most interested in, so I'm not going to go into it here. If you want to learn more specifics about how Flex and Bison work, they both have great documentation which can be found here: 


Once Flex and Bison have done their thing with the tokens and rules above, you should be able to run your .out file with the following ins and outs:
                   
          in:  5+3
          out: 8

          in:  boat+house
          out: boathouse

          in:  area+51
          out: area51 

Beep Boop Turn Right on South Brood-way Avenue

We've all laughed at the poor pronunciation of our GPS apps, and while it has certainly improved over the years, there's not a lot of groundbreaking work being done in teaching computers proper pronunciation. Within the field of linguistics, the term "pronunciation" can refer to phonology, phonetics, or sometimes even prosody. However, these areas of study almost entirely focus on spoken language completely divorced from its written representation (orthography). Similarly, most of the interesting Natural Language Processing (NLP) work being done either deals with language starting in wave form (e.g. speech recognition) or solely as written symbols of meaning (e.g. semantic analysis). 

My intuition is that the relative lack of study in text-to-speech techniques is due to two main factors:
  1. There's limited work to be done. That is, at some point we can have a one-to-one relationship between all written words and their spoken forms. Compared to the complication of matching an infinite number of wave forms (since each spoken production of a written word will vary by some amount) to a single written word, mapping a written word to a single spoken form is trivial. Furthermore, creating a translator for IPA (International Phonetic Alphabet) symbols is all you would need to produce a spoken form for any word found in a dictionary, since the pronunciation is generally included in the dictionary entry.
  2. There's limited application and need for improvement. Sure, we laugh at the GPS or the automated phone service when it mispronounces a word, but how often does the text-to-speech mispronunciation actually cause communication to break down? On the other hand, the existence of Buzzfeed articles like this one about embarrassing speech-to-text failures demonstrates that this does not hold when computer translation goes the other way. The one important area I can think of that could benefit from text-to-speech improvement is accessibility for the visually-impaired, but that's a fairly niche application.
While these factors make text-to-speech an "uninteresting" area of study, they could also be seen as indications that text-to-speech is a good candidate for a simple, rule-based computer program, such as a Flex/Bison compiler language. Which is where this project idea began.

I before E except after C... and some other times

Normally if I wanted to represent a word's spoken form, I'd use IPA symbols, which were designed to represent any and all possible linguistic sounds. However, given that I was already learning C for this project and didn't want to complicate things any more than necessary by importing a special symbol package, I wanted to find a succinct and accurate way to represent phonetic symbols with the Latin alphabet. Luckily Carnegie Mellon already took care of this and even compiled a sizeable dataset of English language pronunciations, nominally for use in training a machine learning model.

With that problem solved for me, the next step was to decide how I wanted to structure the Bison rules. I started out thinking that I could use my knowledge of phonology alone to build the rules that I needed, but every time that I came up with a rule, I also came up with a handful of exceptions to that rule. So, with guidance from my professor, I decided to try letting the computer do the work for me. 
 
Working in Python, I decided first to see if I could narrow my rules down to a relatively short list of letter combinations that occur in English. I knew that groups of vowels and groups of consonants often combine with each other and map to a single phoneme (e.g. "ea" or "th"), so I made a list of words from a few dozen classic texts and then counted the number of times each group of vowels or consonants occurred. This left me with almost 2600 "tokens" of vowel and consonant groups. Limiting the list to tokens that occurred at least 100 times shortened the list to ~700 tokens--still a lot, but seemingly much more manageable.

englishContexts = defaultdict(int)
letters = re.compile("[a-z]+")

vowelmatch = re.compile("[AEIOUY]+")
consmatch = re.compile("[B-DF-HJ-NP-TV-Z]+")

for w in words:
                             # skip any words with numbers or symbols
    if letters.fullmatch(w):
        w = w.upper()
                  # make lists of all vowel and consonant groups 
        v = vowelmatch.findall(w)
        c = consmatch.findall(w)
#check whether word starts and ends with a vowel or consonant group, mark with a "^" or "$" and tally all groups in the list of tokens 
        if vowelmatch.match(w):
            englishContexts["^"+v[0]] += 1
            if vowelmatch.match(w[-1]):
                englishContexts[v[-1]+"$"] += 1
                for i in range(1, len(v)-1):
                    englishContexts[v[i]] += 1
                for i in range(len(c)):
                    englishContexts[c[i]] += 1
            else:
                englishContexts[c[-1]+"$"] += 1
                for i in range(1, len(v)):
                    englishContexts[v[i]] += 1
                for i in range(len(c)-1):
                    englishContexts[c[i]] += 1
        if consmatch.match(w):
            englishContexts["^"+c[0]] += 1
            if vowelmatch.match(w[-1]):
                englishContexts[v[-1]+"$"] += 1
                for i in range(len(v)-1):
                    englishContexts[v[i]] += 1
                for i in range(1, len(c)):
                    englishContexts[c[i]] += 1
            else:
                englishContexts[c[-1]+"$"] += 1
                for i in range(len(v)):
                    englishContexts[v[i]] += 1
                for i in range(1, len(c)-1):
                    englishContexts[c[i]] += 1

#only keep tokens which occur more than 100 times among all texts
topEnglishContexts = {w:count for w, count in englishContexts.items() if                          count > 100}
 
I also knew that there were different rules for letter combinations at the beginning and end of words (e.g. "t"' is grammatical at the beginning of an English word, but not at the end), so I split my list of tokens up by where they fell in the word.

starts = set()
ends = set()
vowels = set()
consonants = set()

for w in topEnglishContexts:
    if w[0] == "^":
        starts.add(w)
    if w[-1] == "$":
        ends.add(w)
    if vowelmatch.fullmatch(w):
        vowels.add(w)
    if consmatch.fullmatch(w):
        consonants.add(w)
 
This left me with 102 possible word starts, 155 possible word ends, 55 vowels, and 374 consonants. Luckily I started working through the vowels first, because it was the shortest list, and quickly realized that this method was not going to work. For any given vowel, there could be a handful of pronunciations depending on the word it was in (e.g. "a" sounds different in each "cat", "late", "saw", "many"). This multiplied the ~700 rules I thought I was going to make (which really would have been an absurd undertaking anyway...), so I decided to let the computer do even more work for me.

Using Carnegie Mellon's training data, I decided to flip the process I had started above and instead find all of the different letter combinations that mapped to each phoneme in the dataset. This was also more complicated than I had originally thought, since the training data only tells you the input and the output, and not which specific letters matched to which part of the phonetic production. An example of how the dataset is structured is shown below:

           ACCOMPANIED : 'AH_K_AH_M_P_AH_N_IY_D'
           PROPELLED   : 'P_R_AH_P_EH_L_D'
           MINDING     : 'M_AY_N_D_IH_NG'
           COMFORTABLE : 'K_AH_M_F_ER_T_AH_B_AH_L'

Already from these examples we can see that there is not a one-to-one relationship between letters and phonemes. So I once again grouped vowels and consonants in the original tokens, and incorporated some other piecemeal rules to adjust for rule exceptions or artifacts of this particular transcription method. For example, the letter y is sometimes treated as a vowel and sometimes treated as a consonant, so I used regex to describe these rules in my vowel/consonant definitions.

y behaves as a vowel in combination with other vowels when:
  • it follows a vowel at the end of the word
  • it follows a vowel and precedes a consonant
y behaves as a vowel on its own when:
  • it precedes a consonant
  • it ends a word
y behaves as a consonant when:
  • it precedes a vowel
And I created similar rules to account for word-final or "silent" e, as well as r which can be combined with either vowels or consonants. With these rules in place, both the original tokens and the phonemic transcriptions were split into vowel and consonant groups

            #define vowels, consonants, and letters
            vowelmatch = re.compile("([AEIOU]+Y$)|([AEIOU]+Y)(?=[^AEIOU])|
                                     (Y)(?=[^AEIOU])|([AEIOUY]+R+[AEIOUY]*)|
                                     ([AEIOU][AEIOU]+)|([AIOU]+)|(E)|(^E)|(Y$)")
            consmatch = re.compile("(Y)(?=[AEIOU]+)|([^AEIOUY]+R+)|(^R)|
                                    ([^AEIOUDT]+ED$)|([^AEIOURY]+)")
            letters = re.compile("[A-Z]+")     

            IPAdict = {w:trans for w,trans in IPAdict.items()  
                        if letters.fullmatch(w)}

            #split words into cons and vowels
            for w in IPAdict:
                parts = []
                word = str(w)
                while len(w) > 0:
                    if vowelmatch.match(w):
                        v = vowelmatch.match(w)
                        parts.append(v.group(0))
                        w = w[(len(v.group(0))):]
                    else:
                        c = consmatch.match(w)
                        parts.append(c.group(0))
                        w = w[(len(c.group(0))):]
                IPAdict[word].append(parts)

            #group symbols into cons and vowels for 1-1 matching letters to symbols 
            for w in IPAdict:
                trans, split = IPAdict[w]
                symbols = trans.split("_")
                sym = ""
                parts = []
                i = 0
                while i < len(symbols):
                    if i == 0:
                        sym = sym +"_"+symbols[i]
                        i += 1
                    while i < len(symbols) and ((symbols[i] in IPA['vowel']) == 
                            (symbols[i-1] in IPA['vowel']) or (len(sym) == 0)):
                        sym = sym +"_"+symbols[i]
                        i += 1
                    parts.append(sym)
                    sym = ""
                IPAdict[w].append(parts)

For some words, such as 'MINDING' seen above, this grouping is all that was needed to find a one-to-one match between the orthographic token and the provided phonemic transcription (i.e. "M_I_ND_I_NG" = "M_AY_N-D_IH_NG"). But other words, such as 'COMFORTABLE', which has an r-colored vowel ("OR" = "ER"), an extra vowel sound ("BL" = "B_AH_L"), and a silent e at the end, need additional rules to get a computer-recognizable match between the processed orthographic token and the processed phonemic transcription. Those three rules were specifically accounted for here:

            for w in IPAdict:
                trans, split, tsplit = IPAdict[w]
                i = 1
                while i < (len(tsplit)-1): #merge r with preceding vowels
                    if ((len(tsplit) > 2) and (tsplit[i] == "_R") 
                            and (tsplit[i-1].split("_")[-1] in IPA['vowel'])):
                        tsplit[i-1] = tsplit[i-1]+tsplit[i]
                        del tsplit[i]
                        if tsplit[i].split("_")[1] in IPA['vowel']:
                            tsplit[i-1] = tsplit[i-1]+tsplit[i]
                            del tsplit[i]
                    else:
                        i+=1
                if len(re.findall("[^AEIOUYRW]LE", w)) > 0 
                        and (len(tsplit) > len(split)):
                    i = 1
                    while i < len(tsplit)-1: #add additional vowel to 'LE$' tokens
                        if ((tsplit[i] == "_AH") 
                        and (tsplit[i+1].split("_")[1] == "L") 
                        and (tsplit[i-1].split("_")[-1] not in IPA['vowel'])
                        and (tsplit[i-1].split("_")[-1] not in ['R', 'W', 'Y'])):
                            tsplit[i-1] = tsplit[i-1]+tsplit[i]+tsplit[i+1]
                            del tsplit[i]
                            del tsplit[i]
                        else:
                            i += 1
                if (split[-1] == "E") 
                        and (len(split) > len(tsplit)): # silence final e
                    tsplit.append("0")

            #add word-start and word-end indicators
            for w in IPAdict:
                trans, split, tsplit = IPAdict[w]
                split.insert(0, "^")
                split.extend("$")
                tsplit.insert(0, "^")
                tsplit.extend("$")

These few rules left me with only about 10% of the words in the training dataset not having a one-to-one match (~8K out of ~80K), which felt like a reasonable amount of tokens to exclude for simplicity's sake. So, filtering out these non-matches, I created a dictionary ofeach unique combination of three letter-groups to consider the context for pronunciation. I then counted how many times each letter-group matched to each phoneme-group. For example, the word mind would add a count to three entries in the dictionary, with the keys being [letters of interestpreceding contextfollowing context] and the values being the phonetic transcription:

            ["M", "^", "I"]   : "M"
            ["I", "M", "ND"]  : "AY"
            ["ND", "I", "$"]  : "N_D" 

Then, because the English language has too many ambiguous letter-to-sound mappings, I decided to only choose the most common match for each specific context.

            #count context-specific instances of letter-to-phoneme matches
            phoneDict = defaultdict(lambda: defaultdict(int))    
            for w in IPAdict:
                trans, split, tsplit = IPAdict[w]
                if len(split) == len(tsplit):
                    for i in range(1, len(split)-1):
                        phoneDict[(split[i], split[i-1], split[i+1])][tsplit[i]]+=1

            #limit transcription to most common matching for each context            
            smallPhoneDict = {}
            for con in phoneDict:
                mCount = 0
                mPh = ""
                for ph in phoneDict[con]:
                    if phoneDict[con][ph] > mCount:
                        mCount = phoneDict[con][ph]
                        mPh = ph
                smallPhoneDict[con] = mPh

Unfortunately, this still left me with 70K contexts to potentially code into my Bison script. I hoped that I could group a lot of these contexts together, so I looked for letter-groups that only had one phoneme match among all contexts, which gave me ~2K tokens and seemed fairly incomplete. When I added letter-group / following context pairs with unique phoneme matches among all preceding contexts, I ended up with ~13K tokens to somehow code into my Bison script--a lot less than 70K, but still totally impractical. And so, it was back to the drawing board.

Pretty Good is Good Enough:

Siri, I feel your pain.


As a linguist and a person working in NLP, I'm a little embarrassed that I thought it might be relatively simple to create the rules necessary to make text-to-speech transcriptions for English. Throughout all of the processes I went through above, I kept running into the problem of, "Whoa, this is going to be really complicated!". I think I was secretly hoping that I could somehow find the key that would unlock a magical and manageable list of English Pronunciation Rules, even though I know that this list doesn't exist and that pronunciation in any language is highly dependent on context, word origin, speaker dialect, etc. 


So, I gave up a little and decided that if I could take any "grammatical" English word (i.e. a series of letters allowable according to English spelling rules; non-words like "florinant" are okay, but "flrrrrninr" is not), and produce a possible English pronunciation, that would be pretty good, even if it's not the actual pronunciation of that word. Starting with this understanding, my first step was to go through the consonants and think of all the possible phonemes that they could map to. For example my list for n looked like this:

                            NN -> _N ("running")
                            NG -> _NG ("sing")
                            N -> _N ("run"), _NG ("think")

Vowels are always more complicated, so I switched the order of the consonant brainstorming and instead thought through all possible letter combinations for a given phoneme. For example, my list for _UW (i.e. "ooh") looked like this:

                           _UW -> OU ("ghoul"), OO ("tool"), U ("flute"), EW ("flew"), UE ("due")

The next step was to decide how I wanted to handle the Flex script, or more specifically, what I wanted my tokens to be. I went back and forth on this a little, but ultimately decided that the tokens would be FIRST and LAST characters ("^" and "$") and individual letters, and any grouping that I wanted to do later (such as consonants and vowels) would be done in the Bison script. This left me with a really simple Flex script with only the following definitions:


            [ \t]           ;
            [\n]            {return EOL;}
            [\^]            {return FIRST;}
            [\$]            {return LAST;}

            [Aa]            {return A;}
            [Bb]            {return B;}
            [Cc]            {return C;}
            [Dd]            {return D;}
            [Ee]            {return E;}
            [Ff]            {return F;}
            [Gg]            {return G;}
            [Hh]            {return H;}
            [Ii]            {return I;}
            [Jj]            {return J;}
            [Kk]            {return K;}
            [Ll]            {return L;}
            [Mm]            {return M;}
            [Nn]            {return N;}
            [Oo]            {return O;}
            [Pp]            {return P;}
            [Qq]            {return Q;}
            [Rr]            {return R;}
            [Ss]            {return S;}
            [Tt]            {return T;}
            [Uu]            {return U;}
            [Vv]            {return V;}
            [Ww]            {return W;}
            [Xx]            {return X;}
            [Yy]            {return Y;}
            [Zz]            {return Z;}


And then it was time to start creating my Bison rules. Since I knew, as discussed above, that there would be different rules for what was allowed and how things were pronounced at the beginning and end of a word, I decided to have separate rules for letter tokens found at the beginning, middle, and end of a word. Also, since Bison scripts generally include some amount of recursion, I had to decide how I wanted to structure the tree built by the rules. Although it may not be the most "Bison-y" structure, I decided to start the rules with a strictly right-branching structure to account for the many context-specific pronunciation rules I was anticipating (i.e. s sounds different before a vowel than it does before a consonant like h).


So my Bison script starts like this:

            begin: start EOL { 
printf("%s\n\n", $1); 
            } begin
        | /* NULL */
        ;
            start: FIRST vowel_start {
                    $$ = $2;
                 }
                 | FIRST con_start {
                    $$ = $2;
                 }
            vowel_start: o_start
                       | a_start
                       | u_start
                       | i_start
                       | e_start
            o_start: aa_ou
                   | aw_ou
                   | ow_oa 
                   | uw_oo
                   | oy
                   | aa_o 
                   | ow_o
            .
            .
            .
            con_start: simple_start
                     | complex_start
         
            simple_start_con: aff_start
                            | approx
                            | v 
                            | sibilant
            
            aff_start: C H  {
                        char str[] = "CH_";
                        $$ = (char*)malloc(strlen(str)+1);
                        strcpy($$, str);
                     }
                     | J {
                        char str[] = "JH_";
                        $$ = (char*)malloc(strlen(str)+1);
                        strcpy($$, str);
                     }
            .
            .
            .
            simple_start: simple_start_con vowel_middle {
                            $$ = (char*)malloc(strlen($1)+strlen($2)+1);
                            strcpy($$, $1);
                            strcat($$, $2);
                        }
                        | simple_start_con vowel_end {
                            $$ = (char*)malloc(strlen($1)+strlen($2)+1);
                            strcpy($$, $1);
                            strcat($$, $2);
                        }
            
            complex_start: v_stop_start
                         | stop_start
                         | c_start
                         | t_start
                         | dental_start
                         | nasal_start
                         | r_start
                         | l_start
                         | s_start
                         | f_start
                         | q_start
            

            v_stop_start: v_stop vowel_middle {
                            $$ = (char*)malloc(strlen($1)+strlen($2)+1);
                            strcpy($$, $1);
                            strcat($$, $2);
                        }    
                        | v_stop vowel_end {
                            $$ = (char*)malloc(strlen($1)+strlen($2)+1);
                            strcpy($$, $1);
                            strcat($$, $2);
                        }
                        | v_stop r_start {
                            $$ = (char*)malloc(strlen($1)+strlen($2)+1);
                            strcpy($$, $1);
                            strcat($$, $2);
                        }
                        | v_stop l_start {
                            $$ = (char*)malloc(strlen($1)+strlen($2)+1);
                            strcpy($$, $1);
                            strcat($$, $2);
                        }
            .
            .
            .


As opposed to most of the more traditional Bison scripts that I've seen which create rules that can be re-used as part of a tidy recursion (like the simple concatenation example at the beginning of this post), I ended up using the rules more to create categories of pieces that behave similarly and can be used again in other rules. I'm not sure if this is the best way to attack this type of project, but it was the most natural way for my brain to put it together.

I think you can start to see the natural development of these rules even in this short excerpt of code:
  1. We have a word and it can either start with a vowel or a consonant, so let's make a split there. 
  2. For vowel starts, there are five possible vowel letters that can occur, and we can split into each of those vowels and list the vowel-phonemes that are possible at the start of a word (note that this list is different than the vowel-phonemes that are possible in the middle or end of a word).
  3. For consonant starts, the consonant can either be simple (only one consonant phoneme) or complex (multiple consonant phonemes).
  4. For simple consonant starts, only certain consonant starts will always be simple-- the example above shows the affricate phonemes _CH (i.e "chin") and _JH (i.e. "jewel"), which cannot occur as a complex consonant at the beginning of a word.
  5. For complex consonant starts, groups of consonants follow different rules for how they are allowed to be combined to create a complex consonant. The example above shows the rules for voiced, non-dental, stops (i.e. g, b) which can be followed by a vowel (like in a simple consonant start), or can be found preceding an r or an l that are also following word-start rules (i.e. "ground" or "blue"). However, unlike some other consonants, voiced stops cannot be followed by another stop (t,k,p), a nasal (n,m), or a fricative (f,v) at the beginning of a word, and so they need to be in their own category.
It's probably pretty obvious that this rule-making process was slow and required a lot of amendments along the way. One of the major changes that I had to continuously make along the way was to choose a single phoneme match for any letter-token and following-context combination. Take, for example, the words "through" and "though". English speakers magically know that the vowel in the first word is _UW and the vowel in the second word is _OW, even though there isn't any orthographic context that tells us that's the case (especially if we also consider the word "rough"). This is where my acceptance of imperfection, or producing as an output a possible English pronunciation rather than the actual English pronunciation, comes in. So each time that I came across an ambiguity like this, I had to make a command decision about which phoneme I wanted my program to choose. In this particular case I decided to go with _OW as the matched phoneme when ough appears at the end of a word, as seen in the rule below:

o_end: O W LAST {
        char str[] = "OW";
        $$ = (char*)malloc(strlen(str)+1);
        strcpy($$, str);
     }
     | O H LAST {
        char str[] = "OW";
        $$ = (char*)malloc(strlen(str)+1);
        strcpy($$, str);
     }
     | O R LAST {
        char str[] = "ER";
        $$ = (char*)malloc(strlen(str)+1);
        strcpy($$, str);
     }
     | O O LAST {
        char str[] = "UW";
        $$ = (char*)malloc(strlen(str)+1);
        strcpy($$, str);
     }
     | O Y LAST {
        char str[] = "OY";
        $$ = (char*)malloc(strlen(str)+1);
        strcpy($$, str);
     }
     | O U G H LAST {
        char str[] = "OW";
        $$ = (char*)malloc(strlen(str)+1);
        strcpy($$, str);
     }
     | O E LAST {
        char str[] = "OW";
        $$ = (char*)malloc(strlen(str)+1);
        strcpy($$, str);
     }

Although I personally would enjoy listing all of the rules that I came up with and the reasoning behind them, it's probably not a very productive use of my time and it seems pretty unlikely that anyone else would ever be interested enough to read about all 124 of them... So instead I'll leave you with a link to the github repository with my full Flex, Bison, and makefile scripts, as well as some examples of inputs and outputs from my final program. But please, if you have any questions about what I've done or the decisions I made, be assured that I would LOVE to talk about it :-)

            in:    ^bothersome$
            out:    B_AA_TH_EH_R_S_OW_M

            in:    ^without$
            out:    W_IH_TH_AW_T

            in:    ^something$
            out:    S_OW_M_TH_IH_NG

            in:    ^thisisaword$
            out:    TH_IH_S_IH_S_AA_W_OW_R_D

            in:    ^florinant$
            out:    F_L_OW_R_IH_N_AE_N_T


Conclusion: Guess what? Language is complicated.

I knew this. Or, I should have known this--I've already spent a lot of time learning this. But the thing about language is that human brains are so good at it, which makes it easy to forget how complicated it really is. My program did not turn out exactly how I had imagined it would, but I'm still pretty happy with the results given the framework I was working within. I am also glad to have had the opportunity to dig more deeply into the nuts-and-bolts of language than I have for quite awhile--an endeavor which continued into another syntax-related project that I completed this spring and hope to write a future blog about. I expect to draw on my (however distant) familiarity with language a lot in my research, so the more review I can do in the meantime feels productive.

In addition to the language-related learning/relearning that I accomplished with this project, I also learned a little about building a compiler, stretched my recursive code-writing muscles, and gained way more familiarity with strings in C than I ever wanted. Generally, I feel good about the outcome even though it's not exactly what I set out to do.

English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we m...