Friday, February 23, 2018

Working with Strings






Working with Strings, and I'm not even tangled!

Working with Strings, and I'm not even tangled!

This week I continued to make my way through the Wrangle section of R For Data Science, focusing mainly on the chapter on Strings. It shouldn't be surprising to me, since the focus of this independent study and potentially my future career is working with textual data, but working with character strings has been SO FUN!

Like the rest of this course so far, working through this chapter has been a nice confirmation that I've chosen a good path and actually do enjoy doing the work that I've imagined myself doing for years now. I think anyone who is going/has gone through a career change can probably attest to how much relief comes with that feeling of confirmation. Both my brother- and sister-in-law are also in the midst of major career changes, so the subject has been a frequent topic of discussion in our family. Basically, I think the consensus is that changing careers is really hard, stressful, and scary, so any feelings of relief are quite welcome.

Regular Expressions: You Mean, AMAZING EXPRESSIONS?!?

Working within the really wonderful framework of the stringr package, all of the otherwise complicated work of working with character strings became pretty straightforward. This really only left the puzzle of regular expressions to learn and figure out. Luckily, I feel like my entire childhood (and, let's face it, adulthood) of working through grocery store puzzle books from cover to cover has really just been preparing me for the day when I got to play with regex and call it work.

When I reached the end of the first section on regex, and one of the exercises was to go to https://regexcrossword.com/ and complete the beginner puzzles, I felt like I had cheated the system. Was I really allowed to do puzzles and still pretend like I was studying? But since it was written right there in the textbook, I decided I was allowed to spend as much time doing it as I wanted…so there went the rest of the day.

The book starts us out doing pretty simple operations, like finding all words in a list of strings that start or end with a specific letter:

str_subset(words, "(^y)|(x$)")
##  [1] "box"       "sex"       "six"       "tax"       "year"     
##  [6] "yes"       "yesterday" "yet"       "you"       "young"

Working up to something slightly more complicated, like finding all words in a list with three or more vowels in a row (turns out there aren't any with more than three…):

str_subset(words, "[aeiou]{3,}")
## [1] "beauty"   "obvious"  "previous" "quiet"    "serious"  "various"

Or finding all the words in a list with a repeating letter pair:

str_subset(words, "(..).*\\1")
##  [1] "appropriate" "church"      "condition"   "decide"      "environment"
##  [6] "london"      "paragraph"   "particular"  "photograph"  "prepare"    
## [11] "pressure"    "remember"    "represent"   "require"     "sense"      
## [16] "therefore"   "understand"  "whether"

Then things really do start to pick up, like creating this tibble that counts how many vowels and consonants are in each word, and then uses those values to calculate each word's vowel to consonant ratio.

tibble(word=words, 
                  vowels=str_count(word, "[aeiou]"),
                  consonants=str_count(word, "[^aeiou]"),
                  vtoc_ratio=vowels/(consonants+vowels))%>%
  head(10)
## # A tibble: 10 x 4
##    word     vowels consonants vtoc_ratio
##    <chr>     <int>      <int>      <dbl>
##  1 a             1          0      1.00 
##  2 able          2          2      0.500
##  3 about         3          2      0.600
##  4 absolute      4          4      0.500
##  5 accept        2          4      0.333
##  6 account       3          4      0.429
##  7 achieve       4          3      0.571
##  8 across        2          4      0.333
##  9 act           1          2      0.333
## 10 active        3          3      0.500

Or pulling all sentences with colors in them:

colors <- c("red", "orange", "yellow", "green", "blue", "purple")
color_match <-str_c(" ", colors, " ", collapse = "|")
str_subset(sentences, color_match)%>%
  head(10)
##  [1] "Glue the sheet to the dark blue background."   
##  [2] "Two blue fish swam in the tank."               
##  [3] "A wisp of cloud hung in the blue air."         
##  [4] "Leaves turn brown and yellow in the fall."     
##  [5] "The spot on the blotter was made by green ink."
##  [6] "The sofa cushion is red and of light weight."  
##  [7] "A blue crane is a tall wading bird."           
##  [8] "It is hard to erase blue or red ink."          
##  [9] "The lamp shone with a steady green flame."     
## [10] "The box is held by a bright red snapper."

At one point the book asks us to find all words that end in “ing”:

ing <- str_subset(sentences, regex(" [a-z]*ing ", ignore_case=TRUE))
str_extract_all(ing, regex(" [a-z]*ing ", ignore_case=TRUE), simplify=TRUE)%>%
  head(10)
##       [,1]        
##  [1,] " winding " 
##  [2,] " king "    
##  [3,] " making "  
##  [4,] " raging "  
##  [5,] " playing " 
##  [6,] " sleeping "
##  [7,] " glaring " 
##  [8,] " dying "   
##  [9,] " lodging " 
## [10,] " filing "

Well, when I saw this list, the first thing I thought was how useless it was since it includes both verbs and words like “king”, “thing”, and “sing” which happen to end in “ing” but don't follow the same rules. So I decided to try something a little more specific to isolate verbs ending in “ing”. Of course, any linguist will bring up words like “herring” which would still be selected with this rule, but I think I did pretty well:

gerund_string <- " [a-z]*[aeiouy][a-z]*ing "
gerunds <- str_subset(sentences, regex(gerund_string, ignore_case=TRUE))
str_extract_all(gerunds, gerund_string, simplify=TRUE)%>%
  head(10)
##       [,1]        
##  [1,] " winding " 
##  [2,] " making "  
##  [3,] " raging "  
##  [4,] " playing " 
##  [5,] " sleeping "
##  [6,] " glaring " 
##  [7,] " dying "   
##  [8,] " lodging " 
##  [9,] " filing "  
## [10,] " making "

Similarly, the expression needed to find plurals is more complicated than just finding words that end in “s”:

plural_string <- "[a-z]*(e|[^aeious'])s "
plurals <- str_subset(sentences, 
                      regex(plural_string, ignore_case=TRUE))
str_extract_all(plurals, plural_string, simplify = TRUE)%>%
  head(10)
##       [,1]         [,2]      [,3]
##  [1,] "days "      ""        ""  
##  [2,] "lemons "    "makes "  ""  
##  [3,] "hogs "      ""        ""  
##  [4,] "hours "     ""        ""  
##  [5,] "stockings " ""        ""  
##  [6,] "helps "     ""        ""  
##  [7,] "fires "     ""        ""  
##  [8,] "pants "     ""        ""  
##  [9,] "books "     ""        ""  
## [10,] "keeps "     "chicks " ""

Unfortunately I couldn't think of a pattern that could weed out verbs ending in “s” like “helps”, but even if I could it wouldn't be able to account for words like “needs” which could be either a verb or a plural noun. I guess I'll just have to wait to do that kind of mining once I've learned about POS tagging in the tidytext book. Perhaps I'm showing my nerd a little here, but I'm actually really excited for that day.

My last task in this chapter was to find out which individual words are used the most often within the sentences list. It was a bit of a fight to figure out how to count the words individually, even after splitting them up, but I am quite happy with the solution I came up with:

sent_word <- str_extract(sentences, boundary("word"))
sent_words <- tibble(words=sent_word)
sent_words%>%
 count(words)%>%
  arrange(desc(n))
## # A tibble: 262 x 2
##    words     n
##    <chr> <int>
##  1 The     262
##  2 A        72
##  3 He       24
##  4 We       13
##  5 It       12
##  6 They     10
##  7 She       9
##  8 There     7
##  9 Take      5
## 10 This      5
## # ... with 252 more rows

Although I do have a bit more that I want to learn out of R for Data Science before I move on to Text Mining in R, the work that I did this week has me more excited than ever to do so. Look, everyone! I'm really using my undergraduate degree after all! I know things and they are useful! Those are statements that I wasn't sure I would ever utter, and it feels pretty darn good to be able to say them without any sarcasm.

No comments:

Post a Comment

English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we m...