Friday, March 2, 2018

Fun with FUNctions!






Fun with FUNctions!

Fun with FUNctions!

And also dates happened.

This week I finished up the sections in R for Data Science that I'd wanted to cover before moving on to Text Mining in R which is a really exciting place to be. We've also started working in R in my Data Analytics course, so once again things are matching up quite well between my different studies. Although I'm definitely ahead of everyone else in terms of using R thanks to this independent study, a lot of what we're doing in class is in Base R, which I really haven't gotten into very much yet. Sometimes my reaction to Base R functions is “Cool, that's so straightforward!”, but usually it's more like “Well, this is good to know, but I will never use it if I can use X instead…” So, thanks tidyverse contributors for making things easy and pretty!

Although I worked through a lot of chapters this week, I definitely spent the most time working on various functions and date formats. I did start out with some basic work with factors, practicing cleaning up data in situations with complicated or inconsistent levels:

#Lump the rincome levels to 5 labeled levels and one "other"
gss_cat%>%
  mutate(rincome=fct_lump(rincome, n=5))%>%
  count(rincome)
## # A tibble: 6 x 2
##   rincome            n
##   <fct>          <int>
## 1 $25000 or more  7363
## 2 $20000 - 24999  1283
## 3 $15000 - 19999  1048
## 4 $10000 - 14999  1168
## 5 Not applicable  7043
## 6 Other           3578
#Collapse the partyid levels into clearer groups and plot the proportion of each party over time
gss_cat%>%
  mutate(partyid=fct_collapse(partyid,
            "other"=c("No answer","Don't know","Other party"),
            "rep"=c("Strong republican","Not str republican"),
            "dem"=c("Strong democrat","Not str democrat"),
            "ind"=c("Ind,near rep","Independent","Ind,near dem")))%>%
  group_by(year)%>%
  count(partyid)%>%
  mutate(prop=(n/sum(n)))%>%
  ggplot(aes(x=year, y=prop, color=partyid))+
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess'

plot of chunk levelcollapse

Then I started with some basic work on dates:

# List the first day of each month in the current year
months <- c(1:12)
first_days_thisyear <- ymd(str_c(year(today()), months, 1, sep="-"))
first_days_thisyear
##  [1] "2018-01-01" "2018-02-01" "2018-03-01" "2018-04-01" "2018-05-01"
##  [6] "2018-06-01" "2018-07-01" "2018-08-01" "2018-09-01" "2018-10-01"
## [11] "2018-11-01" "2018-12-01"
#Calculate my current age in years
my_age <- (ymd(19890702)%--%today())/(years(1))
my_age
## [1] 28.66575

And then I spent a really long time trying to figure out how to work with the datetime format within the nycflights13 dataset. This was also my first taste of creating a function, with the goal of converting the strangely formatted date and time fields in this dataset into an accepted datetime format.

make_datetime_100 <- function(year, month, day, time){
  make_datetime(year, month, day, time %/% 100, time %% 100)
}

flights_dt <- flights%>%
  filter(!is.na(dep_time), !is.na(arr_time))%>%
  mutate(date=make_date(year, month, day),
         dep_time=make_datetime_100(year, month, day, dep_time),
         arr_time=make_datetime_100(year, month, day, arr_time),
         sched_dep_time=make_datetime_100(year, month, day, sched_dep_time),
         sched_arr_time=make_datetime_100(year, month, day, sched_arr_time))%>%
  select(origin, date, dest, ends_with("delay"), ends_with("time"))

This all went fine (and was pretty directly guided by the textbook), but then when I tried to work with this new df, several complications arose. The task that made everything difficult was trying to compare the reported delay with a manually calculated difference between the scheduled time and the actual time of each flight. I arbitrarily chose to work with the departure times instead of arrival times, but theoretically the same issues would have occurred either way.

So first I had to figure out how to subtract two datetime vectors from each other, and express the difference in minutes. After much struggle, I got difftime to work, but in order to do so I had to convert each datetime into POSIXct first.

flights_dt <- flights_dt%>%
   mutate(calc_delay=difftime(as.POSIXct(flights_dt$dep_time, format="%FT%R"), 
                             as.POSIXct(flights_dt$sched_dep_time, format="%FT%R"), 
                             units = "mins"))%>%
  select(dep_delay, calc_delay, sched_dep_time, dep_time, arr_time, sched_arr_time)
head(flights_dt)
## # A tibble: 6 x 6
##   dep_delay calc_delay sched_dep_time      dep_time           
##       <dbl> <time>     <dttm>              <dttm>             
## 1      2.00 2          2013-01-01 05:15:00 2013-01-01 05:17:00
## 2      4.00 4          2013-01-01 05:29:00 2013-01-01 05:33:00
## 3      2.00 2          2013-01-01 05:40:00 2013-01-01 05:42:00
## 4     -1.00 -1         2013-01-01 05:45:00 2013-01-01 05:44:00
## 5     -6.00 -6         2013-01-01 06:00:00 2013-01-01 05:54:00
## 6     -4.00 -4         2013-01-01 05:58:00 2013-01-01 05:54:00
## # ... with 2 more variables: arr_time <dttm>, sched_arr_time <dttm>

Looking at these first few observations, it seems like the recorded delay and the calculated delay are already equal to each other. However, when we run the correlation, we can see that this is not the case. This is confirmed and explained when we filter for observations in which the reported and calculated delays differ:

cor(flights_dt$dep_delay, as.numeric(flights_dt$calc_delay))
## [1] 0.2261434
flights_dt%>%
  filter(dep_delay!=as.numeric(calc_delay))%>%
  select(dep_delay, calc_delay, sched_dep_time, dep_time)
## # A tibble: 1,205 x 4
##    dep_delay calc_delay sched_dep_time      dep_time           
##        <dbl> <time>     <dttm>              <dttm>             
##  1     853   -587       2013-01-01 18:35:00 2013-01-01 08:48:00
##  2      43.0 -1397      2013-01-02 23:59:00 2013-01-02 00:42:00
##  3     156   -1284      2013-01-02 22:50:00 2013-01-02 01:26:00
##  4      33.0 -1407      2013-01-03 23:59:00 2013-01-03 00:32:00
##  5     185   -1255      2013-01-03 21:45:00 2013-01-03 00:50:00
##  6     156   -1284      2013-01-03 23:59:00 2013-01-03 02:35:00
##  7      26.0 -1414      2013-01-04 23:59:00 2013-01-04 00:25:00
##  8     141   -1299      2013-01-04 22:45:00 2013-01-04 01:06:00
##  9      15.0 -1425      2013-01-05 23:59:00 2013-01-05 00:14:00
## 10     127   -1313      2013-01-05 22:30:00 2013-01-05 00:37:00
## # ... with 1,195 more rows

Here we can see that all of the mismatched values are on flights that were delayed overnight. Since the date of each flight is based on the scheduled departure date, that date was transferred onto the actual departure time, even if the actual departure occurred after midnight that night.

The textbook addresses this issue on the arrival time, since this error also causes all overnight flights to arrive before they actually depart. They fix this issue with the following code:

flights_dt <- flights_dt %>% 
  mutate(
    overnight = arr_time < dep_time,
    arr_time = arr_time + days(overnight * 1),
    sched_arr_time = sched_arr_time + days(overnight * 1)
  )

However, this method doesn't apply directly as written for the issue I'm seeing with the delays. Since the root of the problem is that the reported departure time is almost a full day before the scheduled departure time, I changed the dates on the reported departure time like so:

flights_dt <- flights_dt %>% 
  mutate(
    overnight = (sched_dep_time - dep_time >hours(1)),
    dep_time = dep_time + days(overnight * 1),
    calc_delay=difftime(as.POSIXct(dep_time, format="%FT%R"), 
                        as.POSIXct(sched_dep_time, format="%FT%R"), 
                        units = "mins"))

I figured it was pretty unlikely that any plane left more than 1 hour before the scheduled time, but theoretically this could still be an issue. Then, of course, I had to rerun the calc_delay variable to use the updated datetimes. And happily, the correlation shows us what we want to see!

cor(flights_dt$dep_delay, as.numeric(flights_dt$calc_delay))
## [1] 1

After all of this struggle, I can see why the authors spent a whole chapter on dates, and why my professor thought it was a good idea for me to work through it. Who knew that dates were so complicated?

Now for the FUNction part

I have to admit that this chapter was the hardest one for me yet. I think part of this was because I haven't used a lot of the built-in functions that are referenced here (even simple ones like range or length), so I had to consider everything in pieces. But also, the format of the functions are often not very straightforward. It still bothers me that there isn't any connector between the arguments of the function in parantheses and the actual function in curly brackets like this: function(x,y){here's the function}. It doesn't help that this seems to upset rstudio as well, so there has been a lot of running and re-running to try to get the whole function to run. All in all, this chapter was rather frustrating, but in some ways that made it more satisfying when things went right.

These first functions were pretty instinctual:

#Creates a function that returns a greeting depending on the current time of day
say_hi <- function(x){
  if(4<hour(x)& hour(x)<10){
    "Good Morning"
  } else if (10<hour(x)&& hour(x)<17){
      "Good Afternoon"
  } else {
      "Good Evening"
    }}
say_hi(now())
## [1] "Good Evening"
#Creates a function that returns "fizz", "buzz", or "fizzbuzz" depending on the divisibility of the input
fizzbuzz <- function(x){
  if((x%%3)==0 & (x%%5)==0){
  "fizzbuzz"
  }else if((x%%3)==0){
  "fizz"
  } else if ((x%%5)==0){
    "buzz"
  }}
fizzbuzz(3)
## [1] "fizz"
fizzbuzz(5)
## [1] "buzz"
fizzbuzz(15)
## [1] "fizzbuzz"

One issue that I ran into repeatedly was the inability to use a vector with the if function. I know that ifelse is vectorised, but nesting ifelse statements just feels so much clunkier than using if. As far as I could tell, there's basically no workaround for this.

The next puzzle that stuck me was creating a function that would return the number of matching NA from two vectors. I really wanted to use count since it's so slick, but apparently this only works on 2-dimensional objects, so inputting two vectors was not an option. I came up with the following workaround:

na1 <- c(1,2,4,NA,5,NA,5,NA)
na2 <- c(3,4,NA,NA,NA,5,3,NA)
both_na <- function(x,y){
  tibble(x,y)%>%
    count(is.na(x), is.na(y))}
both_na(na1, na2)
## # A tibble: 4 x 3
##   `is.na(x)` `is.na(y)`     n
##   <lgl>      <lgl>      <int>
## 1 F          F              3
## 2 F          T              2
## 3 T          F              1
## 4 T          T              2

This worked OK, but it gave us information about all of the intersections, not just matching NAs. I struggled with a better solution for a long time, mainly due to one mistake that I unknowingly repeated over and over. Eventually, I learned that using && instead of & in a logical statement between vectors limits the comparison to the first values of the vector. Once I figured that out, I was able to use ifelse to create my own vectorized count:

NA_count <- function(x,y){
  sum(ifelse((is.na(x)&is.na(y)), 1, 0))
}

NA_count(na1, na2)
## [1] 2

Next I worked on creating functions that would select specific subsets of vectors:

#Returns the last value of any vector
x <- c(1:10, NA, 11:13)
y <- c(-4:25, NA, NA, 15)
last_value <- function(x){
  x[[length(x)]]
}
last_value(x)
## [1] 13
#Returns only even values
even_number <- function(x){
  x[!is.na(x) & x %% 2 == 0] 
}
even_number(y)
##  [1] -4 -2  0  2  4  6  8 10 12 14 16 18 20 22 24
#Returns all values in even-numbered positions
even_position <- function(x){
  x[c(seq(2, length(x), 2))]
}
even_position(y)
##  [1] -3 -1  1  3  5  7  9 11 13 15 17 19 21 23 25 NA
even_position(x)
## [1]  2  4  6  8 10 11 13

Even though this week was frustrating at times, I feel like I'm at a good place to be transitioning into the tidytext package. And boy am I excited to get started with that.

My dad sent me a short and sweet tidytext analysis that he did on one of my brother's novels a few years ago when he was playing with the package. It was SO cool to be looking through his code and mostly understanding what he was doing (if not the specific functions he was calling). In my family, I think my siblings and I always just assumed that the science-y computer-y things that my dad did were too complicated for us to really understand. It's a weird feeling to realize that he and I are going to have this in common as I move further into this career path, but it's also really nice. An added bonus was the fact that this analysis was on my brother's words as well. Tidytext family fun! Can't wait for more :-)

1 comment:

  1. Great post. Make sure to get in the habit of explicitly using the `return` statement in your functions. That's a good way to make sure you don't get some hard-to-debug errors.

    ReplyDelete

English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we m...