Tuesday, January 30, 2018

Creating a True Pie Chart in R with ggplot2

Ultimate Goal: TidyText Master

I'm very excited to be doing some preliminary work towards my MSBA (Masters of Science in Business Analytics) degree this semester. Other than the prerequisites that I'm completing in order to start the program's core classes next fall, I also managed to convince the administration to let me do an independent study in text mining using R. With my background in Linguistics, performing analytics on textual data is one of the things I am most excited about doing with this degree. So in my opinion, the earlier I can start, the better.

Of course, as my professor and I started looking at the textbook we had planned to use for the course (TidyText Mining with R: A Tidy Approach by Julia Silge and David Robinson), we realized that I should probably go back and learn some of the basics about using the "tidyverse" as it's called by its nearest and dearest. So I'm currently making my way through R for Data Science by Garrett Grolemund and Hadley Wickham to get my foundation in using R more generally, before I begin my journey toward the title of TidyText Master.

Visualizations with ggplot2

The book, R for Data Science, is structured so that readers get to create visualizations first thing. So we practice making various types of plots (scatter, line, bar, etc.). Most of this chapter uses the mpg and diamonds datasets to produce the plots. Here are some examples of the plots that I made from the book's instruction:


ggplot(data = mpg) +
    geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
    labs(title = "Average Highway MPG vs. Engine Size", x = "Engine Size", y = "Avg Hwy MPG", color = "Car Type")



ggplot(data = mpg, mapping = aes(x = displ, y = cty, color = drv)) +    
    geom_smooth(se = FALSE) +
    geom_point() +
    labs(title = "Average City MPG vs. Engine Size", x = "Engine Size", y = "Avg City MPG", color="Drive Type")



ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")+
    labs(title = "Diamond Cut Quality and Clarity", x = "Diamond Cut Quality", y = "Count", fill = "Clarity")

At the end of the chapter on visualizations, the authors introduce a few different ways to manipulate the graph coordinates to change the plot, including using polar coordinates to create a Coxcomb chart like this:


ggplot(data = diamonds) +
    geom_bar(mapping = aes(x = cut, fill = cut), width = 1, show.legend = FALSE) +
    coord_polar() +
    labs(title = "Diamond Cut Quality", x = NULL, y = NULL)

Creating a True Pie Chart

When I saw how they were creating a Coxcomb this way, I immediately started to consider how I could use these tools to create a regular pie chart. In order to create a smooth circle, I knew that I would need to figure out how to make my bars all the same length. However, if you make all the bars in a bar graph the same length, you lose all the meaning from the graph. So, instead of making them all the same length, I realized that I would need to create a variable that all of the data could fit under and use the position = "stack" argument to instead make all of the variables the same width.

I cheated by looking ahead into the next chapter in order to learn how to create new variables using the mutate function, which I then assigned to a new data frame made especially for this pie chart, called mpgpie. Since all of the vehicles in the dataset had engines of >0 liters, this new variable includes all of the data under one label.

mpgpie <- mutate(mpg, cartype = ifelse(displ > 0,"Car", "Not Car"))


ggplot(data = mpgpie) +
    geom_bar(mapping = aes(x = cartype, fill = class), width = 1, position = "stack") +
    labs(title = "Car Type", x = NULL, y = NULL, fill = NULL)

So now that we've got our bar graph with all of the variables in equal width, we should be able to just use the coord_polar() function to turn it into a pretty pie chart, right?


We got our pretty circle, but the coord_polar() function rotates over the x-axis by default. I thought, I can fix that! I'll use the coord_flip() function first to switch the x- and y-axes, then apply coord_polar(). But it turns out that R knows that the x-axis is still the x-axis, even if you make it look like the y-axis, so you still get this lovely bullseye doing it that way.

Luckily, the coord_polar() function has as one of its arguments the determination of which axis to rotate over. So, finally, we get:


ggplot(data = mpgpie) +
    geom_bar(mapping = aes(x = cartype, fill = class), width = 1, position = "stack") + 
    labs(title = "Car Type", x = NULL, y = NULL, fill = NULL) + 
    coord_polar(theta = "y")

Look at that pie! I haven't yet figured out how to get rid of the labeling on the chart. Or how to put more appropriate labeling for a pie chart (such as percentage) onto the plot. But this is a good start! And I even had fun doing it.


English Syntax Trees and Question Creation with Flex and Bison

In the first (official) semester of my PhD program this spring, I was able to take a Computer Science class called NLP Methods in which we m...