Im A Republican Because Visualized With R

1 minute read

permalink: /archive/:year/:month/:day/:title.html --- layout: single status: publish published: true title: '"I''m a Republican because...", visualized with R' author: display_name: jjh login: admin email: url: author_login: admin author_email: author_url: excerpt: Visualizing user-generated statements from to the theme of "I'm a Republican because...", using R. wordpress_id: 212 wordpress_url: date: '2009-10-15 13:21:39 -0400' date_gmt: '2009-10-15 17:21:39 -0400' categories: - R tags: - visualization - R - permalink: /archive/:year/:month/:day/:title.html ---

The GOP recently relaunched its main web site with a new design and numerous interactive and social features like Facebook integration, blogs, etc. Of particular interest is the GOP Faces section, which asks users to submit a photo and answer the question "Why are you a Republican?" Not being a Republican, I was curious to see if there were any common themes among the submissions that would lead to insights about being a Republican and user. Not excited about actually reading all 180 reasons, I instead used R to download, transform, analyze and visualize the data for me.

I used several packages (XML and plyr) to fetch and extract reasons, and then tm to filter stop words and identify commonly used terms. Finally, I used ggplot2, the invaluable ggplot2 blook, and a helpful post from the R-help mailing list to perform the visualization.

R code


# fetch & parse the HTML
doc <- htmlParse("",isURL = TRUE)
# pull the matching A elements of CSS class tipz
nodes <- getNodeSet(doc, "//a[@class='tipz']")
# extract the 'title' attribute 
titles <- sapply(nodes, function(x) xmlAttrs(x)[["title"]])
# clean up the title attribute 
titles <- sub("^[^:]+::","",titles)
# create the corpus and doc term matrix
co <- Corpus(VectorSource(titles))
tdm <- DocumentTermMatrix(co, control=list("tolower", removeNumbers=TRUE, stopwords=TRUE))
# extract the tags at each level
levels <- c(1,2,3,4)
df <- ldply(levels, function(x) data.frame(freq=x,term=findFreqTerms(tdm,x,x))) 
#assign random non-repeating coordinates to the terms
df$x <- sample(1:nrow(df),nrow(df), replace=F)
df$y <- df$freq + rnorm(nrow(df))

# clear standard graph options (thanks mike lawrence on r-help)
clear <- opts(
         legend.position = 'none'
         , panel.grid.minor = theme_blank()
         , panel.grid.major = theme_blank()
         , panel.background = theme_blank()
         , axis.line = theme_blank()
         , axis.text.x = theme_blank()
         , axis.text.y = theme_blank()
         , axis.ticks = theme_blank()
         , axis.title.x = theme_blank()
         , axis.title.y = theme_blank()

p <- ggplot(df,aes(x=x,y=y,colour=freq,label=term,size=freq)) + geom_text() + coord_polar()+ clear 
ggsave("because.pdf", p)

And the output:

I'm a Republican because...

Click for a page-sized PDF, or the raw terms and frequency counts.

The most common term is 'freedom', followed by 'equal', and 'pro'. After those come 'personal', 'government', 'people', 'school', 'family', and 'believe'. A more robust analysis could use term extraction (pro family, pro life, anti government) or stemming, and then feed the results into a better visualization. That would take more than the 10 minutes I spent so far, so I'm leaving that as an exercise to somebody else.

As it is I have the most common answer as to why visitors are Republicans: freedom. I think that's probably why anybody belongs to any political party, but without a corpus from other parties I suppose we'll never know.