Im A Republican Because Visualized With R
The GOP recently relaunched its main web site with a new design and numerous interactive and social features like Facebook integration, blogs, etc. Of particular interest is the GOP Faces section, which asks users to submit a photo and answer the question "Why are you a Republican?" Not being a Republican, I was curious to see if there were any common themes among the submissions that would lead to insights about being a Republican and GOP.com user. Not excited about actually reading all 180 reasons, I instead used R to download, transform, analyze and visualize the data for me.
I used several packages (XML and plyr) to fetch and extract reasons, and then tm to filter stop words and identify commonly used terms. Finally, I used ggplot2, the invaluable ggplot2 blook, and a helpful post from the R-help mailing list to perform the visualization.
R code
library(XML)
library(plyr)
library(ggplot2)
library(tm)
# fetch & parse the HTML
doc <- htmlParse("http://gop.com/index.php/learn/republican_faces/",isURL = TRUE)
# pull the matching A elements of CSS class tipz
nodes <- getNodeSet(doc, "//a[@class='tipz']")
# extract the 'title' attribute
titles <- sapply(nodes, function(x) xmlAttrs(x)[["title"]])
# clean up the title attribute
titles <- sub("^[^:]+::","",titles)
# create the corpus and doc term matrix
co <- Corpus(VectorSource(titles))
tdm <- DocumentTermMatrix(co, control=list("tolower", removeNumbers=TRUE, stopwords=TRUE))
# extract the tags at each level
levels <- c(1,2,3,4)
df <- ldply(levels, function(x) data.frame(freq=x,term=findFreqTerms(tdm,x,x)))
#assign random non-repeating coordinates to the terms
df$x <- sample(1:nrow(df),nrow(df), replace=F)
df$y <- df$freq + rnorm(nrow(df))
# clear standard graph options (thanks mike lawrence on r-help)
clear <- opts(
legend.position = 'none'
, panel.grid.minor = theme_blank()
, panel.grid.major = theme_blank()
, panel.background = theme_blank()
, axis.line = theme_blank()
, axis.text.x = theme_blank()
, axis.text.y = theme_blank()
, axis.ticks = theme_blank()
, axis.title.x = theme_blank()
, axis.title.y = theme_blank()
)
p <- ggplot(df,aes(x=x,y=y,colour=freq,label=term,size=freq)) + geom_text() + coord_polar()+ clear
ggsave("because.png",p,dpi=72,scale=1.3)
ggsave("because.pdf", p)
And the output:
Click for a page-sized PDF, or the raw terms and frequency counts.
The most common term is 'freedom', followed by 'equal', and 'pro'. After those come 'personal', 'government', 'people', 'school', 'family', and 'believe'. A more robust analysis could use term extraction (pro family, pro life, anti government) or stemming, and then feed the results into a better visualization. That would take more than the 10 minutes I spent so far, so I'm leaving that as an exercise to somebody else.
As it is I have the most common answer as to why GOP.com visitors are Republicans: freedom. I think that's probably why anybody belongs to any political party, but without a corpus from other parties I suppose we'll never know.