Voter targeting with R
Voter targeting for turnout is the process of scoring registered voters using demographic and electoral variables taken from voter lists and commercial databases. The score of all voters together is used to predict overall turnout, which determines the allocation of campaign resources and directs strategy for voter contact and communication.
Targeting for turnout is a three-step process:
- A turnout table is created for a previous election similar to the target election;
- A scoring procedure is implemented with regression, clustering, or some other statistical process and
- Every voter is scored with a likely turnout percentage.
Depending on his or her turnout percentage - high, middling, or low - a voter will be ignored, targeted for persuasion, or targeted for get-out-the-vote (GOTV) efforts by a campaign. Targeting for turnout, along with almost every other type of political targeting, is explained in detail in Political Targeting by Hal Malchow (2008).
In this post, I recreate parts of the regression analysis from Chapter Nine (Targeting for Turnout) of Political Targeting (Malchow 2008), using the free R Project for Statistical Computing. R is a programming environment that excels at data manipulation and statistical analysis, making it an interesting alternative to traditional statistical tools, like SPSS or web-based voter management software. The analysis will be performed against the full voters list from Ohio's 1st congressional district, with the intention of predicting turnout for the 2010 congressional midterm elections. This analysis is similar or identical to what a candidate for Ohio's 1st district would perform throughout the election year. The R code for each step in the analysis will be provided inline so a reader can perform the same operations.
OH-01 Voter File
A voter file is a list containing electoral and demographic data on registered voters, maintained by state boards of elections, political parties, PACs, or private companies. My voter file was downloaded from the Ohio Secretary of State in late 2009 and contains: name, address, age, registration date, voting history, and party affiliation (primary voters only).
To better simulate what a political campaign would use, I've appended the following fields:
- Gender: Using birth data from the Social Security Administration, I matched each voter's first name to a probable gender. About 9% of names were unable to be matched and coded as an empty string.
- Age Group (2010): Using the birth year, I calculated age as of 2010, and then assigned each voter to an age group: 18-21,22-29,30-39, 40-49, 50-59, 60-69, 70-79, 80-89, and 90+.
- Age Group (2006): Using the birth year, I calculated age as of 2006, and then assigned each voter to an age group: 18-21,22-29,30-39, 40-49, 50-59, 60-69, 70-79, 80-89, and 90+.
- Household: I grouped voters into discrete households using the full street address and zip code.
- Marriage status: Using the household variable, I performed a very simple marriage determination: people living in the same household with a difference in age < 15 years were flagged as married.
- Last4 (2006): Measures participation in the last 4 major elections prior to 2006: 2004 Primary and General, and 2002 Primary and General. Range 0-4.
- Last4 (2008): Measures participation in the last 4 major elections prior to 2008: 2006 Primary and General, and 2004 Primary and General. Range 0-4.
- Last4 (2010): Measures participation in the last 4 major elections prior to 2010: 2008 Primary and General, 2006 Primary and General. Range 0-4.
Email me here for the code used to scrub and augment the voter file.
I am using the R environment to perform this analysis. To download and install R, go to CRAN homepage and follow the instructions for your platform. Once R us up and running, execute the following to get the required libraries installed and load the voter file into memory:
# install plyr, ggplot2, and RColorBrewer # ggplot2 loads plyr as a dependency install.packages(c("ggplot2","RColorBrewer")) # load required libraries (ggplot2 loads plyr as a dependency) library(ggplot2) library(RColorBrewer) # load the voter file into vfs variables vfs <- read.csv("voterfile.csv")
Now the dependencies are installed and the voter file is read into the vfs variable.
According to Political Targeting (Malchow 2008), the strongest indicators of participation in a future election are age and previous participation. Malchow also says participation tends to be consistent between similar elections in different years. I am looking at the 2010 General election, a congressional midterm, so I used the 2006 General election as my guide. The first step is to generate a turnout table.
Political campaigns use a tool called "last 4", which measures a voter's recent participation. A voter's last 4 score represents how many of the previous four elections he or she cast a ballot in. A standard is to use both the primary and general elections for the previous two major election years. My data set contains last 4 calculations for the 2010, 2008, and 2006 elections.
The 2006 last 4 calculation looks at elections as far back as the primary in 2002, but a percentage of voters in the list weren't eligible or registered to vote in some or all of these elections. These voters have an incomplete last 4 score, and need to be evaluated separately so their scores don't influence voters with a complete history. As such I created two turnout tables: one for voters eligible for all elections (full), and one for voters eligible for at least one of the last four elections (partial). The turnout tables below show a turnout percentage for every combination of age group and participation score for 2006 voters:
# find voters registered before the 2002 primary ele.full <- which(vfs$reg.date <= '2002-05-07') # find voters registered after the 2002 primary but before the 2006 general ele.partial <- which(vfs$reg.date > '2002-05-07' & vfs$reg.date <= '2006-11-07') # show the turnout table for full eligible voters turnout.full <- ddply(vfs[ele.full,],c("age.2006","last4.g2006"), function(x) length(which(x$turnout.g06 == "X")) / nrow(x) ) # show the turnout table for partial voters turnout.partial <- ddply(vfs[ele.partial,],c("age.2006","last4.g2006"), function(x) sum(x$turnout.g06 == "X") / nrow(x) )
Turnout percentage for turnout.full:
Turnout percentage for turnout.partial:
For each table, we see that turnout percentage increases as previous participation increases for every age group, but it is pretty difficult to compare more than two age groups at once using this table. There are also several anomalous groups with 100% turnout, indicating a small population in that group. We'll use the R library ggplot2 to create a simple visualization of each table to help interpret the turnout values:
# turnout-full visualization qplot(last4.g2006,V1,color=age.2006,group=age.2006,data=turnout.full,geom=c("point","line"), main="OH-01 2006 General Turnout by Age Group, Last 4 (Full)",xlab="Last 4",ylab="Turnout %") + scale_colour_hue(name="Age Group") # turnout-partial visualization qplot(last4.g2006,V1,color=age.2006,group=age.2006,data=turnout.partial,geom=c("point","line"), main="OH-01 2006 General Turnout by Age Group, Last 4 (Partial)",xlab="Last 4",ylab="Turnout %") + scale_colour_hue(name="Age Group")
Figure 1 shows turnout for voters with a complete last 4 score, and tells us that for all age groups except 18-21, turnout increases with previous participation, until turnout reaches a maximum of 85%-90%. The rate at which turnout increases is similar between age groups, suggesting previous participation may have more predictive value than age. Figure 2 is a representation of all voters who were registered in time for the the 2006 general but not for the 2002 primary. Figure 2 shows a relationship between turnout and previous participation, but there is substantially more noise than in Figure 1. Taken together we can verify the hypothesis put forth by Malchow that age and previous participation seem to have a positive influence on future participation.
The 2006 turnout tables are useful but they don't represent a formal model of turnout the 2006 election. A formal model will measure the interactions between the predictor variables (participation & age) and the intended outcome (turnout) of 2006 voters. This model can be applied to 2010 voters to project turnout.
The model can include the other voter file variables with potential predictive qualities like gender, party affiliation, and martial status. A campaign will traditionally build a linear regression model to project turnout, but linear regression doesn't support categorical variables and can produce values that don't make sense for turnout, so I won't be using that type of regression here.
Instead, I'll use a generalized linear model to perform a binomial regression with a logit link function (logistic regression). Logistic regression estimates a binary variable given an intercept and a number of independent continuous or categorical predictor variables. R has terrific support for defining and evaluating these models using the base glm package.
The goal is to fit a logistic regression on voter data from 2006, and then use that regression to project turnout for 2010. I actually create two regressions, one for voters with at least a 4-year voting eligibility (full model), and one for all other voters (partial model). This is identical to the segmentation used when creating the turnout tables. The output of these regressions is the probability that a voter will turn out in the given year. A campaign can use this figure to estimate total turnout in an election, and to allocate resources to different geographic and demographic segments.
The R function glm is used to create two models of 2006 turnout based on last 4 participation, age group, gender, party affiliation, and martial status.
# create temporary variables inside the data frame for 2006 values vfs$last4 <- vfs$last4.g2006 vfs$age <- vfs$age.2006 # create a model for voters with at least 4 years of voting history full.lr <- glm(turnout.g06 ~ last4 + age + gender+party+married,data=vfs[ele.full,],family=binomial) # run ANOVA against the full table to test for term significance anova(full.lr,test="Chisq") # create a model for voters with less than 4 years of voting history partial.lr <- glm(turnout.g06 ~ last4 + age + gender+party+married,data=vfs[ele.partial,],family=binomial) # run ANOVA against the partial table to test for term significance anova(partial.lr,test="Chisq")
Now that I have fitted models, I'll use the predict function to capture the model output. The output is the likelihood that a voter turned out in 2006 given his last4.2006 score, age group, gender, martial status, and party affiliation. Then I can compare the predicted turnout probability with the actual turnout to determine the effectiveness of each model. This isn't a valid statistical measure of accuracy but merely a smell test.
# create a new column in the vfs data frame vfs$pred.g06 <- c(0) pred.g06.full <- predict(full.lr,type="response") pred.g06.partial <- predict(partial.lr,type="response") # apply the full model to voters with at least 4 years of registration vfs[names(pred.g06.full),]$pred.g06 <- pred.g06.full # apply the partial model to voters with less than 4 years of registration vfs[names(pred.g06.partial),]$pred.g06 <- pred.g06.partial # take the number of correct predictions divided by the number of voters full.correct <- sum((vfs[ele.full,]$pred.g06 > .5) == (vfs[ele.full,]$turnout.g06 == "X")) / nrow(vfs[ele.full,]) # value is .797 = ~80% accurate for the full model # take the same for the partial model partial.correct <- sum((vfs[ele.partial,]$pred.g06 > .5) == (vfs[ele.partial,]$turnout.g06 == "X")) / nrow(vfs[ele.partial,]) # value is .776 = ~77% accurate for the partial model
The prediction rates for our regressions aren't spectacular: 80% for the full model and 77% for the partial model. Given the limited information in our voter file, though, they aren't that bad. Additionally, a political campaign would have access to other data like detailed demographics, financial data, and more accurate lifestyle or ideological information. Extending the regression with these variables might increase the predictive power of the system.
Now I'll apply the regression equations to project turnout in 2010. First, I determine which regression (partial or full) to apply to current voters by their registration date:
# find voters registered before the 2006 primary (328594 voters) ele.full2010 <- which(vfs$reg.date <= '2006-05-02') # find voters registered after the 2006 primary but before the 2008 general (68863 voters) ele.partial2010 <- which(vfs$reg.date > '2006-05-02' & vfs$reg.date <= '2008-11-04')
Next I prepare the data and project 2010 turnout for each model using the predict function:
# assign the last4 and age model variables to values calculated for 2010 vfs$age <- vfs$age.2010 vfs$last4 <- vfs$last4.2010 # call predict for the full model pred.g10.full <- predict(full.lr,newdata=vfs[ele.full2010,],type="response") # predict based on the partial model pred.g10.partial <- predict(partial.lr,newdata=vfs[ele.partial2010,],type="response") # turnout % for 2010 pred.g10.turnout.full <- sum(pred.g10.full > .5) / length(ele.full2010) # 63% predicted turnout pred.g10.turnout.partial <- sum(pred.g10.partial > .5) / length(ele.partial2010) # 18% predicted turnout # save the predictions into the vfs data frame vfs$pred.g10 <- c(0) vfs[names(pred.g10.full),]$pred.g10 <- pred.g10.turnout.full vfs[names(pred.g10.partial),]$pred.g10 <- pred.g10.turnout.partial
According to pred.g10.full and pred.g10.partial, OH-01 will see 63% overall turnout for voters from the full model and 18% overall turnout for voters from the partial model. To determine the validity of the 2010 projections, I plotted 2006 actual turnout against the 2010 projected turnout for every age group. As stated in the introduction, Malchow says the turnout rates for 2010 should be similar to the 2006 election, so I expect no large unexplainable deviations in the chart. A separate chart is created for each participation model (full, partial):
# which voters have a > 50% chance of turning out in 2010 turnout.g10 <- vfs$pred.g10 > .5 # which voters turned out in 2006 turnout.g06 <- vfs$turnout.g06 == "X" # build a summary of voters who turned out in 06 or 10 based on age to <- rbind( ddply(vfs[ele.full,],"age.2006",function(x) data.frame(age=x$age.2006,n=nrow(x),series="G06"))[,c(2:4)], ddply(vfs[ele.full2010,],"age.2010",function(x) data.frame(age=x$age.2010,n=nrow(x),series="G10"))[,c(2:4)]) qplot(x=age,y=n,data=to,fill=series,stat="identity",geom="bar",position="dodge",main="OH-01 2006 Turnout vs 2010 Projected Turnout (full)",xlab="Age", ylab="Count") + scale_fill_brewer(pal="Paired","Election") to <- rbind( ddply(vfs[ele.partial,],"age.2006",function(x) data.frame(age=x$age.2006,n=nrow(x),series="G06"))[,c(2:4)], ddply(vfs[ele.partial2010,],"age.2010",function(x) data.frame(age=x$age.2010,n=nrow(x),series="G10"))[,c(2:4)]) qplot(x=age,y=n,data=to,fill=series,stat="identity",geom="bar",position="dodge",main="OH-01 2006 Turnout vs 2010 Projected Turnout (partial)",xlab="Age", ylab="Count") + scale_fill_brewer(pal="Paired","Election")
Figure 3 suggests larger 2010 turnout for all groups as compared to 2006. The projected increase in the 22-29 age group seems unlikely, but can probably be explained by the higher turnout among younger voters in 2008. In addition to 2008 being a presidential election year, the Obama for America campaign focused on registering new voters and activating dormant voters, both of which increased turnout among younger voters. That higher 2008 turnout inflated the last 4 measure for new voters, which pushed up the projected 2010 turnout. Figure 4 exhibits a similar projected increase for the younger age groups, which is probably due to the same increase in 2008 turnout. Before using these models in an actual election, the projections would need to be scaled based on some other turnout estimate. Despite the inflated values, however, this is a strong system for turnout prediction and could be used by almost any congressional campaign.
In addition to scaling, the regression models would need to be improved in several ways before being put into production. The predictor variables are currently considered independently, which effectively discounts any interactive effects that may exist. Turnout for younger married females or unmarried democrats may be better modeled using compound variables, for example. Also, the model makes no use of demographic or opinion survey information available to political campaigns. Finally, the projection isn't limited to two regressions; a campaign could create regressions by county or school district, or based on marriage status, or any other combination.
Other Visualization Examples
While not specifically related to turnout, I produced several simple visualizations that explore the rest of the voter file. The full power of R can be applied using the same voter data from the turnout projections.
## precinct summary # summarize turnout in 2008 & registered democrats, by precinct pct <- ddply(vfs,"precinct.code",function(x) data.frame(turnout.g08=sum(x$turnout.g08 == "X") / nrow(x),dem.pct=sum(x$party=="D")/nrow(x),nvoters=nrow(x))) # visualize qplot(turnout.g08*100,dem.pct*100,data=pct,geom="point",size=nvoters,alpha=I(0.4),main="OH-01 Precinct Turnout/Registration Summary", xlab="Registered Democrats (%)",ylab="2008 General Turnout") ## 2008 turnout by gender + age qplot(age.2010,data=vfs[which(vfs$turnout.g08 == "X"),],geom="bar",fill=gender,position="dodge",main="OH-01 2008 General Turnout by Gender, Age",xlab="Age", ylab="Count") + scale_fill_brewer(pal="Set1") ## 2008 newly registered voter counts by age qplot(age.2010,data=new.08,main="OH-01 2008 Newly registered voters",xlab="Age",ylab="Count") ## 2008 newly registered voter turnout by age qplot(age.2010,data=new.08,fill=turnout.g08,position="dodge",main="OH-01 2008 Turnout for newly registered voters",xlab="Age",ylab="Count") + scale_fill_brewer(pal="Paired")
Figure 5 shows the 2008 Democratic turnout percentage and 2008 general turnout percentage for each precinct, and each bubble is scaled to the registered voter population of the precinct it represents. Figure 6 shows 2008 general turnout by gender and age group. Figure 7 is the raw count of voters registered between the day after the 2006 general election and the day of the 2008 general election, broken down by age group. Figure 8 is a variation of Figure 7, showing 2008 turnout of voters registered after the 2006 election. None of these charts took more than 5 minutes to create from concept to output, and ggplot did almost all of the heavy lifting. I believe the ease with which these charts were built highlights the utility of having your data analysis tool also be your visualization tool.
In this short example I've analyzed, visualized, and modeled electoral data using R and a few add-on packages. These are standard techniques used by any congressional campaign, but they are usually performed by some by combination of Excel, SPSS, or SQL. By using R, I avoided the compatibility issues usually encountered when transferring data between tools. R would have also allowed me to perform clustering, component analysis, or Bayesian inference on the same data from the same R interface. All together, these reasons make R a good addition to the political analysis toolbox for a campaign or campaign consultant. If you would like to discuss how advanced statistical analysis can help your Democratic campaign model turnout, increase fund-raising, or benchmark field operations, please don't hesitate to contact me.
Click here (14MB) to download the R scripts and data associated with this post. The voter file data has been scrubbed to remove the VoterID, name, and address components.