Aggreate electoral targeting with R

18 minute read

Introduction

Electoral targeting is the process of quantifying the partisan bias of a single voter or subset of voters in a geographic region. Bias can be calculated using an individual's demographic and voting behavior or by aggregating results from an entire election precinct. Targeting is traditionally performed by national committees (e.g., National Committee for an Effective Congress, National Republican Congressional Committee), state political parties, interest groups (e.g., EMILY's List, National Rifle Association), or campaign consultants. Targeting data is consumed by campaign managers and analysts, and it is used along with polling data to build strategy, direct resources, and project electoral outcomes.

While aggregate electoral targeting can build a sophisticated picture of a district, the mathematics behind targeting are very simple. Targeting can be performed by anyone with previous electoral data, and calculations can be done using 3x5 note cards, with simple spreadsheets, or high-end software packages like SPSS. The targeting methods discussed in this post are taken from academic publications on electioneering: Campaign Craft (Burton, Shea 2006) and The Campaign Manager (Shaw, 2004).

Although targeting data is usually usually inexpensive or free, a down-ballot campaign or a primary challenger might not have the connections or support of a PAC or party to obtain the data. In these cases, a campaign will probably purchase one of the books listed above to perform its own analysis. Even an established campaign may run its own analysis, possibly to test different turnout theories or to integrate additional data. This post is directed towards these groups.

Together, we will assume the role of campaign consultant and perform an aggregate electoral analysis on the 13th House of Delegates seat (HOD#13) in the Commonwealth of Virginia. In HOD#13, the 18-year Republican incumbent Bob Marshall is being challenged by Democrat John Bell. This analysis will compute and visualize turnout, partisan bias, and a precinct ranking based on projected turnout and historical Democratic support.

The analysis of HOD#13 will be performed using R, an open-source computing platform. R is free, extensible, and interactive, making it an ideal platform for experimentation. The R package aggpol was created specifically for this tutorial, and it contains all the data and operations required to execute an aggregate electoral analysis. Readers can execute the provided R code to reproduce the analysis or simply follow along to learn how it was performed. Readers unfamiliar with R should read Introduction to R, which is available on the R project homepage.

The electoral and registration data used were compiled from the Virginia State Board of Elections using several custom written parsers and two different PDF-to-text engines. Please contact me for source data or more information at: jjh@offensivepolitics.net.

Prerequisites

This section only applies to readers interested in recreating the analysis and graphics produced in this tutorial. To completely recreate this analysis you will need the following:

  1. The latest version of the R statistical computing environment. Binaries, source, and installation instructions can be downloaded from R homepage.
  2. Additional R packages. This analysis requires several packages that provide additional functionality on top of the existing R system. Install the appropriate R environment for your system and run the program.
      plyr, ggplot2, RColorBrewer. To install these packages execute the following in your R environment:

      install.packages(c("plyr", "RColorBrewer", "ggplot2"))				
      			
      Next you'll need to install the aggpol package for calculating aggregate political statistics. You will need to download the latest version: For Unix-style systems or For Windows systems. Installation of local packages is detailed in the R Manual on package installation.

Getting Started

Now that the prerequisites are installed we can get started with our data analysis. Start up your R environment and load the required libraries by typing in the following commands:

library(plyr)
library(aggpol)
library(ggplot2)
library(RColorBrewer)		

We need to attach the VAHOD data set that comes with aggpol. This data set contains precinct-level electoral returns for state and federal elections in the Commonwealth of Virginia from 2001 to 2008. Since we are focusing on HOD#13, we'll need to select just the records that have to do with that seat.

data(VAHOD)
hd013 <- vahod[which(vahod$seat == "HD-013"),]

The data set contains precinct-level electoral results for the following races: U.S. President, U.S.Senate, U.S.House of Representatives, Virginia Governor, Senate of Virginia, and Virginia House of Delegates. This breadth of electoral returns allows us to build a very detailed profile of the partisan bias of a district.

We will first determine the historical partisanship in HOD#13. Since partisanship can fluctuate over the years and different seats have different turnout expectations, we'll first need to see the major party support for every seat in each election for precincts in HOD#13. We can use the historical.election.summary function from the aggpol package to group the precinct results into district results, and then break them down by seat and year.

esum <- historical.election.summary(hd013)

esum now contains:

. year district_type total.turnout rep.turnout rep.turnout.percent dem.turnout dem.turnout.percent oth.turnout oth.turnout.percent
1 2001 GV 5527 3266 0.5909 2207 0.3993 54 0.0097
2 2001 HD 5399 3475 0.6436 1924 0.3563 0 0
3 2001 LG 5432 3291 0.6058 2025 0.3727 116 0.0213
4 2003 HD 10299 10103 0.9809 110 0.0106 86 0.0083
13 more lines...

We now have major-party turnout for every election in our data set. To best visualize the results we'll build a bar graph comparing major-party turnout in each seat over time. We first need to transpose the election summary object (esum) from a summary format to an observation format, one line per distinct year+district+party. The plyr package makes this task extremely simple.

elx <- ddply(esum,c("year", "district_type"), function(x) 
  rbind(  data.frame(party="REP",turnout=x$rep.turnout.percent),
  data.frame(party="DEM",turnout=x$dem.turnout.percent)))

We will now use the powerful ggplot2 package to view the Republican and Democratic support for each election, in each seat, for our subset:

ggplot(elx,aes(year,turnout,fill=factor(party))) + 
  geom_bar(stat="identity") + 
  facet_wrap(~district_type,scales="free_x") + 
  scale_fill_brewer(palette="Set1")

Result:
HD#013 major party percentages

This graphic gives us a decent understanding of district-level electoral trends. For U.S. federal elections (figs.: PVP, USH, USS), we can see a distinct drop in Republican support moving towards 2008; the results for U.S. House (USH) and U.S. Senate (USS), in particular, show a strong increase in Democratic support. This growth correlates to statewide trends that resulted in the election of two Democratic Senators representing Virginia for the first time since 1970. General Democratic gains notwithstanding, the House of Delegates (fig.: HD) results aren't as promising for a Democratic challenger. The incumbent Del. Marshall saw more than 60% support in three of the last four elections and saw no challenger at all in 2003. While the district may be trending more Democratic over time, the voters of HOD#13 are obviously big fans of Del. Marshall.

Now that we understand the historical partisanship of this district we need to understand historical turnout, allowing us to project of the number of votes required to win. We will utilize the historical.turnout.summary function from the aggpol package to produce a summary of turnout for this district.

historical.turnout.summary(hd013, district.type="HD", district.number="013", years=c(2001,2003,2005,2007))
. year total.turnout total.registration
1 2001 5399 13275
2 2003 10031 45769
3 2005 23592 62497
4 2007 26110 78028

Looking at this table one can see some data collection problems in the 2001 HD elections. In recent years, precincts belonged to only one House of Delegates seat, but in 2001 and somewhat less so in 2003 some precincts are split and some have duplicate names and now information on how to allocate results from different races to precincts. The turnout numbers are slightly affected by these problems, but the aggpol attempts to correct this by substituting alternate years or even races if possible.

The take away from from the previous table is that turnout for the last four House of Delegates elections has hovered around 30%. This makes some political sense, because Virginia holds state elections in odd-numbered years with no federal elections to drive up turnout. This leaves a lot of registered voters to be activated, but we need to delve down to the precinct level to find them.

We use the district.analyze function of aggpol to aggregate all electoral results into a summary for each precinct.

hd013s <- district.analyze(hd013)

hd013s is a data frame with columns calculated for every precinct; several values for each major party and other values for the precinct as a whole. Those statistics are:

  • Aggregate base partisan vote - The lowest non-zero turnout for a major party, in all electoral years.
  • Average Party Performance - The average percentage of the vote a party receives in the closest 3 elections in recent years.
  • Swing vote - The part of the electorate not included in the aggregate base partisan vote.
  • Soft-partisan vote - The average worst a party has performed, minus the actual worst.
  • Toss-up - The portion of the electorate not included in the Aggregate base or soft-base partisan vote.
  • Partisan base - The combined aggregate-base and soft-partisan vote for each major party.
  • Partisan swing - The combined major party swing vote.
  • Projected turnout - The portion of the electorate that is projected to turn out given previous turnout and current registration data.

These variables can be visualized with the following graphic, adapted --along with definitions above-- from Campaign Craft (Burton, Shea).

variables

The actual columns in the data frame returned from from district.analyze are:

  • proj.turnout.percent - The projected turnout percent of for a hypothetical next election.
  • proj.turnout.count - The projected number of voters who will turn out for a hypothetical next election.
  • current.reg - Current number of registered voters in a precinct.
  • partisan.base - The combined aggregate-base and soft-partisan vote for both major parties ( Partisan base ).
  • partisan.swing - All non-base voters (1.0 - partisan.base).
  • tossup - The portion of the electorate not in the base or soft support of either major party.
  • app.rep - The average party performance of a Republican candidate in this precinct.
  • base.rep - The aggregate base partisan vote for a Republican candidate in this precinct.
  • soft.rep - The soft partisan vote for a Republican candidate in this precinct.
  • app.dem - The average party performance of a Democratic candidate in this precinct.
  • base.dem - The aggregate base partisan vote for a Democratic candidate in this precinct.
  • soft.dem - The soft partisan vote for a Democratic candidate in this precinct.
  • partisan.rep - Combination of aggregate base and soft vote percentages for the Republican.
  • partisan.dem - Combination of aggregate base and soft vote percentages for the Democrat.

The most useful statistic above is the Average Party Performance (APP), which is an average of major-party turnout in the 3 closest recent elections. The APP describes supporter levels for a best-case scenario in a close election. We've already calculated the APP of each major party (app.dem, app.rep), but when a race doesn't have a third party candidate what we'll usually visualize is the share of the combined partisan performance that each party receives. We'll add these variables to our summary data frame generated previously, one for each major party.

hd013s$dem.share <- hd013s$app.dem/(hd013s$app.dem+hd013s$app.rep)
hd013s$rep.share <- hd013s$app.rep/(hd013s$app.dem+hd013s$app.rep)

Now that we have the APP and partisan vote share for each party, we can visualize the precinct-level terrain for the Democratic challenger Mr. Bell. This visualization should show us the democratic support for each precinct and give us an idea whinc precincts could be competitive. We'll produce this visualization using a density plot + 1d histogram, adapted from the seatsVotes plot in the pscl package. We'll also draw a cut-line down the 50% vote mark to to help find competitive precincts.

qplot(dem.share, data=hd013s, geom=c("density","rug"),
    xlab="Dem Vote Share",
    main="Democratic vote share, by precinct")  + 
  geom_vline(xintercept=.50)

dem-vote-share-by-precinct

We can see a lot of precincts are between 48% and 53% Democratic, which means those precincts could potentially go for either candidate. We need to classify these results into something more solid. Let's say precincts with less than 48% Democratic share are Safe Republican, 48-52% are Tossup, and greater than 52% are Safe Democrat. This is a simple representation but can be refined later. We'll add a seat classification to our data frame using the cut function:

hd013s$cl <- cut(hd013s$dem.share, breaks=c(0,.48,.52,1), labels=c("Safe Rep", "Tossup", "Safe Dem"))

Now we need to visualize how many precincts fall into which classification, using a histogram this time instead of a density curve.

ggplot(hd013s, aes(x=dem.share)) + 
  geom_bar(aes(fill=cl),binwidth=0.01) + 
  scale_fill_brewer("Precinct Rating", palette="RdYlBu") + 
  scale_x_continuous("Democratic Vote Share") + 
  scale_y_continuous("Frequency") 

dem-vote-share-by-precinct-hist

From the histogram we see that not only does a Republican candidate enjoy more "Safe" precincts, but even the majority of the tossup precincts have less than 50% Democratic share. While the precinct breakdown looks bad, a Democratic win in this district is theoretically possible if these tossup precincts are held. A Democratic candidate will face a tough challenge, so the next step will be identifying Democratic and Democrat-leaning precincts to target.

To make this target precinct list we'll need a method to prioritize the precincts so that we can reach the most persuadable voters while spending the least resources. A popular method to identify a precinct as high-value is to sort precincts by lowest projected turnout with highest Democratic vote share. Lower turnout means there are registered voters waiting to be convinced to show up, and high Democratic vote share means more of those voters will be Democrats.

Since we measured both of these values (turnout%, democratic vote share), it is very easy to order our data by turnout (ascending) and democratic average party performance (descending) using R.

hd013s[order(hd013s$proj.turnout.percent,-hd013s$app.dem),c(1:2,20:21),]
precinct_name proj.turnout.percent dem.share rep.share
25 153 - 409 - SUDLEY NORTH 0.1959 0.5105 0.4894
27 153 - 411 - MULLEN 0.2218 0.5026 0.4973
4 107 - 111 - BRIAR WOODS 0.2256 0.4837 0.5162
6 107 - 212 - CLAUDE MOORE PARK 0.2279 0.5285 0.4714
26 153 - 410 - MOUNTAIN VIEW 0.2319 0.4945 0.5054
16 153 - 110 - BUCKLAND MILLS 0.2448 0.4891 0.5108
13 153 - 106 - ELLIS 0.2475 0.5038 0.4961
5 107 - 112 - FREEDOM 0.2509 0.5028 0.4971
15 153 - 108 - VICTORY 0.2645 0.5005 0.4994
24 153 - 408 - GLENKIRK 0.2837 0.4998 0.5001
1 107 - 106 - EAGLE RIDGE 0.2856 0.4992 0.5007
18 153 - 112 - CEDAR POINT 0.2876 0.4855 0.5144
14 153 - 107 - MARSTELLER 0.3067 0.4775 0.5224
3 107 - 109 - HUTCHISON 0.3168 0.4857 0.5142
2 107 - 108 - MERCER 0.3281 0.5034 0.4965
17 153 - 111 - BRISTOW RUN 0.3324 0.4822 0.5177
23 153 - 406 - ALVEY 0.3460 0.4736 0.5263
21 153 - 402 - BATTLEFIELD 0.3546 0.4323 0.5676
10 153 - 102 - BENNETT 0.3896 0.4959 0.5040
19 153 - 209 - WOODBINE 0.4014 0.4651 0.5348
7 107 - 307 - MIDDLEBURG 0.4043 0.4953 0.5046
9 153 - 101 - BRENTSVILLE 0.4180 0.4904 0.5095
22 153 - 403 - BULL RUN 0.4226 0.4860 0.5139
20 153 - 401 - EVERGREEN 0.4283 0.5006 0.4993
12 153 - 104 - NOKESVILLE 0.4537 0.4960 0.5039
11 153 - 103 - BUCKHALL 0.4636 0.4773 0.5226
8 107 - 309 - ALDIE 0.4687 0.4881 0.5118

This sorted list is our critical intelligence to finding persuadable voters, but we need a better way to visualize the output. Since we have two scalar variables (turnout %, democratic vote share) we can use a scatter plot with the Democratic vote share on the Y axis and Turnout % on the X. We'll also color each precinct with its seat classification we defined earlier (Safe Republican, Tossup, Safe Democrat):

ggplot(aes(x=proj.turnout.percent, y=dem.share), data=hd013s) + 
  geom_point(aes(colour=cl,title="a")) + 
  labs(x="Projected Turnout %", y="Democratic Vote Share %",colour="Seat Type")

dem-vote-share-by-precinct-scatter-color

This chart echoes what we've seen previously: the Democratic challenger faces an uphill battle, but there is room for a win. We see a single "Safe Democract" precinct with very low turnout, and five "Safe Republican" precincts that run the board in turnout. Given the high number of "Tossup" precincts, and the fact that they run the gamut as far as turnout is concerned, we'll need to incorporate additional information into our prioritization. If we also rank precincts by current voter registration, we can focus on precincts where we stand to gain the most ground.

Before we continue, we need to make sure there is enough difference in precinct-to-precinct registration to have an impact. Let's look at some statistics for the current registration in this district.

mean(hd013s$current.reg)
sd(hd013s$current.reg)
.
2970.111
1014.072

There are on average 2,970 current registered voters in each precinct, but the standard deviation is 1,014 voters. A standard deviation that high tells us we need to take into account registration if we want to focus on the precincts with 4000 people and not 1000 people. A histogram of current registration will help us clarify this finding:

qplot(current.reg, data=hd013s, geom="bar",binwidth=500,xlab="Current Registration") + 
  scale_y_continuous("Frequency")

Current registration histogram

The standard deviation was correct: we see some very small precincts and some large precincts, but the majority are somewhere in the 2000-4000 range. The difference looks to be large enough to include current registration in our ranking.

We need to look at the Democratic Vote Share vs Turnout % scatter plot again, but with the points scaled to the current precinct registration.

qplot(proj.turnout.percent,dem.share,size=current.reg, data=hd013s,colour=cl)+ 
  labs(x="Projected Turnout %", y="Democratic Vote Share %",colour="Seat Type",size="Current Registration")

[caption id="attachment_170" align="aligncenter" width="671" caption="Democratic vote share by Turnout %"]Democratic vote share by Turnout %[/caption]

This plot is almost complete and ready to be analyzed. The last job is to label the points with ther precinct names. Our current precinct_name variable is actually a unique identifier with a FIPS county code, a precinct code, and a name, and it is too long for a point label. We'll shrink it down to just the name and then we'll recreate the scatter plot with the label:

# replace the fips code and precinct number w/ an empty string
hd013s$precinct.label <- sub("^[0-9]+ - [0-9]+ - ",'',as.character(hd013s$precinct_name))
# plot the previous graph again but this time use precinct.label as the label
ggplot(hd013s, aes(x=proj.turnout.percent, y=dem.share,label=precinct.label)) + 
  geom_point(aes(colour=cl,size=current.reg)) + 
  geom_text(size=2.5,vjust=1.5,angle=25) + 
  labs(x="Projected Turnout %", y="Democratic Vote Share %",colour="Seat Type",size="Current Registration")

dem-vote-share-by-precinct-scatter-color-size-label

From the chart we can see that a Democrat in the HD#013 will want to focus contact efforts on the precincts in the upper-left hand corner of the plot and will want to target larger precincts before smaller. Integrating the current registration into our previous sort command leaves us with the following sort order:

hd013s[order(hd013s$proj.turnout.percent,-hd013s$app.dem,hd013s$current.reg),c(1:2,4,20:22),]
precinct_name proj.turnout.percent current.reg dem.share rep.share cl
25 153 - 409 - SUDLEY NORTH 0.1959 2497 0.5105 0.4894 Tossup
27 153 - 411 - MULLEN 0.2218 3555 0.5026 0.4973 Tossup
4 107 - 111 - BRIAR WOODS 0.2256 2288 0.4837 0.5162 Tossup
6 107 - 212 - CLAUDE MOORE PARK 0.2279 3115 0.5285 0.4714 Safe Dem
26 153 - 410 - MOUNTAIN VIEW 0.2319 3749 0.4945 0.5054 Tossup
16 153 - 110 - BUCKLAND MILLS 0.2448 3646 0.4891 0.5108 Tossup
13 153 - 106 - ELLIS 0.2475 1303 0.5038 0.4961 Tossup
5 107 - 112 - FREEDOM 0.2509 3929 0.5028 0.4971 Tossup
15 153 - 108 - VICTORY 0.2645 4874 0.5005 0.4994 Tossup
24 153 - 408 - GLENKIRK 0.2837 2175 0.4998 0.5001 Tossup
1 107 - 106 - EAGLE RIDGE 0.2856 2531 0.4992 0.5007 Tossup
18 153 - 112 - CEDAR POINT 0.2876 3497 0.4855 0.5144 Tossup
14 153 - 107 - MARSTELLER 0.3067 3669 0.4775 0.5224 Safe Rep
3 107 - 109 - HUTCHISON 0.3168 3722 0.4857 0.5142 Tossup
2 107 - 108 - MERCER 0.3281 3229 0.5034 0.4965 Tossup
17 153 - 111 - BRISTOW RUN 0.3324 3031 0.4822 0.5177 Tossup
23 153 - 406 - ALVEY 0.3460 4403 0.4736 0.5263 Safe Rep
21 153 - 402 - BATTLEFIELD 0.3546 3851 0.4323 0.5676 Safe Rep
10 153 - 102 - BENNETT 0.3896 4440 0.4959 0.5040 Tossup
19 153 - 209 - WOODBINE 0.4014 2406 0.4651 0.5348 Safe Rep
7 107 - 307 - MIDDLEBURG 0.4043 1239 0.4953 0.5046 Tossup
9 153 - 101 - BRENTSVILLE 0.4180 1708 0.4904 0.5095 Tossup
22 153 - 403 - BULL RUN 0.4226 3111 0.4860 0.5139 Tossup
20 153 - 401 - EVERGREEN 0.4283 2535 0.5006 0.4993 Tossup
12 153 - 104 - NOKESVILLE 0.4537 2501 0.4960 0.5039 Tossup
11 153 - 103 - BUCKHALL 0.4636 2287 0.4773 0.5226 Safe Rep
8 107 - 309 - ALDIE 0.4687 902 0.4881 0.5118 Tossup

Now that we have our ranking, we can figure out how much each precinct might offer. Let's first see the number of votes required to win the seat, the number of votes we're projected to receive given the calculated APP, previous turnout, and current registration. The district.summary function will provide us will all this information:

district.summary(hd013s)[,c(1,2,9,10,11)]
current.reg proj.turnout.count votes.to.win proj.turnout.rep proj.turnout.dem
1 80193 25401 12701.5 12499 12074

We can see that the projected turnout (proj.turnout.count) is about 25,401, so the votes projected to win this district is only 12,702. Using the Democratic APP, we can project Democratic turnout at 12,074, so we need to find 628 votes to win. How do we find these votes?

Lets go back to our sorted precinct list and take the top 30% and call them our target.precincts.

sorted.precincts <- hd013s[order(hd013s$proj.turnout.percent,-hd013s$app.dem,hd013s$current.reg),]
target.precincts <- sorted.precincts[1:(nrow(sorted.precincts)/3),]

We've got our target list, and we know we need 628 votes from them to bring our total to 50% + 1. Adding a small buffer to that number, we'll take 640 target votes and allocate them across our target precincts, proportional to the number of registered voters in the precinct. Hopefully, this will set more realistic goals for larger and smaller precincts.

target.precincts$inc <- as.integer(640 * target.precincts$current.reg/sum(target.precincts$current.reg))

target.precincts[,c(2,3,17,23,18,20:22,24)]
precinct.label proj.turnout.percent proj.turnout.count proj.turnout.dem proj.turnout.rep dem.share rep.share cl inc
SUDLEY NORTH 0.1959 489 248 238 0.5105 0.4894 Tossup 55
MULLEN 0.2218 788 391 387 0.5026 0.4973 Tossup 78
BRIAR WOODS 0.2256 516 243 259 0.4837 0.5162 Tossup 50
CLAUDE MOORE PARK 0.2279 709 366 326 0.5285 0.4714 Safe Dem 68
MOUNTAIN VIEW 0.2319 869 427 437 0.4945 0.5054 Tossup 82
BUCKLAND MILLS 0.2448 892 431 450 0.4891 0.5108 Tossup 80
ELLIS 0.2475 322 160 158 0.5038 0.4961 Tossup 28
FREEDOM 0.2509 986 492 487 0.5028 0.4971 Tossup 86
VICTORY 0.2645 1289 638 637 0.5005 0.4994 Tossup 107

The final column in the result is the target increase for that precinct (column: 'inc'). With this information in hand the campaign field operations can devise a contact strategy to bring these voters to the polls on election day.

Conclusion

Playing the role of campaign consultant, we have analyzed previous electoral outcomes in the 13th seat of the House of Delegates in Virginia. We have shown how a Democratic candidate can leverage increasing Democratic support and low turnout to make this race competitive. We have also created a precinct targeting methodology that provides a high-level blueprint for resources planning. The analysis we performed performed is very standard, but using R makes our methodology unique. A down-ballot or primary-challenger campaign taking advantage of this methodology will spend less money and can experiment more on their targeting, potentially leading them to a win.

Are you a Democrat running for the Virginia House of Delegates who would like to see the same data for your race? Or, are you a Democratic congressional candidate preparing for the 2010 cycle? Contact me at jjh@offensivepolitics.net for robust targeting data or other analysis.

Follow Offensive Politics on twitter