Aggreate electoral targeting with R

18 minute read

Introduction

Electoral targeting is the process of quantifying the partisan bias of a single voter or subset of voters in a geographic region. Bias can be calculated using an individual's demographic and voting behavior or by aggregating results from an entire election precinct. Targeting is traditionally performed by national committees (e.g., National Committee for an Effective Congress, National Republican Congressional Committee), state political parties, interest groups (e.g., EMILY's List, National Rifle Association), or campaign consultants. Targeting data is consumed by campaign managers and analysts, and it is used along with polling data to build strategy, direct resources, and project electoral outcomes.

While aggregate electoral targeting can build a sophisticated picture of a district, the mathematics behind targeting are very simple. Targeting can be performed by anyone with previous electoral data, and calculations can be done using 3x5 note cards, with simple spreadsheets, or high-end software packages like SPSS. The targeting methods discussed in this post are taken from academic publications on electioneering: Campaign Craft (Burton, Shea 2006) and The Campaign Manager (Shaw, 2004).

Although targeting data is usually usually inexpensive or free, a down-ballot campaign or a primary challenger might not have the connections or support of a PAC or party to obtain the data. In these cases, a campaign will probably purchase one of the books listed above to perform its own analysis. Even an established campaign may run its own analysis, possibly to test different turnout theories or to integrate additional data. This post is directed towards these groups.

Together, we will assume the role of campaign consultant and perform an aggregate electoral analysis on the 13th House of Delegates seat (HOD#13) in the Commonwealth of Virginia. In HOD#13, the 18-year Republican incumbent Bob Marshall is being challenged by Democrat John Bell. This analysis will compute and visualize turnout, partisan bias, and a precinct ranking based on projected turnout and historical Democratic support.

The analysis of HOD#13 will be performed using R, an open-source computing platform. R is free, extensible, and interactive, making it an ideal platform for experimentation. The R package aggpol was created specifically for this tutorial, and it contains all the data and operations required to execute an aggregate electoral analysis. Readers can execute the provided R code to reproduce the analysis or simply follow along to learn how it was performed. Readers unfamiliar with R should read Introduction to R, which is available on the R project homepage.

The electoral and registration data used were compiled from the Virginia State Board of Elections using several custom written parsers and two different PDF-to-text engines. Please contact me for source data or more information at: jjh@offensivepolitics.net.

Prerequisites

This section only applies to readers interested in recreating the analysis and graphics produced in this tutorial. To completely recreate this analysis you will need the following:

The latest version of the R statistical computing environment. Binaries, source, and installation instructions can be downloaded from R homepage.
Additional R packages. This analysis requires several packages that provide additional functionality on top of the existing R system. Install the appropriate R environment for your system and run the program.

Getting Started

Now that the prerequisites are installed we can get started with our data analysis. Start up your R environment and load the required libraries by typing in the following commands:

library(plyr)
library(aggpol)
library(ggplot2)
library(RColorBrewer)

We need to attach the VAHOD data set that comes with aggpol. This data set contains precinct-level electoral returns for state and federal elections in the Commonwealth of Virginia from 2001 to 2008. Since we are focusing on HOD#13, we'll need to select just the records that have to do with that seat.

data(VAHOD)
hd013 <- vahod[which(vahod$seat == "HD-013"),]

The data set contains precinct-level electoral results for the following races: U.S. President, U.S.Senate, U.S.House of Representatives, Virginia Governor, Senate of Virginia, and Virginia House of Delegates. This breadth of electoral returns allows us to build a very detailed profile of the partisan bias of a district.

We will first determine the historical partisanship in HOD#13. Since partisanship can fluctuate over the years and different seats have different turnout expectations, we'll first need to see the major party support for every seat in each election for precincts in HOD#13. We can use the historical.election.summary function from the aggpol package to group the precinct results into district results, and then break them down by seat and year.

esum <- historical.election.summary(hd013)

esum now contains:

.	year	district_type	total.turnout	rep.turnout	rep.turnout.percent	dem.turnout	dem.turnout.percent	oth.turnout	oth.turnout.percent
1	2001	GV	5527	3266	0.5909	2207	0.3993	54	0.0097
2	2001	HD	5399	3475	0.6436	1924	0.3563	0	0
3	2001	LG	5432	3291	0.6058	2025	0.3727	116	0.0213
4	2003	HD	10299	10103	0.9809	110	0.0106	86	0.0083
13 more lines...

We now have major-party turnout for every election in our data set. To best visualize the results we'll build a bar graph comparing major-party turnout in each seat over time. We first need to transpose the election summary object (esum) from a summary format to an observation format, one line per distinct year+district+party. The plyr package makes this task extremely simple.

elx <- ddply(esum,c("year", "district_type"), function(x) 
  rbind(  data.frame(party="REP",turnout=x$rep.turnout.percent),
  data.frame(party="DEM",turnout=x$dem.turnout.percent)))

We will now use the powerful ggplot2 package to view the Republican and Democratic support for each election, in each seat, for our subset:

ggplot(elx,aes(year,turnout,fill=factor(party))) + 
  geom_bar(stat="identity") + 
  facet_wrap(~district_type,scales="free_x") + 
  scale_fill_brewer(palette="Set1")

Result:
HD#013 major party percentages

This graphic gives us a decent understanding of district-level electoral trends. For U.S. federal elections (figs.: PVP, USH, USS), we can see a distinct drop in Republican support moving towards 2008; the results for U.S. House (USH) and U.S. Senate (USS), in particular, show a strong increase in Democratic support. This growth correlates to statewide trends that resulted in the election of two Democratic Senators representing Virginia for the first time since 1970. General Democratic gains notwithstanding, the House of Delegates (fig.: HD) results aren't as promising for a Democratic challenger. The incumbent Del. Marshall saw more than 60% support in three of the last four elections and saw no challenger at all in 2003. While the district may be trending more Democratic over time, the voters of HOD#13 are obviously big fans of Del. Marshall.

Now that we understand the historical partisanship of this district we need to understand historical turnout, allowing us to project of the number of votes required to win. We will utilize the historical.turnout.summary function from the aggpol package to produce a summary of turnout for this district.

historical.turnout.summary(hd013, district.type="HD", district.number="013", years=c(2001,2003,2005,2007))

.	year	total.turnout	total.registration
1	2001	5399	13275
2	2003	10031	45769
3	2005	23592	62497
4	2007	26110	78028

Looking at this table one can see some data collection problems in the 2001 HD elections. In recent years, precincts belonged to only one House of Delegates seat, but in 2001 and somewhat less so in 2003 some precincts are split and some have duplicate names and now information on how to allocate results from different races to precincts. The turnout numbers are slightly affected by these problems, but the aggpol attempts to correct this by substituting alternate years or even races if possible.

The take away from from the previous table is that turnout for the last four House of Delegates elections has hovered around 30%. This makes some political sense, because Virginia holds state elections in odd-numbered years with no federal elections to drive up turnout. This leaves a lot of registered voters to be activated, but we need to delve down to the precinct level to find them.

We use the district.analyze function of aggpol to aggregate all electoral results into a summary for each precinct.

hd013s <- district.analyze(hd013)

hd013s is a data frame with columns calculated for every precinct; several values for each major party and other values for the precinct as a whole. Those statistics are:

Aggregate base partisan vote - The lowest non-zero turnout for a major party, in all electoral years.
Average Party Performance - The average percentage of the vote a party receives in the closest 3 elections in recent years.
Swing vote - The part of the electorate not included in the aggregate base partisan vote.
Soft-partisan vote - The average worst a party has performed, minus the actual worst.
Toss-up - The portion of the electorate not included in the Aggregate base or soft-base partisan vote.
Partisan base - The combined aggregate-base and soft-partisan vote for each major party.
Partisan swing - The combined major party swing vote.
Projected turnout - The portion of the electorate that is projected to turn out given previous turnout and current registration data.

These variables can be visualized with the following graphic, adapted --along with definitions above-- from Campaign Craft (Burton, Shea).

The actual columns in the data frame returned from from district.analyze are:

proj.turnout.percent - The projected turnout percent of for a hypothetical next election.
proj.turnout.count - The projected number of voters who will turn out for a hypothetical next election.
current.reg - Current number of registered voters in a precinct.
partisan.base - The combined aggregate-base and soft-partisan vote for both major parties ( Partisan base ).
partisan.swing - All non-base voters (1.0 - partisan.base).
tossup - The portion of the electorate not in the base or soft support of either major party.
app.rep - The average party performance of a Republican candidate in this precinct.
base.rep - The aggregate base partisan vote for a Republican candidate in this precinct.
soft.rep - The soft partisan vote for a Republican candidate in this precinct.
app.dem - The average party performance of a Democratic candidate in this precinct.
base.dem - The aggregate base partisan vote for a Democratic candidate in this precinct.
soft.dem - The soft partisan vote for a Democratic candidate in this precinct.
partisan.rep - Combination of aggregate base and soft vote percentages for the Republican.
partisan.dem - Combination of aggregate base and soft vote percentages for the Democrat.

The most useful statistic above is the Average Party Performance (APP), which is an average of major-party turnout in the 3 closest recent elections. The APP describes supporter levels for a best-case scenario in a close election. We've already calculated the APP of each major party (app.dem, app.rep), but when a race doesn't have a third party candidate what we'll usually visualize is the share of the combined partisan performance that each party receives. We'll add these variables to our summary data frame generated previously, one for each major party.

hd013s$dem.share <- hd013s$app.dem/(hd013s$app.dem+hd013s$app.rep)
hd013s$rep.share <- hd013s$app.rep/(hd013s$app.dem+hd013s$app.rep)

Now that we have the APP and partisan vote share for each party, we can visualize the precinct-level terrain for the Democratic challenger Mr. Bell. This visualization should show us the democratic support for each precinct and give us an idea whinc precincts could be competitive. We'll produce this visualization using a density plot + 1d histogram, adapted from the seatsVotes plot in the pscl package. We'll also draw a cut-line down the 50% vote mark to to help find competitive precincts.

qplot(dem.share, data=hd013s, geom=c("density","rug"),
    xlab="Dem Vote Share",
    main="Democratic vote share, by precinct")  + 
  geom_vline(xintercept=.50)

We can see a lot of precincts are between 48% and 53% Democratic, which means those precincts could potentially go for either candidate. We need to classify these results into something more solid. Let's say precincts with less than 48% Democratic share are Safe Republican, 48-52% are Tossup, and greater than 52% are Safe Democrat. This is a simple representation but can be refined later. We'll add a seat classification to our data frame using the cut function:

hd013s$cl <- cut(hd013s$dem.share, breaks=c(0,.48,.52,1), labels=c("Safe Rep", "Tossup", "Safe Dem"))

Now we need to visualize how many precincts fall into which classification, using a histogram this time instead of a density curve.

ggplot(hd013s, aes(x=dem.share)) + 
  geom_bar(aes(fill=cl),binwidth=0.01) + 
  scale_fill_brewer("Precinct Rating", palette="RdYlBu") + 
  scale_x_continuous("Democratic Vote Share") + 
  scale_y_continuous("Frequency")

From the histogram we see that not only does a Republican candidate enjoy more "Safe" precincts, but even the majority of the tossup precincts have less than 50% Democratic share. While the precinct breakdown looks bad, a Democratic win in this district is theoretically possible if these tossup precincts are held. A Democratic candidate will face a tough challenge, so the next step will be identifying Democratic and Democrat-leaning precincts to target.

To make this target precinct list we'll need a method to prioritize the precincts so that we can reach the most persuadable voters while spending the least resources. A popular method to identify a precinct as high-value is to sort precincts by lowest projected turnout with highest Democratic vote share. Lower turnout means there are registered voters waiting to be convinced to show up, and high Democratic vote share means more of those voters will be Democrats.

Since we measured both of these values (turnout%, democratic vote share), it is very easy to order our data by turnout (ascending) and democratic average party performance (descending) using R.

hd013s[order(hd013s$proj.turnout.percent,-hd013s$app.dem),c(1:2,20:21),]

	precinct_name	proj.turnout.percent	dem.share	rep.share
25	153 - 409 - SUDLEY NORTH	0.1959	0.5105	0.4894
27	153 - 411 - MULLEN	0.2218	0.5026	0.4973
4	107 - 111 - BRIAR WOODS	0.2256	0.4837	0.5162
6	107 - 212 - CLAUDE MOORE PARK	0.2279	0.5285	0.4714
26	153 - 410 - MOUNTAIN VIEW	0.2319	0.4945	0.5054
16	153 - 110 - BUCKLAND MILLS	0.2448	0.4891	0.5108
13	153 - 106 - ELLIS	0.2475	0.5038	0.4961
5	107 - 112 - FREEDOM	0.2509	0.5028	0.4971
15	153 - 108 - VICTORY	0.2645	0.5005	0.4994
24	153 - 408 - GLENKIRK	0.2837	0.4998	0.5001
1	107 - 106 - EAGLE RIDGE	0.2856	0.4992	0.5007
18	153 - 112 - CEDAR POINT	0.2876	0.4855	0.5144
14	153 - 107 - MARSTELLER	0.3067	0.4775	0.5224
3	107 - 109 - HUTCHISON	0.3168	0.4857	0.5142
2	107 - 108 - MERCER	0.3281	0.5034	0.4965
17	153 - 111 - BRISTOW RUN	0.3324	0.4822	0.5177
23	153 - 406 - ALVEY	0.3460	0.4736	0.5263
21	153 - 402 - BATTLEFIELD	0.3546	0.4323	0.5676
10	153 - 102 - BENNETT	0.3896	0.4959	0.5040
19	153 - 209 - WOODBINE	0.4014	0.4651	0.5348
7	107 - 307 - MIDDLEBURG	0.4043	0.4953	0.5046
9	153 - 101 - BRENTSVILLE	0.4180	0.4904	0.5095
22	153 - 403 - BULL RUN	0.4226	0.4860	0.5139
20	153 - 401 - EVERGREEN	0.4283	0.5006	0.4993
12	153 - 104 - NOKESVILLE	0.4537	0.4960	0.5039
11	153 - 103 - BUCKHALL	0.4636	0.4773	0.5226
8	107 - 309 - ALDIE	0.4687	0.4881	0.5118

This sorted list is our critical intelligence to finding persuadable voters, but we need a better way to visualize the output. Since we have two scalar variables (turnout %, democratic vote share) we can use a scatter plot with the Democratic vote share on the Y axis and Turnout % on the X. We'll also color each precinct with its seat classification we defined earlier (Safe Republican, Tossup, Safe Democrat):

ggplot(aes(x=proj.turnout.percent, y=dem.share), data=hd013s) + 
  geom_point(aes(colour=cl,title="a")) + 
  labs(x="Projected Turnout %", y="Democratic Vote Share %",colour="Seat Type")

This chart echoes what we've seen previously: the Democratic challenger faces an uphill battle, but there is room for a win. We see a single "Safe Democract" precinct with very low turnout, and five "Safe Republican" precincts that run the board in turnout. Given the high number of "Tossup" precincts, and the fact that they run the gamut as far as turnout is concerned, we'll need to incorporate additional information into our prioritization. If we also rank precincts by current voter registration, we can focus on precincts where we stand to gain the most ground.

Before we continue, we need to make sure there is enough difference in precinct-to-precinct registration to have an impact. Let's look at some statistics for the current registration in this district.

mean(hd013s$current.reg)
sd(hd013s$current.reg)

.
2970.111
1014.072

There are on average 2,970 current registered voters in each precinct, but the standard deviation is 1,014 voters. A standard deviation that high tells us we need to take into account registration if we want to focus on the precincts with 4000 people and not 1000 people. A histogram of current registration will help us clarify this finding:

qplot(current.reg, data=hd013s, geom="bar",binwidth=500,xlab="Current Registration") + 
  scale_y_continuous("Frequency")

The standard deviation was correct: we see some very small precincts and some large precincts, but the majority are somewhere in the 2000-4000 range. The difference looks to be large enough to include current registration in our ranking.

We need to look at the Democratic Vote Share vs Turnout % scatter plot again, but with the points scaled to the current precinct registration.

qplot(proj.turnout.percent,dem.share,size=current.reg, data=hd013s,colour=cl)+ 
  labs(x="Projected Turnout %", y="Democratic Vote Share %",colour="Seat Type",size="Current Registration")

[caption id="attachment_170" align="aligncenter" width="671" caption="Democratic vote share by Turnout %"][/caption]

This plot is almost complete and ready to be analyzed. The last job is to label the points with ther precinct names. Our current precinct_name variable is actually a unique identifier with a FIPS county code, a precinct code, and a name, and it is too long for a point label. We'll shrink it down to just the name and then we'll recreate the scatter plot with the label:

# replace the fips code and precinct number w/ an empty string
hd013s$precinct.label <- sub("^[0-9]+ - [0-9]+ - ",'',as.character(hd013s$precinct_name))
# plot the previous graph again but this time use precinct.label as the label
ggplot(hd013s, aes(x=proj.turnout.percent, y=dem.share,label=precinct.label)) + 
  geom_point(aes(colour=cl,size=current.reg)) + 
  geom_text(size=2.5,vjust=1.5,angle=25) + 
  labs(x="Projected Turnout %", y="Democratic Vote Share %",colour="Seat Type",size="Current Registration")

From the chart we can see that a Democrat in the HD#013 will want to focus contact efforts on the precincts in the upper-left hand corner of the plot and will want to target larger precincts before smaller. Integrating the current registration into our previous sort command leaves us with the following sort order:

hd013s[order(hd013s$proj.turnout.percent,-hd013s$app.dem,hd013s$current.reg),c(1:2,4,20:22),]

	precinct_name	proj.turnout.percent	current.reg	dem.share	rep.share	cl
25	153 - 409 - SUDLEY NORTH	0.1959	2497	0.5105	0.4894	Tossup
27	153 - 411 - MULLEN	0.2218	3555	0.5026	0.4973	Tossup
4	107 - 111 - BRIAR WOODS	0.2256	2288	0.4837	0.5162	Tossup
6	107 - 212 - CLAUDE MOORE PARK	0.2279	3115	0.5285	0.4714	Safe Dem
26	153 - 410 - MOUNTAIN VIEW	0.2319	3749	0.4945	0.5054	Tossup
16	153 - 110 - BUCKLAND MILLS	0.2448	3646	0.4891	0.5108	Tossup
13	153 - 106 - ELLIS	0.2475	1303	0.5038	0.4961	Tossup
5	107 - 112 - FREEDOM	0.2509	3929	0.5028	0.4971	Tossup
15	153 - 108 - VICTORY	0.2645	4874	0.5005	0.4994	Tossup
24	153 - 408 - GLENKIRK	0.2837	2175	0.4998	0.5001	Tossup
1	107 - 106 - EAGLE RIDGE	0.2856	2531	0.4992	0.5007	Tossup
18	153 - 112 - CEDAR POINT	0.2876	3497	0.4855	0.5144	Tossup
14	153 - 107 - MARSTELLER	0.3067	3669	0.4775	0.5224	Safe Rep
3	107 - 109 - HUTCHISON	0.3168	3722	0.4857	0.5142	Tossup
2	107 - 108 - MERCER	0.3281	3229	0.5034	0.4965	Tossup
17	153 - 111 - BRISTOW RUN	0.3324	3031	0.4822	0.5177	Tossup
23	153 - 406 - ALVEY	0.3460	4403	0.4736	0.5263	Safe Rep
21	153 - 402 - BATTLEFIELD	0.3546	3851	0.4323	0.5676	Safe Rep
10	153 - 102 - BENNETT	0.3896	4440	0.4959	0.5040	Tossup
19	153 - 209 - WOODBINE	0.4014	2406	0.4651	0.5348	Safe Rep
7	107 - 307 - MIDDLEBURG	0.4043	1239	0.4953	0.5046	Tossup
9	153 - 101 - BRENTSVILLE	0.4180	1708	0.4904	0.5095	Tossup
22	153 - 403 - BULL RUN	0.4226	3111	0.4860	0.5139	Tossup
20	153 - 401 - EVERGREEN	0.4283	2535	0.5006	0.4993	Tossup
12	153 - 104 - NOKESVILLE	0.4537	2501	0.4960	0.5039	Tossup
11	153 - 103 - BUCKHALL	0.4636	2287	0.4773	0.5226	Safe Rep
8	107 - 309 - ALDIE	0.4687	902	0.4881	0.5118	Tossup

Now that we have our ranking, we can figure out how much each precinct might offer. Let's first see the number of votes required to win the seat, the number of votes we're projected to receive given the calculated APP, previous turnout, and current registration. The district.summary function will provide us will all this information:

district.summary(hd013s)[,c(1,2,9,10,11)]

	current.reg	proj.turnout.count	votes.to.win	proj.turnout.rep	proj.turnout.dem
1	80193	25401	12701.5	12499	12074

We can see that the projected turnout (proj.turnout.count) is about 25,401, so the votes projected to win this district is only 12,702. Using the Democratic APP, we can project Democratic turnout at 12,074, so we need to find 628 votes to win. How do we find these votes?

Lets go back to our sorted precinct list and take the top 30% and call them our target.precincts.

sorted.precincts <- hd013s[order(hd013s$proj.turnout.percent,-hd013s$app.dem,hd013s$current.reg),]
target.precincts <- sorted.precincts[1:(nrow(sorted.precincts)/3),]

We've got our target list, and we know we need 628 votes from them to bring our total to 50% + 1. Adding a small buffer to that number, we'll take 640 target votes and allocate them across our target precincts, proportional to the number of registered voters in the precinct. Hopefully, this will set more realistic goals for larger and smaller precincts.

target.precincts$inc <- as.integer(640 * target.precincts$current.reg/sum(target.precincts$current.reg))

target.precincts[,c(2,3,17,23,18,20:22,24)]

precinct.label	proj.turnout.percent	proj.turnout.count	proj.turnout.dem	proj.turnout.rep	dem.share	rep.share	cl	inc
SUDLEY NORTH	0.1959	489	248	238	0.5105	0.4894	Tossup	55
MULLEN	0.2218	788	391	387	0.5026	0.4973	Tossup	78
BRIAR WOODS	0.2256	516	243	259	0.4837	0.5162	Tossup	50
CLAUDE MOORE PARK	0.2279	709	366	326	0.5285	0.4714	Safe Dem	68
MOUNTAIN VIEW	0.2319	869	427	437	0.4945	0.5054	Tossup	82
BUCKLAND MILLS	0.2448	892	431	450	0.4891	0.5108	Tossup	80
ELLIS	0.2475	322	160	158	0.5038	0.4961	Tossup	28
FREEDOM	0.2509	986	492	487	0.5028	0.4971	Tossup	86
VICTORY	0.2645	1289	638	637	0.5005	0.4994	Tossup	107

The final column in the result is the target increase for that precinct (column: 'inc'). With this information in hand the campaign field operations can devise a contact strategy to bring these voters to the polls on election day.

Conclusion

Playing the role of campaign consultant, we have analyzed previous electoral outcomes in the 13th seat of the House of Delegates in Virginia. We have shown how a Democratic candidate can leverage increasing Democratic support and low turnout to make this race competitive. We have also created a precinct targeting methodology that provides a high-level blueprint for resources planning. The analysis we performed performed is very standard, but using R makes our methodology unique. A down-ballot or primary-challenger campaign taking advantage of this methodology will spend less money and can experiment more on their targeting, potentially leading them to a win.

Are you a Democrat running for the Virginia House of Delegates who would like to see the same data for your race? Or, are you a Democratic congressional candidate preparing for the 2010 cycle? Contact me at jjh@offensivepolitics.net for robust targeting data or other analysis.

Follow Offensive Politics on twitter

Twitter Facebook Google+ LinkedIn

Jason Holt

Aggreate electoral targeting with R

Introduction

Prerequisites

Getting Started

Conclusion

You May Also Enjoy

The Most Lucrative Traffic Camera in D.C.

Mapping the Iowa GOP 2012 Caucus Results

Exploring Your Voter File with R

Candidate Debt in CA-36 runoff