# Donor analysis in R - Smith for Congress

In a previous post I introduced the Smith for Congress data set. The data is 49k contributions made by individuals to a congressional campaign for the 2006-2010 electoral cycles. Smith for Congress is not the name of the actual campaign.

Individual contributions are not required to be disclosed by a campaign unless the individual donates more than \$200 during a single electoral cycle. The Smith for Congress campaign has, for their own reasons, published every individual contribution. This disclosure allows us an unprecedented look into how a modern campaign raises money. I've collected and scrubbed these contributions and published them for research use. In this post I will perform a detailed donor analysis on with R to better understand how the Smith for Congress campaign financed its 2010 election. Full code and graphs can be found on the simple-analysis github repository for this post:

## Prepartion

```# latest smith for congress data as of this writing is March 23 2011.
#subset the data to just the 2010 cycle
cd0 <- cd[cd\$cycle == 2010,]
# clean up a date variable, and drop amounts < \$1.
cd\$contribution_date <- as.Date(cd\$contribution_date,format="%m/%d/%Y")
cd0 <- cd0[-which(cd0\$amount < 1),]
```

Data for the 2010 electoral cycle consists of 11,721 contributions made by 6949 individuals, totaling over \$770,000. Here is a sample:

personid amount ctd_aggregate contribution_date cycle
9zvlnzw1qj9bvq7k1x47v486a 10 20 2009-04-01 2010
iy8xcopedihv9vwqpg3iwmal 15 35 2009-04-01 2010
1f0lct995ckygk6y4vaxk2q44 20 20 2009-04-01 2010
bf2d43vdjdg07pgfmph6ghy7o 20 20 2009-04-01 2010
7sj05z74r8y10fcctvx4a38pn 20 20 2009-04-01 2010

## Data Summary

Since the number of individual donors (6,949) is so much lower than the number of contributions (11,717) we can guess a good portion of those donors gave multiple times. The long-form contribution data is somewhat difficult to work when looking at multiple contributions from the same person. We'll generate a summary data frame to help with our analysis. The following variables will be captured per individual donor:

• Date of first contribution
• The total value of all contributions by this individual
• The total number of contributions by this individual
• The amount of the first three contributions. Blank or NA if they have made less than 3 contributions.
• The difference in time for the first three contributions. Blank or NA if they have made less than 3 contributions.
```summarize.contributions <- function(x) {
xo <- x[order(x\$contribution_date),]
dtx <- as.integer(diff(x\$contribution_date))

return(data.frame(
first.contribution=xo\$contribution_date[1],
num.contributions = nrow(xo),
dt1=dtx[1],
dt2=dtx[2],
dt3=dtx[3],
am1=xo\$amount[1],
am2=xo\$amount[2],
am3=xo\$amount[3],
total.value=sum(x\$amount)
))
}
cd0s <- ddply(cd0, "personid", summarize.contributions)
```

Now the cd0s data frame holds our summary table, which looks like this:

personid first.contribution num.contributions dt1 dt2 dt3 am1 am2 am3 total.value
1023ryaqqbvz76kh3yq0r2ngq 2010-10-18 1 NA NA NA 25 NA NA 25
1036lg58hd4skceuyqrr2peb4 2010-03-25 2 166 NA NA 35 25 NA 60
106f366ysq6xe9ci731wejh0k 2009-12-11 4 91 185 63 50 50 50 250
1081wyujzkgninrt1srf79tbo 2009-08-27 3 58 114 NA 25 30 10 65
1094yhx62fcdx3c012mlpxnex 2009-10-15 1 NA NA NA 1000 NA NA 1000

## Giving Levels

With detailed giving levels we can infer a lot of information about a campaign, and about how the fundraisers are doing their jobs. If most of the giving was in the \$15-20 range we can assume they focus on small donors and maybe online contributions. If most of the giving is in the \$100-250 range then maybe the campaign throws lots of medium sized dinners. If most of the donations are close to the legal maximum of \$4800 then the campaign is focused on major donors, and might be ignoring smaller donors all together.

Plotting a histogram of total donation amount per individual will give us better insight into the giving levels.

```> qplot(total.value,data=cd0s,geom="histogram",binwidth=50)
nrow(cd0[cd0\$amount<250,]) / nrow(cd0)
summary(cd0s\$total.value)
```

[caption id="attachment_617" align="aligncenter" width="480" caption="Giving Levels, Smith for Congress 2010"][/caption]

Min. 1st Qu. Median Mean 3rd Qu. Max.
1 25 50 111 100 4800

In 2010, 75% of contributors gave \$100 or less total to the campaign. The summary table shows us the median total value donated was \$50, while the overall average was \$111. The maximum was \$4800, which is also the maximum allowed by law for 2010. We can infer that while there was certainly some major-donor solicitation, the fundraisers were focused on much smaller donors.

## Repeat donors

Now that we know more about giving levels, it would be helpful to better understand giving frequency. The amount of repeat giving may give us insight in to how involved the fundraisers are getting, and maybe even how often they are asking for money.
We'll use a histogram and a cross-tab of the total number of contributions by individuals to help us with this analysis:

```qplot(num.contributions,data=cd0s,geom="histogram",binwidth=1)
table(cd0s\$num.contributions)
```

[caption id="attachment_620" align="aligncenter" width="480" caption="Giving Frequency, Smith for Congress 2010"][/caption]

1 2 3 4 5 6 7 8 9 10 13 14 18 20
4242 1599 621 256 120 60 28 7 7 5 1 1 1 1

Our plot and table shows about two thirds (61%, 4,242) of the contributors to Smith for Congress only gave one time, leaving 2,707 people who gave more than once. Most of the people who gave more than once gave twice, but there were still several hundred people who gave 3 or 4 times each.

To understand how important repeat giving might be we need more detailed information. We need to look at the total amount donated by each group of contributors; we'll also include the cumulative total, cumulative percentage, and individual percentage of total for each group.

```gft <- ddply(cd0s,"num.contributions",function(x) { data.frame(total=sum(x\$total.value),n=nrow(x))})
gft\$percent <- gft\$total / sum(gft\$total) * 100
gft\$running.total <- cumsum(gft\$total)
gft\$running.percent <- gft\$running.total / sum(gft\$total) * 100
```

Our gft data frame looks like this:

num.contributions total n percent running.total running.percent
1 284043 4242 36.821 284043 37
2 212697 1599 27.572 496740 64
3 118998 621 15.426 615738 80
4 72197 256 9.359 687935 89
5 43513 120 5.641 731448 95
6 24428 60 3.167 755876 98
7 4825 28 0.625 760701 99
8 3988 7 0.517 764689 99
9 4340 7 0.563 769029 100
10 990 5 0.128 770019 100
13 167 1 0.022 770186 100
14 675 1 0.088 770861 100
18 360 1 0.047 771221 100
20 200 1 0.026 771421 100

We see the campaign raised \$284,000 (36.8% of the total raised) from the 4,242 contributors that gave only once, and \$212,000 (27.5% of the total raised) from the 1,599 contributors who gave two times. We also see the campaign raised \$487,378 from 2,702 repeat donors; that is almost 64% of the total value raised for the entire cycle from individuals. It is obvious the Smith for Congress campaign is good at attracting small dollar donors, one-third whom gave more man once. This is a pretty impressive repeat donor rate.

Finally I'd like to look at what kind of donations make up each level of giving. We know repeat donors gave \$487,000, but we don't know if that was mostly in \$50 donations or in \$250 donations. We can use a box and whisker plot to break down each giving level. I'm leaving off contribution levels 8 - 14 since giving was so sparse at those levels. We'll be plotting this histogram with a log transform on the y axis since few very large values will skew graph and render it mostly useless. I used a trick from this stack overflow thread to get the formatting correct on the Y axis:

```formatBack <- function(x) paste(round(10^x, 2), "\$", sep=' ')
qplot(factor(num.contributions),log10(total.value),data=cd0s[cd0s\$num.contributions < 8,],geom="boxplot",ylab="Total Value (log)",xlab="Giving Frequency",main="Giving Levels by Giving Frequency, Smith for Congress 2010") + scale_y_continuous(formatter=formatBack)
# same data, but in table format
ddply(cd0s,"num.contributions",function(x) { data.frame(total=sum(x\$total.value),n=nrow(x), min=min(x\$total.value),mean=mean(x\$total.value), median=median(x\$total.value),std=sd(x\$total.value),max=max(x\$total.value))})
```

[caption id="attachment_625" align="aligncenter" width="480" caption="Giving Levels by Giving Frequency, Smith for Congress 2010"][/caption]

num.contributions total n min mean median std max
1 284043 4242 1 67 35 149 2400
2 212697 1599 2 133 70 280 4800
3 118998 621 4 192 105 299 3800
4 72197 256 20 282 144 443 3800
5 43513 120 5 363 175 616 4129
6 24428 60 30 407 168 749 4700
7 4825 28 33 172 175 103 475
8 3988 7 80 570 160 1094 3048
9 4340 7 90 620 225 627 1450
10 990 5 100 198 200 72 280
13 167 1 167 167 167 NA 167
14 675 1 675 675 675 NA 675
18 360 1 360 360 360 NA 360
20 200 1 200 200 200 NA 200

This latest plot and table are both incredibly text heavy, but this is the critical intelligence required to start a fundraising plan.

We see the average total contribution increases with the giving frequency, this makes sense. The average increases in an approximately linear fashion which suggests the individual contribution amounts are staying constant. This may be a function of some campaign fundraising tactic, like "donate \$35 now for a free tshirt." We can also get a sense of how much success the Smith for Congress major donor program enjoys. An individual can legally donate \$2,400 for both a primary and a general election per cycle. We can count how many individuals have maxed out at \$4800 and measure how much impact the major donors have on the total amounts raised:

```# how many individuals gave the max for one election
nrow(cd0s[cd0s\$total.value == 2400,])
nrow(cd0s[cd0s\$total.value == 4800,])
```

We see 7 individuals who gave the maximum for one election, and only 2 individuals who maxed out for the entire cycle. The maxed out donors make up only 1.2% of total giving; this is very low for the average campaign. This tells us major donors aren't the most important segment to Smith for Congress, but it could also mean that the campaign isn't able or isn't willing to ask the max amount from large donors.

## Take Away

We can take away the following facts from our analysis:

• 40% of individual donors gave more than once to Smith for Congress
• 80% of donors gave \$100 or less to the campaign
• Repeat donors gave \$487,000 total to the campaign
• Two out of 6,949 (0.028 percent) donors gave the maximum amount allowable by law for a total of 1.2% of the total amount raised

From all this we can infer that Smith for Congress is running a very strong repeat donor program, and isn't focused on only high-dollar donors. This information could be very useful in a number of different ways. A treasurer for Smith for Congress could use this information to design a 2012 fundraising plan and campaign budget. A candidate similar to Smith, or running in a similar district, could use this same information to plan their own campaign. Or a rival campaign could use this during opposition research and financial planning. Or researchers could use this to build better generic models of US House individual fundraising. I hope this shows that detailed campaign finance analysis is pretty simple when you've got access to the relevant data, which unfortunately is very uncommon.