open-source campaign finance analysis with ruby and fechell
Introduction
fechell is a ruby library used to extract data from electronically filed Federal Election Commission reports, saving you from the hell of parsing them yourself. These reports are filed by all political action committees, presidential and House candidates; at this time Senate rules don't require electronic disclosure. Depending on the form type and filer, a report may contain summaries of receipts and disbursements, a 48-hour notice of independent spending, a statement of candidacy, and other finance or organizational notifications. In the 9 years of available data, the format of each of these reports has changed several times making it complex to uniformly parse data across multiple years and versions. The fechell library, using the field definition files provided by Watchdog.net, allows simple pragmatic access to the full library of FEC data. To demonstrate the features of the library we will use fechell to extract all itemized donations made to the Obama for America campaign in 2007 and 2008 by individuals. We will then analyze these data by importing them into a database and visualizing contribution amounts aggregated by zip code over time. Plotting this data on a map will show us clusters of financial support from individuals, as well as how those clusters grow and change over time.
Code books & formats for electronic FEC reports
The terminology detailing electronically-filed FEC reports is sometimes confusing. A report is a plain text file, usually with an extension of 'fec.' Every report contains a header line detailing the FEC format version and the software used to file the report, with values separated by a version-dependent separator character. After the header, each line corresponds to either a form (ex: F1,F3P,F24), or a schedule attached to the form (Schedule A, Schedule I). The first column of every line tells the reader what type of form or schedule is contained on that line. Every form and schedule is detailed in a workbook named FEC_Format_v6.4.xls available from the FEC as part of their free Vendor Tools package. You will want to download this package and have the FEC_Format_v6.4.xls (format workbook) available when processing reports with fechell. The format workbook contains one worksheet for each form or schedule, and each worksheet gives a column-by-column description of each value contained in the report. The description contains a standard name, as well as sample data and implementation notes. The fechell library uses this standard name to return values from a form/schedule to the calling program. The library will determine the correct separator character and field definition to use at run-time so all these details are hidden from the user.
The use and sale of data from these reports for commercial purposes, especially identifying information of individuals giving to campaigns, is regulated by the FEC. See http://fec.gov/pubrec/publicrecordsoffice.shtml#using.
Installation
the fechell library is hosted on github and is available as a gem.
You will need to add GitHub to your gems source if you haven't already done so.
sudo gem sources -a http://gems.github.com/
Now install the fechell gem.
sudo gem install offensivepolitics-fechell
Download FEC reports
Using the FEC electronic filing search tool we can find raw reports as uploaded by candidates and committees.
In this case we're looking for committee id C00431445, as this is the primary campaign committee for Obama For America. Contributions are itemized on form F3P, "Summary of Receipts and Disbursements by an authorized committee (pres / vice pres)." Using those query parameters you will see close to a hundred filings but we're only interested in 15 of them.
The combined size of all Obama For America F3P filings is about 473 megabytes, and you can download the filings individually using the following URLs:
359390.fec- period 01/01/2007-03/31/2007, filed 08/22/2008 - APR QUARTERLY
359395.fec- period 04/01/2007-06/30/2007, filed 08/22/2008 - JUL QUARTERLY
359397.fec- period 07/01/2007-09/30/2007, filed 08/22/2008 - OCT QUARTERLY
360400.fec- period 10/01/2007-12/31/2007, filed 08/30/2008 - YEAR-END
360372.fec- period 01/01/2008-01/31/2008, filed 08/29/2008 - FEB MONTHLY
360401.fec- period 02/01/2008-02/29/2008, filed 08/30/2008 - MAR MONTHLY
362085.fec- period 03/01/2008-03/31/2008, filed 09/15/2008 - APR MONTHLY
358077.fec- period 04/01/2008-04/30/2008, filed 08/18/2008 - MAY MONTHLY
358076.fec- period 05/01/2008-05/31/2008, filed 08/18/2008 - JUN MONTHLY
406266.fec- period 06/01/2008-06/30/2008, filed 03/03/2009 - JUL MONTHLY
405793.fec- period 07/01/2008-07/31/2008, filed 02/27/2009 - AUG MONTHLY
405795.fec- period 08/01/2008-08/31/2008, filed 02/27/2009 - SEP MONTHLY
405796.fec- period 09/01/2008-09/30/2008, filed 02/27/2009 - OCT MONTHLY
405794.fec- period 10/01/2008-10/15/2008, filed 02/27/2009 - PRE-GENERAL
406271.fec- period 10/16/2008-11/24/2008, filed 03/03/2009 - POST-GENERAL
Save all these files into a directory.
F3P fields
fechell will give you access to any field in a report, but you need to know specifically which fields to look for. Using the format worksheet (FEC_Format_6.4.xls), we can easily find the correct field names and their possible values. In the previous step we downloaded 15 different filings, all type F3P "Summary of receipts and disbursements by an authorized committee (pres / vice pres)." A F3P filing contains the normal FEC header line, a summary line detailing fund raising for the current period, followed by any number of schedules. Using the Specification Requirements document (FEC_Format_6.4.pdf) that comes with the 'FEC Vendor Tools' download we can quickly determine which schedules belong to which forms. In Appendix C, page 22 of Specification Requirements we see that form F3P can contain Schedules A, B, C, C1, C2 and D. Using the format workbook as a reference for each schedule, we see that Schedule A ('Sch A') contains an itemized receipt from an individual or a committee. In our analysis, we'll look for Schedule A filings and ignore everything else.
We want to visualize contribution amounts from individuals by zip code and contribution date, so we'll extract 3 fields from each Schedule A: contribution amount, contribution date, and zip code of contributor. Schedule A can contain contributions from several types of entities, each identified by the "ENTITY TYPE" field. Looking through the "Sch A" worksheet of the format workbook we see we want column #6 ("ENTITY TYPE"), column #19 ("CONTRIBUTOR ZIP"), column #20 ("CONTRIBUTION DATE"), and column #21 ("CONTRIBUTION AMOUNT"). The names of these columns are all we'll need to start extracting data.
Extract the data
Now we're ready to load the FEC data files and extract individual contribution data.
First we need to load and initialize the FEC library
require 'rubygems'
require 'fechell'
h = FECHell.new
Next we'll use fechell to load each file we downloaded. We will pass a filename to the process function, and fechell will return the schedule type for that line and all the values keyed by their name from the FEC worksheet for the corresponding schedule.
Note: You'll need to change the Dir["./obama-fec/*.fec"] path to the directory with your downloaded FEC reports from step 2.
require 'rubygems'
require 'fechell'
h = FECHell.new
Dir["./obama-fec/*.fec"].each do |filename|
h.process(filename) do |line|
schedule = line[0]
values = line[1]
end
end
We want to process Schedule A lines (SA) , and we only want contributions from the IND (individual) entity type . We first check the schedule value returned by fechell to make sure we're processing a "Schedule A" (SA) line. If we aren't, then we move to the next schedule. To check the entity type that made the contribution, we look in the values hash with the key 'ENTITY TYPE' field we identified at the beginning of this step. Referring again to the FEC worksheet we see that possible values for column #6 ("ENTITY TYPE") are "[CAN|CCM|COM|IND|ORG|PAC|PTY]". In this case we're only interested in contributions from individuals (IND) so we'll skip processing the line if that value doesn't match.
require 'rubygems'
require 'fechell'
h = FECHell.new
Dir["./obama-fec/*.fec"].each do |filename|
h.process(filename) do |line|
schedule = line[0]
values = line[1]
next if schedule != 'SA'
next if values["ENTITY TYPE"] != "IND"
end
end
Now we know we've got a Schedule A line for an individual so we can extract the actual data we care about. Again, the field names used below were taken from the FEC worksheet for Schedule A: "CONTRIBUTION AMOUNT", "CONTRIBUTOR ZIP", and "CONTRIBUTION DATE."
require 'rubygems'
require 'fechell'
h = FECHell.new
Dir["./obama-fec/*.fec"].each do |filename|
h.process(filename) do |line|
schedule = line[0]
values = line[1]
next if schedule != 'SA'
next if values["ENTITY TYPE"] != "IND"
amount = values["CONTRIBUTION AMOUNT"].to_f
fullzip = values["CONTRIBUTOR ZIP"]
date_str = values["CONTRIBUTION DATE"]
next if fullzip.nil? == true
next if date_str.nil? == true
zip5 = fullzip[0..4]
puts "On #{date_str} we received #{amount} from zipcode #{zip5}"
end
end
Now just printing out the individual contributions is a great way to verify the library is working, but it doesn't really do a lot for us as far as analysis is concerned. Instead of just printing information we'll create a CSV file of all the individual contributions, suitable for import into a database. We next do some small cleanup and data verification, as well as adding some simple error handling since FEC data is notoriously poorly formatted or incomplete. This leaves us with the final version of our individual contribution extraction code:
require 'rubygems'
require 'fastercsv'
require 'fechell'
csv = FasterCSV.open("obama-fec-SA-2007-2008.csv","w")
csv << ["contribution_date","amount","contributor_zip"]
h = FECHell.new
Dir["./obama-fec/*.fec"].each do |filename|
h.process(filename) do |line|
schedule = line[0]
values = line[1]
next if schedule != 'SA'
next if values["ENTITY TYPE"] != "IND"
amount = values["CONTRIBUTION AMOUNT"].to_f
fullzip = values["CONTRIBUTOR ZIP"]
date_str = values["CONTRIBUTION DATE"]
next if fullzip.nil? == true || fullzip == ''
next if date_str.nil? == true || date_str == ''
date_val = Date.strptime(date_str,'%Y%m%d')
next if date_val.nil? == true
date_formatted = "#{date_val.year}-#{date_val.month.to_s.rjust(2,'0')}-#{date_val.day.to_s.rjust(2,'0')}"
zip5 = fullzip[0..4]
csv << [date_formatted,amount,zip5]
end
end
csv.close
Running this code on the Obama for America FEC files produces around 2.8 million individual itemized contributions over two years. The output file is about 67 megabytes and is ripe to be imported into a database for further analysis. Which is exactly what we do in Part 2(link).