- Phase 0: Original source URLs.
We keep the master list of the URLs for the election summary data, for every county, for
every year, as a
google sheet. Data entry volunteers are assisting us in filling it out.
It includes brief notes about the kind of format for the data.
Sometimes the format changes year over year, but usually for the better.
(We also archive the results for state-level elections, in a separate directory in the repo -- see 'StateResults'.
While they follow the same overall flow of phases as the county data, they are already 'done' and
do not need volunteer attention.)
- Phase 1: Source file.
We download and keep each original data file, in a structured name format, e.g.
Counties/wayne/81-wayne_2024-11-03.sum1.ph1.pdf.
- Phase 2: This is for "intermediate" formats, which include:
- XML. PDFs are translated to a "phase 2" XML file, using UPDF.
- XLS. Converted by phase2Xls.py to a TSV format like "57-missaukee_2021-11-02.xls.ph2.tab".
- Washtenaw. Washtenaw-specific HTML, converted by phase2Washtenaw.php to a TSV format like "81-washtenaw_2020-11-03.html.ph2.tab".
- MAN. "Manual" data entry, like 77-schoolcraft_2021-11-02.man.ph2.man.
- Phase 3: First CSV (actually TSV) format. Think of it this as the "wide mouth" of the data funnel.
We send in a variety of XML and HTML formats, one JSON format ("Enhancedvoting"), and a CSV format ("Clarity"), and they
all get converted into a standard CSV format, aka "phase 3".
This is where we need multiple volunteer programmers to handle each of the varied formats.
The phase 3 CSV output format has 7 columns, separated by tabs:
yyyy-mm-dd county# title voteFor# candidateName partyName #votes
The "title" field is free-form text, describing the office and locale.
These will get plucked apart later.
It may contain information about term length, or whether this is a "partial" term
(to fill a seat that someone left, part-way through).
The formats and conversion scripts for this "wide mouth" are shown below.
The format code is the same as the "format" column in the
master spreadsheet.
| Format |
Raw type |
Intermediate type |
Script |
| sum1 | pdf | xml | phase3ParseXmlSummary1.php |
| column | pdf | xml | (Jeff's xml parser, add to repo) |
| clarity | csv | | phase3Clarity.php |
| clarity | html | | phase3ClarityHtml.php |
| enhanced | json | | phase3ParseJson.php |
| excel | xls | tab | phase2Xls.py, translator |
| washtenaw | html | tab | translator |
| state-wide | csv | | parseStateResultsTsv (generates phase 7 tsv) |
When phase 3 processing is complete, no matter what the original data format was,
we end up with our standard phase 3 ".tsv" file, e.g. 37-isabella_2020-11-03.json.ph3.tsv.
- Phase 4: Eliminate all irrelevant races, such as Federal, state-wide, state house & senate, judicial,
or proposals.
See phase4IgnoreRaces.php.
(Note: only for county data files.
We already have that data from the state-wide files, see parseStateResultsTsv.php.)
- Phase 5: Add up the # of votes columns, so that every candidate
has the relevant totals. (Who actually won is resolved later, since
some offices cross county boundaries.)
Also, convert partyName to partyLetter.
TSV format below. For votes fields, C=candidate, D=Dem, R=Repub, O=all other, T=grand total
yyyy-mm-dd county# title voteFor# candidateName partyLetter votes_C votes_D votes_R votes_O votes_T
Usage example:
php phase5ParseVotes.php <Counties/wayne/wayne-2024-11-ph3.tsv >Counties/wayne/wayne-2024-11-ph4.tsv
- Phase 6: Extract the organization and office names, and translate them to our standard codes.
Extract the region name, if any.
Extract the term length, term cycle, and "partial term" flag (if any).
Extract the subdist code, if it exists (mostly county commissioner districts, city wards).
The tab-separated format is (split across two lines here, for visibility):
yyyy-mm-dd county# region voteFor# candidateName partyLetter votes_C votes_D votes_R votes_O votes_T
org office termlen termcycle partial subdist
Usage example:
php phase6ParseOrg.php <Counties/wayne/wayne-2024-11-ph4.tsv >Counties/wayne/wayne-2024-11-ph5.tsv
- Phase 7: Translate the region names into "district" codes.
This uses a combination of "short name" fields for various jurisdictions
(city, township, village) in our database, plus some heuristic name-matching code.
Also, where possible, fix missing 'voteFor' values by using values from 'title26' table.
The tab-separated format is (split across two lines here, for visibility):
yyyy-mm-dd county# region voteFor# candidateName partyLetter votes_C votes_D votes_R votes_O votes_T
org office termlen termcycle partial subdist district incumbent
Usage example:
php phase7Districts.php <Counties/wayne/wayne-2024-11-ph5.tsv >Counties/wayne/wayne-2024-11-ph6.tsv
- Phase 8: Add any missing term lengths.
We have multiple sources of office term lengths:
- Sometimes the raw election data contains it.
- County commissioner term lengths are set by state law (but changed in 2024).
- Some offices have known term lengths across the state, and we
already stored that info the title26 table.
- We've previously fetched all school board member term lengths
from Ballotpedia.org (and stored them in the "school_district" table,
see the script getSchoolInfoFromBallotpedia).
- We are still investigating using AI agents to find any remaining
office term lengths that are still missing from our model.
If we're missing the voteFor number, we can can assume it is 1, if the maximum
number of seats for that position (as recorded in the 'title26' table) is 1.
The tab-separated format is (split across two lines here, for visibility):
yyyy-mm-dd county# region voteFor# candidateName partyLetter votes_C votes_D votes_R votes_O votes_T
org office termlen termcycle partial subdist district incumbent
Usage example:
php phase8TermLenFinder.php <Counties/wayne/wayne-2024-11-ph6.tsv >Counties/wayne/wayne-2024-11-ph7.tsv
- Phase 9: Export to database.
The database table 'elections' is created by the script 20-elections.sql.
The phase9db.php script is run with every phase 8 file, and imports that data
into this table.
So when phase 9 is complete, we have all of the election data for all
of the years, in this one table.
Errors are written to stderr.
- Phase 10: "Vote for" cleanup.
The "vote for" (number of candidates) data, as extracted from the county elections reports, is occasionally incorrect
or missing.
The phase10clean.php script attempts to analyze those cases, and (where possible) deduce a correct value, or
at least a 'better' value.
All of the work is done directly in the database.
The script takes only one argument, the election date to be processed (e.g. "2020-11-03").
- Phase 11: Merge ("deduplicate") candidate entries that cross multiple counties.
E.g. some school boards, community college boards, even a few city councils.
This gets a little tricky, given that candidate names are sometimes reported differently,
but the code does some smart "name matching" (including "levenshtein" difference calculations)
to handle those cases.
All of the vote totals for the "duplicate" cases are added up, and inserted as new rows, marked
with the 'magic' county number 99.
The original rows for those cases are deleted.
Again, all of the work is done directly in the 'elections' table, given the election date to be processed.
- Phase 12: Determine the winners!
The phase12winners.php script now finds (all of) the winners for each election, and sets winner=1 for those
rows of the 'elections' table.
As before, it takes one argument, the election date.