- Phase 0: Original sources.
We keep the master list of the URLs for the election summary data, for every county, for
every year, as a
google sheet. Volunteers are assisting us in filling it out.
It includes brief notes about the kind of format for the data.
Sometimes the format changes year over year, but usually for the better.
- Phase 1: Source file.
We download and keep each original data file, in a structured name format, e.g.
Counties/wayne/wayne-2024-11-ph1.pdf.
- Phase 2: XML. PDFs are translated to a "phase 2" XML file, using UPDF.
- Phase 3: First CSV (actually TSV) format. Think of it this as the "wide mouth" of the data funnel.
We send in a variety of XML and HTML formats, one JSON format ("Enhancedvoting"), and a CSV format ("Clarity"), and they
all get converted into a standard CSV format, aka "phase 3".
This is where we need multiple volunteer programmers to handle each of the varied formats.
The phase 3 CSV output format has 7 columns, separated by tabs:
yyyy-mm-dd county# title voteFor# candidateName partyName #votes
The "title" field is free-form text, describing the office and locale.
These will get plucked apart later.
It may contain information about term length, or whether this is a "partial" term
(to fill a seat that someone left, part-way through).
The conversion scripts for this "wide mouth" are:
- Summary1 pdf (Many counties.)
php phase3ParsePdfSummary1.php yyyy-mm-dd county# phase2filename
So, using our standard filename format, typical usage would look like:
php phase3ParsePdfSummary.php 2024-11-05 82 Counties/wayne/wayne-2024-11-ph2.xml \
>Counties/wayne/wayne-2024-11-ph3.tsv
- Washtenaw HTML
php phase3Washtenaw.php 2024-11-05 81 Counties/washtenaw/washtenaw-2024-11-ph2.html \
>Counties/washtenaw/washtenaw-2024-11-ph3.tsv
- (There will be several others, to match the different phase 2 input formats.)
- Phase 4: Add up the # of votes columns, so that every candidate
has the relevant totals. (Who actually won is resolved later, since
some offices cross county boundaries.)
Also, convert partyName to partyLetter.
TSV format below. For votes fields, C=candidate, D=Dem, R=Repub, O=all other, T=grand total
yyyy-mm-dd county# title voteFor# candidateName partyLetter votes_C votes_D votes_R votes_O votes_T
Usage example:
php phase4ParseVotes.php Counties/wayne/wayne-2024-11-ph3.tsv >Counties/wayne/wayne-2024-11-ph4.tsv
- Phase 5: Extract the organization and office names, and translate them to our standard codes.
Extract the region name, if any.
Extract the term length, term cycle, and "partial term" flag (if any).
Extract the subdist code, if it exists (mostly county commissioner districts, city wards).
The tab-separated format is (split across two lines here, for visibility):
yyyy-mm-dd county# region voteFor# candidateName partyLetter votes_C votes_D votes_R votes_O votes_T
org office termlen termcycle partial subdist
Usage example:
php phase5ParseOrg.php Counties/wayne/wayne-2024-11-ph4.tsv >Counties/wayne/wayne-2024-11-ph5.tsv
- Phase 6: Translate the region names into "district" codes.
This uses a combination of "short name" fields for various jurisdictions
(city, township, village) in our database, plus some heuristic name-matching code.
The tab-separated format is (split across two lines here, for visibility):
yyyy-mm-dd county# region voteFor# candidateName partyLetter votes_C votes_D votes_R votes_O votes_T
org office termlen termcycle partial subdist district
Usage example:
php phase6Districts.php Counties/wayne/wayne-2024-11-ph5.tsv >Counties/wayne/wayne-2024-11-ph6.tsv
- Phase 7: Add any missing term lengths, and (where possible) missing 'voteFor' values.
We have multiple sources of office term lengths:
- Sometimes the raw election data contains it.
- County commissioner term lengths are set by state law (but changed in 2024).
- Some offices have known term lengths across the state, and we
already stored that info the title26 table.
- We've previously fetched all school board member term lengths
from Ballotpedia.org (and stored them in the "school_district" table,
see the script getSchoolInfoFromBallotpedia).
- We are still investigating using AI agents to find any remaining
office term lengths that are still missing from our model.
If we're missing the voteFor number, we can can assume it is 1, if the maximum
number of seats for that position (as recorded in the 'title26' table) is 1.
The tab-separated format is (split across two lines here, for visibility):
yyyy-mm-dd county# region voteFor# candidateName partyLetter votes_C votes_D votes_R votes_O votes_T
org office termlen termcycle partial subdist district
Usage example:
php phase7TermLenFinder.php Counties/wayne/wayne-2024-11-ph6.tsv >Counties/wayne/wayne-2024-11-ph7.tsv
- Phase 8: Export to database.
We accumulate all of the data from phase 7, across all years (2020-2025) and counties,
in the "elections" table.
(See 20-election.sql.)
Usage example:
php phase8db.php Counties/wayne/wayne-2024-11-ph7.tsv
A subsequent step, not part of the county-by-county phases, merges this data into the seat26
and incumbent26 tables.