MIvoter: Building our Seats & Incumbents Database from Election data

Charles Roth
Last update: 1/17/2026

I. Introduction and Purpose

This document summarizes the technical details involved in building the mivoter.org database of election data, and the derived list of incumbent officials.  As such, it is the primary documentation for the data flow, and the software built by Charles Roth and other volunteers, to process that data.

One of the fundamental goals of MIvoter is to maintain a complete database of all elected officials in Michigan.  There are multiple reasons why this is important:

  1. Voter information.  We want all Michigan voters to have an easy way to identify, and contact, their specific elected officials.
  2. Voter visits.  The more reasons voters have to visit MIvoter, and the more positive experiences that they have with the site, the better for us, especially come election time.
  3. State & County party cachet.  The more complete our data, the more respect (and publicity!) we get from the state and county Democratic parties.  (And other progressive organizations, as well.)
  4. Open Election seats.  A complete database, including term length and term "cycle" (which year a given official was elected), automatically tells us which seats are open in the next election.
  5. Election tallies.  A complete database also tells us (and the state & county parties, and future candidates) which races were won by how many votes.  A simple query can tell us, for example, which races were won or lost by (say) 5% or less.
  6. Uniqueness.  Perhaps surprisingly, there is no central database of elected officials in Michigan.  None!  The Michigan Secretary of State's office doesn't maintain one.  Even the individual counties' information is very fragmented, and very "20th century" -- they produce PDFs and printed booklets of officials. 

    Perhaps ironically, even our with-the-times SoS Jocelyn Benson claims that Michigan's decentralization is one of the strengths of our election security (and she may be right).

All of this is in support of our primary goal -- to get more voters to know about, and support, Democratically-endorsed candidates (especially in "non-partisan" elections) in election years.

II. Database history

In 2024, we built our database of incumbents from a wide variety of sources:

This was a great effort, but it is not sustainable. 

III. 2026 Database -- experimenting with the future

In preparing for 2026, we experimented with many possible sources of data.  In particular, we tried multiple "AI" engines and APIs, hoping that we could do queries like:

Unfortunately, the results were a bust.  All of the AI's are notorious at giving us answers -- that are wrong.  That's actually the most disturbing outcome from our experiments: the AI agent would claim to have an answer, but still be wrong.

(There are still more experiments to try: in particular, there is a way to tell the Google "Gemini" AI API to "ground itself" by also doing a plain Google search, and then combine the results.)

But the good news, however, is that the "failure" of many different experiments taught us some extremely useful lessons about where and how we could find real data about elected officials.

IV. 2026 Database -- Elections data overview

So for 2026, we're building everything from the history of elections data.  There are a few key design principles in this new "2026" database:

  1. Database tables: there are three key tables:
    1. seats -- the actual offices, including the normal term length of the office, and the term "cycle" (the year an election happens for that seat).  If there are multiple positions for that office (e.g. 2 US Senators), then we have a distinct row for each position.
    2. incumbents -- the people in each office.  Each row is a person, and it points back to the "seat" row for the office.  The incumbent row also includes other data like party, and the vote totals for their most recent election.

      (Note that we can have distinct "incumbents" tables for different years, if we want to keep the history for future reference.)

    3. candidates -- come election season, the Democratically-endorsed candidates for each office.  Much like "incumbents", the candidates rows point back to the relevant office/seat in the "seats" table.
  2. Use all election results.  We build the entire current list of elected officials, by stepping thru the election results -- starting in 2020, and stepping year by year through the November general elections.  This ensures that we have all officials with 6-year terms (or less).  As officials "term out", they are replaced (in the incumbents table) by the newly elected people.

    (There are a handful of 8-year offices, all of which are either the state board of education, or university trustees.  But that info is readily available.)

  3. Key data elements. There are 5 things we need to know (or derive) about each official.
    1. Seat.  (Type of seat, in which region, and if multiple seats, which one.)
    2. Number of positions for that kind of seat.  (E.g. US Senator has 2, mayors are typically 1, etc.)
    3. Name.  (And party, total relevant votes, etc.)
    4. Term length.
    5. Year elected ("term cycle")
  4. Key data sources. 
    1. State data.  The SoS publishes the results of all state-level election results, at (for example) the Nov 2024 elections page. This includes all Federal and State-wide elections, all Michigan House and Michigan Senate elections, and (critcally) all judicial elections. 

      From these links, we can also download TSV files of the raw data, so we don't have to parse the HTML pages.  (Follow the "Data" link, and then "Election Data".)

    2. County data.  Each of the 83 Michigan counties is responsible for publishing the election results inside their county.  But from there, it becomes a classic good news, bad news story:
      1. The state publishes a list of links to the individual county's election results.
      2. But the different counties use a wide variety of formats. 
      3. About 20 counties use enhancedvoting.com, which publishes the results in a standard JSON format.
      4. Another 5 or so use another vendor, which provides a standard HTML format.
      5. About 10 (including Washtenaw) publish the results in their own, individual, HTML format -- which is easy to parse, but non-standard.
      6. The remaining counties publish the results as PDF.
      7. But... (bad news, good news) they all appear to have the row and column metadata embedded in the PDF.  Which means we can use Adobe Pro to export the PDF... as HTML!
      8. In the end, there are about 5 different PDF formats in use, which means 5 more HTML layouts to parse.

      But one way or another, the data is all there, and can be parsed programmatically.  We've divided up the data flow in such a way, that a handful of volunteer programmers can write the parse-the-HTML code to produce a standard CSV layout.  More on that later.

    3. School districts.  These often cross county boundaries, which means we need to add up election results from multiple counties.  The good news is that ballotpedia.org has an astonishingly consistent way of reporting the school board sizes and term lengths, that we can scrape.  A tool has been built (and run) that does just that.  It in turn imports that data into the school_district table.

      (See getSchoolTermLengthFromBallotpedia and other scripts in the 'Schools' directory in this repo.)

    4. Community Colleges.  These often cover a few counties, so here too, we need to add up election results from multiple counties. 

      (See the scripts in the 'ComColleges' directory in this repo.)

    5. Standard number of positions.  Many (most) offices have a standard number of positions: e.g. a village only has one mayor, a county only has one executive, or prosecuting attorney.  For all standard cases, this is now built-in to the "title26" table.

    Adding all of this up... by carefully merging the foregoing data sources together, across all 6 election years (2020-2025)... we can produce current, correct, data for the seats and incumbents tables. 

    There are a few places where we'll need to "triangulate" the data to fill it out completely. E.g. we may not know the term length for the park commissioners for Redford township (to use the example from the AI section).  But if we know how many there are, and the history of who was elected, we can (programmatically) deduce the term length.

V. Data formats and PHP scripts.

This section delves into precise details of the data formats, and the processing flow.  All of the software and scripts referenced in this section, live in the (private!) git repository git@github.com:/wchasroth/ImportTools.git.

Glossary
Definitions of a few terms that we'll use through-out:

Phases
We've broken up the flow of the data into numbered phases.  We keep the results of each phase, using a standard/consistent file name format, that identifies the county (where relevant), election, and phase. 

Breaking up the flow into separate phases (and scripts) makes it easier to inspect what is happening in each phase, debug any problems, and (in some cases) assign particular parts to other volunteer developers.

Most of the phases are automatically handled by the master 'translator' script (in the bin directory), which use keywords in the file name to decide which specific program needs to be run, especially for phase 3.  But this section documents what each of the individual phases do.

Volunteer note:  The primary role of volunteer software developers in this process, is to build programs that translate from a phase 1 or phase 2 file, to phase 3.  Do not go beyond phase 3!  All of the processing beyond phase 3 has already been built; that's the narrow part of the data 'funnel' mentioned below.

  1. Phase 0:  Original source URLs.  We keep the master list of the URLs for the election summary data, for every county, for every year, as a google sheet.  Data entry volunteers are assisting us in filling it out.  It includes brief notes about the kind of format for the data.  Sometimes the format changes year over year, but usually for the better.

    (We also archive the results for state-level elections, in a separate directory in the repo -- see 'StateResults'.  While they follow the same overall flow of phases as the county data, they are already 'done' and do not need volunteer attention.)

  2. Phase 1:  Source file.  We download and keep each original data file, in a structured name format, e.g. Counties/wayne/81-wayne_2024-11-03.sum1.ph1.pdf.
  3. Phase 2:  This is for "intermediate" formats, which include:
    1. XML.  PDFs are translated to a "phase 2" XML file, using UPDF.
    2. XLS.  Converted by phase2Xls.py to a TSV format like "57-missaukee_2021-11-02.xls.ph2.tab".
    3. Washtenaw.  Washtenaw-specific HTML, converted by phase2Washtenaw.php to a TSV format like "81-washtenaw_2020-11-03.html.ph2.tab".
    4. MAN.  "Manual" data entry, like 77-schoolcraft_2021-11-02.man.ph2.man.
  4. Phase 3: First CSV (actually TSV) format.  Think of it this as the "wide mouth" of the data funnel.  We send in a variety of XML and HTML formats, one JSON format ("Enhancedvoting"), and a CSV format ("Clarity"), and they all get converted into a standard CSV format, aka "phase 3".  This is where we need multiple volunteer programmers to handle each of the varied formats.

    The phase 3 CSV output format has 7 columns, separated by tabs:

       yyyy-mm-dd   county#   title   voteFor#   candidateName   partyName   #votes
    
    The "title" field is free-form text, describing the office and locale.  These will get plucked apart later.  It may contain information about term length, or whether this is a "partial" term (to fill a seat that someone left, part-way through).

    The formats and conversion scripts for this "wide mouth" are shown below.  The format code is the same as the "format" column in the master spreadsheet.

    Format    Raw type   Intermediate type   Script
    sum1 pdf xml phase3ParseXmlSummary1.php
    column pdf xml (Jeff's xml parser, add to repo)
    clarity csv phase3Clarity.php
    clarity html phase3ClarityHtml.php
    enhanced  json phase3ParseJson.php
    excel  xls tab phase2Xls.py, translator
    washtenaw  html tab translator
    state-wide  csv parseStateResultsTsv (generates phase 7 tsv)

    When phase 3 processing is complete, no matter what the original data format was, we end up with our standard phase 3 ".tsv" file, e.g. 37-isabella_2020-11-03.json.ph3.tsv.

  5. Phase 4: Eliminate all irrelevant races, such as Federal, state-wide, state house & senate, judicial, or proposals.  See phase4IgnoreRaces.php.

    (Note: only for county data files.  We already have that data from the state-wide files, see parseStateResultsTsv.php.)

  6. Phase 5: Add up the # of votes columns, so that every candidate has the relevant totals.  (Who actually won is resolved later, since some offices cross county boundaries.)

    Also, convert partyName to partyLetter.

    TSV format below.  For votes fields, C=candidate, D=Dem, R=Repub, O=all other, T=grand total

       yyyy-mm-dd  county#   title   voteFor#   candidateName   partyLetter  votes_C  votes_D  votes_R  votes_O  votes_T
    
    Usage example:
       php phase5ParseVotes.php <Counties/wayne/wayne-2024-11-ph3.tsv >Counties/wayne/wayne-2024-11-ph4.tsv
    
  7. Phase 6:  Extract the organization and office names, and translate them to our standard codes.  Extract the region name, if any.  Extract the term length, term cycle, and "partial term" flag (if any).  Extract the subdist code, if it exists (mostly county commissioner districts, city wards).

    The tab-separated format is (split across two lines here, for visibility):

       yyyy-mm-dd  county#   region  voteFor#   candidateName   partyLetter  votes_C  votes_D  votes_R  votes_O  votes_T
       org   office   termlen  termcycle  partial  subdist
    
    Usage example:
       php phase6ParseOrg.php <Counties/wayne/wayne-2024-11-ph4.tsv >Counties/wayne/wayne-2024-11-ph5.tsv
    
  8. Phase 7:  Translate the region names into "district" codes.  This uses a combination of "short name" fields for various jurisdictions (city, township, village) in our database, plus some heuristic name-matching code.

    Also, where possible, fix missing 'voteFor' values by using values from 'title26' table.

    The tab-separated format is (split across two lines here, for visibility):

       yyyy-mm-dd  county#   region  voteFor#   candidateName   partyLetter  votes_C  votes_D  votes_R  votes_O  votes_T
       org   office   termlen  termcycle  partial  subdist  district  incumbent
    
    Usage example:
       php phase7Districts.php <Counties/wayne/wayne-2024-11-ph5.tsv >Counties/wayne/wayne-2024-11-ph6.tsv
    
  9. Phase 8:  Add any missing term lengths.  We have multiple sources of office term lengths:
    1. Sometimes the raw election data contains it.
    2. County commissioner term lengths are set by state law (but changed in 2024).
    3. Some offices have known term lengths across the state, and we already stored that info the title26 table.
    4. We've previously fetched all school board member term lengths from Ballotpedia.org (and stored them in the "school_district" table, see the script getSchoolInfoFromBallotpedia).
    5. We are still investigating using AI agents to find any remaining office term lengths that are still missing from our model.

    If we're missing the voteFor number, we can can assume it is 1, if the maximum number of seats for that position (as recorded in the 'title26' table) is 1.

    The tab-separated format is (split across two lines here, for visibility):

       yyyy-mm-dd  county#   region  voteFor#   candidateName   partyLetter  votes_C  votes_D  votes_R  votes_O  votes_T
       org   office   termlen  termcycle  partial  subdist  district  incumbent
    
    Usage example:
       php phase8TermLenFinder.php <Counties/wayne/wayne-2024-11-ph6.tsv >Counties/wayne/wayne-2024-11-ph7.tsv
    
  10. Phase 9:  Export to database.
    The database table 'elections' is created by the script 20-elections.sql.  The phase9db.php script is run with every phase 8 file, and imports that data into this table.  So when phase 9 is complete, we have all of the election data for all of the years, in this one table.

    Errors are written to stderr.

  11. Phase 10:  "Vote for" cleanup.
    The "vote for" (number of candidates) data, as extracted from the county elections reports, is occasionally incorrect or missing.  The phase10clean.php script attempts to analyze those cases, and (where possible) deduce a correct value, or at least a 'better' value.  All of the work is done directly in the database.  The script takes only one argument, the election date to be processed (e.g. "2020-11-03").
  12. Phase 11:  Merge ("deduplicate") candidate entries that cross multiple counties.  E.g. some school boards, community college boards, even a few city councils.  This gets a little tricky, given that candidate names are sometimes reported differently, but the code does some smart "name matching" (including "levenshtein" difference calculations) to handle those cases.

    All of the vote totals for the "duplicate" cases are added up, and inserted as new rows, marked with the 'magic' county number 99.  The original rows for those cases are deleted.

    Again, all of the work is done directly in the 'elections' table, given the election date to be processed.

  13. Phase 12:  Determine the winners!
    The phase12winners.php script now finds (all of) the winners for each election, and sets winner=1 for those rows of the 'elections' table.

    As before, it takes one argument, the election date.

Summary note:  All of the phase 1 and phase 2 files are archived and backed-up, at regular intervals.  All of the remaining phases (i.e. starting with phase 3 tsv file generation) can be re-run and re-generated at any time. 

This is by design; on occasion we will make manual corrections to the phase 1 or phase 2 files, and so this way any such fixes can be fully propagated through the remaining phases and into the database.  (The entire process usually takes a small number of minutes.)

VI. TBA