MIvoter: Building our Seats & Incumbents Database from Election data

Charles Roth
Last update: 12/12/2025

I. Introduction and Purpose

One of the fundamental goals of MIvoter is to maintain a complete database of all elected officials in Michigan.  There are multiple reasons why this is important:

  1. Voter information.  We want all Michigan voters to have an easy way to identify, and contact, their specific elected officials.
  2. Voter visits.  The more reasons voters have to visit MIvoter, and the more positive experiences that they have with the site, the better for us, especially come election time.
  3. State & County party cachet.  The more complete our data, the more respect (and publicity!) we get from the state and county Democratic parties.  (And other progressive organizations, as well.)
  4. Open Election seats.  A complete database, including term length and term "cycle" (which year a given official was elected), automatically tells us which seats are open in the next election.
  5. Election tallies.  A complete database also tells us (and the state & county parties, and future candidates) which races were won by how many votes.  A simple query can tell us, for example, which races were won or lost by (say) 5% or less.
  6. Uniqueness.  Perhaps surprisingly, there is no central database of elected officials in Michigan.  None!  The Michigan Secretary of State's office doesn't maintain one.  Even the individual counties' information is very fragmented, and very "20th century" -- they produce PDFs and printed booklets of officials. 

    Perhaps ironically, even our with-the-times SoS Jocelyn Benson claims that Michigan's decentralization is one of the strengths of our election security (and she may be right).

All of this is in support of our primary goal -- to get more voters to know about, and support, Democratically-endorsed candidates (especially in "non-partisan" elections) in election years.

II. Database history

In 2024, we built our database of incumbents from a wide variety of sources:

This was a great effort, but it is not sustainable. 

III. 2026 Database -- experimenting with the future

In preparing for 2026, we experimented with many possible sources of data.  In particular, we tried multiple "AI" engines and APIs, hoping that we could do queries like:

Unfortunately, the results were a bust.  All of the AI's are notorious at giving us answers -- that are wrong.  That's actually the most disturbing outcome from our experiments: the AI agent would claim to have an answer, but still be wrong.

(There are still more experiments to try: in particular, there is a way to tell the Google "Gemini" AI API to "ground itself" by also doing a plain Google search, and then combine the results.)

But the good news, however, is that the "failure" of many different experiments taught us some extremely useful lessons about where and how we could find real data about elected officials.

IV. 2026 Database -- Elections data overview

So for 2026, we're building everything from the history of elections data.  There are a few key design principles in this new "2026" database:

  1. Database tables: there are three key tables:
    1. seats -- the actual offices, including the normal term length of the office, and the term "cycle" (the year an election happens for that seat).  If there are multiple positions for that office (e.g. 2 US Senators), then we have a distinct row for each position.
    2. incumbents -- the people in each office.  Each row is a person, and it points back to the "seat" row for the office.  The incumbent row also includes other data like party, and the vote totals for their most recent election.

      (Note that we can have distinct "incumbents" tables for different years, if we want to keep the history for future reference.)

    3. candidates -- come election season, the Democratically-endorsed candidates for each office.  Much like "incumbents", the candidates rows point back to the relevant office/seat in the "seats" table.
  2. Use all election results.  We build the entire current list of elected officials, by stepping thru the election results -- starting in 2020, and stepping year by year through the November general elections.  This ensures that we have all officials with 6-year terms (or less).  As officials "term out", they are replaced (in the incumbents table) by the newly elected people.

    (There are a handful of 8-year offices, all of which are either the state board of education, or university trustees.  But that info is readily available.)

  3. Key data elements. There are 5 things we need to know (or derive) about each official.
    1. Seat.  (Type of seat, in which region, and if multiple seats, which one.)
    2. Number of positions for that kind of seat.  (E.g. US Senator has 2, mayors are typically 1, etc.)
    3. Name.  (And party, total relevant votes, etc.)
    4. Term length.
    5. Year elected ("term cycle")
  4. Key data sources. 
    1. State data.  The SoS publishes an (easily parseable) HTML page with all state-level election results.  This includes all Federal and State-wide elections, all Michigan House and Michigan Senate elections, and (critcally) all judicial elections.  See, for example, the Nov 2024 elections page.
    2. County data.  Each of the 83 Michigan counties is responsible for publishing the election results inside their county.  But from there, it becomes a classic good news, bad news story:
      1. The state publishes a list of links to the individual county's election results.
      2. But the different counties use a wide variety of formats. 
      3. About 20 counties use enhancedvoting.com, which publishes the results in a standard JSON format.
      4. Another 5 or so use another vendor, which provides a standard HTML format.
      5. About 10 (including Washtenaw) publish the results in their own, individual, HTML format -- which is easy to parse, but non-standard.
      6. The remaining counties publish the results as PDF.
      7. But... (bad news, good news) they all appear to have the row and column metadata embedded in the PDF.  Which means we can use Adobe Pro to export the PDF... as HTML!
      8. In the end, there are about 5 different PDF formats in use, which means 5 more HTML layouts to parse.

      But one way or another, the data is all there, and can be parsed programmatically.  We've divided up the data flow in such a way, that a handful of volunteer programmers can write the parse-the-HTML code to produce a standard CSV layout.  More on that later.

    3. School districts.  These often cross county boundaries, which means we need to add up election results from multiple counties.  The good news is that ballotpedia.org has an astonishingly consistent way of reporting the school board sizes and term lengths, that we can scrape.  A tool has been built (and run) that does just that.  It in turn imports that data into the school_district table.
    4. Standard number of positions.  Many (most) offices have a standard number of positions: e.g. a village only has one mayor, a county only has one executive, or prosecuting attorney.  For all standard cases, this is now built-in to the "title26" table.

    Adding all of this up... by carefully merging the foregoing data sources together, across all 6 election years (2020-2025)... we can produce current, correct, data for the seats and incumbents tables. 

    There are a few places where we'll need to "triangulate" the data to fill it out completely. E.g. we may not know the term length for the park commissioners for Redford township (to use the example from the AI section).  But if we know how many there are, and the history of who was elected, we can (programmatically) deduce the term length.

V. Data formats and PHP scripts.

This section delves into precise details of the data formats, and the processing flow.  All of the software and scripts referenced in this section, live in the (private!) git repository git@github.com:/wchasroth/ImportTools.git.

Glossary
Definitions of a few terms that we'll use through-out:

Phases
We've broken up the flow of the data into 8 numbered phases.  We keep the results of each phase, using a standard/consistent file name format, that identifies county, election, and phase. 

Breaking up the flow into separate phases (and scripts) makes it easier to inspect what is happening in each phase, debug any problems, and (in some cases) assign particular parts to other volunteer developers.

  1. Phase 0:  Original sources.  We keep the master list of the URLs for the election summary data, for every county, for every year, as a google sheet.  Volunteers are assisting us in filling it out.  It includes brief notes about the kind of format for the data.  Sometimes the format changes year over year, but usually for the better.
  2. Phase 1:  Source file.  We download and keep each original data file, in a structured name format, e.g. Counties/wayne/wayne-2024-11-ph1.pdf.
  3. Phase 2:  XML.  PDFs are translated to a "phase 2" XML file, using UPDF.
  4. Phase 3: First CSV (actually TSV) format.  Think of it this as the "wide mouth" of the data funnel.  We send in a variety of XML and HTML formats, one JSON format ("Enhancedvoting"), and a CSV format ("Clarity"), and they all get converted into a standard CSV format, aka "phase 3".  This is where we need multiple volunteer programmers to handle each of the varied formats.

    The phase 3 CSV output format has 7 columns, separated by tabs:

       yyyy-mm-dd   county#   title   voteFor#   candidateName   partyName   #votes
    
    The "title" field is free-form text, describing the office and locale.  These will get plucked apart later.  It may contain information about term length, or whether this is a "partial" term (to fill a seat that someone left, part-way through).

    The conversion scripts for this "wide mouth" are:

    1. Summary1 pdf  (Many counties.)
         php phase3ParsePdfSummary1.php yyyy-mm-dd county# phase2filename
      
      So, using our standard filename format, typical usage would look like:
         php phase3ParsePdfSummary.php 2024-11-05 82 Counties/wayne/wayne-2024-11-ph2.xml \
                                                    >Counties/wayne/wayne-2024-11-ph3.tsv
      
    2. Washtenaw HTML
         php phase3Washtenaw.php 2024-11-05 81 Counties/washtenaw/washtenaw-2024-11-ph2.html \
                                              >Counties/washtenaw/washtenaw-2024-11-ph3.tsv
      
    3. (There will be several others, to match the different phase 2 input formats.)

  5. Phase 4: Add up the # of votes columns, so that every candidate has the relevant totals.  (Who actually won is resolved later, since some offices cross county boundaries.)

    Also, convert partyName to partyLetter.

    TSV format below.  For votes fields, C=candidate, D=Dem, R=Repub, O=all other, T=grand total

       yyyy-mm-dd  county#   title   voteFor#   candidateName   partyLetter  votes_C  votes_D  votes_R  votes_O  votes_T
    
    Usage example:
       php phase4ParseVotes.php Counties/wayne/wayne-2024-11-ph3.tsv >Counties/wayne/wayne-2024-11-ph4.tsv
    
  6. Phase 5:  Extract the organization and office names, and translate them to our standard codes.  Extract the region name, if any.  Extract the term length, term cycle, and "partial term" flag (if any).  Extract the subdist code, if it exists (mostly county commissioner districts, city wards).

    The tab-separated format is (split across two lines here, for visibility):

       yyyy-mm-dd  county#   region  voteFor#   candidateName   partyLetter  votes_C  votes_D  votes_R  votes_O  votes_T
       org   office   termlen  termcycle  partial  subdist
    
    Usage example:
       php phase5ParseOrg.php Counties/wayne/wayne-2024-11-ph4.tsv >Counties/wayne/wayne-2024-11-ph5.tsv
    
  7. Phase 6:  Translate the region names into "district" codes.  This uses a combination of "short name" fields for various jurisdictions (city, township, village) in our database, plus some heuristic name-matching code.

    The tab-separated format is (split across two lines here, for visibility):

       yyyy-mm-dd  county#   region  voteFor#   candidateName   partyLetter  votes_C  votes_D  votes_R  votes_O  votes_T
       org   office   termlen  termcycle  partial  subdist  district
    
    Usage example:
       php phase6Districts.php Counties/wayne/wayne-2024-11-ph5.tsv >Counties/wayne/wayne-2024-11-ph6.tsv
    
  8. Phase 7:  Add any missing term lengths, and (where possible) missing 'voteFor' values.  We have multiple sources of office term lengths:
    1. Sometimes the raw election data contains it.
    2. County commissioner term lengths are set by state law (but changed in 2024).
    3. Some offices have known term lengths across the state, and we already stored that info the title26 table.
    4. We've previously fetched all school board member term lengths from Ballotpedia.org (and stored them in the "school_district" table, see the script getSchoolInfoFromBallotpedia).
    5. We are still investigating using AI agents to find any remaining office term lengths that are still missing from our model.

    If we're missing the voteFor number, we can can assume it is 1, if the maximum number of seats for that position (as recorded in the 'title26' table) is 1.

    The tab-separated format is (split across two lines here, for visibility):

       yyyy-mm-dd  county#   region  voteFor#   candidateName   partyLetter  votes_C  votes_D  votes_R  votes_O  votes_T
       org   office   termlen  termcycle  partial  subdist  district
    
    Usage example:
       php phase7TermLenFinder.php Counties/wayne/wayne-2024-11-ph6.tsv >Counties/wayne/wayne-2024-11-ph7.tsv
    
  9. Phase 8:  Export to database. We accumulate all of the data from phase 7, across all years (2020-2025) and counties, in the "elections" table.  (See 20-election.sql.)

    Usage example:

       php phase8db.php Counties/wayne/wayne-2024-11-ph7.tsv
    

    A subsequent step, not part of the county-by-county phases, merges this data into the seat26 and incumbent26 tables.

VI. TBA