ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Can AI Predict the Next US President? A Comprehensive Example for Building an ML Pipeline to Solve Real-World Problems

Anas Kezibri

Google Certified Senior Data Engineer

å‘å¸ƒæ—¥æœŸ: 2024å¹´12æœˆ30æ—¥

+ å…³æ³¨

Data Science Applied to the 2024 US Elections: A Study of the Swing States Factor in Trump's Victory

Introduction

The 2024 United States presidential elections were one of the closest in recent history, leaving even the most experienced political specialist struggling to predict the winner (1). Such uncertainty results from the rapid shifts in voter sentiment influenced by countless factors. This is what sparked my curiosity and made me wonder: could AI/Machine Learning succeed where human expertise struggled?

Machine Learning is a branch of Artificial Intelligence that focuses on creating statistical algorithms capable of learning and improving automatically from experience and data. These algorithms analyze datasets to build models that can perform tasks like image classification, predicting prices, and also in our case predicting election outcomes (2).

This article details how I leveraged ML to tackle this real-world use case. From choosing and processing the source data, selecting the optimal predictive models and interpreting results, you will see how ML can give us insights into potential outcomes of the most complex events in modern democracy.

1. Understanding the US Presidential Election

Before diving into the implementation of the ML pipeline, itâ€™s essential to first understand how the United States presidential election system works.

The US presidential election system relays on the Electoral College rather than a direct popular vote to determine the winner. There are 538 total electoral votes, which are proportionally allocated to each state according to its population. To win the presidency, a candidate must secure a simple majority of at least 270 electoral votes. In most states, the candidate that wins the majority of popular votes gets all the electoral votes for that state. However, there are some exceptions where a state assigns its electors using a proportional system (3). You can find on this map all the details about the electoral votes for each state.

Another particularity in the US elections, is the historical voting pattern that categorizes the US states into three types:

Blue States: refers to states where voters vote traditionally for the Democratic Party.
Red states: refers to states where voters vote predominantly for the Republican Party.
Swing states: are those that don't consistently vote for one party. The vote fluctuates unpredictability between the Democratic and Republican candidates, very often with narrow voting margins.

Grasping this complex electoral system is crucial for understanding some of the key architectural choices taken to design the ML pipeline.

2. ML Pipeline Architecture and Technical Stack

The ML pipeline is implemented mainly using Kedro, an open-source Python framework for creating modular data science workflows, that follow software engineering best practices.

Leveraging Kedro allowed me to structure the ML pipeline into three distinct stages: Data Processing, Machine Learning and Reporting. You can think of these stages as a sequence of sub-pipelines, each composed of several interconnected nodes. Each node processes the output data of the previous one as its input, by triggering a Python function that encapsulates the logic for a particular task. Here are the details of each of the three stages:

Data Processing:

This stage serves as the entry point of the pipeline. It focuses on cleaning and preparing the raw data to align with our ML prediction objectives and scope. In addition, we perform a feature engineering task, allowing the machine learning models to perform better, and thus enhance the predictions.

Machine Learning:

This is the core stage of the pipeline, where we execute the main tasks involved in machine learning. The strategy and approach for each task will be elaborated in further sections in this article. Here is the list of these tasks:

Spliting dataset into training and testing subsets.
Training ML models using the training dataset to learn patterns within the data.
Evaluating the performance of trained models using the test dataset.
Selecting the optimal models based on the models evaluation.
Predict the results of the elections using the selected models.

Reporting:

This final stage of the pipeline focuses on interpreting the prediction results by visualizing them geographically on a map, and gives a summary by using different charts.

Another key benifit of using Kedro is its ability to support detailed data lineage. With the help of the Kedro-Viz package, I can provide you with an overview of the entire ML pipeline displayed as a data lineage diagram, which you can find in the appendix at the very end of this article.

Diving deeper into the pipeline's nodes, I utilized Python and its rich data science libraries to implement each node's function, including:

Pandas for data sets manipulation.
NumPy for mathematical operations.
scikit-learn for machine learning related functions and algorithms.
XGBoost and other specialized libraries for advanced ML models implementation.
matplotlib for data source exploration and visualizations, which I also combine with GeoPandas to plot the prediction results.

Last but not least, I used Streamlit to easily share with you the pipelineâ€™s prediction results. You can see these results by following this link. The application is deployed on Hugging Face Spaces, where I also shared the complete source code. This allows you to easily clone the repository using Git and experiment with it to satisfy your personal curiosity or explore potential improvements.

3. Data Acquisition, Exploration and Preparation

One of the most critical steps in training any machine learning model is the choice of data. As straightforward as it may sound, a modelâ€™s performance is only as good as the quality and relevance of the data it learns from.

Numerous factors influence the outcomes of US elections, ranging from voter demographics and economic indicators to voting sentiment and trends in social and mainstream media. To choose data sources that consolidate all these factors, I considered two main aspects to choose the data sources to train my models:

Polling data, which represents voter responses to events, policies and economic condition, providing the models with a real-time snapshot of voter preferences and sentiment.
Past election results, to enable the models to understand previous voting patterns, allowing them to understand the election dynamics, and hence identify trends and voter behavior that inform predictions of future outcomes.

The polling data is sourced from FiveThirtyEight, a platform that provides aggregated polling information as an average poll percentage for each candidate across various states, updated almost daily during the eight months leading up to the election. This data is publicly available on their GitHub repository. For this project, I used the 2024 polling data as the prediction dataset and historical polling data from 2000 to 2020 as training and testing datasets for the machine learning models. To enhance the quality of the training data, I enriched it with actual state-by-state voting results, which serve as the target variable for the models. These voting results were sourced directly from the Federal Election Commission of the USA.

The data dictionary below provides further details about the data included in this study :

To spot any potential anomalies and understand the relationships between the different features (columns) in the datasets, I conducted an exploratory data analysis (EDA), and applied accordingly some preprocessing to enhance the quality of the datasets. This included:

Dropping Irrelevant Columns: Removing columns that did not contribute to the prediction task.
Renaming Columns: Standardizing column names for better clarity and consistency.
Normalizing Date Formats: Ensuring all date values followed a consistent format for proper time-based analysis.
Dropping Irrelevant Values: Filtering out data points that were not relevant to the scope of the analysis.
Handling Missing Values: Removing rows or entries with missing values to maintain data integrity.

You can find the detailed EDA by checking this Jupyter Notebook on the Hugging Face space of the pipeline's project.

4. Feature Engineering

As mentioned earlier, better data results in smarter and more accurate ML models. To enhance the quality of the source data and, consequently, boost the model's predictive performance, I derived three types of new features (new columns added to the dataset):

Opponent-based feature: These features enable the model to capture the mutual influence between rival candidates. For instance, I added a lead feature that calculates the lead against the opponent by state and in a specific date.

Temporal features: These features reflect time-based trends, such as recurring patterns and changes in voter sentiment as the election approaches, providing valuable insights into how opinions develop over time. Temporal features include:

é¢†è‹±æŽ¨è

LLM Evaluation, AI Side Projects, User-Friendly Data Tables, and Other October Must-Reads

LLM Evaluation, AI Side Projects, User-Friendly Dataâ€¦

Towards Data Science 4 ä¸ªæœˆå‰

Vector search, RAG, and large language models

Clara Shih 1 å¹´å‰

Data-Driven Decisions Simplified with Text-to-SQL Technology

Data-Driven Decisions Simplified with Text-to-SQLâ€¦

Growhut 2 ä¸ªæœˆå‰

Days Until Election: A new feature indicating the number of days remaining until election day.
Rolling Averages: Poll percentages grouped by party (DEM, REP) are smoothed using rolling averages to identify broader trends.
Exponential Moving Average: A weighted average that gives more significance to recent polling data, capturing short-term shifts in voter sentiment.
Momentum: The daily change in poll estimates for each candidate, reflecting the pace and direction of shifts in voter preferences.

Candidate and Party related features: I aimed through this set of features to inform the ML models about a candidate's presidential history, and the voter attitudes attitudes toward the current administration. I created two boolean features to indicate whether a candidate is the incumbent president or the incumbent vice president. Similarly, I added a boolean feature to indicate whether the party currently in office is running for re-election, offering a more general context on the political dynamics at play.

5. ML Training and Evaluation Strategy

5.1. Prediction Target

The prediction target I opted for is the vote share for each candidate. Since vote share is expressed as a percentage, in other words: a numerical value, I implemented a regression-based prediction approach to estimate these outcomes.

5.2. Train/Test data split

In machine learning, we typically split the dataset into training, validation, and testing sets to evaluate model performance. The training set is used to train the model, the validation set helps fine-tune the model's hyperparameters and prevent underfitting/overfitting, while the testing set is reserved for checking how well the model generalizes to unseen data. A common split ratio is 70% for training, 15% for validation, and 15% for testing. To further improve evaluation, we can also use more robust techniques like k-fold cross-validation, where we build models iteratively on data being rotated multiple times within these three subsets.

However, in our use case, we have the opportunity to proceed in a different and more effective way: for each election cycle, starting from 2004, I treated it as an unseen "test cycle" and used all preceding cycles as "training cycles." This approach ensures that predictions for a given election year are based solely on data from earlier years, simulating a real-world scenario where future events remain unknown, which allowed me to evaluate in an absolute way the performance of each trained model.

As we say, an example is worth a thousand words. The following is an example to explain concretely how the data split works: to predict the 2008 US elections, the model was trained on the 2000 and 2004 election cycles, then tested on unseen data from the 2008 election cycle, simulating a prediction for the 2008 elections.

5.3. Time-based Weighting

Additionally, I implemented a weighting function that leverages the Days Until Election engineered feature to assign greater importance to polling data collected closer to the election date. This approach ensures the model focuses more on recent voter sentiment, which is typically more reflective of the final outcome.

5.4. ML Training Strategy

Furthermore, I assumed that Blue and Red states will vote predictably along their historical party lines. Therefore, I focused on predicting the election results of the Swing states: Pennsylvania, Wisconsin, Michigan, Georgia, North Carolina, Arizona, and Nevada. Historically, these states often decide the outcome of an election due to their unpredictability.

To efficiently forecast election results, I adopted a strategy of training separate ML regression models for each swing state. This approach allows each model to focus on the unique patterns, voter behavior, and factors specific to that state, leading to more accurate predictions. By isolating the data for each swing state, the models avoid cross-interference and prevent the risk of generalizing patterns that donâ€™t apply uniformly across states, reducing the chances of getting a single model that underfits or overfits the overall political landscape in all the swing states combined.

I also used the ensemble learning strategy: stacked generalization, that combines the predictions from multiple regression models to mitigate the limitations of individual models, which ultimately enhances performance.

For each of the seven Swing states, I trained a diverse set of regressors:

LinearRegression: A linear model that predicts the target by fitting a straight line to the data.
RandomForest: An ensemble learning method that combines multiple decision trees to improve accuracy and reduce overfitting.
SVR: A model that fits the data within a margin, focusing on outliers and small datasets.
KNeighbors: A non-parametric model that predicts the target by averaging the outcomes of the nearest data points.
XGBoost: A powerful gradient boosting algorithm that excels at handling structured data and overfitting prevention.
GradientBoosting: Another gradient boosting method that builds models sequentially to correct errors made by previous models.
MLPRegressor: A neural network-based model capable of capturing complex, non-linear relationships in the data.
ElasticNet: A regularized linear regression model that combines L1 and L2 penalties for better feature selection and generalization.
AdaBoost: combines multiple weak learners, typically decision trees, into a single strong model by iteratively focusing on the hardest-to-predict data points.
StackingRegressor: An ensemble method that combines the previous regression models as base models and Ridge as meta-model to enhance predictive performance.

To sum up, now that we have a clear understanding of how data is split into training and testing sets for each combination of current and previous election cycles, we can calculate the total number of training iterations performed across all 10 regressors by swing state. In total, 350 training iterations are executed. This is because, for each swing state and each of the listed regressors, we predicted each election cycles from 2004 to 2020, using the preceding cycles as training sets.

5.5. ML Evaluation Strategy

All the trained ML models -70 in total (10 models for each of the seven swing states, each one trained 5 times on 5 train/test datasets)- were saved in a pickle file. From this collection, the optimal model for each state was selected, resulting in 7 models used for the final election predictions. (You can read a pickle file by using Python).

To select these seven optimal models, I based my evaluation on two main metrics:

Number of correct predicted previous presidents: The model with the highest number of correct winner predictions across various election cycles is prioritized. This approach ensures the selected model has consistently demonstrated its ability to accurately identify the winner for that state.
Lowest Average MAE: MAE measures the average difference between the predicted and actual vote shares in percentage points, which allows to quantify how much the predictions deviate from the actual outcomes on average. When multiple models achieved the same number of correct predictions, I selected the one with the lowest MAE. This approach ensured that the chosen model not only identified the correct winner but also provided more precise vote share estimates, with lower MAE indicating higher accuracy.

The detailed evaluation result by swing state, regressor type and training cycle are exported in this CSV file. And sorted aggregated results can be found in this CSV file, it's sorted from best performing model to less performing by swing state based on the evaluation strategy detailed above. Based on these evaluation results, the optimal models that are used for the final election predictions are as follow:

{'Arizona': 'KNeighbors', 'Georgia': 'ElasticNet', 'Michigan': 'LinearRegression', 'Nevada': 'GradientBoosting', 'North Carolina': 'ElasticNet', 'Pennsylvania': 'StackingModel', 'Wisconsin': 'StackingModel'}

6. Election Predictions Results

The prediction data that is passed to the selected optimal models is the 2024 polling data, and the target is the vote share for each swing state. The predicted results can be found in this CSV file. And the visual summary of the predicted outcomes are shared as a Streamlit app.

To interpret these predicted vote shares into who is the election winner according to the machine learning, we need to translate it into electoral votes. So according to the predictions Kamala Harris (REP) will win the swing states of Georgia and Nevada, while Donald Trump (DEM) will win Michigan, North Carolina, Wisconsin, Pennsylvania and Arizona.

If we combine the electoral votes of the swing states predicted to be won by each candidate with those of the reliably blue and red states, the results show Trump winning the election with 290 electoral votes, while Harris falls short with 248 votes. This outcome closely aligns with the actual election results in terms of identifying the winner.

However, in reality, Trump won all the swing states and secured a total of 312 electoral votes (4). One way to measure the success of our ML models, is to say they achieved a score of 5/7: correctly predicting the outcomes for 5 out of 7 swing states. That's pretty decent, isn't it?

Conclusion

Overall, the machine learning model performed well by correctly predicting the election winner. Yet, there is room for optimization. This could involve introducing additional features based on new data sources, such as economic conditions or demographic trends, to provide richer insights. However, you should be aware that using demographic data comes with the risk of introducing bias and unfairness related to personal sensitive features such as gender, race, or age. In order to build fair ML models, you may want to consider applying fairness techniques (e.g. anti-classification, classification parity, calibration). You can also leverage tools like the What-If Tool to analyze whether your dataset might lead to biased training, or use Python toolboxes like FairML to audit your trained models for bias and unfairness. Another optimization approach consists of fine-tuning the model's hyperparameters. For instance, by following advanced techniques outlined in resources like this Google playbook for tuning ML/DL models. Additionally, you can use MLflow to track and benchmark this hyperparameter tuning. Moreover, you can use other evaluation metrics and track them using MLflow as well, which integrates seamlessly with Kedro to monitor the model's performance.

This comprehensive example showcased how Machine Learning can be applied to solve real-world problems effectively. You explored the process of designing and implementing an ML strategy, the critical importance of data relevance and quality in enhancing model performance, and the steps involved in training and evaluating ML models, from splitting data into training and testing sets, to evaluating and selecting the optimal model for specific datasets. Through this guide, you have gained a foundational knowledge of how one of the most widely used branches of Artificial Intelligence is impacting various aspects of our lives today.

Appendix: Data Lineage

References

(1) The 2024 presidential election was close, not a landslide - ABC News

(2) Machine learning - Wikipedia

(3) Electoral College | USAGov

(4) Presidential election results 2024 | CNN Politics

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Anas Kezibriçš„æ›´å¤šæ–‡ç«

Did First Class Passengers really have a better Chance of Surviving the Titanic Disaster?

2024å¹´10æœˆ18æ—¥

Did First Class Passengers really have a better Chance of Surviving the Titanic Disaster?

Exploratory Data Analysis with Python 101 : Survivability Study of the Titanic Disaster Introduction Exploratory Dataâ€¦

Can AI Predict the Next US President? A Comprehensive Example for Building an ML Pipeline to Solve Real-World Problems

Anas Kezibri

Google Certified Senior Data Engineer

Data Science Applied to the 2024 US Elections: A Study of the Swing States Factor in Trump's Victory

Introduction

1. Understanding the US Presidential Election

2. ML Pipeline Architecture and Technical Stack

3. Data Acquisition, Exploration and Preparation

4. Feature Engineering

é¢†è‹±æŽ¨è

5. ML Training and Evaluation Strategy

6. Election Predictions Results

Conclusion

Appendix: Data Lineage

References

Anas Kezibriçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

?? SQL on autopilot

See Pork in Congressional Bills with AI!

Centizen: Empowering Your Business with Cutting-Edge AI, ML, Data, and BI Solutions

Centizen: Your Strategic Partner for AI, ML, Data, and BI Excellence

Data Scientistâ€™s Dilemma: The Cold Start Problem â€“ Ten Machine Learning Examples

Data Science Talent | Newsletter Edition 2

ARTIFICIAL INTELLIGENCE - PART 6.7 - VECTOR DATABASE

The AI-Native Vector Databases and Use Cases

k-Nearest Neighbours (kNN) Imputation Algorithm (with an nice Golang example)

The Hidden Truth About Data Science (That No One Talks About!)

Data Science Applied to the 2024 US Elections: A Study of the Swing States Factor in Trump's Victory

Introduction

1. Understanding the US Presidential Election

2. ML Pipeline Architecture and Technical Stack

3. Data Acquisition, Exploration and Preparation

4. Feature Engineering

é¢†è‹±æŽ¨è

5. ML Training and Evaluation Strategy

6. Election Predictions Results

Conclusion

Appendix: Data Lineage

References

Anas Kezibriçš„æ›´å¤šæ–‡ç«

Did First Class Passengers really have a better Chance of Surviving the Titanic Disaster?

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

?? SQL on autopilot

See Pork in Congressional Bills with AI!

Centizen: Empowering Your Business with Cutting-Edge AI, ML, Data, and BI Solutions

Centizen: Your Strategic Partner for AI, ML, Data, and BI Excellence

Data Scientistâ€™s Dilemma: The Cold Start Problem â€“ Ten Machine Learning Examples

Data Science Talent | Newsletter Edition 2

ARTIFICIAL INTELLIGENCE - PART 6.7 - VECTOR DATABASE

The AI-Native Vector Databases and Use Cases

k-Nearest Neighbours (kNN) Imputation Algorithm (with an nice Golang example)

The Hidden Truth About Data Science (That No One Talks About!)

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†