登录查看更多内容

Decoding the 2016 Ballot: Analyzing County Demographics & Voting Partiality

José Antonio Reyes

Computer Science student at St. Edwards

发布日期: 2025年2月27日

The 2016 presidential election captivated audiences worldwide with its dynamic candidates, political platforms, and goals for the future of American society. As many political scientists and statisticians nationwide attempt to forecast election results by investigating key influencing demographics, we attempt to do the same below. Leveraging 2016 primary election data - we investigated the broader research question: what demographic factors influence voter party tendencies on the county level??

To answer our research question and build a predictive machine learning model, we utilized national primary election result data and county demographic data. Our primary election result dataset consisted of a plethora of interesting information; however, for the sake of simplicity, we leveraged only the following information: political party affiliation, vote count, and fraction votes. For our county demographic data, we looked only to utilize weighted columns that potentially determine county voting tendencies toward Republican or Democratic partiality. Selected weighted columns that we explored via the following general categories include population density, local economic activity, and age.?

Overview

Our analysis demonstrates how cumulative votes were distributed nationally and locally among counties. County-wise,? Republicans secured a significant amount of county votes of 77.1% compared to Democrats at 22.9%. However, in terms of the overall vote shares, it was a closer call, with Democrats earning 46.9% of the total share while Republicans earned the remaining 53.1%. This discrepancy highlights a data imbalance of over ?’s regarding the number of Republic-leaning counties.??

Data Cleaning:

One systematic challenge posed in our analysis was ensuring consistency with our data quality. Approximately 35.6% of data values were missing. Furthermore, states such as Alaska, Colorado, Connecticut, Illinois, Kansas, Maine, Massachusetts, North Dakota, Rhode Island, Vermont, and Wyoming had missing information. These minor inconveniences were addressed prior to merging our datasets. As for merging our data, we employed the FIPS column in each of our datasets. FIPS stands for? Federal Information Processing Standards - essentially it is an identification number for specific geographic areas, in this case that would be our counties.?

The Predictive Model & Results

Adopting a logistic regression model, we attempt to classify counties voting tendencies as either Democratic-leaning or Republican-leaning based on demographic factors. Our confusion matrix and metrics significance are in detail below.?

Model Metrics:

77.5% accuracy: Overall correct predictions
90% precision: The model is rarely wrong when it says “Republican.”
79% recall: It captures most Republican counties.
82% AUROC: Good discerning ability between Republican-voting counties from Democrat-voting counties

Building The Model

We start by selecting numeric features from our dataset and standardizing them using a StandardScaler. Next, we create and train a Logistic Regression model with L1 (Lasso) regularization and a balanced class weight in 80% of our data (the remaining 20% is saved for testing). The Lasso penalty performs feature selection by shrinking less important coefficients to zero. The balanced class weight compensates for any imbalance in the label distribution (e.g., if there are more Republican-winning counties than Democratic ones, or vice versa).?

These are the feature coefficients, giving us insight into which factors most strongly push a county toward each party in the model’s eyes.

Democratic Feature Coefficients

For the democratic winning counties, these are the features with the highest coefficients:

Density (the first red box) is the strongest group of features for democratic counties. This analysis revealed that Democratic-leaning counties have an average population density 7 times higher than their Republican counterparts.

To confirm our analysis, we looked to an external reputable research source. (Pew Research Center, 2024) confirms our population density inkling, stating that over the past two decades, there has been robust support for the Republican party among rural counties. Nevertheless, in 2016, there was 65% of Democratic party support among urban counties.???

Economic activity (the second red box) features also highlight the concentration of Democratic voters in economically active urban counties, with 2.75 times higher total business sales compared to the Republican counties.?

Confirming our suspicion for understanding how economic activity influences voter tendencies above, we looked to (Muro, Whiton & Maxim, 2020) for the 2016 political-economic context regarding the overall final presidential results. (Muro, Whiton & Maxim, 2020) gives us reason to believe that counties with higher economic activity tend to vote Democrat, while less economically involved counties vote Republican. As noted in 2016, despite fewer Democrat-voting counties overall, they made up the majority of the national aggregate economy at 64%, with only 472 counties voting blue. Republican-leaning counties comprised 36% of the national aggregate economy, with 2584 counties voting red.???

Republican Feature Coefficients

Population (the first red box) is the strongest group of features for republican counties. However, after looking into our data, we noticed that higher population counties also vote democratic.?

This may make it seem like our model is contradicting itself, but there could be multiple reasons why this occurs.

Why is this the Case?

Data Imbalance: Because there are many small counties that vote Republican, the model can end up associating higher population with a Republican outcome.

Other Variables: Once we account for factors like race, poverty, and age groups, the partial effect of population on the model's predictions can reverse direction.

Population vs. Density: Counties with a large total population aren't necessarily urban or densely populated, so raw population alone might not capture the urban–rural divide.

Age groups (the second red box) also highlight how Democratic voters had a higher percentage of children under the age of 5, while Republican voters tended to have a higher percentage of teens under the age of 18 and seniors over the age of 65.

Feature Engineering

Variance Inflation Factor (VIF)

We wanted to fine tune the model by reducing multicollinearity in our data.This is because a high VIF could mean that a feature is redundant because it shares too much information with other features. This, in turn, can make the model unstable. We condensed multiple highly correlated features into three general ones. The correlation threshold we used was = 0.8.?

These are the three condensed columns along with what they contain:

County_Size_Index

Youth_pct_2014

Income_Index

With this cleaned data we built the model again.

Logistic Regression Model v.2

However, after comparing the new model's performance with the old one, we see that it actually is worse.

Although the features are highly correlated, they still have a significant impact individually. So dropping or merging the features hurts the model predictive power.

Concluding Thoughts:

Overall, our analysis found that Democratic-leaning counties tend to have 7 times more substantial population density than Republican-leaning counties. We also found that, on average, Democratic-leaning counties have 2.75 times more commercial activity compared to Republican-leaning counties. This evidence leads us to believe that more urbanized and wealthier countries tend to have Democratic voting tendencies and vice-versa for our Republican county counterparts. Nevertheless, we did notice that for our Republican-leaning counties, households with residents under 18 in their households and above 65 leaned Republican. At the same time, “newer” families with family members under five years old tended to vote slightly more Democratic-leaning. Lastly, we found that our predictive machine learning model tended to predict with an accuracy level of 77.5% between political party county winners. However, we firmly believe that due to the data imbalance of having more Republican-winning counties, our model tends to predict Republicans with differentiating population counts inaccurately. Last but not least, we wanted to clarify our motives for this project were solely for our academic development as we combined our data analysis skills with more advanced statistical machine-learning techniques.?

Github Repository: 2016-US-elections-prediction-ML

要查看或添加评论，请登录

José Antonio Reyes的更多文章

Understanding Graduation Trends in America 2021

2025年2月6日

Understanding Graduation Trends in America 2021

Jose Colchado , Leila Kandil , and José Antonio Reyes College can be stressful—we’re too busy drowning in assignments (…

1 条评论
The Harmony of Data Analysis and Piano Performance

2024年12月12日

The Harmony of Data Analysis and Piano Performance

Art and science have always been thought of opposite areas of humanity. You are either an arts person or a science…
The Science of Simplicity: How to Overcome Visual Overload in Data Presentation

2024年12月12日

The Science of Simplicity: How to Overcome Visual Overload in Data Presentation

During my class on Data Analysis, one of the first things we learn is how careful we have to be when making graphs and…
How to Make it Big In Los Angeles: Leveraging Insights from Airbnb

2024年12月12日

How to Make it Big In Los Angeles: Leveraging Insights from Airbnb

Airbnb is a fast-growing industry, reaching almost $10 billion in total revenue in 2023, which is an 18% increase in…
Harnessing Consumer Insights for Effective Resource Allocation

2024年11月15日

Harnessing Consumer Insights for Effective Resource Allocation

In today’s competitive market, understanding consumer trends is essential for driving growth and optimizing resource…
Making Austin 311 Service Requests More Efficient

2024年10月18日

Making Austin 311 Service Requests More Efficient

Austin 311: Who they are, Where they operate, What they do Living in a large metropolitan area, reporting issues such…

1 条评论
Cracking the Code: Key Drivers of Hotel Cancelations

2024年9月25日

Cracking the Code: Key Drivers of Hotel Cancelations

To provide context for this project set out to find why customers are canceling, and see if they can predict ahead of…

See all articles

Overview

Data Cleaning:

The Predictive Model & Results

Model Metrics:

Building The Model

Democratic Feature Coefficients

Republican Feature Coefficients

Feature Engineering

Variance Inflation Factor (VIF)

Logistic Regression Model v.2

Concluding Thoughts:

José Antonio Reyes的更多文章

Understanding Graduation Trends in America 2021

The Harmony of Data Analysis and Piano Performance

The Science of Simplicity: How to Overcome Visual Overload in Data Presentation

How to Make it Big In Los Angeles: Leveraging Insights from Airbnb

Harnessing Consumer Insights for Effective Resource Allocation

Making Austin 311 Service Requests More Efficient

Cracking the Code: Key Drivers of Hotel Cancelations