Simpson’s Paradox
In this article, I will cover a well-known statistical phenomenon that you may have heard of before called ‘Simpson’s Paradox’, which occurs when trends present in several different groups of data disappear or reverse when these groups are combined
Like always, if you find my articles interesting, don’t forget to like and follow ????, these articles take times and effort to do!
What’s Simpson Paradox ?
It’s not related to the famous Simpson series, but it’s named after Edward H. Simpson, a British statistician who first described it in 1951
Simpson’s Paradox refers to the phenomenon where aggregated data can sometimes show the opposite trend compared to the data when it’s split into subgroups (as shown in the image above; the black line representing the aggregated trend is opposite to the trends of the subgroups)
The paradox can arise when the aggregated data is influenced by a third variable (the confounder), which distorts the original relationship between the variables of interest. Let’s look at an example below to understand this better
I will generate data for two groups, plot the individual trends, and compare them with the combined trend
1 — Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
2 — Generate Synthetic Data with Positive and Negative Trends for Two Groups
np.random.seed(42)
#Group 1
group_1_x = np.random.uniform(1, 10, 50)
group_1_y = 2 * group_1_x + np.random.normal(0, 2, 50)
#Group 2
group_2_x = np.random.uniform(1, 10, 50)
group_2_y = -3 * group_2_x + 20 + np.random.normal(0, 2, 50)
3 — Combine Data into a Single DataFrame
df = pd.DataFrame({
'Group': ['Group 1'] * 50 + ['Group 2'] * 50, # Label each group
'X': np.concatenate([group_1_x, group_2_x]), # Combine X values
'Y': np.concatenate([group_1_y, group_2_y]) # Combine Y values
})
df
领英推荐
4 — Plot Regression Lines for Each Group
plt.figure(figsize=(12, 6))
sns.lmplot(data=df, x='X', y='Y', hue='Group', height=6, aspect=1.5, markers=['o', 's'])
plt.title("Regression Lines for Each Group")
plt.show()
We can observe from the graph that Group 1 has a positive trend, while Group 2 has a negative trend
5— Fit a Combined Regression
model = LinearRegression()
X_combined = df[['X']]
y_combined = df['Y']
model.fit(X_combined, y_combined)
6 — Predict and Plot Combined Regression Line
We can observe that the trend in the aggregated data differs from the individual group trend observations, illustrating what is known as Simpson’s Paradox !
If there’s a specific subject you’d like us to cover, please don’t hesitate to let me know! Your input will help shape the direction of my content and ensure it remains relevant and engaging??
If you found this helpful, consider Sharing ?? and follow me Dr. Oualid Soula for more content like this.
Join the journey of discovery and stay ahead in the world of data science and AI! Don't miss out on the latest insights and updates - subscribe to the newsletter for free ????https://lnkd.in/eNBG5dWm , and become part of our growing community!