登录查看更多内容

Simpson’s Paradox

Dr. Oualid S.

AI & Marketing Expert | Bridging Business and Science

发布日期: 2024年12月27日

In this article, I will cover a well-known statistical phenomenon that you may have heard of before called ‘Simpson’s Paradox’, which occurs when trends present in several different groups of data disappear or reverse when these groups are combined

Like always, if you find my articles interesting, don’t forget to like and follow ????, these articles take times and effort to do!

What’s Simpson Paradox ?

Simpson Paradox.Image Source : skewthescript/data-projects/simpsons-paradox

It’s not related to the famous Simpson series, but it’s named after Edward H. Simpson, a British statistician who first described it in 1951

Simpson’s Paradox refers to the phenomenon where aggregated data can sometimes show the opposite trend compared to the data when it’s split into subgroups (as shown in the image above; the black line representing the aggregated trend is opposite to the trends of the subgroups)

Confounding variable.Image Source : thosenerdygirls/confounding-variables

The paradox can arise when the aggregated data is influenced by a third variable (the confounder), which distorts the original relationship between the variables of interest. Let’s look at an example below to understand this better

I will generate data for two groups, plot the individual trends, and compare them with the combined trend

1 — Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

2 — Generate Synthetic Data with Positive and Negative Trends for Two Groups

np.random.seed(42)

#Group 1
group_1_x = np.random.uniform(1, 10, 50) 
group_1_y = 2 * group_1_x + np.random.normal(0, 2, 50)
#Group 2
group_2_x = np.random.uniform(1, 10, 50)  
group_2_y = -3 * group_2_x + 20 + np.random.normal(0, 2, 50)

3 — Combine Data into a Single DataFrame

df = pd.DataFrame({
    'Group': ['Group 1'] * 50 + ['Group 2'] * 50,  # Label each group
    'X': np.concatenate([group_1_x, group_2_x]),  # Combine X values
    'Y': np.concatenate([group_1_y, group_2_y])   # Combine Y values
})

df

领英推荐

Understanding Box and Whisker Plots: A Comprehensive…

Lean Manufacturing & Six Sigma Worldwide 5 个月前

Four Flaws in Foundations of Statistics

Asad Zaman 2 年前

Big Data: 4 Things You Can Do With It, And 3 Things…

Bernard Marr 9 年前

4 — Plot Regression Lines for Each Group

plt.figure(figsize=(12, 6))
sns.lmplot(data=df, x='X', y='Y', hue='Group', height=6, aspect=1.5, markers=['o', 's'])
plt.title("Regression Lines for Each Group")
plt.show()

Regression Lines for Each Group.Image Source : Dr. Walid Soula

We can observe from the graph that Group 1 has a positive trend, while Group 2 has a negative trend

5— Fit a Combined Regression

model = LinearRegression()  

X_combined = df[['X']]  
y_combined = df['Y']    

model.fit(X_combined, y_combined)

6 — Predict and Plot Combined Regression Line

Combined Regression Line.Image Source : Dr. Walid Soula

We can observe that the trend in the aggregated data differs from the individual group trend observations, illustrating what is known as Simpson’s Paradox !

If there’s a specific subject you’d like us to cover, please don’t hesitate to let me know! Your input will help shape the direction of my content and ensure it remains relevant and engaging??

If you found this helpful, consider Sharing ?? and follow me Dr. Oualid Soula for more content like this.

Join the journey of discovery and stay ahead in the world of data science and AI! Don't miss out on the latest insights and updates - subscribe to the newsletter for free ????https://lnkd.in/eNBG5dWm , and become part of our growing community!

Data Science Digest

765 位关注者

要查看或添加评论，请登录

Dr. Oualid S.的更多文章

Herfindahl-Hirschman Index (HHI)

2025年2月28日

Herfindahl-Hirschman Index (HHI)

In this article, I will discuss a key metric in market research known as the Herfindahl-Hirschman Index (HHI), which is…
Evaluating a company’s portfolio with the MABA Analysis

2025年2月21日

Evaluating a company’s portfolio with the MABA Analysis

In this article, we will cover another tool that can be used in international marketing called MABA Analysis. This tool…
7S McKinsey Model for Internal Analysis

2025年2月14日

7S McKinsey Model for Internal Analysis

It's been quite a while since I wrote an article on business strategies, so I thought I'd kick off this week by…

2 条评论
Step by Step guide A/B for UX (Binary Data)

2025年2月7日

Step by Step guide A/B for UX (Binary Data)

In the last article I covered how to execute a hypothesis test illustrated by a UX research design where we compared…
Retail Analytics project

2025年1月31日

Retail Analytics project

This article is an introduction to the world of machine learning, for anyone wanting to participate in small-scale…
From Sci-Fi to Reality | Exploring the root of AI

2025年1月24日

From Sci-Fi to Reality | Exploring the root of AI

For people who have not jumped into AI or are just hooked on generative AI and want to understand how things work?…
Apache Airflow Building End To End ETL Project

2025年1月17日

Apache Airflow Building End To End ETL Project

In that article I will cover the essential that you need to know about Airflow, if you don’t know what it is, I wrote…
Diving Deep into Significance Analysis

2025年1月10日

Diving Deep into Significance Analysis

In the constantly changing landscape of scientific research, the pursuit of significance extends well beyond the usual…
Volcano Plots

2025年1月3日

Volcano Plots

In this article, I will cover a well-known plot used mainly in genomics called the volcano plot. It is used to…
AI Agent Starter using Llama 3.3

2024年12月20日

AI Agent Starter using Llama 3.3

If you are not familiar with AI Agents yet and feel like you are falling behind, this article is for you! It’s an…

3 条评论

See all articles

Simpson’s Paradox

Dr. Oualid S.

AI & Marketing Expert | Bridging Business and Science

What’s Simpson Paradox ?

领英推荐

Data Science Digest

765 位关注者

Dr. Oualid S.的更多文章

社区洞察

其他会员也浏览了

Data from Air, Value from Data - A Statistics Adventure for Business Professionals

Can Likert Scale Data ever be Continuous?

Exploring the F-Distribution and ANOVA: Keys to Statistical Insights

Concise Basic Stats - Part X: Distribution-free tests (Nonparametric Statistics)

Comparison of Multivariate Data Using Principal Component Analysis

VP of Analytics Takes Data-Driven Insights to a Whole New Level

Mastering Time Series Analysis: The Key to Unlocking Future Trends

Homoscedasticity — From a line in a checklist to a key element in data analysis

You Should Use Analysis Plans (as a Data Scientist)

The scrambled egg fallacy

What’s Simpson Paradox ?

领英推荐

Data Science Digest

765 位关注者

Dr. Oualid S.的更多文章

Herfindahl-Hirschman Index (HHI)

Evaluating a company’s portfolio with the MABA Analysis

7S McKinsey Model for Internal Analysis

Step by Step guide A/B for UX (Binary Data)

Retail Analytics project

From Sci-Fi to Reality | Exploring the root of AI

Apache Airflow Building End To End ETL Project

Diving Deep into Significance Analysis

Volcano Plots

AI Agent Starter using Llama 3.3

社区洞察

其他会员也浏览了

Data from Air, Value from Data - A Statistics Adventure for Business Professionals

Can Likert Scale Data ever be Continuous?

Exploring the F-Distribution and ANOVA: Keys to Statistical Insights

Concise Basic Stats - Part X: Distribution-free tests (Nonparametric Statistics)

Comparison of Multivariate Data Using Principal Component Analysis

VP of Analytics Takes Data-Driven Insights to a Whole New Level

Mastering Time Series Analysis: The Key to Unlocking Future Trends

Homoscedasticity — From a line in a checklist to a key element in data analysis

You Should Use Analysis Plans (as a Data Scientist)

The scrambled egg fallacy