Simpson’s Paradox
Image Source : youtube/watch?v=ebEkn-BiW5k

Simpson’s Paradox

In this article, I will cover a well-known statistical phenomenon that you may have heard of before called ‘Simpson’s Paradox’, which occurs when trends present in several different groups of data disappear or reverse when these groups are combined

Like always, if you find my articles interesting, don’t forget to like and follow ????, these articles take times and effort to do!

What’s Simpson Paradox ?

Simpson Paradox.Image Source : skewthescript/data-projects/simpsons-paradox

It’s not related to the famous Simpson series, but it’s named after Edward H. Simpson, a British statistician who first described it in 1951

Simpson’s Paradox refers to the phenomenon where aggregated data can sometimes show the opposite trend compared to the data when it’s split into subgroups (as shown in the image above; the black line representing the aggregated trend is opposite to the trends of the subgroups)

Confounding variable.Image Source : thosenerdygirls/confounding-variables

The paradox can arise when the aggregated data is influenced by a third variable (the confounder), which distorts the original relationship between the variables of interest. Let’s look at an example below to understand this better

Example. Image Source : Dr. Walid Soula

I will generate data for two groups, plot the individual trends, and compare them with the combined trend

1 — Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression        

2 — Generate Synthetic Data with Positive and Negative Trends for Two Groups

np.random.seed(42)

#Group 1
group_1_x = np.random.uniform(1, 10, 50) 
group_1_y = 2 * group_1_x + np.random.normal(0, 2, 50)
#Group 2
group_2_x = np.random.uniform(1, 10, 50)  
group_2_y = -3 * group_2_x + 20 + np.random.normal(0, 2, 50)        

3 — Combine Data into a Single DataFrame

df = pd.DataFrame({
    'Group': ['Group 1'] * 50 + ['Group 2'] * 50,  # Label each group
    'X': np.concatenate([group_1_x, group_2_x]),  # Combine X values
    'Y': np.concatenate([group_1_y, group_2_y])   # Combine Y values
})

df        
df.Image Source : Dr. Walid Soula

4 — Plot Regression Lines for Each Group

plt.figure(figsize=(12, 6))
sns.lmplot(data=df, x='X', y='Y', hue='Group', height=6, aspect=1.5, markers=['o', 's'])
plt.title("Regression Lines for Each Group")
plt.show()        
Regression Lines for Each Group.Image Source : Dr. Walid Soula

We can observe from the graph that Group 1 has a positive trend, while Group 2 has a negative trend

5— Fit a Combined Regression

model = LinearRegression()  

X_combined = df[['X']]  
y_combined = df['Y']    

model.fit(X_combined, y_combined)        

6 — Predict and Plot Combined Regression Line

Combined Regression Line.Image Source : Dr. Walid Soula

We can observe that the trend in the aggregated data differs from the individual group trend observations, illustrating what is known as Simpson’s Paradox !

If there’s a specific subject you’d like us to cover, please don’t hesitate to let me know! Your input will help shape the direction of my content and ensure it remains relevant and engaging??



If you found this helpful, consider Sharing ?? and follow me Dr. Oualid Soula for more content like this.

Join the journey of discovery and stay ahead in the world of data science and AI! Don't miss out on the latest insights and updates - subscribe to the newsletter for free ????https://lnkd.in/eNBG5dWm , and become part of our growing community!

要查看或添加评论,请登录

Dr. Oualid S.的更多文章

  • Herfindahl-Hirschman Index (HHI)

    Herfindahl-Hirschman Index (HHI)

    In this article, I will discuss a key metric in market research known as the Herfindahl-Hirschman Index (HHI), which is…

  • Evaluating a company’s portfolio with the MABA Analysis

    Evaluating a company’s portfolio with the MABA Analysis

    In this article, we will cover another tool that can be used in international marketing called MABA Analysis. This tool…

  • 7S McKinsey Model for Internal Analysis

    7S McKinsey Model for Internal Analysis

    It's been quite a while since I wrote an article on business strategies, so I thought I'd kick off this week by…

    2 条评论
  • Step by Step guide A/B for UX (Binary Data)

    Step by Step guide A/B for UX (Binary Data)

    In the last article I covered how to execute a hypothesis test illustrated by a UX research design where we compared…

  • Retail Analytics project

    Retail Analytics project

    This article is an introduction to the world of machine learning, for anyone wanting to participate in small-scale…

  • From Sci-Fi to Reality | Exploring the root of AI

    From Sci-Fi to Reality | Exploring the root of AI

    For people who have not jumped into AI or are just hooked on generative AI and want to understand how things work?…

  • Apache Airflow Building End To End ETL Project

    Apache Airflow Building End To End ETL Project

    In that article I will cover the essential that you need to know about Airflow, if you don’t know what it is, I wrote…

  • Diving Deep into Significance Analysis

    Diving Deep into Significance Analysis

    In the constantly changing landscape of scientific research, the pursuit of significance extends well beyond the usual…

  • Volcano Plots

    Volcano Plots

    In this article, I will cover a well-known plot used mainly in genomics called the volcano plot. It is used to…

  • AI Agent Starter using Llama 3.3

    AI Agent Starter using Llama 3.3

    If you are not familiar with AI Agents yet and feel like you are falling behind, this article is for you! It’s an…

    3 条评论

社区洞察

其他会员也浏览了