登录查看更多内容

EDA of Olympic Medals using Python

Peter Nicholson

Division Systems and Data Analyst at ITW Appliance Components | YPN's Global Chair at ITW | Podcaster at ABCs of ERP & Beyond

发布日期: 2022年10月3日

Continuing my learning journey of Python (and tools) in data analysis, I found a good dataset on Kaggle that has Olympics medals data of each participating country since 1896.?

The information on the data reads:

The modern Olympic Games or Olympics are the leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Olympic Games are considered the world's foremost sports competition with more than 200 nations participating. The Olympic Games are normally held every four years, and since 1994, has alternated between the Summer and Winter Olympics every two years during the four-year period.

So, let's begin:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.patches as mpatches
from matplotlib.pyplot import figure
import matplotlib.mlab as mlab
import scipy.stats
import seaborn as sns

df = pd.read_csv('olympics_medals.csv')

df.info()

RangeIndex: 156 entries, 0 to 15
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   countries              156 non-null    object
 1   ioc_code               156 non-null    object
 2   summer_participations  156 non-null    int64 
 3   summer_gold            156 non-null    object
 4   summer_silver          156 non-null    int64 
 5   summer_bronze          156 non-null    int64 
 6   summer_total           156 non-null    object
 7   winter_participations  156 non-null    int64 
 8   winter_gold            156 non-null    int64 
 9   winter_silver          156 non-null    int64 
 10  winter_bronze          156 non-null    int64 
 11  winter_total           156 non-null    int64 
 12  total_participation    156 non-null    int64 
 13  total_gold             156 non-null    object
 14  total_silver           156 non-null    int64 
 15  total_bronze           156 non-null    int64 
 16  total_total            156 non-null    object
dtypes: int64(11), object(6)
memory usage: 20.8+ KB5

Removing the comma from the the numerical string data and changing to int64 format:

df['summer_gold'] = df['summer_gold'].str.replace(',','').astype('int64')
df['summer_total'] = df['summer_total'].str.replace(',','').astype('int64')
df['total_gold'] = df['total_gold'].str.replace(',','').astype('int64')
df['total_total'] = df['total_total'].str.replace(',','').astype('int64')

Stripping the ioc code from its brackets:

df['ioc_code'] = df['ioc_code'].str.replace('(', '')
df['ioc_code'] = df['ioc_code'].str.replace(')', '')

Checking correlation across the data points:

plt.figure (figsize = (18,12))
sns.set_style('ticks')


corr = df.corr()


sns.heatmap (data = corr, annot = True, fmt= " .2g", linewidth = 2)
sns.set (font_scale = 0)
plt.show ()

This correlation map shows a strong correlation between countries that participate and countries that win the most medals. ?There's also quite a strong correlation between total summer participations and total winter participations (0.65). ?There's a stronger correlation between number of winter participations & medals won at the winter games (0.53), than there is between total summer participations & medals won at the summer games (0.34).

Taking this further, let's see if there's a test the hypothesis that the more Summer Olympics you participate in, the more you will win medals

f, axs = plt.subplots(2, 2, figsize=(10, 10), gridspec_kw=dict(width_ratios=[4, 4]))
sns.set_style('ticks')
sns.scatterplot( data=df, x="summer_participations", y="summer_gold", ax=axs[0,0])
sns.scatterplot( data=df, x="summer_participations", y="summer_silver", ax=axs[0,1])
sns.scatterplot( data=df, x="summer_participations", y="summer_bronze", ax=axs[1,0])
sns.scatterplot( data=df, x="summer_participations", y="summer_total", ax=axs[1,1])
f.tight_layout()

Looking at the scatterplots above, it doesn't show any strong correlation.

Counting the number of medals won at the Olympics may not be the fairest system of determining the most successful countries. For example, at the 2008 Olympic Games in Beijing, the USA finished second in the gold count to China but were ahead of them in the total medal count. This created quite a bit of discussion about which ranking system should be used. The Americans were obviously quite happy with their usual system of counting total medals, but the rest of the world generally did not agree.

For this project, I am using the New York Times weighted point system (4:2:1) — gold 4 points, silver 2 points, and bronze 1 point. An exponential points system giving ' Medal Points' described in the New York Times in 2008.

df['summer_points'] = df['summer_gold'] * 4
df['summer_points'] += df['summer_silver'] * 2
df['summer_points'] += df['summer_bronze']


df['winter_points'] = df['winter_gold'] * 4
df['winter_points'] += df['winter_silver'] * 2
df['winter_points'] += df['winter_bronze']


df['total_points'] = df['summer_points'] + df['winter_points']

It might be interesting to see how well a country performs on average. ?A country that takes part in more Olympics will have a better chance than a country participating in less. ?Let's create some "average medals" columns to assess performance later on.

领英推荐

Mastering Matplotlib: Easy Plotting Tips and Common…

Ali Asghar Torabi 1 年前

A Slap in the Face with Pandas

Leonardo A. 3 年前

+30 Useful Operations in Pandas ??

Leonardo A. 3 年前

df['avg_summer_medals'] = df['summer_total'] / df['summer_participations']
df['avg_winter_medals'] = df['winter_total'] / df['winter_participations']


df['avg_total_medals'] = df['total_total'] / df['total_participation']


cols = ['avg_summer_medals', 'avg_winter_medals', 'avg_total_medals']
df[cols] = df[cols].round(1)

summer = df
? ? [
? ? ? ? "countries",
? ? ? ? "ioc_code",
? ? ? ? "summer_participations",
? ? ? ? "summer_gold",
? ? ? ? "summer_silver",
? ? ? ? "summer_bronze",
? ? ? ? "summer_total",
? ? ? ? "summer_points",
? ? ? ? "avg_summer_medals",
? ? ]

summer = summer.sort_values(by="summer_points", ascending=False)


summer.head(10)

winter = df
? ? [
? ? ? ? "countries",
? ? ? ? "ioc_code",
? ? ? ? "winter_participations",
? ? ? ? "winter_gold",
? ? ? ? "winter_silver",
? ? ? ? "winter_bronze",
? ? ? ? "winter_total",
? ? ? ? "winter_points",
? ? ? ? "avg_winter_medals",
? ? ]

winter = winter.sort_values(by="winter_points", ascending=False)


winter.head(10)

total = df
? ? [
? ? ? ? "countries",
? ? ? ? "ioc_code",
? ? ? ? "total_participation",
? ? ? ? "total_gold",
? ? ? ? "total_silver",
? ? ? ? "total_bronze",
? ? ? ? "total_total",
? ? ? ? "total_points",
? ? ? ? "avg_total_medals",
? ? ]

total = total.sort_values(by="avg_total_medals", ascending=False)


total.head(10)


summer_top20 = summer.head(20)


plt.figure(figsize = (10,5))
sns.set_style('ticks')
color = ['#88292f']
sns.barplot (data=summer_top20, x='countries', y='summer_points', palette=color)
plt.xticks(rotation=90)
plt.ylabel('SUMMER POINTS', y=0.9)
plt.xlabel('COUNTRY', x=0.1)
plt.suptitle('Top 20 Countries in Summer Olympics', x=0.35)


plt.show()

winter_top20 = winter.head(20)


plt.figure(figsize = (10,5))
sns.set_style('ticks')
color = ['#367fa9']
sns.barplot (data=winter_top20, x='countries', y='winter_points', palette=color)
plt.xticks(rotation=90)
plt.ylabel('WINTER POINTS', y=0.9)
plt.xlabel('COUNTRY', x=0.1)
plt.suptitle('Top 20 Countries in Winter Olympics', x=0.35)


plt.show()

total_top20 = total.head(20).sort_values(by='total_points', ascending=False)


plt.figure(figsize = (10,5))
sns.set_style('ticks')
color = ['#3f2d76']
sns.barplot (data=total_top20, x='countries', y='total_points', palette=color)
plt.xticks(rotation=90)
plt.ylabel('TOTAL POINTS', y=0.9)
plt.xlabel('COUNTRY', x=0.1)
plt.suptitle('Top 20 Countries in the Olympics (combined)', x=0.35)


plt.show()

Let's get back to the average of medals won. ?We're going to call this a country's 'efficiency rate' for simplicity. ?Remember, this is calculated by dividing the number of participations by the number of medals won.

top20_avg_medals_total = (
? ? df[["avg_total_medals", "countries"]]
? ? .groupby(["countries"])
? ? .first()
? ? .sort_values(by="avg_total_medals", ascending=False)
? ? .head(20)
)


plt.figure(figsize=(10, 5))
sns.set_style("ticks")
color = ["#3f2d76"]
sns.barplot(
? ? data=top20_avg_medals_total,
? ? x=top10_avg_medals_total.index,
? ? y="avg_total_medals",
? ? palette=color,
)
sns.set(font_scale=1)
plt.xticks(rotation=80)
plt.title("Top 20 Countries With A Better Participation/Medals Average")
plt.show()

It's worth noting that Germany has participated in the Games as different names in their history (West Germany, East Germany, United Team of Germany, and Germany). ?Similarly, Russia has competed as Unified Team, Soviet Union, ROC, and Olympic Athletes from Russia. ?

If we take this into account, we can see that the most 'efficient' countries at winning medals are Russia, USA, Germany, and China.

That's all for this exploratory data analysis article. Please let me know what you thought!

Until next time,

Pete

要查看或添加评论，请登录

Peter Nicholson的更多文章

Navigating the World of Generative AI in Business: A Primer for IT Professionals

2023年11月12日

Navigating the World of Generative AI in Business: A Primer for IT Professionals

Unveiling the Future of Business Intelligence with Generative AI In a world where technological advancement is not just…
Understanding Large Language Models: A Beginner’s Guide to How AI Understands and Generates Text

2023年11月11日

Understanding Large Language Models: A Beginner’s Guide to How AI Understands and Generates Text

Decoding the Language of AI - A Beginner's Journey into Large Language Models Imagine teaching a computer to understand…
Why Power BI Project files are an important step forward for Microsoft

2023年9月29日

Why Power BI Project files are an important step forward for Microsoft

There's been so many releases and announcements from Microsoft over the last 6 months or so that it's easy for some of…
Using a pie chart? Boooo!

2023年4月5日

Using a pie chart? Boooo!

Pie charts are one of the most widely recognised and frequently used chart types in data visualisation. They are often…

1 条评论
Using ROW_NUMBER for Data Analysis in SQL

2023年2月27日

Using ROW_NUMBER for Data Analysis in SQL

What is the ROW_NUMBER Function in SQL? The ROW_NUMBER function in SQL is a window function that assigns a unique…
Exploratory Data Analysis (EDA) of UK Museums

2023年1月30日

Exploratory Data Analysis (EDA) of UK Museums

The data comes from Mapping Museums project. "The project’s research team has gathered, cleansed, and codified data…
Setting up for the year ahead (2023)

2023年1月5日

Setting up for the year ahead (2023)

The start of a new year is a great time to set personal goals and make positive changes in your life. Whether you want…
Exploratory Data Analysis (EDA) of Horror Films (1950 - Present)

2022年11月29日

Exploratory Data Analysis (EDA) of Horror Films (1950 - Present)

Continuing on my journey through Python and given that we've just had Hallowe'en, I thought it'd be a good idea to find…
Exploratory Data Analysis of Kent County Council Library Use (My First Python EDA!)

2022年11月4日

Exploratory Data Analysis of Kent County Council Library Use (My First Python EDA!)

Although this is my latest post about my journey with Python, this is one of the first projects I did back in January…
Exploratory Data Analysis (EDA) of the 2022 Commonwealth Games using Python

2022年10月22日

Exploratory Data Analysis (EDA) of the 2022 Commonwealth Games using Python

Continuing my learning journey of Python (and tools) in data analysis, there was a data set for the 2022 Commonwealth…

See all articles

EDA of Olympic Medals using Python

Peter Nicholson

Division Systems and Data Analyst at ITW Appliance Components | YPN's Global Chair at ITW | Podcaster at ABCs of ERP & Beyond

领英推荐

Peter Nicholson的更多文章

社区洞察

其他会员也浏览了

Crafting Visuals in Python

A Comprehensive Guide to Data Visualization with Matplotlib

?? Unleash the Power of Data Visualization with Python! ??

Example of K-Means Clustering in Python with GUI

matplotlib 3.4 is out!

Sup04-Quickly Visualize DEM Attributes with xarray-spatial

Visualize CO2 Time Series with Python

Data Visualisation Using Seaborn

Visualize DEM in An Interactive Map

Let's Build a Flight Tracker Part 4: Explosion in San Francisco

领英推荐

Peter Nicholson的更多文章

Navigating the World of Generative AI in Business: A Primer for IT Professionals

Understanding Large Language Models: A Beginner’s Guide to How AI Understands and Generates Text

Why Power BI Project files are an important step forward for Microsoft

Using a pie chart? Boooo!

Using ROW_NUMBER for Data Analysis in SQL

Exploratory Data Analysis (EDA) of UK Museums

Setting up for the year ahead (2023)

Exploratory Data Analysis (EDA) of Horror Films (1950 - Present)

Exploratory Data Analysis of Kent County Council Library Use (My First Python EDA!)

Exploratory Data Analysis (EDA) of the 2022 Commonwealth Games using Python

社区洞察

其他会员也浏览了

Crafting Visuals in Python

A Comprehensive Guide to Data Visualization with Matplotlib

?? Unleash the Power of Data Visualization with Python! ??

Example of K-Means Clustering in Python with GUI

matplotlib 3.4 is out!

Sup04-Quickly Visualize DEM Attributes with xarray-spatial

Visualize CO2 Time Series with Python

Data Visualisation Using Seaborn

Visualize DEM in An Interactive Map

Let's Build a Flight Tracker Part 4: Explosion in San Francisco