EDA of Olympic Medals using Python

EDA of Olympic Medals using Python

Continuing my learning journey of Python (and tools) in data analysis, I found a good dataset on Kaggle that has Olympics medals data of each participating country since 1896.?

The information on the data reads:

The modern Olympic Games or Olympics are the leading international sporting events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Olympic Games are considered the world's foremost sports competition with more than 200 nations participating. The Olympic Games are normally held every four years, and since 1994, has alternated between the Summer and Winter Olympics every two years during the four-year period.

So, let's begin:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.patches as mpatches
from matplotlib.pyplot import figure
import matplotlib.mlab as mlab
import scipy.stats
import seaborn as sns        


df = pd.read_csv('olympics_medals.csv')        


df.info()

RangeIndex: 156 entries, 0 to 15
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   countries              156 non-null    object
 1   ioc_code               156 non-null    object
 2   summer_participations  156 non-null    int64 
 3   summer_gold            156 non-null    object
 4   summer_silver          156 non-null    int64 
 5   summer_bronze          156 non-null    int64 
 6   summer_total           156 non-null    object
 7   winter_participations  156 non-null    int64 
 8   winter_gold            156 non-null    int64 
 9   winter_silver          156 non-null    int64 
 10  winter_bronze          156 non-null    int64 
 11  winter_total           156 non-null    int64 
 12  total_participation    156 non-null    int64 
 13  total_gold             156 non-null    object
 14  total_silver           156 non-null    int64 
 15  total_bronze           156 non-null    int64 
 16  total_total            156 non-null    object
dtypes: int64(11), object(6)
memory usage: 20.8+ KB5        

Removing the comma from the the numerical string data and changing to int64 format:

df['summer_gold'] = df['summer_gold'].str.replace(',','').astype('int64')
df['summer_total'] = df['summer_total'].str.replace(',','').astype('int64')
df['total_gold'] = df['total_gold'].str.replace(',','').astype('int64')
df['total_total'] = df['total_total'].str.replace(',','').astype('int64')        

Stripping the ioc code from its brackets:

df['ioc_code'] = df['ioc_code'].str.replace('(', '')
df['ioc_code'] = df['ioc_code'].str.replace(')', '')        

Checking correlation across the data points:

plt.figure (figsize = (18,12))
sns.set_style('ticks')


corr = df.corr()


sns.heatmap (data = corr, annot = True, fmt= " .2g", linewidth = 2)
sns.set (font_scale = 0)
plt.show ()        
No alt text provided for this image

This correlation map shows a strong correlation between countries that participate and countries that win the most medals. ?There's also quite a strong correlation between total summer participations and total winter participations (0.65). ?There's a stronger correlation between number of winter participations & medals won at the winter games (0.53), than there is between total summer participations & medals won at the summer games (0.34).

Taking this further, let's see if there's a test the hypothesis that the more Summer Olympics you participate in, the more you will win medals

f, axs = plt.subplots(2, 2, figsize=(10, 10), gridspec_kw=dict(width_ratios=[4, 4]))
sns.set_style('ticks')
sns.scatterplot( data=df, x="summer_participations", y="summer_gold", ax=axs[0,0])
sns.scatterplot( data=df, x="summer_participations", y="summer_silver", ax=axs[0,1])
sns.scatterplot( data=df, x="summer_participations", y="summer_bronze", ax=axs[1,0])
sns.scatterplot( data=df, x="summer_participations", y="summer_total", ax=axs[1,1])
f.tight_layout()        
No alt text provided for this image

Looking at the scatterplots above, it doesn't show any strong correlation.

Counting the number of medals won at the Olympics may not be the fairest system of determining the most successful countries. For example, at the 2008 Olympic Games in Beijing, the USA finished second in the gold count to China but were ahead of them in the total medal count. This created quite a bit of discussion about which ranking system should be used. The Americans were obviously quite happy with their usual system of counting total medals, but the rest of the world generally did not agree.

For this project, I am using the New York Times weighted point system (4:2:1) — gold 4 points, silver 2 points, and bronze 1 point. An exponential points system giving ' Medal Points' described in the New York Times in 2008.


df['summer_points'] = df['summer_gold'] * 4
df['summer_points'] += df['summer_silver'] * 2
df['summer_points'] += df['summer_bronze']


df['winter_points'] = df['winter_gold'] * 4
df['winter_points'] += df['winter_silver'] * 2
df['winter_points'] += df['winter_bronze']


df['total_points'] = df['summer_points'] + df['winter_points']        

It might be interesting to see how well a country performs on average. ?A country that takes part in more Olympics will have a better chance than a country participating in less. ?Let's create some "average medals" columns to assess performance later on.


df['avg_summer_medals'] = df['summer_total'] / df['summer_participations']
df['avg_winter_medals'] = df['winter_total'] / df['winter_participations']


df['avg_total_medals'] = df['total_total'] / df['total_participation']


cols = ['avg_summer_medals', 'avg_winter_medals', 'avg_total_medals']
df[cols] = df[cols].round(1)        


summer = df
? ? [
? ? ? ? "countries",
? ? ? ? "ioc_code",
? ? ? ? "summer_participations",
? ? ? ? "summer_gold",
? ? ? ? "summer_silver",
? ? ? ? "summer_bronze",
? ? ? ? "summer_total",
? ? ? ? "summer_points",
? ? ? ? "avg_summer_medals",
? ? ]

summer = summer.sort_values(by="summer_points", ascending=False)


summer.head(10)

        
No alt text provided for this image
winter = df
? ? [
? ? ? ? "countries",
? ? ? ? "ioc_code",
? ? ? ? "winter_participations",
? ? ? ? "winter_gold",
? ? ? ? "winter_silver",
? ? ? ? "winter_bronze",
? ? ? ? "winter_total",
? ? ? ? "winter_points",
? ? ? ? "avg_winter_medals",
? ? ]

winter = winter.sort_values(by="winter_points", ascending=False)


winter.head(10)

        
No alt text provided for this image
total = df
? ? [
? ? ? ? "countries",
? ? ? ? "ioc_code",
? ? ? ? "total_participation",
? ? ? ? "total_gold",
? ? ? ? "total_silver",
? ? ? ? "total_bronze",
? ? ? ? "total_total",
? ? ? ? "total_points",
? ? ? ? "avg_total_medals",
? ? ]

total = total.sort_values(by="avg_total_medals", ascending=False)


total.head(10)

        
No alt text provided for this image

summer_top20 = summer.head(20)


plt.figure(figsize = (10,5))
sns.set_style('ticks')
color = ['#88292f']
sns.barplot (data=summer_top20, x='countries', y='summer_points', palette=color)
plt.xticks(rotation=90)
plt.ylabel('SUMMER POINTS', y=0.9)
plt.xlabel('COUNTRY', x=0.1)
plt.suptitle('Top 20 Countries in Summer Olympics', x=0.35)


plt.show()        
No alt text provided for this image


winter_top20 = winter.head(20)


plt.figure(figsize = (10,5))
sns.set_style('ticks')
color = ['#367fa9']
sns.barplot (data=winter_top20, x='countries', y='winter_points', palette=color)
plt.xticks(rotation=90)
plt.ylabel('WINTER POINTS', y=0.9)
plt.xlabel('COUNTRY', x=0.1)
plt.suptitle('Top 20 Countries in Winter Olympics', x=0.35)


plt.show()        
No alt text provided for this image


total_top20 = total.head(20).sort_values(by='total_points', ascending=False)


plt.figure(figsize = (10,5))
sns.set_style('ticks')
color = ['#3f2d76']
sns.barplot (data=total_top20, x='countries', y='total_points', palette=color)
plt.xticks(rotation=90)
plt.ylabel('TOTAL POINTS', y=0.9)
plt.xlabel('COUNTRY', x=0.1)
plt.suptitle('Top 20 Countries in the Olympics (combined)', x=0.35)


plt.show()        
No alt text provided for this image

Let's get back to the average of medals won. ?We're going to call this a country's 'efficiency rate' for simplicity. ?Remember, this is calculated by dividing the number of participations by the number of medals won.


top20_avg_medals_total = (
? ? df[["avg_total_medals", "countries"]]
? ? .groupby(["countries"])
? ? .first()
? ? .sort_values(by="avg_total_medals", ascending=False)
? ? .head(20)
)


plt.figure(figsize=(10, 5))
sns.set_style("ticks")
color = ["#3f2d76"]
sns.barplot(
? ? data=top20_avg_medals_total,
? ? x=top10_avg_medals_total.index,
? ? y="avg_total_medals",
? ? palette=color,
)
sns.set(font_scale=1)
plt.xticks(rotation=80)
plt.title("Top 20 Countries With A Better Participation/Medals Average")
plt.show()        
No alt text provided for this image

It's worth noting that Germany has participated in the Games as different names in their history (West Germany, East Germany, United Team of Germany, and Germany). ?Similarly, Russia has competed as Unified Team, Soviet Union, ROC, and Olympic Athletes from Russia. ?

If we take this into account, we can see that the most 'efficient' countries at winning medals are Russia, USA, Germany, and China.

That's all for this exploratory data analysis article. Please let me know what you thought!

Until next time,

Pete

要查看或添加评论,请登录

Peter Nicholson的更多文章

社区洞察

其他会员也浏览了