登录查看更多内容

Why Germany did not defeat Brazil in the final, or Data Science lessons from the World Cup

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

发布日期: 2018年7月30日

This article is based on a KDnuggets blog jointly written with Dan Clark.

The 2018 World Cup is over, with France defeating Croatia 4-2 in the final. It was a great match, to end a brilliant tournament, with the French deserved winners.

Before the tournament, KDnuggets (and many others) have published predictions, which generally had Germany vs Brazil in the final.

Fig. 1: Expected World Cup 2018 Brackets, with Germany vs Brazil in the final, as predicted by KDnuggets before the tournament start.

We predicted 13 of the last 16 (81.25%) correctly, with only Poland, Germany and Egypt missing out and Japan, Sweden and hosts Russia taking their places.

At the quarter-final stage, 4 of the 8 teams were correctly predicted (50%), Only one out of 4 (France) was correct at the semi-final stage, and we were 0 out of 2 for the final.

The other analysts also got it wrong. The FiveThirtyEight predictions had Brazil (19%), Spain (17%) and Germany (13%) all ahead of France (8%) as the winners. Gracenote’s predictions had the same three sides and even Argentina ahead of France. Predicting the World Cup is difficult.

Lessons

So why did everyone get it so wrong? Here are some Data Science lessons:

Human aspect

Human behavior has a lot of randomness and so trying to use data science to predict it is difficult and offers limited accuracy. One particular example from the World Cup is French goalkeeper Lloris mistake leading to the second Croatia goal in the final match. Something like this is impossible to predict, likewise with any own goals and mistakes, it simply comes down to human behaviour.

External factors

Sport in general contains a lot of external factors that can affect the results. For example for football ( soccer), the result may be affected by an unfair referee, adverse weather conditions, the climate, the player’s personal lives and much more. It’s very tricky to factor in these features, as they can be difficult to measure and collect.

Individual Events

Predicting the results in of the entire tournament requires predicting all the separate matches, and randomness tends to aggregate. The knockout nature of the World Cup makes is harder to predict, as one defeat can send a team home.

Group behaviour

Predicting sports with individual competition, like baseball or chess, is easier than predicting team events.

Data science has limited accuracy when dealing with predicting group behaviour. Because team composition is changing all the time in soccer, we cannot draw many conclusions from a team performance 4 years ago to predict the same team performance today (and what that team did 20 years has very little relevance).

Uncertainty range

Every prediction has a range of uncertainty. For example, if we throw a fair coin 1000 times, then from the binomial distribution (or normal approximation to binomial) we can predict that the number of heads will be between 469 and 531 with 95% confidence.

However, very few analysts do their predictions with sufficient rigor to determine it and present confidence intervals. If you see a prediction about a very uncertain event where the range of uncertainty is not given, can you trust the prediction?

Rules

With all that in mind, here are our three golden rules for knowing when to trust predictions:

If there are mathematical laws (eg for games of chance like fair coin or dice) or physical laws (for example in astronomy, where positions of planets can be predicted very precisely).
If there is a lot of data on the same type of entity. Note that the Brazil team of 2010 isn’t going to be the same as the Brazil 2018 team.
If the predictions include a range of uncertainty, which usually indicates a good work with solid statistical foundations. When only a single number is provided without a standard deviation, it is probably more for entertainment, and you shouldn’t trust it.

Conclusion

This experience highlights how limited data science can be when predicting something controlled by human behaviour. It’s abundantly clear that these predictions are more for entertainment purposes, or to give a rough estimate, rather than an exact science.

Mohammed LEADI

Actuarial & Quantitative Analyst

6 年

Data science allows us to make a decision with confidence interval but doesn't guarantee 100% future behavior.

Mohammed LEADI

Actuarial & Quantitative Analyst

6 年

Let's check it.

Julian Angulo Abril

Data Scientist / Machine Learning Engineer

6 年

Probably they anayzed the wrong data as Albrecht Zimmeran says chance is a defining factor on football scores, so the analysis must include data that may be related to the "chance generation".

Albrecht Zimmermann

6 年

I would add how strongly soccer is affected by chance in the first place because it is so low-scoring. In basketball, with (more or less) 100 possessions per match, most of which lead to a basket, chance is much lower. And even basketball is *strongly* affected by chance (https://ceur-ws.org/Vol-1970/paper-09.pdf, if I may plug myself). In lower-scoring games such as American football, ice hockey, and soccer, it gets progressively worse.

1 次回应

Professor Ravi Vadlamani

Head, Center for AI/ML (formerly Center of Excellence in Analytics), Institute for Development and Research in Banking Technology

6 年

Brilliant analysis of the failed predictions!

查看更多评论

要查看或添加评论，请登录

Gregory Piatetsky-Shapiro的更多文章

KDnuggets: Personal History and Nuggets of Experience

2021年12月4日

KDnuggets: Personal History and Nuggets of Experience

Dear Readers, I have big news! After 40+ years of working full time, including 35+ years of data mining/KDD/data…

160 条评论
Which Data Science Skills are core and which are hot/emerging ones?

2019年9月17日

Which Data Science Skills are core and which are hot/emerging ones?

The latest KDnuggets Poll asked 1. Which skills / knowledge areas do you currently have (at the level you can use in…

30 条评论
Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

2019年2月11日

Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

For the first time in several years the name of this highly anticipated Gartner MQ for Data Science and Machine…

10 条评论
AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

2018年12月4日

AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

As in the past, we bring you a roundup of predictions and analysis from experts. We have asked What were the main…

6 条评论
How Important is that Machine Learning Model be Understandable?

2018年11月19日

How Important is that Machine Learning Model be Understandable?

The previous KDnuggets Poll asked When building Machine Learning / Data Science models in 2018, how often was it…

10 条评论
Anticipating the next move in data science – my interview with Thomson Reuters

2018年11月18日

Anticipating the next move in data science – my interview with Thomson Reuters

Thomson Reuters has a series, AI experts, where they interview thought leaders from different areas - including…

11 条评论
Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

2018年10月31日

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

The latest KDnuggets Poll asked: What was the largest dataset you analyzed / data mined? This poll received 1108 votes,…

5 条评论
How many Data Scientists are there and is there a shortage?

2018年9月19日

How many Data Scientists are there and is there a shortage?

(this blog was jointly written with Preet Gandhi, NYU) The 2011 McKinsey report on Big Data said that “The United…

8 条评论
SuperDataScience Podcast: Insights from the Founder of KDnuggets

2018年7月23日

SuperDataScience Podcast: Insights from the Founder of KDnuggets

I recently appeared on Super DataScience Podcast, where I had an interesting conversation with SDS Founder Kirill…

4 条评论
The 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R?

2018年6月6日

The 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R?

In May, we reported initial results on 19th annual KDnuggets Software Poll: Python eats away at R: Top Software for…

25 条评论

See all articles

Why Germany did not defeat Brazil in the final, or Data Science lessons from the World Cup

Gregory Piatetsky-Shapiro

Part-time philosopher, Retired, Data Scientist, KDD and KDnuggets Founder, was LinkedIn Top Voice on Data Science & Analytics. Currently helping Ukrainian refugees in MA.

Gregory Piatetsky-Shapiro的更多文章

社区洞察

其他会员也浏览了

Game, Set, Match: The Data Revolution that's Reshaping the Tennis Court!

Recruiting through data science.

Artificial Intelligence in FIFA World Cup Football 2018; By - Utpal Chakraborty

32 ways to succeed with data: Lessons from the teams at the 2022 World Cup

32 ways to succeed with data. Lessons from the 2022 World Cup teams.

How Liverpool are using Data Science to improve their performance on and off the pitch.

How technology is improving football

How the “High Ball Recovery” Statistic is Helping Football Teams Win

World Cup Data Report 2022 | Understanding Football’s Digital Transformation

What Does VAR Teach us About AI?

Gregory Piatetsky-Shapiro的更多文章

KDnuggets: Personal History and Nuggets of Experience

Which Data Science Skills are core and which are hot/emerging ones?

Gainers, Losers, and Trends in Gartner 2019 Magic Quadrant for Data Science and Machine Learning Platforms

AI, Data Science, Analytics Main Developments in 2018 and Key Trends for 2019

How Important is that Machine Learning Model be Understandable?

Anticipating the next move in data science – my interview with Thomson Reuters

Amazing consistency: Largest Dataset Analyzed / Data Mined – Poll Results and Trends

How many Data Scientists are there and is there a shortage?

SuperDataScience Podcast: Insights from the Founder of KDnuggets

The 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R?

社区洞察

其他会员也浏览了

Game, Set, Match: The Data Revolution that's Reshaping the Tennis Court!

Recruiting through data science.

Artificial Intelligence in FIFA World Cup Football 2018; By - Utpal Chakraborty

32 ways to succeed with data: Lessons from the teams at the 2022 World Cup

32 ways to succeed with data. Lessons from the 2022 World Cup teams.

How Liverpool are using Data Science to improve their performance on and off the pitch.

How technology is improving football

How the “High Ball Recovery” Statistic is Helping Football Teams Win

World Cup Data Report 2022 | Understanding Football’s Digital Transformation

What Does VAR Teach us About AI?