登录查看更多内容

Using machine learning to identify the true stars of the 2022 World Cup

Ivan Reznikov

PhD, Principal Data Scientist || O'Reilly Book Author || TEDx/PyCon/GITEX Speaker || University Lecturer || LangChain, Large Language Models (LLMs) and Generative AI || 30K+ followers

发布日期: 2022年12月18日

!Spoiler: if you're interested just in the results - scroll down to the last section :)

The FIFA World Cup is a highly anticipated event that brings together the best soccer players from around the globe. With so many talented players competing on the world stage, it can be challenging to determine which players truly peeked their form toward the Mundiale. In this article, we will use machine learning to identify the most overperforming players of the FIFA World Cup.

To be clear. Identifying the best players isn't hard. One can go to their favorite football stats website and find a similar picture with the best football player per position.

No alt text provided for this image — World Cup Final Stage Best XI

But this is boring. Our goal is to find players, whose performance was much higher than their expectations. By analyzing player performance and comparing it to predictions made before the tournament, we can identify players who exceeded expectations and significantly impacted the pitch.

To do this, we first need to define what it means for a player to "overperform." In this context, overperforming refers to a player who performs significantly better than expected based on pre-tournament predictions based on season features and past performance metrics. These players may not necessarily be the biggest stars or the most highly-paid. By using machine learning to analyze player data and predict performance, we can identify which players truly stood out and exceeded expectations at the FIFA World Cup.

Data Pipeline

The data pipeline we'll use will consist of four main parts:

Data collection
Data transformation
Feature design and engineering
Machine learning training

Data collection?is the process of gathering and storing data from various sources. This can be done manually, through APIs, or using web scraping techniques to extract data from websites. In our case, we'll use whoscored.com public API. Our data collection pipeline?can be represented as a 3-layer cake.?

Firstly we collect all tournament-related data about all countries that participated in the last three World Cups. Secondly, we collect data country-related data about players and their stats during those World Cups for their countries. Lastly, we collect all players-related data between 2012 - 2022.

Once the data is collected, it is often necessary to perform?data transformation?-- converting data to a usable format. This involves cleaning the data by removing missing or incorrect values, formatting it correctly, and combining it with other data sets.?

In our case, we'll need to filter out data from the years out of our interest, merge data between different competitions, merge data between different positions played, and split World Cup participation with not World Cup games.

Feature design?is selecting and creating features that will be used to train a machine learning model. Features are the input variables that the model uses to make predictions or decisions. Careful feature design is crucial because it can significantly affect the model's performance.

Most of the features collected are "perGame," meaning they are aggregated. It will need to be corrected to compare stats from different competitions where players took part in a different number of games.

After the data has been collected and transformed, and the features have been designed, the next step is to?train a machine learning model. This involves using the data and features to fit a model to the data. The model we further test on a separate dataset to evaluate its performance. Once the model is trained and tested, it can be used to make predictions or decisions on new data.

In our case, 20% of the data was left for testing, with 80% used for training purposes. The?LazyRegressor?from?lazypredict?was used to observe multiple model performances.

Based on the outputs, several model directions were chosen:

LinearRegression
RandomForestRegressor
GradientBoostingRegressor
LinearSVR
CatBoostRegressor

As the goal is to find an explainable linear correlation between predicted and true values, using?r2_score?and?mean_squared_error?metrics makes perfect sense. Based on performance,?GradientBoostingRegressor, RandomForestRegressor,?LinearRegression, and?LinearSVR?models were chosen to predict players' performance at FIFA 2022 World Cup.

Lean Manufacturing & Six Sigma Worldwide 11 个月前

Enhance Your ML Workflows with Logic Stage: Use Cases…

V7 1 年前

Data Collection & Preprocessing

Dr. John Martin 10 个月前

Tree methods allow accessing?feature importance?out of the box. In machine learning, feature importance refers to the relative importance of each feature in a model. It is a measure of how much each feature contributes to the model's ability to make predictions or decisions.

As is seen from the plots, the most important features are pre-season rating and ranking. It makes perfect sense, as such features reflect the latest form of the player.

A pretty important feature is the ratio of clearances made, followed by other characteristics such as age, the number of offsides won, pass accuracy, etc. The feature importance plots allow us to peak at the machine learning decision-making process. We can proceed with the machine learning results as the results look valid.

Results

Now we can move to the section where we predict players' performance. We have their pre-World Cup form (ratingWeighted) and generate four predicted values from each model (pred_linreg,?pred_rf,?pred_svr,?pred_gbr). To display and sort the values, we'll also introduce the difference column (dif) between the actual rating during 2022 WC (rating_wc) and the mean of the predicted values (pred_mean). We'll also calculate the ratio between?rating_wc/pred_mean?as?the?ratio?column.

Before going to the best-overperformed players, here is a list of the Top10 underperformers of the FIFA 2022 World Cup.

Our Top20 overperforming true stars are displayed in the table below:

Let's make the team from our top 11 players:

Quite unusual to see five players from France on the list, as they were defending champions. Maybe, I'll make another analysis of the team's performance as well.

Last but not least, one may want to analyze, why this or that prediction was made for a specific person. To answer this question, we'll use the SHAP python package. Let's view the results for Lionel Messi:

I'm not ready to answer how this or that player improved to make the list (or not make it). I'll try finding time to dig into the topic deeper, although not promising it.

Conclusion

Using machine learning to identify the true stars of the 2022 World Cup has the potential to provide valuable insights and predictions for fans, analysts, and fantasy footballers. Despite not being a pure machine learning problem, data science approaches can be well placed to analyze players' performance and expectations.

We've built an e2e data pipeline to collect, transform, analyze, engineer, and build an ml model to find the players, whose performance was higher than expected. Of course, we could build a feature as "Last WC ever for Lionel Messi," but it's great to notice his commitment to the world cup and probably spot a few future talents.

Here is the final list of the TOP20 overperformed football players at Qatar World Cup 2020: Bruno Fernandes,?Wojciech Szczesny, Jean-Charles Castelletto,?Bukayo Saka,?Paik Seung-Ho, Jamal Musiala, Mohammed Kudus, Eduardo Camavinga, Lionel Messi, Jude Bellingham,?Enzo Ebosse, Kylian Mbappé, Antoine Griezmann, Richarlison, Serge Gnabry, Casemiro, Ritsu Doan, Adrien Rabiot, Randal Kolo Muani, Enner Valencia

#machinelearning #datascience #linearregression #randomforest #gradientboosting #shap #artificialintelligence #fifaworldcup #fifaworldcup2022 #fifaworldcupqatar2022 #fifa2022 #fifaqatar2022 #fifawc2022 #worldcup2022 #worldcup #lionelmessi #messi #mbappe

Newsletter for ML enthusiasts

11,203 位关注者

Tolga Kurtulu?

Revenue Management Specialist | Flight Data Analyst | Instructor at Turkish Airlines ??

1 年

Hi Ivan Reznikov , Thank you for such valuable work! ?? I just read it all and noticed tht r2 adjusted value is very low in terms of the explainability of the dependent variable, where its nominal and decision is only based the difference. Also, the reason of 5 French players that placed on top 11 formation might be the reason of their previous trophy and these players's new appearance in the WC2022.

1 次回应

Stanislav Filippov

Data Science and AI | I like fine-tuning deep learning models for fun

1 年

It's all cool, but where's the code?

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Using machine learning to identify the true stars of the 2022 World Cup

Ivan Reznikov

PhD, Principal Data Scientist || O'Reilly Book Author || TEDx/PyCon/GITEX Speaker || University Lecturer || LangChain, Large Language Models (LLMs) and Generative AI || 30K+ followers

Data Pipeline

领英推荐

Results

Conclusion

Newsletter for ML enthusiasts

11,203 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Insight engines with analytics and machine learning

Subscribe The Ravit Show Newsletter

ML model

How to create a train and test dataset

Feature Engineering: Boosting Your Data for Better Model Performance

Role of cross validation data in machine learning

We are ranking #1

Data Transformation in Machine Learning: Best Methods and Challenges

The dangers of invented data

Bookclub Chapter 4 - Designing Machine Learning Systems

Data Pipeline

领英推荐

Results

Conclusion

Newsletter for ML enthusiasts

11,203 位关注者

5 Reasons Why Sam Altman Might've Been Fired from?OpenAI?

2023年11月18日

How to Fit Large Language Models in Small Memory: Quantization

2023年9月4日

I Caught 16 US Presidents Using ChatGPT

2023年8月2日

How exactly LLM generates text?

2023年7月27日

Reasons Why You Will Need Linear Algebra as a Data Scientist

2023年3月7日

Hybrid Rule-ML Solutions: A Smarter Way to Run Business

2023年2月27日

ML Systems for Business: A Step-by-Step Guide

2023年2月7日

Data Scientist 2.0: The Evolution of the Role and the Skills Needed to Succeed

2023年1月28日

The Misuse of Terminology in Data Field Job Descriptions

2023年1月23日

Stop Starting, Start Finishing: How To Achieve Your Pet Project Goals

2023年1月15日

社区洞察

其他会员也浏览了

Insight engines with analytics and machine learning

Subscribe The Ravit Show Newsletter

ML model

How to create a train and test dataset

Feature Engineering: Boosting Your Data for Better Model Performance

Role of cross validation data in machine learning

We are ranking #1

Data Transformation in Machine Learning: Best Methods and Challenges

The dangers of invented data

Bookclub Chapter 4 - Designing Machine Learning Systems