Using machine learning to identify the true stars of the 2022 World Cup
Ivan Reznikov
PhD, Principal Data Scientist || O'Reilly Book Author || TEDx/PyCon/GITEX Speaker || University Lecturer || LangChain, Large Language Models (LLMs) and Generative AI || 30K+ followers
!Spoiler: if you're interested just in the results - scroll down to the last section :)
The FIFA World Cup is a highly anticipated event that brings together the best soccer players from around the globe. With so many talented players competing on the world stage, it can be challenging to determine which players truly peeked their form toward the Mundiale. In this article, we will use machine learning to identify the most overperforming players of the FIFA World Cup.
To be clear. Identifying the best players isn't hard. One can go to their favorite football stats website and find a similar picture with the best football player per position.
But this is boring. Our goal is to find players, whose performance was much higher than their expectations. By analyzing player performance and comparing it to predictions made before the tournament, we can identify players who exceeded expectations and significantly impacted the pitch.
To do this, we first need to define what it means for a player to "overperform." In this context, overperforming refers to a player who performs significantly better than expected based on pre-tournament predictions based on season features and past performance metrics. These players may not necessarily be the biggest stars or the most highly-paid. By using machine learning to analyze player data and predict performance, we can identify which players truly stood out and exceeded expectations at the FIFA World Cup.
Data Pipeline
The data pipeline we'll use will consist of four main parts:
Data collection?is the process of gathering and storing data from various sources. This can be done manually, through APIs, or using web scraping techniques to extract data from websites. In our case, we'll use whoscored.com public API. Our data collection pipeline?can be represented as a 3-layer cake.?
Firstly we collect all tournament-related data about all countries that participated in the last three World Cups. Secondly, we collect data country-related data about players and their stats during those World Cups for their countries. Lastly, we collect all players-related data between 2012 - 2022.
Once the data is collected, it is often necessary to perform?data transformation?-- converting data to a usable format. This involves cleaning the data by removing missing or incorrect values, formatting it correctly, and combining it with other data sets.?
In our case, we'll need to filter out data from the years out of our interest, merge data between different competitions, merge data between different positions played, and split World Cup participation with not World Cup games.
Feature design?is selecting and creating features that will be used to train a machine learning model. Features are the input variables that the model uses to make predictions or decisions. Careful feature design is crucial because it can significantly affect the model's performance.
Most of the features collected are "perGame," meaning they are aggregated. It will need to be corrected to compare stats from different competitions where players took part in a different number of games.
After the data has been collected and transformed, and the features have been designed, the next step is to?train a machine learning model. This involves using the data and features to fit a model to the data. The model we further test on a separate dataset to evaluate its performance. Once the model is trained and tested, it can be used to make predictions or decisions on new data.
In our case, 20% of the data was left for testing, with 80% used for training purposes. The?LazyRegressor?from?lazypredict?was used to observe multiple model performances.
Based on the outputs, several model directions were chosen:
As the goal is to find an explainable linear correlation between predicted and true values, using?r2_score?and?mean_squared_error?metrics makes perfect sense. Based on performance,?GradientBoostingRegressor, RandomForestRegressor,?LinearRegression, and?LinearSVR?models were chosen to predict players' performance at FIFA 2022 World Cup.
领英推荐
Tree methods allow accessing?feature importance?out of the box. In machine learning, feature importance refers to the relative importance of each feature in a model. It is a measure of how much each feature contributes to the model's ability to make predictions or decisions.
As is seen from the plots, the most important features are pre-season rating and ranking. It makes perfect sense, as such features reflect the latest form of the player.
A pretty important feature is the ratio of clearances made, followed by other characteristics such as age, the number of offsides won, pass accuracy, etc. The feature importance plots allow us to peak at the machine learning decision-making process. We can proceed with the machine learning results as the results look valid.
Results
Now we can move to the section where we predict players' performance. We have their pre-World Cup form (ratingWeighted) and generate four predicted values from each model (pred_linreg,?pred_rf,?pred_svr,?pred_gbr). To display and sort the values, we'll also introduce the difference column (dif) between the actual rating during 2022 WC (rating_wc) and the mean of the predicted values (pred_mean). We'll also calculate the ratio between?rating_wc/pred_mean?as?the?ratio?column.
Before going to the best-overperformed players, here is a list of the Top10 underperformers of the FIFA 2022 World Cup.
Our Top20 overperforming true stars are displayed in the table below:
Let's make the team from our top 11 players:
Quite unusual to see five players from France on the list, as they were defending champions. Maybe, I'll make another analysis of the team's performance as well.
Last but not least, one may want to analyze, why this or that prediction was made for a specific person. To answer this question, we'll use the SHAP python package. Let's view the results for Lionel Messi:
I'm not ready to answer how this or that player improved to make the list (or not make it). I'll try finding time to dig into the topic deeper, although not promising it.
Conclusion
Using machine learning to identify the true stars of the 2022 World Cup has the potential to provide valuable insights and predictions for fans, analysts, and fantasy footballers. Despite not being a pure machine learning problem, data science approaches can be well placed to analyze players' performance and expectations.
We've built an e2e data pipeline to collect, transform, analyze, engineer, and build an ml model to find the players, whose performance was higher than expected. Of course, we could build a feature as "Last WC ever for Lionel Messi," but it's great to notice his commitment to the world cup and probably spot a few future talents.
Here is the final list of the TOP20 overperformed football players at Qatar World Cup 2020: Bruno Fernandes,?Wojciech Szczesny, Jean-Charles Castelletto,?Bukayo Saka,?Paik Seung-Ho, Jamal Musiala, Mohammed Kudus, Eduardo Camavinga, Lionel Messi, Jude Bellingham,?Enzo Ebosse, Kylian Mbappé, Antoine Griezmann, Richarlison, Serge Gnabry, Casemiro, Ritsu Doan, Adrien Rabiot, Randal Kolo Muani, Enner Valencia
Revenue Management Specialist | Flight Data Analyst | Instructor at Turkish Airlines ??
1 年Hi Ivan Reznikov , Thank you for such valuable work! ?? I just read it all and noticed tht r2 adjusted value is very low in terms of the explainability of the dependent variable, where its nominal and decision is only based the difference. Also, the reason of 5 French players that placed on top 11 formation might be the reason of their previous trophy and these players's new appearance in the WC2022.
Data Science and AI | I like fine-tuning deep learning models for fun
1 年It's all cool, but where's the code?