Using machine learning to identify the true stars of the 2022 World Cup
AI generated image

Using machine learning to identify the true stars of the 2022 World Cup

!Spoiler: if you're interested just in the results - scroll down to the last section :)

The FIFA World Cup is a highly anticipated event that brings together the best soccer players from around the globe. With so many talented players competing on the world stage, it can be challenging to determine which players truly peeked their form toward the Mundiale. In this article, we will use machine learning to identify the most overperforming players of the FIFA World Cup.

To be clear. Identifying the best players isn't hard. One can go to their favorite football stats website and find a similar picture with the best football player per position.

No alt text provided for this image
World Cup Final Stage Best XI

But this is boring. Our goal is to find players, whose performance was much higher than their expectations. By analyzing player performance and comparing it to predictions made before the tournament, we can identify players who exceeded expectations and significantly impacted the pitch.

To do this, we first need to define what it means for a player to "overperform." In this context, overperforming refers to a player who performs significantly better than expected based on pre-tournament predictions based on season features and past performance metrics. These players may not necessarily be the biggest stars or the most highly-paid. By using machine learning to analyze player data and predict performance, we can identify which players truly stood out and exceeded expectations at the FIFA World Cup.

Data Pipeline

The data pipeline we'll use will consist of four main parts:

  • Data collection
  • Data transformation
  • Feature design and engineering
  • Machine learning training

No alt text provided for this image
Data Pipeline

Data collection?is the process of gathering and storing data from various sources. This can be done manually, through APIs, or using web scraping techniques to extract data from websites. In our case, we'll use whoscored.com public API. Our data collection pipeline?can be represented as a 3-layer cake.?

Firstly we collect all tournament-related data about all countries that participated in the last three World Cups. Secondly, we collect data country-related data about players and their stats during those World Cups for their countries. Lastly, we collect all players-related data between 2012 - 2022.

Once the data is collected, it is often necessary to perform?data transformation?-- converting data to a usable format. This involves cleaning the data by removing missing or incorrect values, formatting it correctly, and combining it with other data sets.?

In our case, we'll need to filter out data from the years out of our interest, merge data between different competitions, merge data between different positions played, and split World Cup participation with not World Cup games.

Feature design?is selecting and creating features that will be used to train a machine learning model. Features are the input variables that the model uses to make predictions or decisions. Careful feature design is crucial because it can significantly affect the model's performance.

Most of the features collected are "perGame," meaning they are aggregated. It will need to be corrected to compare stats from different competitions where players took part in a different number of games.

After the data has been collected and transformed, and the features have been designed, the next step is to?train a machine learning model. This involves using the data and features to fit a model to the data. The model we further test on a separate dataset to evaluate its performance. Once the model is trained and tested, it can be used to make predictions or decisions on new data.

In our case, 20% of the data was left for testing, with 80% used for training purposes. The?LazyRegressor?from?lazypredict?was used to observe multiple model performances.

Based on the outputs, several model directions were chosen:

No alt text provided for this image

  • LinearRegression
  • RandomForestRegressor
  • GradientBoostingRegressor
  • LinearSVR
  • CatBoostRegressor

As the goal is to find an explainable linear correlation between predicted and true values, using?r2_score?and?mean_squared_error?metrics makes perfect sense. Based on performance,?GradientBoostingRegressor, RandomForestRegressor,?LinearRegression, and?LinearSVR?models were chosen to predict players' performance at FIFA 2022 World Cup.

Tree methods allow accessing?feature importance?out of the box. In machine learning, feature importance refers to the relative importance of each feature in a model. It is a measure of how much each feature contributes to the model's ability to make predictions or decisions.

No alt text provided for this image

As is seen from the plots, the most important features are pre-season rating and ranking. It makes perfect sense, as such features reflect the latest form of the player.

A pretty important feature is the ratio of clearances made, followed by other characteristics such as age, the number of offsides won, pass accuracy, etc. The feature importance plots allow us to peak at the machine learning decision-making process. We can proceed with the machine learning results as the results look valid.

Results

Now we can move to the section where we predict players' performance. We have their pre-World Cup form (ratingWeighted) and generate four predicted values from each model (pred_linreg,?pred_rf,?pred_svr,?pred_gbr). To display and sort the values, we'll also introduce the difference column (dif) between the actual rating during 2022 WC (rating_wc) and the mean of the predicted values (pred_mean). We'll also calculate the ratio between?rating_wc/pred_mean?as?the?ratio?column.

Before going to the best-overperformed players, here is a list of the Top10 underperformers of the FIFA 2022 World Cup.

No alt text provided for this image
Top underperformed players @ FIFA WC 2022

Our Top20 overperforming true stars are displayed in the table below:

No alt text provided for this image
Top overperformed players @ FIFA WC 2022

Let's make the team from our top 11 players:

No alt text provided for this image

Quite unusual to see five players from France on the list, as they were defending champions. Maybe, I'll make another analysis of the team's performance as well.

No alt text provided for this image

Last but not least, one may want to analyze, why this or that prediction was made for a specific person. To answer this question, we'll use the SHAP python package. Let's view the results for Lionel Messi:

No alt text provided for this image

I'm not ready to answer how this or that player improved to make the list (or not make it). I'll try finding time to dig into the topic deeper, although not promising it.

Conclusion

Using machine learning to identify the true stars of the 2022 World Cup has the potential to provide valuable insights and predictions for fans, analysts, and fantasy footballers. Despite not being a pure machine learning problem, data science approaches can be well placed to analyze players' performance and expectations.

We've built an e2e data pipeline to collect, transform, analyze, engineer, and build an ml model to find the players, whose performance was higher than expected. Of course, we could build a feature as "Last WC ever for Lionel Messi," but it's great to notice his commitment to the world cup and probably spot a few future talents.

Here is the final list of the TOP20 overperformed football players at Qatar World Cup 2020: Bruno Fernandes,?Wojciech Szczesny, Jean-Charles Castelletto,?Bukayo Saka,?Paik Seung-Ho, Jamal Musiala, Mohammed Kudus, Eduardo Camavinga, Lionel Messi, Jude Bellingham,?Enzo Ebosse, Kylian Mbappé, Antoine Griezmann, Richarlison, Serge Gnabry, Casemiro, Ritsu Doan, Adrien Rabiot, Randal Kolo Muani, Enner Valencia

#machinelearning #datascience #linearregression #randomforest #gradientboosting #shap #artificialintelligence #fifaworldcup #fifaworldcup2022 #fifaworldcupqatar2022 #fifa2022 #fifaqatar2022 #fifawc2022 #worldcup2022 #worldcup #lionelmessi #messi #mbappe

Tolga Kurtulu?

Revenue Management Specialist | Flight Data Analyst | Instructor at Turkish Airlines ??

1 年

Hi Ivan Reznikov , Thank you for such valuable work! ?? I just read it all and noticed tht r2 adjusted value is very low in terms of the explainability of the dependent variable, where its nominal and decision is only based the difference. Also, the reason of 5 French players that placed on top 11 formation might be the reason of their previous trophy and these players's new appearance in the WC2022.

Stanislav Filippov

Data Science and AI | I like fine-tuning deep learning models for fun

1 年

It's all cool, but where's the code?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了