Clustering Football Players by Using FIFA 19 Data
A comprehensive machine learning project to explore football skills and cluster football players based on their attributes. The source code of this project can be accessed via Kaggle.
Introduction
FIFA 19 is a football video game developed by Electronic Arts and released on September 28th, 2018 on platforms such as PC, PlayStation, Xbox [1]. In order to provide a realistic and immersive game experience, Electronic Arts creates digital representations of football players based on their skills in the real world as much as possible. Each football player is rated mostly between 0 and 100 in terms of different features which characterizes their playing skills such as shot power, short passing, tackling skills etc. According to an interview made with Michael Mueller-Moehring who is one of the producers of Electronic Arts in charge of rating players in FIFA games, player abilities are determined by the cooperation of a network consisting of more than 9000 data reviewers who are actually coaches, professional scouts and football fans actively visiting stadiums to watch many footballers playing and approximately 300 editors working for Electronic Arts who help Mueller-Moehring to assign rating values for the players in the game by utilizing data reviewers’ feedback and stats in real football matches [2].
The data generated for FIFA 19 with nearly 18,000 players and millions of data points forms a rich dataset to be analyzed and extracted insights by data scientists. The aim of this project is to cluster football players playing in forward positions such as strikers, forwards and wingers in FIFA 19 based on their skills and attributes. Fulfilling this aim to some extent can contribute to the following areas:
Data
The FIFA 19 dataset used in this project is publicly available at Kaggle [3].
The original dataset has been filtered in terms of features and samples in line with the scope and aim of the project. Related features are determined and selected in the beginning and players who do not play on forward positions according to FIFA’s criteria are eliminated. Some of the features are used only in Exploratory Data Analysis (preprocessing) and Results stages to get a better understanding about the data, determine the most distinctive features for players and discuss the outcomes. Some of the features are used in Modelling as well as exploration and discussion. The features are sorted below based on their categories. The features starting from Name up to Position are called personal information, and the features starting from ST up to LW are called positional skills, and the rest are called playing skills in the scope of this project. The explanations for the features in the dataset are obtained from [4]. The attributes used in the project are:
Fig. 1. Positions in football [5]
Exploratory Data Analysis
Exploratory data analysis is conducted to extract insights from the data and to see the relationship between variables, and these findings are used in the Modelling stage.
First of all, there is no missing data in the dataset.
The count plot in Fig. 2 visualizes the number of players in each position in which many of the forward players are strikers followed by wingers and a smaller number of center, left and right forwards.?
Fig. 2. Count plot for Position
The pair plot of positional skills such as ST, CF, etc. demonstrates that the general ability of a player when he plays in different positions is highly positively correlated with the general ability of him in his own position. For example, a player who is playing in ST position with high ST skill is also good at in other forward positions such as CF and RW. The correlation within the groups (striker group: ST, LS, RS; forward group: CF, RF, LF; winger group: RW, LW) is approximately 1. The skill similarity between wingers and forwards are also high with low variance. The highest difference is between striker skills and winger skills with higher variance. Besides, all positional skill ratings are close to normal distribution.
Fig. 3. Pair plot of positional skills
The box plot below carries the analysis one step further while supporting the findings in Fig. 3. Fig. 4a shows that the median RS skill of players who play in LS position are higher than the median of RS position players. It is also valid for LF position players. Similar relationship is also seen in CF skill in Fig. 4b. The result deduced here is the playing position of a player does not matter too much, but the skill ratings are more crucial. Positions cannot be used for grouping or clustering players alone because it would be lacking. For example, if a manager looks for a new striker and checks only for players playing in ST, RS or LS positions in this dataset, he/she will miss the opportunity to find players playing in other positions like CF, LF who also have high skills in striker positions and could be successful at striker positions. Clustering based on using all skills prevents this situation. So, the Position variable is only used in preprocessing, on the other hand the positional and playing skills are used in the modelling in this project.
Fig. 4. (a) RS skill vs Position; (b) CF skill vs Position
The correlation between positional skills and playing skills is in line with the responsibilities expected from the positions as explained above. For example, strikers are expected to have high skills at finishing, shot power and ball control and heatmap below demonstrates that ST, LS, and RS are highly positively correlated with Finishing, ShotPower, LongShots, Positioning, Reactions and Composure. Similarly, CF, RF and LF are highly correlated with BallControl, Dribbling, ShortPassing, Positioning, Reactions and Composure. Lastly, RW and LW have high correlation with Crossing, Dribbling, ShortPassing, BallControl and Vision. This analysis allows us to determine important skills in football in terms of positions. Again, as explained in Fig. 3, positional skills within the groups are highly correlated, also winger and forward positional skills have high similarity with 97% correlation value which is followed by the relationship between forward and striker skills with 93% correlation value. So, the forward players resembles both striker and winger players like a bridge between them in terms of both location on the pitch and skills they have. In addition, defensive skills such as Interception, Marking, SlidingTackle and StandingTackle have high correlation. The most negative correlations are between Height and Balance with Balance and Stamina which means players’ height affect their Balance negatively and players with high balance put too much effort to control the ball while running and carrying the ball and get tired quickly.
In addition to the relationship shows up between positional skills and playing skills, the relation between playing skills and positions is examined. For that purpose, players with Overall attribute higher than 65 is filtered to get rid of players with relatively low skills. Then, filtered high skill players are grouped by their Position and median value for each playing skill is calculated for each position. The top 5 attribute is shown for each position in Fig. 6, Fig. 7, and Fig. 8. For example, the highest median values are obtained in Strength, SprintSpeed, ShotPower, Jumping and Finishing for ST position. This analysis allows us to differentiate crucial skills again in terms of positions. As a result, the top features for striker group are SprintSpeed, Strength, ShotPower, Positioning, Finishing and Jumping. For forward group, they are Agility, Balance, Acceleration, Dribbling and BallControl. For wingers, they are Acceleration, Agility, SprintSpeed, Balance and Dribbling.
Based on two analyses above, the distinctive playing skills for striker group are ShotPower, Positioning, Finishing, Strength and for forward group ShortPassing, BallControl, Positioning, Acceleration, SprintSpeed, Agility, Dribbling, Balance, and for winger group Crossing, LongPassing, ShortPassing, Acceleration, SprintSpeed, Agility, Dribbling, Balance. Again, winger and forward group features are similar.?????
The same analysis is repeated to find 5 features with the lowest median values for each position and as a result, defensive skills, Interception, Marking, SlidingTackle and StandingTackle are the features at the bottom. The example can be seen for ST position in Fig. 9.????
The relation between discrete variables such as WeakFoot, AttackWorkRate, DefenseWorkRate and Overall attribute shows that players who can use their weak foot very well has high overall rating. Besides, players with high overall skills put high working effort during the game both in attack and defense. The interesting part is although some players have high skill rates, they put low effort, but it does not change the fact that they are talented and add value to their teams. The median Overall values for each element of these discrete variables can be seen from Fig. 10.?
Fig. 5. Heatmap
Fig. 6. Top five skills for striker group (a) ST; (b) RS; (c) LS
Fig. 7. Top five skills for forward group (a) CF; (b) RF; (c) LF
Fig. 8. Top five skills for forward group (a) RW; (b) LW
Fig. 9. The lowest five skills for ST position
Fig. 10. Overall vs. (a) WeakFoot; (b) AttackWorkRate; (c) DefenseWorkRate
Table 1 expresses that AttackingWorkRate is positively related to median forward and winger playing skills such as Acceleration, SprintSpeed, Balance, Dribbling, Agility, so these skills require more effort during the match in terms of energy and work. On the other hand, striker skills such as ShotPower, Finishing, Positioning, BallControl and Strength requires less effort, but more intelligence and natural talent. Defensive reciprocal of this analysis can be seen in Table 2. As expected, forward players with higher defensive skills put more defensive effort in the pitch. In addition, winger and forward positional skills have higher correlation with defensive skills as seen in heatmap in Fig. 5. So, we can say that winger and forward players takes more defensive responsibilities than strikers. As both analyses show, since work rate is dependent on physical endurance of a player, players having high Stamina are able to put more effort and work rate.
Table 1. AttackWorkRate and Playing Skills
Table 2. DefenseWorkRate and Playing Skills
Preferred Foot is also a distinctive skill to be used in clustering. The positional right and left locations is not directly related with the preferred foot of the players in some positions. For example, lefty players in RF position have higher RF skill than righty ones in the same position. Similarly, righty players in LW position have higher LW skill than lefty ones in the same position.
Fig. 12 shows median Overall and Potential skills versus age and it seems that players reach their potential skills around at 26-27 years old.?
Fig. 11. Box plot showing positions skills based on PreferredFoot (a) RF; (b) LW
Fig. 12. Overall and Potential vs. Age
Modelling
The findings obtained from Exploratory Data Analysis is used to determine which features will be inputs to the model and how. The distinctive features which can be useful in clustering are selected as all features explained above except Name, Overall and Position.
K-Means clustering, hierarchical clustering with average linkage and Ward’s method are applied as clustering algorithms on the dataset. K-Means algorithm clusters data by trying to separate samples in K number of groups of equal variance, minimizing a criterion known as the inertia (within-cluster-sum-of-square-error) which is distance based and also requires the number of clusters to be specified [6]. Since K-Means algorithm has problems when clusters have different sizes or densities and the dataset has outliers, hierarchical clustering methods which are less susceptible to outliers and noise are chosen to compare with K-Means. Hierarchical clustering is clustering approach that creates clusters by either using bottom-up technique (Agglomerative) or top-down approach (Divisive). At the beginning of bottom-up technique which is the type of average linkage and Ward’s method, all data points are assumed as separate clusters and next, two nearest clusters in terms of similarity are merged until that all the data points have merged and created only one cluster [7]. Similarity used in these techniques are distance-based. In average linkage, the average distances between cluster points are used whereas in Ward’s method, similarity of two clusters is based on the increase in squared error when two clusters are merged. There is no need to define number of clusters in hierarchical clustering at the beginning.
Since all methods are distance-based, normalization should be applied to the dataset to prevent high magnitude features from affecting and dominating the distance calculations negatively. Data standardization is a term which is used as a specific type of data normalization where it scales values of the features into a similar range such as [0, 1]. Standardization removes the mean from each sample and scales the data to unit variance [8]. Standardization also considers that all samples for each attribute are normally distributed. However, standardization can be influenced by outliers easily. When the given dataset contains outliers, the results of standardization can be misguiding. In these kinds of conditions, it is better to use different approaches which are robust against outliers such as robust standardization which uses the interquartile range. It scales features using statistics that are robust to outliers. This method removes the median and scales the data in the range between 1st quartile and 3rd quartile. i.e., in between 25th quantile and 75th quantile range [9]. Since FIFA 19 dataset contains outliers in some of its features, both standardization and robust standardization are applied on the dataset and compared to see the outlier effect.?
领英推荐
Fig. 13. Features with outliers (a) Finishing; (b) ShortPassing
The number of dimensions can be reduced by applying Principal Component Analysis (PCA) especially to highly correlated features. PCA method is obtaining components which explain the highest variance or most of the information in the data with a smaller number of attributes. As explained in Exploratory Data Analysis, positional skills are highly correlated within each group. So, PCA is used to obtain Striker component from ST, RS, LS, and Forward component from CF, RF, LF, and lastly Wing component from RW and LW. The explained variance ratio is 100% for each PCA application with both standardization method. The same method is also used for defensive skills, Marking, Interceptions, SlidingTackle and StandingTackle because these features are highly correlated and one component obtained from them which will be called Defensive is enough to represent a forward player’s defensive skills with DefenseWorkRate. The explained variance ratio is 70.3% for standardization and 71.6% for robust standardization.
In order to determine the number of clusters, inertia is calculated for different K values from 1 to 9 with K-Means algorithm. Besides, Silhouette Coeficient is calculated for K values from 2 to 6. Silhouette Coefficient for a set of samples can be found by taking the average of Silhouette Coefficient for each sample by using the formula below. The results for each standardization method can be seen in Fig. 15 and Table 3. According to Elbow method and Silhouette Coefficient values, the number of clusters is determined as 4 for K-Means algorithm and robust standardization is chosen as normalization method for all algorithms.?
Fig. 14. Formula for (a) Inertia; (b) Silhouette Coefficient for one sample [10]
Fig. 15. Inertia vs. K for K-Means
Table 3. Silhouette Scores and K values in K-Means with different normalization methods
After selecting the number of clusters for K-Means algorithm?and normalization method for all algorithms, the clustering methods, K-Means, hierarchical clustering with average linkage and Ward’s method are compared based on a performance metric which is again Silhouette Coefficient. The results can be seen in Table 4. According to the results, K-Means algorihtm with 4 clusters is selected as the final model.
Table 4. Silhouette Scores and K values with different algorithms (with robust standardization)
Results
The characteristics of each cluster are examined in this part. First of all, each cluster has close number of samples, players. As expected, each cluster has players from each position, since players are grouped according to their abilities and attributes, not positions. However, there are more strikers in Cluster 0 than each cluster and Cluster 1 has more wingers and forwards. Fig. 16 shows the number of players in each cluster and their distribution based on the position (Striker=ST, RS, LS; Forward=CF, RF, LF; Wing=RW, LW).
Fig. 16. Count plot of (a) clusters; (b) clusters in terms of positions
Table 5 shows the average Overall, Age and Potential for each cluster. Fig. 17 demonstrates more details about these features of clusters with a pair plot. From this information, it is understood that Cluster 3 includes high skilled and more mature players and Cluster 2 includes low skilled and mature players, also very young players with high potential. Cluster 0 and Cluster 1 show similar characteristics based on these features except Cluster 1 has younger players than Cluster 0 does. In order to differentiate Cluster 0 and Cluster 1 more and get a better understanding about the abilities of each cluster, average distinctive playing skills are examined for each cluster in the radar plots in Fig. 18 and Fig. 19. In the first radar plot, average distinctive playing skills are showed for each cluster, however in the second one, they are shown for only Cluster 0 and Cluster 1. Approximately for each skill, the highest averages, that is highest skilled players, are in Cluster 3 and the lowest ones are in Cluster 2. The interesting part is the average distinctive playing skills for strikers such as ShotPower, Finishing, Positioning and Strength are higher for Cluster 0 than Cluster 1, on the other hand the average distinctive playing skills for wingers and forwards such as Acceleration, Agility, Dribbling, SprintSpeed, LongPassing, ShortPassing and Crossing are higher for Cluster 1 than Cluster 0. The same result can be obtained with Striker, Forward and Wing features produced by PCA which is shown in Table 6. So, it can be concluded that Cluster 0 consists of players mostly with striker-related skills and Cluster 1 consists of players mostly with forward-related and winger-related skills. This differentiation is independent from the positions of the players, but it is directly related with the positional and playing skills of them. Besides, Striker and Wing feature differences between two clusters are higher than Forward feature difference because forward position players are like a bridge between striker and winger position players, and also forward positional skills are highly correlated with both striker and winger positional skills which can be seen in the heatmap of Fig. 5.?
Table 5. Mean Overall, Age and Potential in each cluster
Fig. 17. Pair plot based on clusters
Fig. 18. Radar plot for each cluster
Fig. 19. Radar plot for Cluster 0 and Cluster 1
Table 6. Average Striker, Forward and Wing skills with respect to each cluster
The other average playing skills, work rates and Defensive feature result of PCA for each cluster can be seen in the following tables. The differences between each cluster are in line with the previous findings. For example, skills more related to striker positions and highly correlated with Striker positional skills such as HeadingAccuracy, Reactions, Jumping, Aggression average is higher for Cluster 0. In contrast, Curve and Vision are higher for Cluster 1 as more forward and winger related skills. DefenseWorkRate and Defensive skill is higher for Cluster 1 which is in line with the previous results about the defensive skills and positions.
Table 7. Average playing skills
Table 8. Average work rates and defensive skills
By looking closer to Fig. 20, the details of Cluster 3, which has high skilled players, and Cluster 2, which has low skilled players as well as very young players which have low rating skills right now but high potential for the future. Cluster 3 includes top players such as Lionel Messi, Cristiano Ronaldo, Neymar Jr, Eden Hazard, etc. In addition to them, there are some players which have lower Overall rating close to the other clusters. When these players are explored in detail, it can be seen that they are specialized and skilled too high in some playing skills and that’s why they are clustered with the top players in this model although they do not have high Overall ratings. For example, D. Braaten who has 67 Overall rating has Strength rating of 88. Similarly, J. Berget who has 69 Overall rating has Stamina rating of 93. Lastly, Nino, who has 70 Overall rating and is 38 years old which can be counted as old for football, has Composure rating of 83 and Balance rating of 80. The model allows us to find these special players which could be omitted by traditional filtering methods or clustering methods which take Overall rating as input.
Fig. 20. Pair plot based on Cluster 2 and Cluster 3
Conclusion
This project presents a clustering method for football players who play in forward positions based on their skills and attributes by using FIFA 19 dataset which is in the essence a video game but provides a realistic representation of football players. By exploring the data, the most distinctive features for players are determined to use in the modelling. Standardization and feature extraction methods are applied on the dataset to prepare it for modelling. K-Means and hierarchical clustering algorithms are compared and K-Means algorithm with 4 clusters is selected as the most appropriate model. Forward players are grouped as 1) highly skilled top, elite players and players specialized at some skills very well, 2) players with mostly striker-related skills, 3) players with mostly forward-related and winger-related skills, 4) low skilled players and high potential young players.
References
Cover photo: Photo by JESHOOTS.COM on Unsplash
[1] C. Vaz, “FIFA 19 Review – Just Football Perfection,” wccftech.com, Oct. 2, 2018. [Online]. Available: https://wccftech.com/review/fifa-19-review-its-football-perfection/. [Accessed: Jan. 10, 2021].
[2] S. Saed, “EA explains how FIFA player ratings are calculated,” vg247.com, Sep. 27, 2016. [Online]. Available: https://www.vg247.com/2016/09/27/how-ea-calculates-fifa-17-player-ratings/. [Accessed: Jan. 8, 2021].
[3] FIFA 19 Complete Player Dataset, 2018. [Online]. Available: https://www.kaggle.com/karangadiya/fifa19
[4] FIFA Encyclopedia. [Online]. Available: https://www.fifplay.com/encyclopedia/
[5] Position. [Online]. Available: https://www.fifplay.com/encyclopedia/position/
[6] K-means. [Online]. Available: https://scikit-learn.org/stable/modules/clustering.html#k-means
[7] S. Kaushik, “An Introduction to Clustering and different methods of clustering,” analyticsvidhya.com, Nov. 3, 2016. [Online]. Available: https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/. [Accessed: Jan. 11, 2021].
[8] Compare the Effect of Different Scalers on Data with Outliers. [Online]. Available: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
[9] “StandardScaler, MinMaxScaler and RobustScaler Techniques – ML,” geeksforgeeks.com, Jul. 16, 2020. [Online]. Available: https://www.geeksforgeeks.org/standardscaler-minmaxscaler-and-robustscaler-techniques-ml/. [Accessed: Jan. 15, 2021].
[10] Clustering Performance Evaluation. [Online]. Available: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation