FIFA 22 Dataset: Principal Component and Clustering Analysis
? EA / EG

FIFA 22 Dataset: Principal Component and Clustering Analysis

" The goal is to turn data into information, and information into insights."
- Carly Fiorina

This article explores the ways in which one can derive patterns from high-dimensional and unlabelled data (data that has only features and no target variable) in a simple and beginner-friendly manner. For this, I have considered a very interesting (and vast) dataset taken from Kaggle.

If you are a gamer or a football lover, then you must be knowing about the FIFA football simulation video game which releases a new installment every year.

This dataset has 16155 instances and 105 attributes. These attributes mostly describe each player's characteristics like team, nationality, net worth, physical descriptions, playing patterns(like goal-keeping, pace, shot, defense, etc), player positions, and so on. Here I focused on finding insights related to all player's physical and playing attributes.

Since it is high-dimensional data (data with many attributes), I first carried out PCA (Principal Component Analysis) for dimensionality reduction. This method reduces computational time, removes outliers, and makes the data easier to visualize.

Since the data is unlabelled, I followed the K-means clustering algorithm to cluster players having similar physical and playing characteristics.?

But before doing all this interesting stuff, let us load the data and clean it!

#?importing?required?libraries

import?numpy?as?n
import?pandas?as?pd
import?matplotlib.pyplot?as?plt
import?plotly.express?as?px
import?statsmodels.api?as?sm
import?seaborn?as?sns
sns.set()
from?sklearn.cluster?import?KMeans
%matplotlib?inline


#?Importing?dataset?from?google?drive


from?google.colab?import?drive
drive.mount('/content/drive/')
input_excel?=?pd.read_excel("/content/drive/MyDrive/Career?Mode?player?datasets?-?FIFA?15-22.xlsx")

#?Converting?file?type?from?excel?to?csv?for?convenience


input_excel.to_csv("input_conv.csv",?index=None,?header=True)
df?=?pd.DataFrame(pd.read_csv("input_conv.csv"))        

Data Preprocessing :

  1. Since PCA can mostly be carried out on continuous variables, we will filter out all 'object' datatype attributes. Then we drop all those columns which are not relevant to our analysis.


#?creating?dataframe?having?only?numeric?value

df_con?=?df.select_dtypes([np.number])

#?removing?unwanted?attributes

df_con.drop(['sofifa_id','club_contract_valid_until','nation_team_id','nation_jersey_number','release_clause_eur','value_eur','wage_eur','club_jersey_number','club_team_id','club_jersey_number','nationality_id','league_level',?????'international_reputation',?'club_jersey_number'],axis=1,?inplace=True)        

2. I found that some categorical variables such as 'work_rate', 'preferred_foot', and 'body_type' are important so I decide to keep them by encoding them to numeric.

 
#?encoding?relevent?categorical?attributes

work_rates?=?{'Low/Low':0,'Medium/Low':1,'Low/Medium':1,'Medium/Medium':2,'High/Low':2,'Low/High':2,'High/Medium':3,'Medium/High':3,'High/High':3}
df['work_rate']?=?df['work_rate'].map(work_rates)

preferd_foot?=?{'Left':0,'Right':1}
df['preferred_foot']?=?df['preferred_foot'].map(preferd_foot)

body_types?=?{'Normal?(170-)':0,'Normal?(185+)':0,'Normal?(170- 185)':0,'Lean?(170-)':1,'Lean?(185+)':1,'Lean?(170-185)':1,'Stocky?(170-)':3,'Stocky?(185+)':3,'Stocky?(170-185)':3}
df['body_type']?=?df['body_type'].map(body_types)


df_ctg?=?df[['work_rate','preferred_foot','body_type']].copy()        

3. We now have two data frames - One continuous and the other with encoded variables. Let's merge the two and check for any null values. The null values are replaced with the mean of the column.


#?join?dataframes -
df_all?=?pd.concat([df_ctg,df_con],axis=1)

#?Checking?for?null?values -
df_ctg.isnull().sum()

# filling out missing values - 
df_all.fillna(df_all.median(),inplace=True)        

The data is now processed having attributes related to physique and playing pattern as can be seen below.

No alt text provided for this image

There are 50 attributes remaining out of the 105 that we had started with. However it is still too many, so we need to carry out dimensionality reduction.

?It's time for PCA!

Principal Component Analysis -

Before we carry out PCA, data needs to be standardized as different attributes have different ranges. Z-score standardization is applied in order to get each variable's mean to 0 and Standard Deviation to 1. Scikit-learn StandardScaler() class is used for this purpose.

NOTE:?When?there?are?too?many?categorical?variables,?we?can?reduce?their?dimension?using?chi-square?statistics?(this?method?is?not?explored?in this case).

  1. Our first step in PCA is to choose the correct number of principal components (PCs<=no.of attributes). We need to choose those components which have the highest variance(hold maximum information). For that, we apply PCA to the entire dimension(here 50). And look at which components hold the maximum information. Here we use the explained_variance_ratio_ attribute which explains the variance percentage of each chosen component.


#?carrying?out?PCA?with?(no.of?pc?=?no.of?attributes)

from?sklearn.decomposition?import?PCA
pca?=?PCA()pca.fit(df_scaled)
pca_all?=?pca.transform(df_scaled)
pca_all.shape

output - (16155, 50)        

We have considered all 50 components. We will now look at the first few components with maximum variance.

No alt text provided for this image
Explained variance of each Principal Component

We can see that the first 10 components hold the maximum information (84%) and are ideal for our analysis. This can also be seen from the graph plotted below.


#?plotting

px.area(x=range(1,pca_var.shape[0]+1),y=pca_var,labels={'X:?no.?of?Components','Y:?Explained_variance'})        
No alt text provided for this image
Explained variance of each PC vs Total number of components

However, our aim is to visualize clusters of players with similar attributes. In such cases, the dimensionality should be reduced to 2 or 3 in order to be able to visualize them graphically. Therefore we consider 3 principal components (retains only 63% of information as seen before).


#?PCA?with?3?pricipal?components

pca?=?PCA(n_components=3)
pca.fit(df_scaled)
pca_3?=?pca.transform(df_scaled)
pca_3.shape

output - (16155, 3)        
No alt text provided for this image
3 principal component scores of each attribute

NOTE: Our?data?is?now?being?projected?to?3D?subspace

K - Means Clustering -

Now that we reduced the dimension of our data, it is time to look for patterns in the data. Since our reduced data has values that are orthogonal to each other (linearly independent), it is easier to spot differences.

?k-means clustering is being carried out in order to find groups of similar attributes. In k-means, one has to predefine the number of clusters, just like we had to predefine the number of principal components.

  1. WCSS (Within Cluster Sum of Squares) is used to determine clusters. It calculates the least square difference between each PC and the centroid. Points with a lesser distance from a central point form a cluster. We can run the algorithm any number of times till we get an accurate fit.


#?running?kmeans?algo?in?a?loop?of?10 -

w?=?[]
for?i?in?range(1,11):
  kmm?=?KMeans(n_clusters=i,init='kmeans++',random_state=42)
  kmm.fit(pca_3)??
  w.append(kmm.inertia_)        

2. Since we considered 10 loops(10 clusters), we plot the distances calculated against the number of clusters. We use the elbow method, to determine the number of clusters we want to keep.


plt.figure(figsize=(10,8)
plt.plot(range(1,11),w,marker='o',linestyle='--')
plt.xlabel("Number?of?Clusters")
plt.ylabel("WCSS")
plt.title("K-means?on?principal?components")plt.show()
        
No alt text provided for this image
WCSS vs no.of clusters

3. By the elbow method, we look for the kink in the graph before which the slope is steep and after which the slope is smooth. In the graph above, after kink 4, the line smoothens out. Hence we should consider 3 having clusters.

4. Now that we have determined the number of clusters, we run the k-means algorithm again on the reduced data.


kmm?=?KMeans(n_clusters=3,init='k-means++',random_state=42)
kmm.fit(pca_3)        

It is time to combine the results in one data frame. A new data frame is created with original features to which the names of players (to whom the attributes belong) are added. To this table, the corresponding PCA scores along with their assigned clusters are also added.

For the k-means clusters that we have formed, each cluster is a combination of similar attributes that belong to a group of players. So we map the cluster numbers to different player types (1,2,3 as we have 3 clusters).


#?merging?names?of?players?to?dataframe?in?order?to?categories?names?of?players?by?cluster -

df_all?=?pd.merge(df_all,df["short_name"],left_index=True,?right_index=True)


#?merging?original?dataset?with?principal?components?-

df_seg?=?pd.concat([df_all.reset_index(drop=True),pd.DataFrame(pca_3)],axis=1)
df_seg.columns.values[-3:]?=?['cp1','cp2','cp3']
df_seg['K-means?segments']?=?kmm.labels_

# mapping  -

df_seg['segment']?=?df_seg['Kmeans?segments'].map({0:'Player?type?1',1:'Player?type?2',2:'Player?type?3'})        
No alt text provided for this image
Few columns and rows of the resultant dataframe

Finally, to the fruit of our endeavor and to the most exciting part!

Visualization -

Plotly library enables us to create interactive visualizations easily. For our data, we will take the principal components as axes and the clusters as data points.


# 3D interactive plot - 

import?plotly.express?as?px - 
x_axis?=?df_seg['cp1']
y_axis?=?df_seg['cp2']
z_axis?=?df_seg['cp3']
fig?=?px.scatter_3d(df_seg,x_axis,y_axis,z_axis,color?='segment')
fig.show()        
No alt text provided for this image
Players segmented by their physical and playing attributes (view 1)
No alt text provided for this image
Players segmented by their physical and playing attributes (view 2)

NOTE: If we had carried out k-means clustering, without doing PCA first, we would not have gotten such distinct and recognizable clusters.

Finally, we look at the names and the number of players who belong to different segments. This can be done by using the groupBy method.


#?Grouping names by the k-means segments -

df_grp?=?df_seg.groupby('K-means?segments')['short_name']


# getting names of players who belong to type 1 cluster - 
grp_1?=?df_grp.get_group(0)
print("Type?1?players?are?listed?as?follows:")
grp_1

# getting names of players who belong to type 2 cluster - 
grp_2?=?df_grp.get_group(1)
print("Type?2?players?are?listed?as?follows:")
grp_2

# getting names of players who belong to type 3 cluster -grp_3?=?df_grp.get_group(2)
print("Type?3?players?are?listed?as?follows:")
grp_3 
        
No alt text provided for this image
Players who belong to players type 2 cluster
# Counting players in each cluster

obj = df_seg.groupby('segment')for grp,count in obj:? ? print("cluster",grp,"has",count.shape[0],"players.")

#Plotting count? ?

df_cnt = df_seg.groupby('segment'['short_name'].count()
df_cnt.plot.bar(x="segment",y="count"color=['green','red','blue'],rot=0)?

        
No alt text provided for this image

Insights -

  1. We have found 3 major clusters. The individual cluster represents similar physical and playing attributes of players. Each cluster is very distinct from the other as can be seen in the graph above.
  2. We found players who have similar attributes and could group them into 3 categories.
  3. The number of players in each group - 'Players Type 1': 6166, 'Players Type 2': 8214, and 'Players Type 3': 1775
  4. Teams can be formed by selecting players from different groups as a team with different attributes will perform better than a team having players with the same strengths.

Conclusion -

There is a huge scope for carrying out more types of analysis on this dataset. The above process was a simple approach to finding patterns in the data. The aim was to understand how PCA and Kmeans work on a multivariate dataset. If you have understood how to implement the above methods, then try it on a different dataset or maybe find other insights from the FIFA 22 dataset.

Here is a link to my Kaggle code -


Reference -

  1. https://towardsdatascience.com/principal-component-analysis-pca-with-scikit-learn-1e84a0c731b0
  2. https://365datascience.com/tutorials/python-tutorials/k-means-clustering/

See you in the next article!


Santanu Mandal, Ph.D.

Associate Professor, Technology Transfer Officer, 1st and former Dean (Jan 2019-June 2023), School of Advanced Sciences (SAS), 1st and former HoD Mathematics (May 2017-July 2022) at VIT-AP University, A.P., India

2 年

Interesting analysis..

回复
Hiren Thakkar

AI Engineer| Data Scientist | Data Analyst

2 年

This will help me.

回复
Sanket Maheta

AI Developer @ MarsBazaar.Com | Former AI Developer Intern @ Jio Platforms | Stats Grad | Data Science Enthusiast

2 年

It was an amazing article Shritama! Waiting for next article! Congratulations a lot for this article!??

Harshita Bajaj

Data Engineer @Santeware Healthcare Solutions ll M.sc. Data Science @Vellore Institute of Technology || Computer Science Student @Delhi University

2 年

Quite informative......Keep sharing !!!!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了