FIFA 21 Data Cleaning and Analysis ||?Excel
Introduction and Data Background
FIFA 21, developed by Electronic Arts (EA), is one of the most popular esports titles. Players compete against each other in both regional and international teams to win the championship crown.
The FIFA 21 dataset, publicly available on Kaggle, consists of web-scraped raw messy data from sofifa.com. It includes detailed information on over 18,000 football players from around the world.
Objectives?
There are two main objectives for this project:
Business Questions
Q1: Which players have been playing at a certain club for more than 20 years?
Q2: Which players are highly valuable but still underpaid (on low averages)?
Q3: Which top 10 young players have the potential to be future superstars?
Q4: Top 10 best performing players
Q5: Top 10 most paid players
Q6: Players by Nationality
Q7(a): Average age distribution of players
(b): Average weight and height distribution
Q8: Position Distribution (No: of players to positions)
Data Cleaning & Transformation
The initial data cleaning task involves checking for NULL or missing values and duplicated records. The “Go To” feature is used to identify blank cells. To access this feature, follow these steps: Home > Find & Select > Go To Special in the ribbon. Then, select the Blanks option from the Go To Special pop-up window and click OK to apply the selection.
The result of this process shows that the dataset does not have any missing values.?
Next, the “Remove Duplicates” feature in Excel is used to check for and remove any duplicate rows from the dataset. To do this, navigate to Data > Remove Duplicates in the Data Tools group.
The result showcases that no duplicate values are found. After completing these two processes, the dataset is confirmed to be clean and curated.
Name
The raw dataset includes two columns for the players’ names: “Name” and “LongName”. Both columns contain various special characters, which make the naming inconsistent. Several steps are required to remove all unnecessary characters from these columns. Notably, the “playerUrl” column contains the full names of the players embedded in the links. Therefore, the “playerUrl” column is used to extract the full names of the players.
First, the “playerUrl” column is split into multiple columns based on each occurrence of the forward slash (‘/’) using Excel’s TEXTSPLIT() function in the following manner.
=TEXTSPLIT(relative_cell_referencing,”/”)
Next, the SUBSTITUTE() and PROPER() functions are used to remove dashes (-) from the words in the split column containing the players’ full names and to capitalize the first letter of each word. After that, the remaining columns are removed.
=PROPER(SUBSTITUTE(text,”-”,“ ”))
Contract
The “Contract” column contains three data categories: years such as ‘2004 ~ 2021’, ‘On Loan’, and ‘Free’. A new column called “Contract Condition” is created to assign data based on the “Contract” column. If the “Contract” column includes ‘2004~2021’ or a similar format, the record in Contract Condition is assigned as ‘Active’. If it includes ‘On Loan’, the assignment is ‘On Loan’. For records marked as ‘Free’, the assignment is ‘Free’. This is achieved using the following combination of formulas:
= IFS(ISNUMBER(FIND(“On Loan”,O1)),”On Loan”,ISNUMBER(FIND(“~”,O1)),“Active”,ISNUMBER(FIND(“Free”,O1)),“Free”)
Note: O1 is the relative cell addressing.
The “Contract” column is divided into two other columns: “Contract Start Date” and “Contract End Date”, using the “Text to Columns” feature. The following formula is used to replace ‘On Loan’ and ‘Free’ data categories from the Contract column with ‘NA’ in both columns, while the original data remains the same for entries categorized with years:
=IFS(ISNUMBER(FIND(“~”,BK2)),BP2,ISNUMBER(FIND(“On Loan”,BK2)),“NA”,ISNUMBER(FIND(“Free”,BK2)),“NA”)
Note: BK2 & BP2 are the relative cell addressing.
Positions
The data in the “Positions” column is disorganized, so it is sorted in ascending alphabetical order using a combination of functions such as ARRAYTOTEXT(), SORT(), and TRIM().
=ARRAYTOTEXT(SORT(TRIM(H1)))
Note: H1 is the relative cell addressing.
The ARRAYTOTEXT() function converts the array of values in the “Positions” column into a text string, SORT() arranges the text in ascending order, and TRIM() removes any irregular spacing between the text. Then, the delimiter (i.e., the comma) between each value is replaced by a blank space using the “Find and Replace” feature. Navigate to Home > Find & Select > Replace in the ribbon. In the Find what option, enter a comma (,), and in the Replace with option, enter a space. Click Replace All to apply the changes.
Weight &?Height
The “Weight” and “Height” columns contain inconsistent data types. The “Weight” column includes units of measurement in both ‘kg’ and ‘lbs’, while the “Height” column describes the players’ heights in ‘cm’, ‘ft’, and ‘inch’.
Weight
For the “Weight” column, all values are converted to the ‘kg’ unit to ensure consistency. First, the ‘lbs’ suffixes are removed from the column using the “Find and Replace” feature. Next, the records that previously had ‘lbs’ suffixes are converted to ‘kg’ by dividing the values by 2.205. This is achieved using a combination of the ISNUMBER() and SEARCH() functions within an IF formula. Afterward, the “Find and Replace” feature is applied again to remove the ‘kg’ suffixes.
=IF(ISNUMBER(SEARCH(“kg”,A2)),A2,ROUND(A2/2.205),0))
Note: A2 is the relative cell addressing.
Height
Converting all units in the “Height” column to ‘cm’ involves two steps. The first step uses an IF conditional check to perform a logical test.
The IF condition checks if the cell contains a number with the ‘cm’ unit using the ISNUMBER and SEARCH functions. If it does, the condition returns TRUE, meaning the value in that cell remains unchanged. For example, if the cell contains “170cm”, the output will also be “170cm”.
If the cell does not contain ‘cm’, the conversion process begins by multiplying the value in ‘ft’ by 12 to convert it to inches. This is achieved using the LEFT and FIND functions. The LEFT function extracts the number of characters from the left side of the text string in the cell, while the FIND function locates the ‘ft’ symbol, and subtracting 1 from this position gives the first character in that particular cell which indicates the number of feet, and that character is then multiplied by 12.
Next, the result is summed with the value that indicates the number of inches in the cell. This is done using a combination of SUBSTITUTE(), MID(), FIND(), and LEN() functions. The MID() function extracts the number of inches based on a specified number of characters from the cell. The FIND() and LEN() functions locate the ‘inch’ symbol in the cell, and adding 1 to the position gives the number of inches to be summed. The SUBSTITUTE() function replaces the symbol with a blank space, and the ABS() function ensures all negative values are converted to positive. As a result, the values with no units in the column are now expressed in inches.
=IF(ISNUMBER(SEARCH(“cm”,Z2)),Z2,LEFT(Z2,FIND(“ ‘ ”,Z2)-1)*12+ABS(SUBSTITUTE(MID(Z2,FIND(“ ‘ “,Z2)+1,LEN(Z2)),” ” ” ”,” ”)))
Note: Z2 is the relative cell addressing.
In the second step, another set of formulas is used to convert numbers with inches to ‘cm’. The ISNUMBER() and SEARCH() functions within the IF() formula check if there are cells without ‘cm’ units. If a cell does not contain ‘cm’, its value is multiplied by 2.54 to convert inches to centimeters. If a cell already contains ‘cm’, the value remains unchanged. Finally, the ‘cm’ units in the column are removed using the “Find and Replace” feature.
=IF(ISNUMBER(SEARCH(“cm”,A2)),A2,A2*2.54)
Note: A2 is the relative cell addressing.
Joined?
The “Joined” column provides the joining date details of each player in the dataset. This column is separated into three columns: Day, Month and Year. This is carried out by utilizing Excel’s “Text to Columns” feature by navigating to Data > Text to Columns (in the Data Tools Group). Select Delimited and check the boxes for Comma and Space. Click Next to proceed to the next step and then click Finish to complete the process (Refer to the screenshots below).
Value, Wage & Release?Clause
The “Value”, “Wage” and “Release Clause” columns contain unnecessary characters such as ‘a’, ‘?’ and ‘?’. These characters are removed using the “Find and Replace” feature. First, select all three columns. Go to Home > Find & Select > Replace in the ribbon (or press Ctrl+H). Then, insert the characters (a??) in Find What option under the Replace tab, and leave the Replace with option empty. After that, click Replace All to apply.
领英推荐
The data records in these columns also include suffixes to represent values in units: ‘K’ stands for thousands and ‘M’ for millions. To convert these values with abbreviations to numeric ones, the following combination of formulas is applied.
=IFERROR(LEFT(A1,LEN(A1)-1)*10^(SEARCH(RIGHT(A1),“KM”)*3),A1)
Note: A1 is the relative cell addressing.
To explain the formula, it is necessary to extract only the numeric values from cell A1. The LEFT function extracts characters from the leftmost side of the cell, and the LEN function returns the number of characters minus the last character (the suffix). For example, 138.4M is now converted to 138.4.
Next, the SEARCH function looks for a specific substring within the text “KM”. With the help of the RIGHT function used within the SEARCH function as the first argument, the position number of the suffix (‘M’) as the last character is returned, which is 2 for cell A1. Using this result and multiplying it by 10 raised to the power of 3 will output 10 raised to the power of 6 (1,000,000) for ‘M’. (Note: In the case of ‘K’, the output is 1,000.)
Finally, multiplying 138.4 by 10? will give 138,400,000. The rest of the cells are processed accordingly based on the applied formula.
W/F, SM &?IR
The “W/F”, “SM”, and “IR” columns also contain unnecessary characters as suffixes, specifically ‘a?…’?. These are removed using the SUBSTITUTE formula as shown below:
=SUBSTITUTE(A1,“a?…”, )
Note: A1 is the relative cell addressing.
Hits
“Hits” column includes some NULL (blank) values, and the column data type is incorrect. To clean this up and ensure consistency in data format, NULL values are replaced with zeros, and the column data type is converted to numerical form. Additionally, some data records also contain ‘K’ suffixes similar to the “Value”, “Wage” and “Release Clause” columns, representing thousands. The same formula is applied here again to remove these suffixes and multiplied by 1000 to numbers with ‘K’ suffixes.
=IFERROR(LEFT(CI2,LEN(CI2)-1)*10^(SEARCH(RIGHT(CI2),“K”)*3),CI2)
Note: CI2 is the relative cell addressing.
The remaining columns are in the correct data types and contain accurate values, requiring no further data cleaning or transformation. Now that the dataset has been cleaned and transformed, the next step is to perform Exploratory Data Analysis (EDA) to answer the business questions mentioned in the earlier section.
Exploratory Data Analysis (EDA)?
In this section, pivot tables are extensively used to quicky extract, summarize, and report the findings for the business questions. The analyzed data is then visualized using various Excel charts.?
Q1: Which players have been playing at a certain club for more than 20 years?
Based on the analysis result of the maximum number of years each player has played for a specific football club, the Japanese goalkeeper, Hitoshi Sogahata, played for Kahsima Antlers for 23 years. Korean and Russian players, Kwang Suk Kim and Igor Akinfeev, have the second and third longest serving records for their respective clubs with each having 21 and 20 years.
Q2: Which players are highly valuable but still underpaid (on low averages)?
To answer this question, the columns for the average overall rating, total worth, and total wage of each player are analyzed. The Average Overall Rating column is filtered to include only players with an average score of 80 and above, and the Total Value Worth column is sorted from largest to smallest.
Out of the top 10 players with the highest average overall rating scores, six play for clubs in the English Premier League: three for Liverpool, two for Manchester City, and one for Chelsea. The rest of the players are in the French Ligue 1, the German Bundesliga, and the Spanish LaLiga. Kylian Mbappe, who plays for Paris Saint-Germain in the French Ligue 1, has the highest total value worth of over 185 million euros (a figure close to his current market value according to Google search results). However, his weekly total wage is only 160 thousand euros.
The scatterplot above illustrates the difference between the total worth and total wage of each football player. The dots close to the trendline in the middle of the scatterplot indicate that there is not a significant difference between the values in both columns. In contrast, the dots far from the trendline represent the outliers.
Q3: Which top 10 young players have the potential to be future superstars?
To address this question, the value fields for the Age and Potential Overall Rating (Potential) columns are summarized into averages. The top 10 players are then filtered based on the highest average potential overall rating scores. The young players with an average age between 19 and 21 appear in the top 10 list, all with Potential scores above 90. Among them, Kylian Mbappe has the highest Potential score of 95, making him one of the most promising future superstars of the next generation.
Q4: Top 10 best performing players
The overall performance of each player in the dataset is measured using three data columns: “Overall Rating”, “Best Overall Rating”, and “Hits”. The “Hits ”column shows how many times a player’s information has been viewed on the Sofifa website, from which the dataset was scraped. By averaging the data records of the first two columns and summing up the data in the Hits column, and then filtering the player rows into the top 10 by the average of the Overall Rating column in the pivot table, the result table is obtained as shown below:
The top 10 player list includes more than 10 players because six players have the same average overall rating scores. Lionel Messi possesses the highest average overall rating score of 93 among all players. It is noteworthy that Kylian Mbappe has the highest total Sofifa views at 1.6k, surpassing Lionel Messi and all other players on the list, even though his average overall rating score is lower than Messi’s.
Q5: Top 10 most paid players
The Wage column contains data on how much each player earns weekly during their tenure. Therefore, it can be used to analyze the total wage of each player to answer this question. In the top 10 list of most paid players, Lionel Messi has the highest total weekly earnings of 560,000 euros. Neymar Da Silva Santos Jr and Raheem Sterling are tied with the same earnings of 270,000 euros each.
Q6: Players by Nationality
To answer this question, it is necessary to analyze the total number of players representing each country. This is done by summarizing the “Full Name ”column into the total number of players by “Nationality” column. A pivot table is then created and filtered to show the top 10 countries with the most players.
According to the summarized data, a significant number of football players are from European countries such as England, Germany, Spain, and France, all of which appear consecutively in the top 10 list. England has the largest number of players, with a total of 1,705. Additionally, two countries from South America, Argentina and Brazil, and one country from Asia, Japan, are included in the list.
Q7 (a): Average age distribution of players
In this analysis, the “Age ”column data is summarized into averages, and a histogram chart is used to visualize the frequency distribution of the average age among the players in the entire dataset. The average age interval (on the horizontal axis) is set to two years.
As seen in the histogram above, most football players are between the ages of 18 to 32, with the 22–24 age period showcasing the highest number of football players, totaling 2,876. The numbers start to decline after the players exceed the age of 32.?
Q7 (b): Average weight and height distribution
Similar to the previous question, the data from the “Weight” and “Height” columns are calculated into averages, and separate histogram charts are created to visualize the frequency distribution of average weight and height among the players. The periods for both average weight and height columns are specified as two units.
The analysis results indicate that the majority of the players have an average weight between 74 and 76kg.?
Regarding the results for average height distribution, the histogram chart shows that the common average height among the players falls between 179 and 181 cm.
Q8: Position Distribution (No: of players to positions)
The “Position” and “Full Name” columns are used to identify the common distribution of positions based on the number of players in the dataset. Although the “Position” column includes mixed position data, only single position data are selected here for concise analysis. The total number of players for each position is calculated.
From the analysis result, it is evident that CB (Center Back) is the most common position, with a total count of 2,441 football players, followed by GK (Goalkeeper), ST (Striker) and CM (Central Midfielder).
FIFA 21 Dashboard
The FIFA 21 dashboard is designed to summarize key insights from the dataset in an easily accessible and understandable format for viewers.
Strategic Recommendations
THANK YOU FOR READING THIS FAR! Feel free to provide any suggestions in the comments so that I can make corrections and further improvements.
References & Resources:?
Dataset Link: FIFA 21 Dataset
Check out this article by Victoria, where she presents her comprehensive data cleaning processes on the dataset in a clear and concise manner.?
Order Management Analyst | Data Analytics | Excel | SQL | R | Tableau | Data Visualization
8 个月As a soccer fan, I really enjoyed your walkthrough and insights. I want to try this one myself! Thank you for sharing, awesome work!! ??
Predictive Analytics | SQL | Microsoft Excel | Tableau
8 个月Very good read and interesting analysis. Hope you enjoyed the whole process.
Pythonistas | Data-Driven Business Alchemist | Certified Scrum Product Owner (CSPO)?
8 个月Interesting!
Engineer, Data analyst, Data science
8 个月Can I get the data set
Data & Analytics Manager · Content Creator
8 个月Well done!