Data Cleaning
Abdul-Azeez Busari
Data Analyst | Microsoft Certified | Product Analyst | Power BI | Excel | SQL | Looker Studio | Civil Engineer
Data cleaning is an important step in data analysis because inaccurate data can lead to incorrect conclusions and decisions. It involves various processes such as removing duplicates, filling missing values, correcting spelling errors, standardizing date formats and identifying outliers.
The dataset used for the challenge is FIFA 21 data and it was sourced from Kaggle. It contains the details of football players with their performance up till year 2021. The dataset has a total number of 18979 rows and 77 columns.
The objective of this challenge is to transform the MESSY data to clean and ready to use.
I started the challenge by going through the data dictionary and getting familiar with the column names. The dataset was extracted and opened in Microsoft Excel for the transformation. The dataset is a practical example of MESSY data as it contains inconsistent data type, spelling errors, special characters, wrong data format etc. Here are the transformation I made to each column;
ID – this is the unique identifier assigned to each player. The column was originally formatted in Numbers which in changed to Text.
Name & LongName – these columns contains the names of each player in short (e.g L.Messi) and long format(e.g Lionel Messi). I noticed some accent alphabets were replaced special characters which I used find & replace to correct manually.
PlayerUrl & photoUrl – these columns contain links to access each player’s details and images. These columns were removed.
Nationality & Age – these two columns had no missing values and wrong data formats. The two columns were okay as it is.
Club – this column contains the name of club for each player. The accents alphabets were substituted with special characters which I corrected using Find & Replace.
领英推荐
Contract – this column shows the kind of agreement each player has i.e. Loan, Free and duration of contract. The column has inconsistent values and wrong data type. The column has 3 categories of values in the format, '23rd July 2020 on loan’, 'Free' and '2018 ~ 2024’. I cleaned the column by firstly creating a conditional column in power query where I categorized the values into: contract, Loan and Free. After that I split the column into two using tilde(~) as the delimiter which generated two columns. I renamed the column contract_start and contract_end. I created a new column to find the duration in years of each Player’s contract (contract_end – contract_start). I replaced the null values and errors with zero as they were the players not on contract agreement.
Positions column – it contains the position each player has played in. This column was removed because there is another column with the Best Position of each player.
Height – this column has its value both in ‘cm’ and ‘feet & inches’ which makes it inconsistent.?The whole column was then converted to cm by converting ‘feet & inches’ to cm. this was done in excel by firstly splitting the column into two using apostrophe as the delimiter i.e. separating feet and inches into different columns. A new column was created and?the formula ‘=IF(LEN(M2) = 1,(( LEFT(M2,1) *12)+RIGHT(N2,2))*2.53),M2)’ was applied to get the equivalent in ‘cm’. The suffix ‘cm’ was removed from the column and the column was renamed to ‘Height(cm)’.
Weight- This column also contains inconsistent unit (i.e. lbs and kg). I converted everything into lbs using the formula ‘=IF(RIGHT(O2) = ‘kg’,(( LEFT(M2,LEN(M2) - 2) *2.204),O2)’. The suffix lbs was removed and the column name was renamed to ‘Weight(lbs)’.
Potential_rating, Best_overall_rating and Overall_rating columns had values in the range of 1-99. The columns are best expressed in percentages which was done by dividing the columns by 100 and converting the data type to percentage.
Value, Wage and release_clause has issues with its data type and suffixes like ‘K’ denoting thousand and ‘M’ denoting Million. These columns were cleaned by replacing the special characters standing as prefixes and using the formular ‘=IF(RIGHT(U2) = ‘K’, LEFT(U2, LEN(U2)-1) *1000), IF(RIGHT(U2) = ‘M’, LEFT(U2, LEN(U2)-1) *1000000), U2)’. After the calculations were done, the columns were formatted in dollars.
?Attacking, Crossing, Finishing, Heading_accuracy, Short_passing, Volleys, Skill, Dribbling, Curve, FK_accuracy, Long_passing, Ball_control, Movement, Acceleration, Spring_speed, Agility, Reactions, Balance, Power, Shot_power, Jumping, Stamina, Strength, Long_shot, Mentality, Aggression, Interception, Positioning, Vision, Penalties, Composure, Defending, Marking, Standing_tackle, Sliding_tackle, Goalkeeping, GK_handling, GK_diving, Gk_kicking, Gk_positioning, Gk_reflexes, Total_stats, Base_stats, Pace, Shooting, Passing, DRI, DEF, Physical, Best_position, Preferred_foot, Attacking_workrate, and Defensive_workrate columns were checked and little correction wer made to the column names and data types