Extracting semi-structured data from JSON Dataset, Parsing, Transforming and Loading to an Excel file with Python Pandas library
Hi,Friends
I am going to show you getting data from semi-structed (JSON) dataset, parsing, transforming and loading into an Excel file with Python Pandas library.
Steps:
1. Extracting data from JSON dataset
2. Parsing and Transforming
3. Loading into an Excel file.
Used tools:
1. Jupyter Notebook
2. Pyton (Pandas Library)
JSON dataset content:
"English Premier League" - season match stats 2017/2018
First of all
We need Jupyter notebook. We can follow the steps with different application or tools. The Jupyter notebook is one of the powerful web application for data purpose
1.??We have to import pandas.
2.??Use “read_json” function to convert JSON string to pandas object.
The output is not understandable because keys of JSON are defined as indexes instead of columns.
In this case, we have to use orient="index" and replace indexes to columns and columns to indexes.
orient = "index" – define indexes as columns
head(10) function – retriving 10 rows
So, we extracted and parsed the JSON dataset into a readable and understandable form. Now, it's time to transform data.
Let’s try to get date, time, day, month, year parts from 'date_string' column.
Used functions:
1. pd.to_datetime - converting argument/string/object to datetime data type.
2. dt.date - extracts date part from datetime.
3. dt.day -?extracts day part from datetime.
4. dt.month_name() - extracts month part (with name) from datetime.
5. dt.year - extracts year part from datetime.
6. dt.time - extracts time part from datetime.
We need to add new columns and assign extracted value with above function to it.
New columns:
df['match_date'] =
df['day'] =
df['month'] =
df['year'] =
df['time'] =?
In Python, if you want to add new column to the pandas object you have to define column name and assign value to it. If new column name is existing column name in this case the existing column name will be updated, if there is no any matched column name with new column name then it will be added as a new column.
Continue transformation.
Let’s concatenate parentheses with 'half_time_score' and 'full_time_score' values.
df['half_time_score'] = '(' + df[['half_time_score']] + ')'
df['full_time_score'] = '(' + df[['full_time_score']] + ')'
After that add new “match result“ column and concatenate 'home_team_name' and 'away_team_name' column values with the values of 'full_time_score' column.
df['match result'] = df['home_team_name'] +' '+ df['full_time_score'] +' '+ df['away_team_name']
AND we should retrieve only needed columns and rename name of the existing columns:
Retrieving only needed columns:
df = df[[ 'home_team_name', 'away_team_name','match result', 'match_date','day','month','year',?'time','half_time_score',' full_time_score' ]]
Renaming existing columns:
df.rename( columns = { 'home_team_name': 'home_team', 'away_team_name': 'away_team' }, inplace = True)
inplace function – performs changing on original DataFrame.
So, There are already half time and full time score. We need to find half-time winner and full-time winner.
领英推荐
Based on the “DataFrame”, we see that left side is home team and right side is away team.
As an example of half time score of first row is (0 : 2) and we can say half time winner is “Liverpool” , full time score is (0 : 3) and we can full time winner is also “Liverpool” but this approach is not professional approach ??
We have to extract half-time and full-time scores from 'half_time_score' and 'full_time_score' column and determine the half and full time exact winners.
I am going to add four new columns and extract each single score and assign to new columns
New columns:
df['home_team_score_half_time'] =
df['away_team_score_half_time'] =
df['home_team_score_full_time'] =
df['away_team_score_full_time'] =?
df['home_team_score_half_time'] = df['half_time_score'].str.strip().str[1]
df['away_team_score_half_time'] = df['half_time_score'].str.strip().str[-2]
df['home_team_score_full_time'] = df['full_time_score'].str.strip().str[1]
df['away_team_score_full_time'] = df['full_time_score'].str.strip().str[-2]
str.strip() function – is used to remove leading and trailing characters.
We have already extracted scores in the separate columns and we have to determine half-time and full-time winners.
As we mentioned above that we know already left side and right side teams.
Left side team is “home_team”, right side team is “away_team” and there are score extracted columns described below:
1) “home_team_score_half_time”? 2) “away_team_score_half_time”, 3)“home_team_score_full_time” 4) “away_team_score_full_time”
Now, time to add condition and determine the half-time and full-time winner.
1. If the “half time home team score” is less than “half time away team score” it means “away_team” is half-time winner then it will return “away_team” name.
2. If the “full time home team score” is greater than “full time away team score” it means “home_team” is full-time winner then it will return “home_team” name.
3. If the “half time home team score” is equal to “half time away team score” it means scores are draw then it will return “Draw”.
4. If the “full time home team score” is equal to “full time away team score” it means scores are draw then it will return “Draw”.
Condition code:
Half-time
df.loc[df['home_team_score_half_time'] >?df['away_team_score_half_time'], 'half_time_winner'] = df['home_team']?
df.loc[df['home_team_score_half_time'] <?df['away_team_score_half_time'], 'half_time_winner'] = df['away_team']?
df.loc[df['home_team_score_half_time'] == df['away_team_score_half_time'], 'half_time_winner'] = 'Draw'
Full-time
df.loc[df['home_team_score_full_time'] >?df['away_team_score_full_time'], 'full_time_winner'] = df['home_team']?
df.loc[df['home_team_score_full_time'] <?df['away_team_score_full_time'], 'full_time_winner'] = df['away_team']?
df.loc[df['home_team_score_full_time'] == df['away_team_score_full_time'], 'full_time_winner'] = 'Draw'
loc() function –helps us to retrieve data values from a dataset at an ease. Using the loc() function, we can access the data values fitted in the particular row or column based on the index value passed to the function.
Half-time and full-time winners are determined, full-time winner is the main winner of the game. Now let’s add new winner column and assign full time winner to it and retrieve only needed columns.?
1. Adding new winner column and assign full_time_winner to it.
df['winner'] = df['full_time_winner']
2. Retrieving final needed columns.
df = df[[ 'home_team', 'away_team','match result', 'half_time_winner',?'half_time_score', 'full_time_score', 'full_time_winner', 'winner','match_date', 'day','month','year','time' ]]
Thus, we have come to the end. Its time to load our work to an Excel file.
df.to_excel( 'English_Premier_League.xlsx' , sheet_name = 'English_League' )
Excel file name and sheet name can be set as desired.
to_excel() function - ?is used to write object to an Excel file.?
Output:
Conclusion
We have learnt how to extract data from JSON dataset, parse, transform it with powerful Python Pandas library and load it to an Excel file.