Extracting semi-structured data from JSON Dataset, Parsing, Transforming and Loading to an Excel file with Python Pandas library

Extracting semi-structured data from JSON Dataset, Parsing, Transforming and Loading to an Excel file with Python Pandas library

Hi,Friends

I am going to show you getting data from semi-structed (JSON) dataset, parsing, transforming and loading into an Excel file with Python Pandas library.

Steps:

1. Extracting data from JSON dataset

2. Parsing and Transforming

3. Loading into an Excel file.

Used tools:

1. Jupyter Notebook

2. Pyton (Pandas Library)


JSON dataset content:

"English Premier League" - season match stats 2017/2018

No alt text provided for this image
Pic:1

First of all

We need Jupyter notebook. We can follow the steps with different application or tools. The Jupyter notebook is one of the powerful web application for data purpose

1.??We have to import pandas.

2.??Use “read_json” function to convert JSON string to pandas object.

No alt text provided for this image
Pic:2

The output is not understandable because keys of JSON are defined as indexes instead of columns.

In this case, we have to use orient="index" and replace indexes to columns and columns to indexes.

orient = "index"define indexes as columns

head(10) function – retriving 10 rows

No alt text provided for this image
Pic:3

So, we extracted and parsed the JSON dataset into a readable and understandable form. Now, it's time to transform data.

Let’s try to get date, time, day, month, year parts from 'date_string' column.

Used functions:

1. pd.to_datetime - converting argument/string/object to datetime data type.

2. dt.date - extracts date part from datetime.

3. dt.day -?extracts day part from datetime.

4. dt.month_name() - extracts month part (with name) from datetime.

5. dt.year - extracts year part from datetime.

6. dt.time - extracts time part from datetime.

We need to add new columns and assign extracted value with above function to it.

New columns:

df['match_date'] =

df['day'] =

df['month'] =

df['year'] =

df['time'] =?

In Python, if you want to add new column to the pandas object you have to define column name and assign value to it. If new column name is existing column name in this case the existing column name will be updated, if there is no any matched column name with new column name then it will be added as a new column.

No alt text provided for this image
Pic:4

Continue transformation.

Let’s concatenate parentheses with 'half_time_score' and 'full_time_score' values.

df['half_time_score'] = '(' + df[['half_time_score']] + ')'

df['full_time_score'] = '(' + df[['full_time_score']] + ')'

After that add new “match result“ column and concatenate 'home_team_name' and 'away_team_name' column values with the values of 'full_time_score' column.

df['match result'] = df['home_team_name'] +' '+ df['full_time_score'] +' '+ df['away_team_name']

AND we should retrieve only needed columns and rename name of the existing columns:

Retrieving only needed columns:

df = df[[ 'home_team_name', 'away_team_name','match result', 'match_date','day','month','year',?'time','half_time_score',' full_time_score' ]]

Renaming existing columns:

df.rename( columns = { 'home_team_name': 'home_team', 'away_team_name': 'away_team' }, inplace = True)

inplace function – performs changing on original DataFrame.

No alt text provided for this image
Pic:5

So, There are already half time and full time score. We need to find half-time winner and full-time winner.

Based on the “DataFrame”, we see that left side is home team and right side is away team.

As an example of half time score of first row is (0 : 2) and we can say half time winner is “Liverpool” , full time score is (0 : 3) and we can full time winner is also “Liverpool” but this approach is not professional approach ??

We have to extract half-time and full-time scores from 'half_time_score' and 'full_time_score' column and determine the half and full time exact winners.

I am going to add four new columns and extract each single score and assign to new columns

New columns:

df['home_team_score_half_time'] =

df['away_team_score_half_time'] =

df['home_team_score_full_time'] =

df['away_team_score_full_time'] =?


df['home_team_score_half_time'] = df['half_time_score'].str.strip().str[1]

df['away_team_score_half_time'] = df['half_time_score'].str.strip().str[-2]

df['home_team_score_full_time'] = df['full_time_score'].str.strip().str[1]

df['away_team_score_full_time'] = df['full_time_score'].str.strip().str[-2]


str.strip() function – is used to remove leading and trailing characters.

No alt text provided for this image
Pic:6

We have already extracted scores in the separate columns and we have to determine half-time and full-time winners.

As we mentioned above that we know already left side and right side teams.

Left side team is “home_team”, right side team is “away_team” and there are score extracted columns described below:

1) “home_team_score_half_time”? 2) “away_team_score_half_time”, 3)“home_team_score_full_time” 4) “away_team_score_full_time”

Now, time to add condition and determine the half-time and full-time winner.

1. If the “half time home team score” is less than “half time away team score” it means “away_team” is half-time winner then it will return “away_team” name.

2. If the “full time home team score” is greater than “full time away team score” it means “home_team” is full-time winner then it will return “home_team” name.

3. If the “half time home team score” is equal to “half time away team score” it means scores are draw then it will return “Draw”.

4. If the “full time home team score” is equal to “full time away team score” it means scores are draw then it will return “Draw”.

Condition code:

Half-time

df.loc[df['home_team_score_half_time'] >?df['away_team_score_half_time'], 'half_time_winner'] = df['home_team']?

df.loc[df['home_team_score_half_time'] <?df['away_team_score_half_time'], 'half_time_winner'] = df['away_team']?

df.loc[df['home_team_score_half_time'] == df['away_team_score_half_time'], 'half_time_winner'] = 'Draw'

Full-time

df.loc[df['home_team_score_full_time'] >?df['away_team_score_full_time'], 'full_time_winner'] = df['home_team']?

df.loc[df['home_team_score_full_time'] <?df['away_team_score_full_time'], 'full_time_winner'] = df['away_team']?

df.loc[df['home_team_score_full_time'] == df['away_team_score_full_time'], 'full_time_winner'] = 'Draw'

loc() function –helps us to retrieve data values from a dataset at an ease. Using the loc() function, we can access the data values fitted in the particular row or column based on the index value passed to the function.

No alt text provided for this image
Pic:7

Half-time and full-time winners are determined, full-time winner is the main winner of the game. Now let’s add new winner column and assign full time winner to it and retrieve only needed columns.?

1. Adding new winner column and assign full_time_winner to it.

df['winner'] = df['full_time_winner']

2. Retrieving final needed columns.

df = df[[ 'home_team', 'away_team','match result', 'half_time_winner',?'half_time_score', 'full_time_score', 'full_time_winner', 'winner','match_date', 'day','month','year','time' ]]

No alt text provided for this image
Pic:8

Thus, we have come to the end. Its time to load our work to an Excel file.

df.to_excel( 'English_Premier_League.xlsx' , sheet_name = 'English_League' )

Excel file name and sheet name can be set as desired.

to_excel() function - ?is used to write object to an Excel file.?

No alt text provided for this image
Pic:9

Output:

No alt text provided for this image
Pic:10
No alt text provided for this image
Pic:11

Conclusion

We have learnt how to extract data from JSON dataset, parse, transform it with powerful Python Pandas library and load it to an Excel file.

要查看或添加评论,请登录

GARIB HASANOV的更多文章

  • PySpark Patitionby() function in Databricks

    PySpark Patitionby() function in Databricks

    Hello friends, In this article, I want to show how to practice the PySpark partitionBy() functionality in DataBricks by…

    1 条评论

社区洞察

其他会员也浏览了