登录查看更多内容

Extracting semi-structured data from JSON Dataset, Parsing, Transforming and Loading to an Excel file with Python Pandas library

GARIB HASANOV

Data Engineering | Data Management

发布日期: 2023年1月29日

Hi,Friends

I am going to show you getting data from semi-structed (JSON) dataset, parsing, transforming and loading into an Excel file with Python Pandas library.

Steps:

1. Extracting data from JSON dataset

2. Parsing and Transforming

3. Loading into an Excel file.

Used tools:

1. Jupyter Notebook

2. Pyton (Pandas Library)

JSON dataset content:

"English Premier League" - season match stats 2017/2018

No alt text provided for this image — Pic:1

First of all

We need Jupyter notebook. We can follow the steps with different application or tools. The Jupyter notebook is one of the powerful web application for data purpose

1.??We have to import pandas.

2.??Use “read_json” function to convert JSON string to pandas object.

The output is not understandable because keys of JSON are defined as indexes instead of columns.

In this case, we have to use orient="index" and replace indexes to columns and columns to indexes.

orient = "index" – define indexes as columns

head(10) function – retriving 10 rows

So, we extracted and parsed the JSON dataset into a readable and understandable form. Now, it's time to transform data.

Let’s try to get date, time, day, month, year parts from 'date_string' column.

Used functions:

1. pd.to_datetime - converting argument/string/object to datetime data type.

2. dt.date - extracts date part from datetime.

3. dt.day -?extracts day part from datetime.

4. dt.month_name() - extracts month part (with name) from datetime.

5. dt.year - extracts year part from datetime.

6. dt.time - extracts time part from datetime.

We need to add new columns and assign extracted value with above function to it.

New columns:

df['match_date'] =

df['day'] =

df['month'] =

df['year'] =

df['time'] =?

In Python, if you want to add new column to the pandas object you have to define column name and assign value to it. If new column name is existing column name in this case the existing column name will be updated, if there is no any matched column name with new column name then it will be added as a new column.

Continue transformation.

Let’s concatenate parentheses with 'half_time_score' and 'full_time_score' values.

df['half_time_score'] = '(' + df[['half_time_score']] + ')'

df['full_time_score'] = '(' + df[['full_time_score']] + ')'

After that add new “match result“ column and concatenate 'home_team_name' and 'away_team_name' column values with the values of 'full_time_score' column.

df['match result'] = df['home_team_name'] +' '+ df['full_time_score'] +' '+ df['away_team_name']

AND we should retrieve only needed columns and rename name of the existing columns:

Retrieving only needed columns:

df = df[[ 'home_team_name', 'away_team_name','match result', 'match_date','day','month','year',?'time','half_time_score',' full_time_score' ]]

Renaming existing columns:

df.rename( columns = { 'home_team_name': 'home_team', 'away_team_name': 'away_team' }, inplace = True)

inplace function – performs changing on original DataFrame.

So, There are already half time and full time score. We need to find half-time winner and full-time winner.

领英推荐

Polars Vs Pandas: Benchmarking performances and beyond

Machine Learning Reply GmbH 1 年前

5 Alternatives to Matplotlib That Make Data…

StrataScratch 1 个月前

Getting Started with Pandas: A Beginner's Guide to…

ITVersity, Inc. 2 个月前

Based on the “DataFrame”, we see that left side is home team and right side is away team.

As an example of half time score of first row is (0 : 2) and we can say half time winner is “Liverpool” , full time score is (0 : 3) and we can full time winner is also “Liverpool” but this approach is not professional approach ??

We have to extract half-time and full-time scores from 'half_time_score' and 'full_time_score' column and determine the half and full time exact winners.

I am going to add four new columns and extract each single score and assign to new columns

New columns:

df['home_team_score_half_time'] =

df['away_team_score_half_time'] =

df['home_team_score_full_time'] =

df['away_team_score_full_time'] =?

df['home_team_score_half_time'] = df['half_time_score'].str.strip().str[1]

df['away_team_score_half_time'] = df['half_time_score'].str.strip().str[-2]

df['home_team_score_full_time'] = df['full_time_score'].str.strip().str[1]

df['away_team_score_full_time'] = df['full_time_score'].str.strip().str[-2]

str.strip() function – is used to remove leading and trailing characters.

We have already extracted scores in the separate columns and we have to determine half-time and full-time winners.

As we mentioned above that we know already left side and right side teams.

Left side team is “home_team”, right side team is “away_team” and there are score extracted columns described below:

1) “home_team_score_half_time”? 2) “away_team_score_half_time”, 3)“home_team_score_full_time” 4) “away_team_score_full_time”

Now, time to add condition and determine the half-time and full-time winner.

1. If the “half time home team score” is less than “half time away team score” it means “away_team” is half-time winner then it will return “away_team” name.

2. If the “full time home team score” is greater than “full time away team score” it means “home_team” is full-time winner then it will return “home_team” name.

3. If the “half time home team score” is equal to “half time away team score” it means scores are draw then it will return “Draw”.

4. If the “full time home team score” is equal to “full time away team score” it means scores are draw then it will return “Draw”.

Condition code:

Half-time

df.loc[df['home_team_score_half_time'] >?df['away_team_score_half_time'], 'half_time_winner'] = df['home_team']?

df.loc[df['home_team_score_half_time'] <?df['away_team_score_half_time'], 'half_time_winner'] = df['away_team']?

df.loc[df['home_team_score_half_time'] == df['away_team_score_half_time'], 'half_time_winner'] = 'Draw'

Full-time

df.loc[df['home_team_score_full_time'] >?df['away_team_score_full_time'], 'full_time_winner'] = df['home_team']?

df.loc[df['home_team_score_full_time'] <?df['away_team_score_full_time'], 'full_time_winner'] = df['away_team']?

df.loc[df['home_team_score_full_time'] == df['away_team_score_full_time'], 'full_time_winner'] = 'Draw'

loc() function –helps us to retrieve data values from a dataset at an ease. Using the loc() function, we can access the data values fitted in the particular row or column based on the index value passed to the function.

Half-time and full-time winners are determined, full-time winner is the main winner of the game. Now let’s add new winner column and assign full time winner to it and retrieve only needed columns.?

1. Adding new winner column and assign full_time_winner to it.

df['winner'] = df['full_time_winner']

2. Retrieving final needed columns.

df = df[[ 'home_team', 'away_team','match result', 'half_time_winner',?'half_time_score', 'full_time_score', 'full_time_winner', 'winner','match_date', 'day','month','year','time' ]]

Thus, we have come to the end. Its time to load our work to an Excel file.

df.to_excel( 'English_Premier_League.xlsx' , sheet_name = 'English_League' )

Excel file name and sheet name can be set as desired.

to_excel() function - ?is used to write object to an Excel file.?

Output:

Conclusion

We have learnt how to extract data from JSON dataset, parse, transform it with powerful Python Pandas library and load it to an Excel file.

要查看或添加评论，请登录

GARIB HASANOV的更多文章

PySpark Patitionby() function in Databricks

2024年1月27日

PySpark Patitionby() function in Databricks

Hello friends, In this article, I want to show how to practice the PySpark partitionBy() functionality in DataBricks by…

1 条评论

Extracting semi-structured data from JSON Dataset, Parsing, Transforming and Loading to an Excel file with Python Pandas library

GARIB HASANOV

Data Engineering | Data Management

We need to add new columns and assign extracted value with above function to it.

Continue transformation.

领英推荐

Condition code:

GARIB HASANOV的更多文章

社区洞察

其他会员也浏览了

Introduction to Pandas: Start Your Data Journey

Accessing Data with iloc: Position-Based Indexing in Pandas

Manipulating Pandas DataFrame Columns Like a Pro: 5 Essential Techniques

Mastering Data Visualization with Matplotlib: A Comprehensive Guide to Creating Powerful Plots and Charts

Python Data Types & Data Structures

?? Top Python Libraries for Data Science ??

+30 Useful Operations in Pandas ??

Spark 3.0 : Adaptive Query Execution & Dynamic Partition Pruning

Data Analytics tool and their implementation with Python.

The application of Python for analysis and visualization of survey form data and the use of Spearman's correlation

We need to add new columns and assign extracted value with above function to it.

Continue transformation.

领英推荐

Condition code:

GARIB HASANOV的更多文章

PySpark Patitionby() function in Databricks

社区洞察

其他会员也浏览了

Introduction to Pandas: Start Your Data Journey

Accessing Data with iloc: Position-Based Indexing in Pandas

Manipulating Pandas DataFrame Columns Like a Pro: 5 Essential Techniques

Mastering Data Visualization with Matplotlib: A Comprehensive Guide to Creating Powerful Plots and Charts

Python Data Types & Data Structures

?? Top Python Libraries for Data Science ??

+30 Useful Operations in Pandas ??

Spark 3.0 : Adaptive Query Execution & Dynamic Partition Pruning

Data Analytics tool and their implementation with Python.

The application of Python for analysis and visualization of survey form data and the use of Spearman's correlation