登录查看更多内容

How to automate reporting building with Python

Gokulakkannan AK

Aspiring Data Analyst | Recent Graduate | Excel in Data Analytics | SQL | Python

发布日期: 2023年2月9日

Python is a powerfull programming language for data analysis, and when associated with the pandas package and SQL it can provide all the tools to build an automated process of data mining for your company.

In this project we will have a deep dive into a real-life case, to showcase how can you use the favourite coding language for data scientists to automate the procces of connecting to a database, updating the tables, operating a query and the do some operations to create a report. This are all common operations that can be done repeatedly in most of the companies and, sometimes, can be time wasting and prone to error. Therefore, with python, it is possible to automate this proccess and do this in seconds.

Suppose you are a data analyst working for a multinational company, and it is important to calculate the CO2 footprint produced by their business flights in order to do the appropriate compensations. So the team request you to create a automated proccess in which the final deliverable would be a file cointaining the CO2 footprint per department filtered in a inputed date.

Every month you receive a file containing all the information about the flights, in a format like this:

Solution design

First we need to split our job in two different steps. We do have a database file ‘flights_db’ containing 3 tables ‘routes’, ‘airports’ and ‘airlines’. This database stores the information about the airports locations and the different routes. So, the first step is to create a connection to this database, read the csv file containing the company flights, and save it into the database. The second step is to create a script that runs a query in the database joining all the relevant information, create the column for CO2 footprint, filter by the dates chosen by the user and save it into a exportable .csv file.

Work sequence:

Connect to the database
Read the csv file
Update the ‘company_flights’ table
Run the query to select the relevant data
Transform the data
Export a new csv file

First let’s create the .py file that will do the tasks one to three.

import sqlite
import sys
import pandas as pd

#Read the file and save it on a pandas dataframe 
file_path = sys.argv[1]
df = pd.read_csv(file_path)

#Connect to the database or create it if it does not exist already

cnx = sqlite3.connect("flights.db")
c = cnx.cursor()

#Create the company_flights table

c.execute("CREATE TABLE IF NOT EXISTS company_flights \
 (employee_id TEXT, department TEXT, route_index TEXT, flight_date TEXT)")

#Create the function to imput the data in the file

#into the company_flights table

def imput_data(df,cursor):
    for i in range(len(df)):
        cursor.execute("INSERT INTO company_flights 
(employee_id, department, route_index, flight_date) VALUES (?, ?, ?, ?)",
        (str(df.iloc[i, 0]), df.iloc[i, 1], str(df.iloc[i, 2]), df.iloc[i, 3]))
        
    cursor.close()

#Execute the fucntion

领英推荐

D-TALE

360DigiTMG 1 年前

Data Analysis with Pandas: Four Essential Methods For…

Benjamin Bennett Alexander 10 个月前

10 Essential Python One-Liners Every Data Scientist…

Muhammad Ishtiaq Khan 3 周前

imput_data(df,c)


cnx.commit()
cnx.close()3

When executed, this code receives a csv file, and use it to create a table into the flights database.

The second file starts reading the query that joins the 3 tables and select the used columns and then create a pandas df out of it. Then we use the input dates using the ‘sys’ packaged and save it into the two variables we will use to filter the dataset.

Thereafter, it creates the ‘distance’ column using the?haversine ?package and the respective CO2 footprint for each flight.

import pandas as p
import numpy as np
import sqlite3
from haversine import haversine
import sys

# Read the query and save it in a variable as a string
query = "SELECT 
c.employee_id, c.department, c.route_index, c.flight_date,
CAST(s.latitude AS float) AS source_latitude, CAST(s.longitude AS float) AS source_longitude,
CAST(d.latitude AS float) AS dest_latitude,CAST(d.longitude AS float) AS dest_longitude


FROM company_flights AS c


LEFT JOIN routes AS r ON r."index" = c.route_index
LEFT JOIN airports AS s ON r.source_id = s.id
LEFT JOIN airports AS d ON r.dest_id = d.id")

#Use the query to crate a pandas dataframe

df= pd.read_sql_query( query, con = sqlite3.connect("flights.db"))

# Filter the dataframe to select only the used columns
df = df[["flight_date", "department", "source_latitude", "source_longitude", "dest_latitude", "dest_longitude"]]

#Filter by dates

date_from = sys.argv[1]
date_to = sys.argv[2]

df = df[(df['flight_date'] >= date_from) & (df['flight_date'] <= date_to)]
df.reset_index(inplace = True)

#Create the distance and co2_footprint columns

df["distance"] = pd.Series([haversine(
    ((df["source_latitude"][i]),df["source_longitude"][i]),
    ((df["dest_latitude"][i]),df["dest_longitude"][i]))
    for i in  (range(len(df)))])

df["co2_footprint"] = df["distance"] * 0.1

#Groupby department and aggregate by the sum of co2 footprint

df_grouped = df.groupby("department", as_index=False)["co2_footprint"].sum().round().sort_values(by = "co2_footprint" , ascending = False)

#Save to csv file

df_grouped.to_csv("co2_report.csv")

#Print the report

print(df_grouped)d

Finally, we group by the department aggregating by the sum of CO2 footprint. The result should look like this:

Overall, the use of automated reports can help organizations improve their efficiency, effectiveness, and competitiveness. They are important because they can help businesses and organizations make data-driven decisions. By providing timely, accurate, and comprehensive data on various aspects of the organization, automated reports can help decision makers identify trends, optimize processes, and make more informed decisions. It can also save time and reduce the risk of human error, as they are generated automatically using algorithms and software.

Therefore, for a data analyst is important to know how to build this automated processes and how to make it trustfull and usefull.

How to automate reporting building with Python

Gokulakkannan AK

Aspiring Data Analyst | Recent Graduate | Excel in Data Analytics | SQL | Python

Solution design

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Python Pandas DataFrame

Pandas

Learn Python and SAS in 1-day !

JSON Parsing with Python | Scrape Parse Data Python

Understanding Variables, Data Types, and Type Conversion in Python Programming

Python vs. Excel: A Comprehensive Comparison for Data Analytics

Do You Read Excel Files with Python? There is a 1000x Faster?Way

Why every PowerBI Developer Should Learn Python.

Python Data Types: A Quick Guide

Importance of Python in the Realms of Data Analytics

Solution design

领英推荐

Why Do We Need Authorization and Authentication? ??

2022年12月31日

How i am learning machine learning - part 0: machine learning algorithms

2022年12月30日

Time Series Analysis of Geospatial Data

2022年12月29日

Mastering the Top 10 Statistical Concepts: The Key to Success in Data Science

2022年12月29日

Quantum Machine Learning: A Beginner’s Guide

2022年12月28日

社区洞察

其他会员也浏览了

Python Pandas DataFrame

Pandas

Learn Python and SAS in 1-day !

JSON Parsing with Python | Scrape Parse Data Python

Understanding Variables, Data Types, and Type Conversion in Python Programming

Python vs. Excel: A Comprehensive Comparison for Data Analytics

Do You Read Excel Files with Python? There is a 1000x Faster?Way

Why every PowerBI Developer Should Learn Python.

Python Data Types: A Quick Guide

Importance of Python in the Realms of Data Analytics