How to automate reporting building with Python

How to automate reporting building with Python

Python is a powerfull programming language for data analysis, and when associated with the pandas package and SQL it can provide all the tools to build an automated process of data mining for your company.

In this project we will have a deep dive into a real-life case, to showcase how can you use the favourite coding language for data scientists to automate the procces of connecting to a database, updating the tables, operating a query and the do some operations to create a report. This are all common operations that can be done repeatedly in most of the companies and, sometimes, can be time wasting and prone to error. Therefore, with python, it is possible to automate this proccess and do this in seconds.

Suppose you are a data analyst working for a multinational company, and it is important to calculate the CO2 footprint produced by their business flights in order to do the appropriate compensations. So the team request you to create a automated proccess in which the final deliverable would be a file cointaining the CO2 footprint per department filtered in a inputed date.


Every month you receive a file containing all the information about the flights, in a format like this:


No alt text provided for this image


Solution design

First we need to split our job in two different steps. We do have a database file ‘flights_db’ containing 3 tables ‘routes’, ‘airports’ and ‘airlines’. This database stores the information about the airports locations and the different routes. So, the first step is to create a connection to this database, read the csv file containing the company flights, and save it into the database. The second step is to create a script that runs a query in the database joining all the relevant information, create the column for CO2 footprint, filter by the dates chosen by the user and save it into a exportable .csv file.


Work sequence:

  1. Connect to the database
  2. Read the csv file
  3. Update the ‘company_flights’ table
  4. Run the query to select the relevant data
  5. Transform the data
  6. Export a new csv file


First let’s create the .py file that will do the tasks one to three.


import sqlite
import sys
import pandas as pd

#Read the file and save it on a pandas dataframe 
file_path = sys.argv[1]
df = pd.read_csv(file_path)
        

#Connect to the database or create it if it does not exist already

cnx = sqlite3.connect("flights.db")
c = cnx.cursor()
        

#Create the company_flights table

c.execute("CREATE TABLE IF NOT EXISTS company_flights \
 (employee_id TEXT, department TEXT, route_index TEXT, flight_date TEXT)")
        

#Create the function to imput the data in the file

#into the company_flights table

def imput_data(df,cursor):
    for i in range(len(df)):
        cursor.execute("INSERT INTO company_flights 
(employee_id, department, route_index, flight_date) VALUES (?, ?, ?, ?)",
        (str(df.iloc[i, 0]), df.iloc[i, 1], str(df.iloc[i, 2]), df.iloc[i, 3]))
        
    cursor.close()


        

#Execute the fucntion

imput_data(df,c)


cnx.commit()
cnx.close()3        


When executed, this code receives a csv file, and use it to create a table into the flights database.

The second file starts reading the query that joins the 3 tables and select the used columns and then create a pandas df out of it. Then we use the input dates using the ‘sys’ packaged and save it into the two variables we will use to filter the dataset.

Thereafter, it creates the ‘distance’ column using the?haversine ?package and the respective CO2 footprint for each flight.




import pandas as p
import numpy as np
import sqlite3
from haversine import haversine
import sys

# Read the query and save it in a variable as a string
query = "SELECT 
c.employee_id, c.department, c.route_index, c.flight_date,
CAST(s.latitude AS float) AS source_latitude, CAST(s.longitude AS float) AS source_longitude,
CAST(d.latitude AS float) AS dest_latitude,CAST(d.longitude AS float) AS dest_longitude


FROM company_flights AS c


LEFT JOIN routes AS r ON r."index" = c.route_index
LEFT JOIN airports AS s ON r.source_id = s.id
LEFT JOIN airports AS d ON r.dest_id = d.id")

        

#Use the query to crate a pandas dataframe

df= pd.read_sql_query( query, con = sqlite3.connect("flights.db"))

# Filter the dataframe to select only the used columns
df = df[["flight_date", "department", "source_latitude", "source_longitude", "dest_latitude", "dest_longitude"]]
        

#Filter by dates

date_from = sys.argv[1]
date_to = sys.argv[2]

df = df[(df['flight_date'] >= date_from) & (df['flight_date'] <= date_to)]
df.reset_index(inplace = True)
        

#Create the distance and co2_footprint columns

df["distance"] = pd.Series([haversine(
    ((df["source_latitude"][i]),df["source_longitude"][i]),
    ((df["dest_latitude"][i]),df["dest_longitude"][i]))
    for i in  (range(len(df)))])

df["co2_footprint"] = df["distance"] * 0.1
        

#Groupby department and aggregate by the sum of co2 footprint

df_grouped = df.groupby("department", as_index=False)["co2_footprint"].sum().round().sort_values(by = "co2_footprint" , ascending = False)
        

#Save to csv file

df_grouped.to_csv("co2_report.csv")
        

#Print the report

print(df_grouped)d

Finally, we group by the department aggregating by the sum of CO2 footprint. The result should look like this:

No alt text provided for this image


Overall, the use of automated reports can help organizations improve their efficiency, effectiveness, and competitiveness. They are important because they can help businesses and organizations make data-driven decisions. By providing timely, accurate, and comprehensive data on various aspects of the organization, automated reports can help decision makers identify trends, optimize processes, and make more informed decisions. It can also save time and reduce the risk of human error, as they are generated automatically using algorithms and software.

Therefore, for a data analyst is important to know how to build this automated processes and how to make it trustfull and usefull.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了