Creating a data pipeline from the API to the database: a practical case.
Orlando Bussolo Neto
Inteligência de Negócios e Analista de Dados com Experiência em Modelagem SQL, Looker e Otimiza??o de Processos.
Overview of experiences and code from a student’s inaugural data engineering project.
Summary
With the aim of expanding my knowledge, especially in Python programming and the use of relational and non-relational databases, I decided to seek improvement through courses and practical cases. Now I want to share a bit of my experience. During this process, I developed my first data pipeline, designed to capture information from an API (API Football) and, through a custom script, store this data in a non-relational database (MongoDB). Later on, I executed a transformation step to retrieve this data and save it into a relational database (MySQL).
Concept Review
Environment
The project was developed in a Linux environment, leveraging the advantages and resources offered by this platform to support Python programming code. Visual Studio Code was the chosen development environment due to its flexibility and efficiency in software development.
For data storage and management, MongoDB was used, taking advantage of the free version available in the cloud. This choice allowed quick and convenient access to MongoDB resources, ensuring the efficiency and scalability required for the project.
Stages of the Project
The project was conceived through three Python scripts, publicly available on GitHub at this link: [link]. Additionally, to provide a more detailed and visual understanding of the project stages, the image below was created. This image highlights the different phases and processes involved in the project, making it easier to visualize the adopted workflow.
Capturing Data from the API and storing it in MongoDB
Among the various freely available data sources, I chose to bring in some data about football matches from a specific championship. After researching free data sources, I found the Football API. It is worth noting that the main goal of the work is to establish the data flow and not to perform an in-depth statistical analysis of a particular football championship.
To capture the data, it was essential to register on the API website and obtain unique access credentials. Each user receives an individual access key for authentication. This key was used in conjunction with the requests library to effectively extract the data.
领英推荐
To establish the connection with MongoDB, the script provided on the database’s website was followed. The code available on the website was encapsulated within the functionconnect_mongo presented below.
def connect_mongo(uri):
# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))
# Send a ping to confirm a successful connection
try:
client.admin.command('ping')
print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
print(e)
return client
Data Transformation to Save in CSV
The original database contains a large amount of information, and to ensure performance, it was necessary to limit the analysis. The information that will be made available in MySQL can be seen in the code snippet below or in the GitHub link.
def filter_dataframe(df):
columns = ['match_id', 'league_id', 'match_date', 'match_time',
'match_hometeam_id', 'match_hometeam_name',
'match_hometeam_score', 'match_awayteam_name',
'match_awayteam_id', 'match_awayteam_score',
'match_round', 'match_stadium',
'match_referee','league_year']
# Columns after filtering
df_games_filtered = df[columns]
return df_games_filtered
Saving the data to MySQL
With the generated CSV, the next step was to add it to the relational database. The VS Code and its extensions were used as the tool for editing and storing the data. It’s important to note that it was necessary to create the same columns obtained in the filtering process previously described.
def create_championship_table(cursor, db_name, tb_name):
cursor.execute(f"""
CREATE TABLE IF NOT EXISTS {db_name}.{tb_name}(
match_id INT,
league_id INT,
match_date DATE,
match_time TIME,
match_hometeam_id INT,
match_hometeam_name VARCHAR(255),
match_hometeam_score INT,
match_awayteam_name VARCHAR(255),
match_awayteam_id INT,
match_awayteam_score INT,
match_round INT,
match_stadium VARCHAR(255),
match_referee VARCHAR(255),
league_year VARCHAR(50),
PRIMARY KEY (match_id)
)""")
Conclusions and Next Steps
The project consisted of designing the data pipeline, from the API to MySQL. Following this stage, with the structured data in place, the Data Analysis or Business Intelligence team can proceed with their analyses.
The provided data consists of 14 columns with 273 samples. It’s important to note that the project focused on the process of obtaining and making the data available, rather than on the analyses themselves, which can be performed. However, several insights can be derived from the dataset, such as determining the most commonly used tactical formation by clubs, identifying the referee who officiated the most games, and determining which team scored the highest number of goals.
Suggestions for the next stage include automating the scripts to highlight the continuous data feeding process into the database, which was not addressed in this project. Additionally, connecting to a business intelligence program (such as Power BI, Tableau, or Looker) can create a data pipeline very close to that used by many companies.
The project was developed as part of a training program at Alura, and some parts of the code were used based on the course: “Data Pipeline: Integrating Python with MongoDB and MySQL.” It is also worth noting the use of ChatGPT as a source of assistance in understanding the errors generated during the process.