Creating a data pipeline from the API to the database: a practical case.

Creating a data pipeline from the API to the database: a practical case.

Overview of experiences and code from a student’s inaugural data engineering project.

Summary

With the aim of expanding my knowledge, especially in Python programming and the use of relational and non-relational databases, I decided to seek improvement through courses and practical cases. Now I want to share a bit of my experience. During this process, I developed my first data pipeline, designed to capture information from an API (API Football) and, through a custom script, store this data in a non-relational database (MongoDB). Later on, I executed a transformation step to retrieve this data and save it into a relational database (MySQL).

Concept Review

  • Application Programming Interface (API): It is a set of rules and protocols that enables different software to communicate with each other. It defines the methods and communication formats that programs can use to request and share data, functionalities, and resources consistently and in a standardized manner. In the project, I utilized the Football API [link], which provides access to a vast amount of football-related data.
  • Non-relational Database: also known as NoSQL (Not Only SQL), it is an alternative to traditional relational databases that stands out for its ability to handle large volumes of data and offer greater flexibility in terms of data structure. There are several categories of NoSQL databases, with MongoDB being used in the project, which stores data in JSON files.
  • Relational Database: these are systems designed to store, manage, and manipulate structured data according to the relational data model. In these systems, data is organized into tables, where each table consists of rows (also known as records or tuples) and columns (also known as fields or attributes). In this project, MySQL is used, which is one of the most popular and widely used relational database management systems (RDBMS) in the world.
  • Visual Studio Code (VS Code): is a lightweight and highly customizable integrated development environment (IDE) developed by Microsoft. It is designed to be an efficient tool for software developers on a variety of platforms, including Windows, macOS, and Linux.
  • Python scripts: are programs written in the Python programming language. They are used to automate tasks, process data, create tools, or perform any kind of computational operation that can be expressed in Python code.

Environment

The project was developed in a Linux environment, leveraging the advantages and resources offered by this platform to support Python programming code. Visual Studio Code was the chosen development environment due to its flexibility and efficiency in software development.

For data storage and management, MongoDB was used, taking advantage of the free version available in the cloud. This choice allowed quick and convenient access to MongoDB resources, ensuring the efficiency and scalability required for the project.

Stages of the Project

The project was conceived through three Python scripts, publicly available on GitHub at this link: [link]. Additionally, to provide a more detailed and visual understanding of the project stages, the image below was created. This image highlights the different phases and processes involved in the project, making it easier to visualize the adopted workflow.

Project Stages

Capturing Data from the API and storing it in MongoDB

Among the various freely available data sources, I chose to bring in some data about football matches from a specific championship. After researching free data sources, I found the Football API. It is worth noting that the main goal of the work is to establish the data flow and not to perform an in-depth statistical analysis of a particular football championship.

To capture the data, it was essential to register on the API website and obtain unique access credentials. Each user receives an individual access key for authentication. This key was used in conjunction with the requests library to effectively extract the data.

To establish the connection with MongoDB, the script provided on the database’s website was followed. The code available on the website was encapsulated within the functionconnect_mongo presented below.

def connect_mongo(uri):    
    # Create a new client and connect to the server
    client = MongoClient(uri, server_api=ServerApi('1'))

    # Send a ping to confirm a successful connection
    try:
        client.admin.command('ping')
        print("Pinged your deployment. You successfully connected to MongoDB!")
    except Exception as e:
        print(e)
    return client         

Data Transformation to Save in CSV

The original database contains a large amount of information, and to ensure performance, it was necessary to limit the analysis. The information that will be made available in MySQL can be seen in the code snippet below or in the GitHub link.

def filter_dataframe(df):
    columns = ['match_id', 'league_id', 'match_date', 'match_time', 
               'match_hometeam_id', 'match_hometeam_name', 
               'match_hometeam_score', 'match_awayteam_name',
               'match_awayteam_id', 'match_awayteam_score', 
               'match_round', 'match_stadium',
               'match_referee','league_year']
# Columns after filtering
df_games_filtered = df[columns]
return df_games_filtered
        

Saving the data to MySQL

With the generated CSV, the next step was to add it to the relational database. The VS Code and its extensions were used as the tool for editing and storing the data. It’s important to note that it was necessary to create the same columns obtained in the filtering process previously described.

def create_championship_table(cursor, db_name, tb_name):    
    cursor.execute(f"""
        CREATE TABLE IF NOT EXISTS {db_name}.{tb_name}(
        
        match_id INT,
        league_id INT,
        match_date DATE,
        match_time TIME,
        match_hometeam_id INT,
        match_hometeam_name VARCHAR(255),
        match_hometeam_score INT,
        match_awayteam_name VARCHAR(255),
        match_awayteam_id INT,
        match_awayteam_score INT,
        match_round INT,
        match_stadium VARCHAR(255),
        match_referee VARCHAR(255),
        league_year VARCHAR(50),
                
        PRIMARY KEY (match_id)
    )""")        

Conclusions and Next Steps

The project consisted of designing the data pipeline, from the API to MySQL. Following this stage, with the structured data in place, the Data Analysis or Business Intelligence team can proceed with their analyses.

The provided data consists of 14 columns with 273 samples. It’s important to note that the project focused on the process of obtaining and making the data available, rather than on the analyses themselves, which can be performed. However, several insights can be derived from the dataset, such as determining the most commonly used tactical formation by clubs, identifying the referee who officiated the most games, and determining which team scored the highest number of goals.

Suggestions for the next stage include automating the scripts to highlight the continuous data feeding process into the database, which was not addressed in this project. Additionally, connecting to a business intelligence program (such as Power BI, Tableau, or Looker) can create a data pipeline very close to that used by many companies.

The project was developed as part of a training program at Alura, and some parts of the code were used based on the course: “Data Pipeline: Integrating Python with MongoDB and MySQL.” It is also worth noting the use of ChatGPT as a source of assistance in understanding the errors generated during the process.


要查看或添加评论,请登录

社区洞察

其他会员也浏览了