登录查看更多内容

Creating a data pipeline from the API to the database: a practical case.

Orlando Bussolo Neto

Inteligência de Negócios e Analista de Dados com Experiência em Modelagem SQL, Looker e Otimiza??o de Processos.

发布日期: 2024年3月18日

+ 关注

Overview of experiences and code from a student’s inaugural data engineering project.

Summary

With the aim of expanding my knowledge, especially in Python programming and the use of relational and non-relational databases, I decided to seek improvement through courses and practical cases. Now I want to share a bit of my experience. During this process, I developed my first data pipeline, designed to capture information from an API (API Football) and, through a custom script, store this data in a non-relational database (MongoDB). Later on, I executed a transformation step to retrieve this data and save it into a relational database (MySQL).

Concept Review

Application Programming Interface (API): It is a set of rules and protocols that enables different software to communicate with each other. It defines the methods and communication formats that programs can use to request and share data, functionalities, and resources consistently and in a standardized manner. In the project, I utilized the Football API [link], which provides access to a vast amount of football-related data.
Non-relational Database: also known as NoSQL (Not Only SQL), it is an alternative to traditional relational databases that stands out for its ability to handle large volumes of data and offer greater flexibility in terms of data structure. There are several categories of NoSQL databases, with MongoDB being used in the project, which stores data in JSON files.
Relational Database: these are systems designed to store, manage, and manipulate structured data according to the relational data model. In these systems, data is organized into tables, where each table consists of rows (also known as records or tuples) and columns (also known as fields or attributes). In this project, MySQL is used, which is one of the most popular and widely used relational database management systems (RDBMS) in the world.
Visual Studio Code (VS Code): is a lightweight and highly customizable integrated development environment (IDE) developed by Microsoft. It is designed to be an efficient tool for software developers on a variety of platforms, including Windows, macOS, and Linux.
Python scripts: are programs written in the Python programming language. They are used to automate tasks, process data, create tools, or perform any kind of computational operation that can be expressed in Python code.

Environment

The project was developed in a Linux environment, leveraging the advantages and resources offered by this platform to support Python programming code. Visual Studio Code was the chosen development environment due to its flexibility and efficiency in software development.

For data storage and management, MongoDB was used, taking advantage of the free version available in the cloud. This choice allowed quick and convenient access to MongoDB resources, ensuring the efficiency and scalability required for the project.

Stages of the Project

The project was conceived through three Python scripts, publicly available on GitHub at this link: [link]. Additionally, to provide a more detailed and visual understanding of the project stages, the image below was created. This image highlights the different phases and processes involved in the project, making it easier to visualize the adopted workflow.

Capturing Data from the API and storing it in MongoDB

Among the various freely available data sources, I chose to bring in some data about football matches from a specific championship. After researching free data sources, I found the Football API. It is worth noting that the main goal of the work is to establish the data flow and not to perform an in-depth statistical analysis of a particular football championship.

To capture the data, it was essential to register on the API website and obtain unique access credentials. Each user receives an individual access key for authentication. This key was used in conjunction with the requests library to effectively extract the data.

Brecht Corbeel 10 个月前

Integrating SQL Databases with Python for Dynamic Web…

Brecht Corbeel 10 个月前

How to Access a Collection in MongoDB Using Python: A…

Ketan Raval 6 个月前

To establish the connection with MongoDB, the script provided on the database’s website was followed. The code available on the website was encapsulated within the functionconnect_mongo presented below.

def connect_mongo(uri):    
    # Create a new client and connect to the server
    client = MongoClient(uri, server_api=ServerApi('1'))

    # Send a ping to confirm a successful connection
    try:
        client.admin.command('ping')
        print("Pinged your deployment. You successfully connected to MongoDB!")
    except Exception as e:
        print(e)
    return client

Data Transformation to Save in CSV

The original database contains a large amount of information, and to ensure performance, it was necessary to limit the analysis. The information that will be made available in MySQL can be seen in the code snippet below or in the GitHub link.

def filter_dataframe(df):
    columns = ['match_id', 'league_id', 'match_date', 'match_time', 
               'match_hometeam_id', 'match_hometeam_name', 
               'match_hometeam_score', 'match_awayteam_name',
               'match_awayteam_id', 'match_awayteam_score', 
               'match_round', 'match_stadium',
               'match_referee','league_year']
# Columns after filtering
df_games_filtered = df[columns]
return df_games_filtered

Saving the data to MySQL

With the generated CSV, the next step was to add it to the relational database. The VS Code and its extensions were used as the tool for editing and storing the data. It’s important to note that it was necessary to create the same columns obtained in the filtering process previously described.

def create_championship_table(cursor, db_name, tb_name):    
    cursor.execute(f"""
        CREATE TABLE IF NOT EXISTS {db_name}.{tb_name}(
        
        match_id INT,
        league_id INT,
        match_date DATE,
        match_time TIME,
        match_hometeam_id INT,
        match_hometeam_name VARCHAR(255),
        match_hometeam_score INT,
        match_awayteam_name VARCHAR(255),
        match_awayteam_id INT,
        match_awayteam_score INT,
        match_round INT,
        match_stadium VARCHAR(255),
        match_referee VARCHAR(255),
        league_year VARCHAR(50),
                
        PRIMARY KEY (match_id)
    )""")

Conclusions and Next Steps

The project consisted of designing the data pipeline, from the API to MySQL. Following this stage, with the structured data in place, the Data Analysis or Business Intelligence team can proceed with their analyses.

The provided data consists of 14 columns with 273 samples. It’s important to note that the project focused on the process of obtaining and making the data available, rather than on the analyses themselves, which can be performed. However, several insights can be derived from the dataset, such as determining the most commonly used tactical formation by clubs, identifying the referee who officiated the most games, and determining which team scored the highest number of goals.

Suggestions for the next stage include automating the scripts to highlight the continuous data feeding process into the database, which was not addressed in this project. Additionally, connecting to a business intelligence program (such as Power BI, Tableau, or Looker) can create a data pipeline very close to that used by many companies.

The project was developed as part of a training program at Alura, and some parts of the code were used based on the course: “Data Pipeline: Integrating Python with MongoDB and MySQL.” It is also worth noting the use of ChatGPT as a source of assistance in understanding the errors generated during the process.

Creating a data pipeline from the API to the database: a practical case.

Orlando Bussolo Neto

Inteligência de Negócios e Analista de Dados com Experiência em Modelagem SQL, Looker e Otimiza??o de Processos.

Overview of experiences and code from a student’s inaugural data engineering project.

Summary

Concept Review

Environment

Stages of the Project

Capturing Data from the API and storing it in MongoDB

领英推荐

Data Transformation to Save in CSV

Saving the data to MySQL

Conclusions and Next Steps

更多精彩文章

社区洞察

其他会员也浏览了

Introduction to Apache Beam Using Java

What are the essential SQL commands for creating a new database and tables in?Python?

Exploring Java Data Types: A Comprehensive Guide for Beginners

Understanding the Role of Flask: A Micro Web Application Framework

Evaluating Which Python library is Best suitable for Bulk insert into Aurora Postgres SQL | Speed Comparison

Spark’s Driver and Executors

Part III: Python and the Oracle Autonomous Database - Three Ways to Connect

Quix Streams: Real-Time Data Processing in Python...APP Created Monitor the Performance of a Delivery Vehicle Fleet...

SQLite Operations in Python

Building Azure Data Factory pipelines using Python

Overview of experiences and code from a student’s inaugural data engineering project.

Summary

Concept Review

Environment

Stages of the Project

Capturing Data from the API and storing it in MongoDB

领英推荐

Data Transformation to Save in CSV

Saving the data to MySQL

Conclusions and Next Steps

A Data Analysis study on the public spending of Brazilian Senators.

2024年4月19日

Correlation Matrix, analyzing data using Power BI and Python: a comparative analysis.

2024年4月1日

社区洞察

其他会员也浏览了

Introduction to Apache Beam Using Java

What are the essential SQL commands for creating a new database and tables in?Python?

Exploring Java Data Types: A Comprehensive Guide for Beginners

Understanding the Role of Flask: A Micro Web Application Framework

Evaluating Which Python library is Best suitable for Bulk insert into Aurora Postgres SQL | Speed Comparison

Spark’s Driver and Executors

Part III: Python and the Oracle Autonomous Database - Three Ways to Connect

Quix Streams: Real-Time Data Processing in Python...APP Created Monitor the Performance of a Delivery Vehicle Fleet...

SQLite Operations in Python

Building Azure Data Factory pipelines using Python