登录查看更多内容

Data Pipeline

Kishan Kumar

Senior Consultant CRD(Corporate function) at Huquo

发布日期: 2023年1月4日

A data pipeline is a method in which raw data is ingested from various data sources and then ported to data store, like a data lake or data warehouse, for analysis. Before data flows into a data repository, it usually undergoes some data processing. This is inclusive of data transformations, such as filtering, masking, and aggregations, which ensure appropriate data integration and standardization. This is particularly important when the destination for the dataset is a relational database. This type of data repository has a defined schema which requires alignment—i.e. matching data columns and types—to update existing data with new data.

Types of data pipelines

There are two main types of data pipelines, which are batch processing and streaming data.

Batch processing

The development of batch processing was critical step in building data infrastructures that were reliable and scalable. In 2004, MapReduce, a batch processing algorithm, was patented and then subsequently integrated in open-source systems, like Hadoop, CouchDB, and MongoDB.

As the name implies, batch processing loads “batches” of data into a repository during set time intervals, which are typically scheduled during off-peak business hours. This way, other workloads aren’t impacted as batch processing jobs tend to work with large volumes of data, which can tax the overall system. Batch processing is usually the optimal data pipeline when there isn’t an immediate need to analyze a specific dataset (e.g. monthly accounting), and it is more associated with the ETL data integration process, which stands for “extract, transform, and load.”

Batch processing jobs form a workflow of sequenced commands, where the output of one command becomes the input of the next command. For example, one command may kick off data ingestion, the next command may trigger filtering of specific columns, and the subsequent command may handle aggregation. This series of commands will continue until the data is completely transformed and written into data repository.

Streaming data

Unlike batching processing, streaming data is leveraged when it is required for data to be continuously updated. For example, apps or point of sale systems need real-time data to update inventory and sales history of their products; that way, sellers can inform consumers if a product is in stock or not. A single action, like a product sale, is considered an “event”, and related events, such as adding an item to checkout, are typically grouped together as a “topic” or “stream.” These events are then transported via messaging systems or message brokers, such as the open-source offering, Apache Kafka.

Since data events are processed shortly after occurring, streaming processing systems have lower latency than batch systems, but aren’t considered as reliable as batch processing systems as messages can be unintentionally dropped or spend a long time in queue. Message brokers help to address this concern through acknowledgements, where a consumer confirms processing of the message to the broker to remove it from the queue.

Data pipeline architecture

Three core steps make up the architecture of a data pipeline.

1. Data ingestion: Data is collected from various data sources, which includes various data structures (i.e. structured and unstructured data). Within streaming data, these raw data sources are typically known as producers, publishers, or senders. While businesses can choose to extract data only when they are ready to process it, it’s a better practice to land the raw data within a cloud data warehouse provider first. This way, the business can update any historical data if they need to make adjustments to data processing jobs.

2. Data Transformation: During this step, a series of jobs are executed to process data into the format required by the destination data repository. These jobs embed automation and governance for repetitive workstreams, like business reporting, ensuring that data is cleansed and transformed consistently. For example, a data stream may come in a nested JSON format, and the data transformation stage will aim to unroll that JSON to extract the key fields for analysis.

3. Data Storage: The transformed data is then stored within a data repository, where it can be exposed to various stakeholders. Within streaming data, this transformed data are typically known as consumers, subscribers, or recipients.

要查看或添加评论，请登录

Kishan Kumar的更多文章

Sales Manager

2024年4月5日

Sales Manager

What is a Sales Manager? A sales manager is responsible for overseeing and leading a team of sales representatives to…
Data Modelers

2024年4月4日

Data Modelers

Data modelers are systems analysts who work with data architects and database administrators to design computer…
Deepfake Technology

2024年4月3日

Deepfake Technology

What is Deepfake? Deepfake is a term that refers to synthetic media that have been digitally manipulated to replace one…
Analytics

2024年4月2日

Analytics

Analytics is a field of computer science that uses math, statistics, and machine learning to find meaningful patterns…
What is Apache Airflow?

2024年4月1日

What is Apache Airflow?

The Apache Airflow platform allows you to create, schedule and monitor workflows through computer programming. It is a…
LSTM Networks

2024年3月30日

LSTM Networks

LSTM networks are an extension of recurrent neural networks (RNNs) mainly introduced to handle situations where RNNs…
Free Space Laser Communication

2024年3月29日

Free Space Laser Communication

FSO is a line-of-sight technology that uses lasers to provide optical bandwidth connections or FSO is an optical…
Neo4j

2024年3月28日

Neo4j

A Neo4j graph database stores nodes and relationships instead of tables or documents. Data is stored just like you…
Customer Communications Management

2024年3月27日

Customer Communications Management

What is customer communications management? Customer communications management is a strategic framework designed to…
Bid Rigging

2024年3月26日

Bid Rigging

Bid rigging is a common practice in almost every industry. It hampers the buyers’ efforts to get goods and services at…

See all articles

Data Pipeline

Kishan Kumar

Senior Consultant CRD(Corporate function) at Huquo

Kishan Kumar的更多文章

社区洞察

其他会员也浏览了

Unified Data Reporting Platform (UDRP) - Data Engineering

10 Important tools that one should possess to build a successful career in Big Data.

Are you planning to learn Azure Data Engineering jobs?

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

The Evolution of Big Data Analytics: From Data Warehousing to Predictive Insights

Introduction

Understanding Apache Hive Metastore: The Backbone of Metadata Management in Big Data Ecosystems

Big Data and Hadoop

Data Engineer vs. Data Platform Engineer

Apache Iceberg: Transforming Data Lake Management for the AI Era

Kishan Kumar的更多文章

Sales Manager

Data Modelers

Deepfake Technology

Analytics

What is Apache Airflow?

LSTM Networks

Free Space Laser Communication

Neo4j

Customer Communications Management

Bid Rigging

社区洞察

其他会员也浏览了

Unified Data Reporting Platform (UDRP) - Data Engineering

10 Important tools that one should possess to build a successful career in Big Data.

Are you planning to learn Azure Data Engineering jobs?

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

The Evolution of Big Data Analytics: From Data Warehousing to Predictive Insights

Introduction

Understanding Apache Hive Metastore: The Backbone of Metadata Management in Big Data Ecosystems

Big Data and Hadoop

Data Engineer vs. Data Platform Engineer

Apache Iceberg: Transforming Data Lake Management for the AI Era