Building Modern Data Pipelines…The Fine Points
Kaustuv Majumder
Director, Data Engineering @ Morgan Stanley | Ex-Nomura, Accenture
What is Data Pipeline
A data pipeline is a series of data processing steps. If the data is not currently loaded into the data platform, then it is ingested at the beginning of the pipeline. Then there are a series of steps in which each step delivers an output that is the input to the next step. This continues until the pipeline is complete. In some cases, independent steps may be run in parallel.
Data pipelines consist of three key elements: a source, a processing step or steps, and a destination. In some data pipelines, the destination may be called a sink. Data pipelines enable the flow of data from an application to a data warehouse or from a data lake to an analytics database. Data pipelines also may have the same source and sink, such that the pipeline is purely about modifying the data set. Any time data is processed between point A and point B (or points B, C, and D), there is a data pipeline between those points.
The Architecture of a Modern Data Pipeline
Design Philosophies
Batch Pipeline
Data pipelines may be architected in several different ways. One common example is a batch-based data pipeline. In this scenario, an application such as a point-of-sale system, generating a large number of data points that need to be pushed to a data warehouse and an analytics database. Here is an example of what that would look like:
Streaming Pipeline
In a streaming data pipeline, data from the point of sales system would be processed as it is generated. The stream processing engine could feed outputs from the pipeline to data stores, marketing applications, and CRMs, among other applications, as well as back to the point of sale system itself.
Lambda Pipeline
Lambda Architecture combines batch and streaming pipelines into one architecture. The Lambda Architecture is popular in big data environments because it enables developers to account for both real-time streaming use cases and historical batch analysis. One key aspect of this architecture is that it encourages storing data in raw format so that it can continue to run new data pipelines to correct any code errors in prior pipelines, or to create new data destinations that enable new types of queries.
The Lambda Architecture accounts for both a traditional batch data pipeline and a real-time data streaming pipeline. It also has a serving layer that responds to queries.
Data Source
Data Source is the point of entry for data from all origins in the pipeline. A comprehensive pipeline should be able to inject data from multiple sources parenting to various business verticals both within and outside the organization. Following are some of the most common examples of data sources.
The Data Highway
ETL vs ELT
Data processing consists of three stages:
Extract-Transform-Load(ETL)
ETL (extract, transform, load) has been the traditional approach in analytics for several decades. It was originally designed to work with relational databases which have historically dominated the market. ETL requires the transformations to happen before the loading process. Data is extracted from data sources and then deposited into a staging area. Data is then cleaned, enriched, transformed and finally loaded into the data warehouse.
Data teams using this approach have to predict all the uses cases for the data before any analysis is even performed and then create transformations accordingly. This often means that data teams are left waiting as they can’t access any information until the whole process is completed.
Because Engineering teams typically own the extract and transform process, analysts have little visibility into the logic that has been used for the transformation process. This makes it hard for analysts to understand exactly what the data represents, often leading to incorrectly drawn conclusions. And because the analysts have little visibility, they can’t help the engineering team fix any bugs with the transformation code.
Extract-Load-Transform(ELT)
ELT is a modern variation of ETL where data cleaning, enrichment and transformation happen after the loading process. This is enabled by the fact that modern cloud data warehouses are extremely scalable and separate storage from compute resources. So essentially the main difference between ETL and ELT is the order that these steps take place.
With the transformation happening in the warehouse, it’s typically defined using SQL. This allows analysts to contribute to (or indeed entirely own) the transformation logic.
The ETL process was developed in order to cope with the limitations of traditional legacy data warehouses such as the high costs of computation and storage. With the rapid growth of cloud-based solutions and the decreasing costs of cloud-based storage, ELT is becoming more popular. Some of the key benefits of ELT are as follows:
Big Data Pipeline
As the volume, variety, and velocity of data have dramatically grown in recent years, architects and developers have had to adapt to “big data.” The term “big data” implies that there is a huge volume to deal with. This volume of data can open opportunities for use cases such as predictive analytics, real-time reporting, and alerting, among many examples.
Like many components of data architecture, data pipelines have evolved to support big data. Big data pipelines are data pipelines built to accommodate one or more of the three traits of big data. The velocity of big data makes it appealing to build streaming data pipelines for big data. Then data can be captured and processed in real-time so some action can then occur. The volume of big data requires that data pipelines must be scalable, as the volume can be variable over time. In practice, there are likely to be many big data events that occur simultaneously or very close together, so the big data pipeline must be able to scale to process significant volumes of data concurrently. The variety of big data requires that big data pipelines be able to recognize and process data in many different formats—structured, unstructured, and semi-structured.
A typical example of a Big data pipeline using open source technologies
Data Destination
Data Warehouse
A data warehouse is the core analytics system of an organization. The data warehouse will frequently work in conjunction with an operational data store (ODS) to ‘warehouse’ data captured by the various databases used by the business.?For example, suppose a company has databases supporting POS, online activity, customer data, and HR data. In that case, the data warehouse will take the data from these sources and make them available in a single location. Again, the ODS will typically handle the process of cleaning and normalizing the data, and preparing it for storage in the data warehouse.
Because a data warehouse captures transformed (i.e. cleaned) historical data, it is an ideal tool for data analysis. Because business units will leverage the warehouse data to create reports and perform data analysis, business units are frequently involved in how the data is organized.?Like a relational database, it typically uses SQL to query the data, and it uses tables, indexes, keys, views, and data types for data organization and integrity.
While a database can be a pseudo-data warehouse through the implementation of views, it is considered best practice to use a data warehouse for business user interaction leaving databases to capture transactional data.?Because the chief intent is analytics, a data warehouse is used for online analytical processing (OLAP).
Data Lake
A data lake stores an organization’s raw and processed (unstructured and structured) data at both large and small scales. Unlike a data warehouse or database, a data lake captures anything the organization deems valuable for future use. This can be images, videos, PDFs, or anything. The data lake will extract data from multiple disparate data sources and process the data like a data warehouse. Also, like a data warehouse, a data lake can be used for data analytics and report creation. However, the technology used in a data lake is much more complex than in a data warehouse.
Frequently, data lakes are used in conjunction with machine learning.?The output from machine learning tests is also often stored as well in the data lake.?Because of the level of complexity and skill required to leverage, a data lake requires users who are experienced in programming languages and data science techniques.?Lastly, unlike a data warehouse, a data lake does not leverage an ODS for data cleaning.
Data Mart
A data mart is very similar to a data warehouse. Like a data warehouse, the data mart will maintain and house clean data ready for analysis. However, unlike a data warehouse, the scope of visibility is limited.
A data mart supplies subject-oriented data necessary to support a specific business unit.?For example, a data mart could be created to support reporting and analysis for the marketing department.?By limiting the data to a particular business unit (for example, the marketing department), the business unit does not have to sift through irrelevant data.
Another benefit is security. Limiting the visibility of non-essential data to the department eliminates the chance of that data being used irresponsibly.
A third benefit is speed.?As there will be fewer data in the data mart, the processing overhead is decreased.?This means that queries will run faster.
Finally, because the data in the data mart is aggregated and prepared for that department appropriately, the chance of misusing the data is reduced.?The potential for conflicting reporting is also reduced.
A high-level comparison of these the constructs.
Pipeline Scheduling
In the context of a data pipeline, pipeline scheduling refers to the act of automating parts or all of a data pipeline’s components at fixed times, dates, or intervals. Typical frequencies of scheduled runs or ‘jobs’ range from a few minutes to days and well beyond, depending on the purpose of the pipeline and what systems, applications, or reports it feeds.
Pipeline scheduling is not to be confused with data streaming which involves a constant, real-time feed of data from one or more sources that passes through the processes specified in the pipeline.
Various tools and techniques can be deployed to automate a data pipeline. The two types of data workflow automation tools are low-code or no-code and code-based.
Low-Code / No-Code Solutions
These solutions let you automate your data workflows by dragging and dropping the desired components from a list of available options onto a graphical representation. Then, you can configure connections between the pieces to automate your data workflows.
Code-free data pipeline automation tools are easy to use and enable non-technical teams to create pipelines and data flows that capture valuable insights from multiple business operations and customer interactions. The learning curve is usually low too, so you can get started quickly.
领英推荐
Although many of these code-free tools are easy to use, they have disadvantages. Most of them lack customization options, so you are stuck to the tool’s limitations i.e. you have to structure your data to work and be limited to whatever they offer regarding data cleaning.
The lack of a source or version control system is often a problem. Without a source or version control system, it can be hard to be transparent among team members. And without such protections, mistakes can take a long time to find and fix and waste more time than necessary for programmers.
These code-free tools have limited integrations with sources and targets, e.g. Fivetran is missing the Salesforce Commerce Cloud integration.
This might not necessarily be a disadvantage to most companies. However, if you are trying to handle streaming data, most of these tools cannot handle real-time data; they can make it difficult to promptly provide the necessary information when needed.
Examples: Fivetran, Alteryx, Panoply
Coding Solutions
Code-based solutions allow for automating data workflows by using custom scripts containing specific instructions for carrying out each step in the process. Anybody who can program in the chosen program language can run these scripts.
The advantages of these code-based solutions are clear — they allow ease of automation and make it possible to extract lots of data from spreadsheets or databases in a short amount of time. Most importantly, though, these files can be easily added to any source and version control, and their code is open-sourced so you can go through the code or add an integration yourself.
There are certain disadvantages to this tool. The learning curve for a code-based tool is usually very steep, and one must have a deep understanding of the coding language to use the application effectively.
If you are an organization with a large team, you can choose to set up your own infrastructure, and for those with smaller teams, you can choose to get the cloud-hosted platform (for fees). There are also fewer of these tools out there, so if you are not as happy with them, there are limited options.
Examples: Airflow, Dagster, Prefect
Apache Airflow
Apache Airflow is undoubtedly the most widely used open source automation tool. It is a system to programmatically author, schedule, and monitor data processing workflows. It was designed to automate data processing workflows and make them scalable as data grows. Airflow can be used for ETL, data validation and quality assurance, machine learning pipelines, or any other type of task that can be broken down into a series of steps with dependencies between them.
Airflow provides the ability to author workflows as directed acyclic graphs (DAGs) of tasks. The user describes each task as a sequence of Python functions that produce output values, called tuples. Tasks can also have side effects, such as updating a database or sending an email.
Unfortunately, there are a handful of disadvantages. The local development and testing are a hassle to set up. They did add XCom to allow for data movement between tasks, but it is not a native solution as some other coding-based tools offer, and running multiple DAGs at the same schedule is not possible. Lastly, Airflow’s UI is not as intuitive as one would hope and does not have an easy interface to create dynamic and parameterized workflows.
The Cloud Ecosystem
In modern days, almost all complex data infrastructures are deployed on clouds. Cloud provides immense advantages over on-premise data centers.
Challenges of Modern Data Pipeline
With every passing year, the volume and complexities of business data are evolving; and modern data pipelines must address these scenarios. Some of the most common problems faced by the data engineering team are as follows:
Designing the optimal architecture
Getting a clear view of data and putting it in the correct place is critical. Organizations must be able to connect their products to as many different data repositories and formats as feasible, including unstructured data. However, deciding what to use, as well as combining, transforming, and ingesting that data, can be difficult.
Data Storing
Storing large volumes of data has always been a challenge. The storage must be cheap, reliable, and easily scalable to accommodate future requirements. Also, multiple copies of the same data need to be maintained on different servers in order to avoid data loss. Lately, with the advent of various cloud providers, the issues related to data storage have been solved to a great extent. They provide fully managed, fault-tolerant data storage options at a much lower cost compared to on-premise servers.
Flexibility
Pipelines based on extract, transform, and load (ETL) operations frequently present enterprises with distinct issues. A flaw in one phase of an ETL process can result in hours of intervention, lowering data quality, losing consumer confidence, and complicating maintenance.
Scaling
Analysts may import data in discrete, all-or-nothing atomic chunks. This strategy, however, is ineffective due to the volume and velocity of data, as well as the demand for real-time insights. For analysts, data storage must be automatically scalable. On-premise data servers fail to address the problem of auto-scaling since nodes need to be added manually to scale the database. However, modern clouds have mitigated this problem largely. They provide storage solutions that can scale automatically as the data volume increases.
Migration
Many organizations use legacy systems to perform business. These systems are not suitable to be integrated with the data pipeline. Thus, data needs to be migrated out of these systems before the pipelines can be deployed. Migration of large, sensitive data is often an intense and time-consuming activity with a chance of data corruption.
Benefits of Modern Data Pipeline
Having a robust data infrastructure can immensely help an organization to make good use of its data and generate revenue from it. Some of the key benefits are as follows:
Access to Information
Businesses struggle to manage their data due to a lack of data centralization. Centralized data management facilitates cross-functional collaboration and data transparency across the organization. As a result, users such as business intelligence teams, marketing teams, and data analysts can all access the same data through a single management system and have a thorough understanding of how and where data is collected.
Reliability
Data duplication occurs when data is stored in several locations, and siloed data can result over time. Data consistency is improved by centralizing data and using a data pipeline to eliminate siloed databases.
Flexibility And Agility
A data pipeline provides an organization the option to quickly adapt to changing environment. The data formats and structures are ever-changing, and the pipeline can be easy to modify to accommodate the changes.
Case Study - Enterprise Data Lake & Pipeline
The Company
This large pension fund invests in a variety of vehicles, including public equities, private equities, real estate, infrastructure, and fixed income instruments.?As of December 31, 2018, the fund managed over $200 billion USD.
The company is a european financial institution, dealing with wealth management, asset management, and investment banking domain. As of FY 2021, the organization has over $3 trillion USD under its portfolio.
The Challenge
The organization wanted to analyze the transactional patterns of its customers to understand their financial stability. This is important to take business decisions related to specific customers. The organization wanted to hedge any potential risk of customer default, and needed to scale the business at the same time. Naturally, the need for a robust and reliable data warehouse was felt that can answer all the questions related to credit risk.
The Solution
The solution was to build an enterprise data warehouse that can compute and model every financial information related to the customers. Data was collected from various sources (OLTP servers, sales & marketing teams, wealth management/portfolio management divisions, etc. ) and modeled to present an aggregated view to draw insights.
Amazon Web Services (AWS) was chosen as the cloud provider. The whole infrastructure was deployed over the cloud to maintain scalability and flexibility.
Cloud Components:
Project-based Implementation:
Python was extensively used to implement business logic. Pyspark library was utilized to integrate python with spark for leveraging the power of big data.
Since redshift uses PostgreSQL in the backend, general purpose PL/SQL was used to interact with the database.
Outcome:
The first phase of the project was implemented within six months. The extensive use of cloud components and modular architecture of the code helped in achieving this timeline.
The organization was able to capture over 100TB of data in the first few months after go-live. This data proved to be extremely valuable. The business intelligence team analyzed the data to come up with a list of potential defaulters and the organization was able to save billions of dollars from becoming into bad debt.
Note: