登录查看更多内容

Standardizing Data Delivery with Data as a Product

Kevin Petrie

Vice President of Research at BARC

发布日期: 2022年1月20日

On Dwight Eisenhower’s first day in the White House as President, an assistant handed him two unopened letters. Ike?responded , “never bring me a sealed envelope.” He only wanted to consume vetted information.

Like Ike, modern enterprises only want to consume vetted information. But their data engineers struggle to deliver the timely, accurate data this requires. They struggle to craft data pipelines across proliferating sources, targets, users, use cases, and platforms. They need to standardize and automate data delivery so they can reduce the complexity—and avoid persistent performance and quality issues.

Automated pipeline and quality tools can help, but in themselves cannot offer the necessary standardization.

Data as a Product

A new way to standardize and automate data delivery is to treat data as a product. You can view?a data product ?as a modular package that data teams create, use, curate, and reuse with no scripting. The data product enables data engineers to become more productive. It also empowers data scientists, data analysts, developers, or even data-savvy business managers to consume data without the help of data engineers. This blog explores what that looks like in a pipeline architecture.

The?data product ?includes transformation logic, metadata, and schema, along with views of the data itself. Let’s break this down.

Transformation logic?prepares a data product for consumption by combining its data from multiple sources, filtering out unneeded records or values, re-formatting data, and validating its accuracy.?The transformation logic ?also might enrich a data product by correlating it with relevant third-party data. Finally, it might obfuscate sensitive data by masking those values.
Metadata—such as the name of a file, its characteristics, and its lineage—describes data so that users know where and how they might use it. Users might?tag a data product ?with additional metadata, for example to describe its purpose and relevance to a given project.
Schema?structure the data for consumption. For example, a schema defines how the rows and columns of a SQL table—or the columns of an?Apache Parquet ?file—relate to one another. By structuring the data, the schema helps applications, tools, and users consume with it.
Views, as you’d expect, present the data to those applications, tools, and users for consumption. These views are logical representations of underlying physical data. Multiple views might apply to a given physical data set, with each view offering a distinct combination or slice of the data.

Andy Palmer 6 个月前

Quality 4.0 Technical Overview – Things you should…

John M. Cachat 6 个月前

Building User-Centric Data Products

Deryck B. 11 个月前

The data product packages together transformation logic, metadata,?schema, and views of the data itself

Implemented well, the data product automatically incorporates changes to ensure users maintain an accurate, consistent picture of the business at all times. These changes might include new rows in a table, a new tag from a user, or an altered schema from source. The data product provides monitoring capabilities to keep users informed of material changes. If new rows in a table fail a validation test, that should trigger an alert. A viable data product offering also includes role-based access controls that authenticate users and authorize the actions they take on the data.?

Data Pipelines

Data products should support?all the major types of data pipelines . You can view this in three dimensions: the sequence of tasks (e.g., ETL vs. ELT), and latency (batch vs. stream), and push vs. pull.

ETL and ELT. A data pipeline can extract data from a source, transform it, then load it to a target in an?ETL?sequence. Another popular sequence is?ELT, which extracts, loads, and then transforms data on the target. Whatever the sequence, here is how those tasks apply to a data product.
They?extract?data, metadata, and source schemas.
They?transform?the data using the transformation logic described above.
They?load?these components, including the data views, as a bundled data product to the target.
Batch and stream. A data pipeline can process full?batches?of data on a scheduled basis—say every hour or day—or continuously process incremental updates within a data?stream. Stream processing consumes fewer resources and meets real-time business needs. Whatever the latency, the data product should automatically absorb incoming data, metadata, and schema changes. Users should see these changes via dashboards or alerts so they can adjust transformation logic or various settings as needed.
Push and pull. A data pipeline can?push?data to a target, meaning that it delivers a batch or stream based on an event or rule. For example, a source database might register a customer transaction on a source database, prompting a traditional ETL or ELT pipeline to push that transaction record to a target such as a cloud data platform. Alternatively, an?application programming interface (API)?on the target might?pull?data from the source based on a consumption need. For example, an API pull service might fetch a customer’s transaction history because a SaaS application user requests it. This second scenario also is known as Data as a Service.

Consumption

There are many ways to access and consume a data product from a?target?such as a cloud data platform or cloud database. To support?analytics, a data scientist or data analyst might use an analytics tool to discover and query the data product. To support?operations or embedded analytics, a software as a service (SaaS) application might consume the data product through an API pull service. These scenarios enable enterprise workloads to consume data in a standardized and automated fashion.

This is the vision. Can enterprises make it work? Check out Eckerson Group’s?webinar ?with Nexla on January 26 as we explore this question.

Standardizing Data Delivery with Data as a Product

Kevin Petrie

Vice President of Research at BARC

Data as a Product

领英推荐

Data Pipelines

Consumption

更多精彩文章

社区洞察

其他会员也浏览了

Building Data Analytics Ecosystem

Anatomy Of A Data Stack (2024 Update)

Data Contracts - Implementation Guide

Data Mesh

Empowering Data Domains: Unleashing the Potential of Domain Ownership in a Data Mesh

What is Big Data

#143 Denormalized Data into Embeddings Lake Vectoria

Data Stack as a Service (2/3)

Unlock the full potential of data

Snowflake Data Pipeline Surveillance using Streamlit

Data as a Product

领英推荐

Data Pipelines

Consumption

Striking the Right Balance with the Modern Cloud Data Platform

2022年3月3日

Assessing the Kafka Data Streaming Opportunity

2018年10月8日

Five Consistent Principles for the Changing World of Data Integration

2018年5月23日

Streaming Change Data Capture

2018年5月5日

How to Turn Dark Data from a Problem into an Advantage

2016年8月24日

Shedding Light on the Problem of Dark Data

2016年5月13日

Navigating US-EMEA Data Privacy Rules

2016年4月14日

Five Steps to Successfully Manage Multiple Data Platforms

2016年4月13日

Big Data Analytics in Latin America: Stormy Waters, but the Tide is Rising

2016年3月4日

Data Variety: The Upside and the Challenge for Financial Services

2016年3月3日

社区洞察

其他会员也浏览了

Building Data Analytics Ecosystem

Anatomy Of A Data Stack (2024 Update)

Data Contracts - Implementation Guide

Data Mesh

Empowering Data Domains: Unleashing the Potential of Domain Ownership in a Data Mesh

What is Big Data

#143 Denormalized Data into Embeddings Lake Vectoria

Data Stack as a Service (2/3)

Unlock the full potential of data

Snowflake Data Pipeline Surveillance using Streamlit