Standardizing Data Delivery with Data as a Product
On Dwight Eisenhower’s first day in the White House as President, an assistant handed him two unopened letters. Ike?responded , “never bring me a sealed envelope.” He only wanted to consume vetted information.
Like Ike, modern enterprises only want to consume vetted information. But their data engineers struggle to deliver the timely, accurate data this requires. They struggle to craft data pipelines across proliferating sources, targets, users, use cases, and platforms. They need to standardize and automate data delivery so they can reduce the complexity—and avoid persistent performance and quality issues.
Automated pipeline and quality tools can help, but in themselves cannot offer the necessary standardization.
Data as a Product
A new way to standardize and automate data delivery is to treat data as a product. You can view?a data product ?as a modular package that data teams create, use, curate, and reuse with no scripting. The data product enables data engineers to become more productive. It also empowers data scientists, data analysts, developers, or even data-savvy business managers to consume data without the help of data engineers. This blog explores what that looks like in a pipeline architecture.
The?data product ?includes transformation logic, metadata, and schema, along with views of the data itself. Let’s break this down.
领英推荐
The data product packages together transformation logic, metadata,?schema, and views of the data itself
Implemented well, the data product automatically incorporates changes to ensure users maintain an accurate, consistent picture of the business at all times. These changes might include new rows in a table, a new tag from a user, or an altered schema from source. The data product provides monitoring capabilities to keep users informed of material changes. If new rows in a table fail a validation test, that should trigger an alert. A viable data product offering also includes role-based access controls that authenticate users and authorize the actions they take on the data.?
Data Pipelines
Data products should support?all the major types of data pipelines . You can view this in three dimensions: the sequence of tasks (e.g., ETL vs. ELT), and latency (batch vs. stream), and push vs. pull.
Consumption
There are many ways to access and consume a data product from a?target?such as a cloud data platform or cloud database. To support?analytics, a data scientist or data analyst might use an analytics tool to discover and query the data product. To support?operations or embedded analytics, a software as a service (SaaS) application might consume the data product through an API pull service. These scenarios enable enterprise workloads to consume data in a standardized and automated fashion.
This is the vision. Can enterprises make it work? Check out Eckerson Group’s?webinar ?with Nexla on January 26 as we explore this question.