Standardizing Data Delivery with Data as a Product

Standardizing Data Delivery with Data as a Product

On Dwight Eisenhower’s first day in the White House as President, an assistant handed him two unopened letters. Ike?responded , “never bring me a sealed envelope.” He only wanted to consume vetted information.

Like Ike, modern enterprises only want to consume vetted information. But their data engineers struggle to deliver the timely, accurate data this requires. They struggle to craft data pipelines across proliferating sources, targets, users, use cases, and platforms. They need to standardize and automate data delivery so they can reduce the complexity—and avoid persistent performance and quality issues.

Automated pipeline and quality tools can help, but in themselves cannot offer the necessary standardization.

Data as a Product

A new way to standardize and automate data delivery is to treat data as a product. You can view?a data product ?as a modular package that data teams create, use, curate, and reuse with no scripting. The data product enables data engineers to become more productive. It also empowers data scientists, data analysts, developers, or even data-savvy business managers to consume data without the help of data engineers. This blog explores what that looks like in a pipeline architecture.

The?data product ?includes transformation logic, metadata, and schema, along with views of the data itself. Let’s break this down.

  • Transformation logic?prepares a data product for consumption by combining its data from multiple sources, filtering out unneeded records or values, re-formatting data, and validating its accuracy.?The transformation logic ?also might enrich a data product by correlating it with relevant third-party data. Finally, it might obfuscate sensitive data by masking those values.
  • Metadata—such as the name of a file, its characteristics, and its lineage—describes data so that users know where and how they might use it. Users might?tag a data product ?with additional metadata, for example to describe its purpose and relevance to a given project.
  • Schema?structure the data for consumption. For example, a schema defines how the rows and columns of a SQL table—or the columns of an?Apache Parquet ?file—relate to one another. By structuring the data, the schema helps applications, tools, and users consume with it.
  • Views, as you’d expect, present the data to those applications, tools, and users for consumption. These views are logical representations of underlying physical data. Multiple views might apply to a given physical data set, with each view offering a distinct combination or slice of the data.

The data product packages together transformation logic, metadata,?schema, and views of the data itself

Implemented well, the data product automatically incorporates changes to ensure users maintain an accurate, consistent picture of the business at all times. These changes might include new rows in a table, a new tag from a user, or an altered schema from source. The data product provides monitoring capabilities to keep users informed of material changes. If new rows in a table fail a validation test, that should trigger an alert. A viable data product offering also includes role-based access controls that authenticate users and authorize the actions they take on the data.?

Data Pipelines

Data products should support?all the major types of data pipelines . You can view this in three dimensions: the sequence of tasks (e.g., ETL vs. ELT), and latency (batch vs. stream), and push vs. pull.

  • ETL and ELT. A data pipeline can extract data from a source, transform it, then load it to a target in an?ETL?sequence. Another popular sequence is?ELT, which extracts, loads, and then transforms data on the target. Whatever the sequence, here is how those tasks apply to a data product.
  • They?extract?data, metadata, and source schemas.
  • They?transform?the data using the transformation logic described above.
  • They?load?these components, including the data views, as a bundled data product to the target.
  • Batch and stream. A data pipeline can process full?batches?of data on a scheduled basis—say every hour or day—or continuously process incremental updates within a data?stream. Stream processing consumes fewer resources and meets real-time business needs. Whatever the latency, the data product should automatically absorb incoming data, metadata, and schema changes. Users should see these changes via dashboards or alerts so they can adjust transformation logic or various settings as needed.
  • Push and pull. A data pipeline can?push?data to a target, meaning that it delivers a batch or stream based on an event or rule. For example, a source database might register a customer transaction on a source database, prompting a traditional ETL or ELT pipeline to push that transaction record to a target such as a cloud data platform. Alternatively, an?application programming interface (API)?on the target might?pull?data from the source based on a consumption need. For example, an API pull service might fetch a customer’s transaction history because a SaaS application user requests it. This second scenario also is known as Data as a Service.

Consumption

There are many ways to access and consume a data product from a?target?such as a cloud data platform or cloud database. To support?analytics, a data scientist or data analyst might use an analytics tool to discover and query the data product. To support?operations or embedded analytics, a software as a service (SaaS) application might consume the data product through an API pull service. These scenarios enable enterprise workloads to consume data in a standardized and automated fashion.

This is the vision. Can enterprises make it work? Check out Eckerson Group’s?webinar ?with Nexla on January 26 as we explore this question.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了