What is a Data Contract?
Data Contract (Source: datamesh-manager.com)

What is a Data Contract?

A data contract is a formal agreement between two parties (the data product provider and the data product consumer) to use a data product. It specifies the guarantees about a provided data set, the purpose of data usage, and costs. Data contracts have a lifecycle to allow the evolution of data products, and they can be used to automate the creation and revocation of roles and permissions.

Data Contract Example

Let's start with an example to see, what a data contract typically covers:

Es wurde kein Alt-Text für dieses Bild angegeben.
Example for a data contract

Request Flow and Life Cycle

A data contract is typically a mutual agreement between two teams, represented by their data product owners. A?consumer team?submits a request to access a data product of another team (provider team) via a specific?output port. It specifies the?purpose?of the data access and the necessary quality requirements (service-level agreements). The provider team then decides whether to approve the request, based on criteria such as usage terms, a valid purpose, and need-to-know principle, or whether to negotiate the?usage?conditions further with the consumer team. Restrictive access patterns, compliance and security requirements can also be defined, and a?cost accounting?model may be applied. A data contract is therefore only concluded when the data consumer and data provider reach an agreement on the?terms. This requires some kind of?status?model, with states such as?requested,?approved, and?rejected.

The syntax and semantics of the fields used and required can be precisely described and defined in a data contract. This can be done in the form of a technology-specific or technology-neutral?schema?(e.g., SQL DDL, dbt model contract, Protobuf, JSON schema). For the provider team, these define the fixed points that the output port must always be adhered to, while also showing degrees of freedom where changes can take place. Consumers can trust that the listed fields are stable and of the specified quality. With a data contract, consumers can count on the promised data quality.

A data contract has a specific duration with a?start date?and can also be canceled by either party regarding a defined notice period. This makes it possible for the provider team to evolve a data product, e.g., when a breaking change needs to be implemented and data consumers are advised to migrate to a newer version of an output port within the notice period. Data consumers can also cancel a data product, e.g., when the costs don't meet the expected business value or the data quality is not sufficient.

It is also a good practice to conduct regular?assessments. Once a year, for example, participants should discuss their active data contracts and evaluate the permissions, business value, and potential for further development.

Automation

The request flow and lifecycle processes should be fully implemented as a self-service, in line with the data mesh principles. This also includes processes and notifications for new requests, approvals, reassessments and terminations.

Data contracts can further act as the foundation to automate processes in the data platform and for computational governance: As soon as a data contract has been approved and within the start date, permissions for the respective data product output port can be set up automatically in the data platform. When the contract is terminated, the permissions must be deleted again. To implement this feature, some technical information, such as the consumer's IAM role or service account, needs to be defined in the data contract.

Data contracts can also be the basis to perform automated tests and quality checks in the CI/CD pipeline or in data quality monitoring tools. For example, a regular check ensures that output port schema conforms to the agreed data contract schema definitions. A similar concept is known in software development as Consumer-Driven Contract Testing by tools like?Pact.

Note:?A provider team may decide that requests can be auto-approved for low-classified data product output ports or in test environments. They still form a valid data contract, but without manual interaction through the provider team and delays for the consumer team.

Data Contract Specification

For automation, a data contract must be available in a machine-readable form, such as a JSON or a YAML representation.

The example from above, encoded as YAML:

dataContractSpecification: 0.0.
info:
  id: 640864de-83d4-4619-afba-ccea8037ed3a
  status: approved
  startDate: 2023-04-01
  endDate:
  noticePeriod: P3M
  nextReassessmentDate: 2024-04-01
provider:
  teamId: 6409a881-90c9-4fbb-8c89-d629e7c45e90
  teamName: Checkout
  dataProductId: 9be77c17-cda8-4b80-b6c6-cc00062b5686
  dataProductName: Orders
  outputPortId: a2197ee5-e0e9-45f8-b111-3138b59ad350
  outputPortName: bigquery_orders_latest_pii_v1
consumer:
  teamId: 9c721368-a61f-4a0d-b729-d00e4629a425
  teamName: Marketing
  dataProductId: 20e28cca-28a8-4991-88c6-64d443cbb797
  dataProductName: Funnel Analytics
terms:
  purpose: >
        Funnel analysis to understand user behaviors throughout
        the customer journey and identify conversion problems.
  usage: >
        Max queries per minute: 10
        Max data processing per day: 1 TiB
  limitations:
  costs: $500 per month
schema:
  specification: dbt  # the format of the model specification: dbt, jsonschema, protobuf, paypal
  description: The subset of the output port's data model that we agree to use
  tables:
    - name: orders
      description: >
        One record per order. Includes cancelled and deleted orders.
      columns:
        - name: order_id
          type: string
          description: Primary key of the orders table
          tests:
            - unique
            - not_null
        - name: order_timestamp
          type: timestamptz
          description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
          tests:
            - not_null
    - name: line_items
      description: >
        The items that are part of an order
      columns:
        - name: lines_item_id
          type: string
          description: Primary key of the lines_item_id table
        - name: order_id
          type: string
          description: Foreign key to the orders table
serviceLevelAgreements:
  intervalOfChange: Continuous streaming
  latency: < 60 seconds
  completeness: All orders since 2020-01-01T00:00:00Z
  freshness: Near real time, max. 60 seconds delay
  availability: 99.9%
  performance: Query all orders of last 12 months < 30 seconds
  dataVolume: 5,000-10,000 orders per day expected, ~50 KiB / order
tags:
- business-critical
links:
  schema: https://catalog.example.com/search/search-queries
  catalog: https://catalog.example.com/search/search-queries
custom:
  iamRole: serviceAccount:[email protected]        

The example follows the?Data Contract Specification, that is compatible with Data Mesh Manager's?Data Contract API. Another example is PayPal's Data Contract Template.

Visualize the Data Mesh

Data contracts are also powerful for data lineage and tracing. In a data mesh architecture, data contracts represent the connections between data products, or more formally, they are the edges in the graph. Together with the data products that represent the nodes in the graph, the mesh can be visualized as a data map:

Es wurde kein Alt-Text für dieses Bild angegeben.
Data contracts connect data products to a traceable graph

Such a data map is a way of making the use of data in the company comprehensible and traceable across teams and domains.

Data Mesh Manager

Data contracts must be managed efficiently and comprehensibly. Many of our customers used a wiki for this purpose, but this quickly reaches its limits and enables hardly any automation.

Since there was no other good tool available for managing data contracts, we developed?Data Mesh Manager, to manage data contracts, data products, and global policies as a web-based self-service. An event-based API enables seamless integration with any data platform. And any change will be recorded in an audit trail.

Es wurde kein Alt-Text für dieses Bild angegeben.
Data Mesh Manager Screenshot

In addition to a data product inventory for finding and evaluating data products, the Data Mesh Manager also supports a request and accept flow for creating data contracts, as well as an event-based API for automatically creating and revoking permissions in the data platform. The visualization as a data map makes the mesh comprehensive and the use of the data products traceable.

Sign up now for free, or explore the clickable?demo?of Data Mesh Manager.

要查看或添加评论,请登录

Jochen Christ的更多文章

社区洞察

其他会员也浏览了