What is a Data Contract?
A data contract is a formal agreement between two parties (the data product provider and the data product consumer) to use a data product. It specifies the guarantees about a provided data set, the purpose of data usage, and costs. Data contracts have a lifecycle to allow the evolution of data products, and they can be used to automate the creation and revocation of roles and permissions.
Data Contract Example
Let's start with an example to see, what a data contract typically covers:
Request Flow and Life Cycle
A data contract is typically a mutual agreement between two teams, represented by their data product owners. A?consumer team?submits a request to access a data product of another team (provider team) via a specific?output port. It specifies the?purpose?of the data access and the necessary quality requirements (service-level agreements). The provider team then decides whether to approve the request, based on criteria such as usage terms, a valid purpose, and need-to-know principle, or whether to negotiate the?usage?conditions further with the consumer team. Restrictive access patterns, compliance and security requirements can also be defined, and a?cost accounting?model may be applied. A data contract is therefore only concluded when the data consumer and data provider reach an agreement on the?terms. This requires some kind of?status?model, with states such as?requested,?approved, and?rejected.
The syntax and semantics of the fields used and required can be precisely described and defined in a data contract. This can be done in the form of a technology-specific or technology-neutral?schema?(e.g., SQL DDL, dbt model contract, Protobuf, JSON schema). For the provider team, these define the fixed points that the output port must always be adhered to, while also showing degrees of freedom where changes can take place. Consumers can trust that the listed fields are stable and of the specified quality. With a data contract, consumers can count on the promised data quality.
A data contract has a specific duration with a?start date?and can also be canceled by either party regarding a defined notice period. This makes it possible for the provider team to evolve a data product, e.g., when a breaking change needs to be implemented and data consumers are advised to migrate to a newer version of an output port within the notice period. Data consumers can also cancel a data product, e.g., when the costs don't meet the expected business value or the data quality is not sufficient.
It is also a good practice to conduct regular?assessments. Once a year, for example, participants should discuss their active data contracts and evaluate the permissions, business value, and potential for further development.
Automation
The request flow and lifecycle processes should be fully implemented as a self-service, in line with the data mesh principles. This also includes processes and notifications for new requests, approvals, reassessments and terminations.
Data contracts can further act as the foundation to automate processes in the data platform and for computational governance: As soon as a data contract has been approved and within the start date, permissions for the respective data product output port can be set up automatically in the data platform. When the contract is terminated, the permissions must be deleted again. To implement this feature, some technical information, such as the consumer's IAM role or service account, needs to be defined in the data contract.
Data contracts can also be the basis to perform automated tests and quality checks in the CI/CD pipeline or in data quality monitoring tools. For example, a regular check ensures that output port schema conforms to the agreed data contract schema definitions. A similar concept is known in software development as Consumer-Driven Contract Testing by tools like?Pact.
Note:?A provider team may decide that requests can be auto-approved for low-classified data product output ports or in test environments. They still form a valid data contract, but without manual interaction through the provider team and delays for the consumer team.
领英推荐
Data Contract Specification
For automation, a data contract must be available in a machine-readable form, such as a JSON or a YAML representation.
The example from above, encoded as YAML:
dataContractSpecification: 0.0.
info:
id: 640864de-83d4-4619-afba-ccea8037ed3a
status: approved
startDate: 2023-04-01
endDate:
noticePeriod: P3M
nextReassessmentDate: 2024-04-01
provider:
teamId: 6409a881-90c9-4fbb-8c89-d629e7c45e90
teamName: Checkout
dataProductId: 9be77c17-cda8-4b80-b6c6-cc00062b5686
dataProductName: Orders
outputPortId: a2197ee5-e0e9-45f8-b111-3138b59ad350
outputPortName: bigquery_orders_latest_pii_v1
consumer:
teamId: 9c721368-a61f-4a0d-b729-d00e4629a425
teamName: Marketing
dataProductId: 20e28cca-28a8-4991-88c6-64d443cbb797
dataProductName: Funnel Analytics
terms:
purpose: >
Funnel analysis to understand user behaviors throughout
the customer journey and identify conversion problems.
usage: >
Max queries per minute: 10
Max data processing per day: 1 TiB
limitations:
costs: $500 per month
schema:
specification: dbt # the format of the model specification: dbt, jsonschema, protobuf, paypal
description: The subset of the output port's data model that we agree to use
tables:
- name: orders
description: >
One record per order. Includes cancelled and deleted orders.
columns:
- name: order_id
type: string
description: Primary key of the orders table
tests:
- unique
- not_null
- name: order_timestamp
type: timestamptz
description: The business timestamp in UTC when the order was successfully registered in the source system and the payment was successful.
tests:
- not_null
- name: line_items
description: >
The items that are part of an order
columns:
- name: lines_item_id
type: string
description: Primary key of the lines_item_id table
- name: order_id
type: string
description: Foreign key to the orders table
serviceLevelAgreements:
intervalOfChange: Continuous streaming
latency: < 60 seconds
completeness: All orders since 2020-01-01T00:00:00Z
freshness: Near real time, max. 60 seconds delay
availability: 99.9%
performance: Query all orders of last 12 months < 30 seconds
dataVolume: 5,000-10,000 orders per day expected, ~50 KiB / order
tags:
- business-critical
links:
schema: https://catalog.example.com/search/search-queries
catalog: https://catalog.example.com/search/search-queries
custom:
iamRole: serviceAccount:[email protected]
The example follows the?Data Contract Specification, that is compatible with Data Mesh Manager's?Data Contract API. Another example is PayPal's Data Contract Template.
Visualize the Data Mesh
Data contracts are also powerful for data lineage and tracing. In a data mesh architecture, data contracts represent the connections between data products, or more formally, they are the edges in the graph. Together with the data products that represent the nodes in the graph, the mesh can be visualized as a data map:
Such a data map is a way of making the use of data in the company comprehensible and traceable across teams and domains.
Data Mesh Manager
Data contracts must be managed efficiently and comprehensibly. Many of our customers used a wiki for this purpose, but this quickly reaches its limits and enables hardly any automation.
Since there was no other good tool available for managing data contracts, we developed?Data Mesh Manager, to manage data contracts, data products, and global policies as a web-based self-service. An event-based API enables seamless integration with any data platform. And any change will be recorded in an audit trail.
In addition to a data product inventory for finding and evaluating data products, the Data Mesh Manager also supports a request and accept flow for creating data contracts, as well as an event-based API for automatically creating and revoking permissions in the data platform. The visualization as a data map makes the mesh comprehensive and the use of the data products traceable.
Sign up now for free, or explore the clickable?demo?of Data Mesh Manager.