DataHub Community Updates: Oct '23

DataHub Community Updates: Oct '23

Following on the heels of a spooktacular ?? October Town Hall, I’m here with a recap and roundup of everything DataHub has been up to this past month! The new DataHub v0.12 update supports nested domains, building on DataHub’s already strong support for data mesh via its built-in support for data contracts and data products. Improved features for extracting column-level lineage and observing changes at the column level improve DataHub’s ability to manage and govern your decentralized data ecosystem. Speaking of which, a case study featuring FinTech innovator Chime showcases all of these features in action, illustrating how to effectively federate governance in a decentralized data environment.

But first, let’s talk about the thriving DataHub community, which is getting ready to celebrate both an inaugural class of “DataHub Champions” and a major membership milestone.


Giving Thanks for a Thriving Community

With more than 9,200 members the DataHub Slack community is well on its way to surpassing the 10,000-member mark by year’s end. It’s an engaged community, too, with almost 1,000 active weekly users and 4,500 messages in October alone. This level of engagement is reflected on GitHub, too, with an average of 39 pull requests opened each week, which points to the solid substrate of support for DataHub among its contributors.Speaking of which, this month’s dbt Coalesce conference was amazing! Not only were a surprising number of DataHub enthusiasts among the attendees, but it also played host to a fascinating presentation by the inestimable Shirshanka Das , CTO of Acryl Data , on how to use shift-left governance and data contracts with dbt. DataHub’s booth was rockin’ during dbt Labs Coalesce, too, with many existing DataHub users and many DataHub-curious attendees stopping by— and data celebrity-cognoscenti like Benn Stancil popping in, too.We’re going to step up DataHub’s presence at many upcoming industry data events, so be on the lookout for the DataHub booth—you’ll have a chance to win prizes and DataHub acclaim!


Announcement: DataHub Champions Program!

Speaking of DataHub acclaim, we’re officially kicking off “DataHub Champions,” an initiative that aims to shine a spotlight on our remarkable community of contributors. The inaugural inductees include some of the most active and engaged contributors in the DataHub community, people we invited to be Champions based on the quality and consistency of their contributions. In future rounds of the program, we plan to open the floor to the community as a whole for nominations.The goal is twofold: first, we want to recognize and validate the contributions of members who’ve distinguished themselves and helped markedly improve DataHub; second, we’ll admit to a self-interested motive: we kind of expect that highlighting the achievements of our most dedicated members will promote even more involvement, nurturing the DataHub community.


What’s new in DataHub v0.12

A new version of DataHub has officially dropped! Here's what you need to know about v0.12:

Overall Progress. About 150 pull requests were merged with this release. A big shoutout to everyone involved, including not just our regular contributors but newcomers, too.

Stability. The team added some notable features. (Don't worry, your older versions won't break.) Of note: thanks to a security fix related to cookie handling, you might find yourself logged out. To fix this, just log back in again. That’s pretty much it on user-impacting fixes.

DataHub v0.12 Features:

Nested Domains One of our latest and greatest features, “Nested Domains,” is here to make it easier to organize your data products and other data assets, kind of like a nested Russian doll, or Matryoshka.

At its core, think of Nested Domains as a way to structure and manage data, similar to folders and subfolders in a filesystem. Just as you'd have a main folder with subfolders in it, Nested Domains lets you create main domains with subdomains within them. This lets you break down your data ecosystem into more manageable chunks—kind of like compartmentalizing your data based on its nature or its source, making it easier to locate and manage.

It also allows you to organize your data ecosystem in a way better reflecting your organization’s structure: if Conway’s Law holds that the design of software and data architecture always reflects the communication patterns or pathways of the groups that built it, Nested Domains gives you a way to design your data architecture in a way that mirrors how different teams or departments in your organization function. So if a particular department—say 'Sales'—has sub-departments like 'Regional' and 'Online', you can have a domain structure that mirrors this.?

This simplifies data discovery and provides essential context. And that’s about it: nested Domains is about ensuring your data is both organized and intuitive for anyone accessing it.

DataHub v0.12 Features:

For business users, data practitioners and developers For business users, the Chrome extension is kind of like having a mini, on-demand version of DataHub right inside your primary tools. It not only makes it easier to discover relevant data, but reduces context switching, and ensures detailed information about data is just a click away.

The updated DataHub Chrome extension allows users to interact with DataHub directly from their browsers, eliminating the need to switch outside of their workflows when using data platforms like PowerBI, Looker, Databricks, BigQuery, and others. New editing capabilities mean that the browser extension isn’t just useful for discovering data: You can now use it to make or propose changes, add tags, or annotate a dataset directly from your BI or data visualization tool. It also supports quick data previews, so that when you’re working in your preferred BI or data visualization tool, you can use it to quickly see details about a dataset, too.

For example, if you’re working in PowerBI and you come across a dataset or metric you're unfamiliar with, you can go to the extension and it will automatically show you the relevant details. (This can be a game-changer when you need to diagnose issues with data or track and understand data lineage!) Also, if there's an issue with the data you're looking at, the extension can let you know. This helps you make informed decisions based on the data's current status.

We have goodies for developers too! Developers can now create and register Data Contracts as YAML files or using the SDK. (This feature is still fairly new, but it’s a focus of activity and will improve in forthcoming versions of DataHub. At this point, even though you can create and register Data Contracts, you can’t see them in the DataHub UI. Support for this feature is slated to roll out in November or December.) Elsewhere, DataHub now supports ingesting metadata from tools such as MLflow and DynamoDB. And if you work with large database tables, especially in platforms like Snowflake or BigQuery, you'll notice faster profiling, thanks to a new sampling-based profiling capability.


Roadmap Updates

There’s a ton of exciting stuff that we’re building—both for the open-source DataHub version and for Acryl Cloud. Here‘s what you can look forward to:


Product Updates: New Features and Improvements

Column-level Lineage for DBT, Redshift and other sources

Big news! DataHub now supports column level lineage for dbt.?

Originally, dbt users could get column-level lineage only with platforms like Snowflake or BigQuery. This means dbt users on platforms like AWS Redshift or Databricks lacked a way to extract lineage at the column level. As of the upcoming 0.12.01 DataHub point release, dbt users will be able to track column-level lineage across these platforms, too. This feature works across all data sources, snapshots, and models—including both incremental and ephemeral dbt models. For ephemeral tables, which aren’t actually physically materialized in a database, DataHub now is able to infer and display the columns and data types in these tables. This allows you to better understand the structure and transformations within your dbt models, beyond what is physically materialized in the data warehouse.

Here’s an example of what a simple dbt project looks like with column-level lineage enabled. If you ever had a question like, where did that email field in that very downstream table actually come from? Well now, you can answer that question finally!

If you were wondering what all integrations support column-level lineage in DataHub, well here’s a quick picture that summarizes it all.?

Integration is ongoing, with a new plugin for Apache Airflow you can use to extract the column-level lineage implicit in SQL. In addition, it will support column-level lineage extraction for three of the most popular BI tools—Looker, Tableau, and Power BI—with more to come. DataHub is going to continue to chip away at column-level lineage, adding platforms and tools iteratively. Today, users on sources that aren’t currently supported have a few options, including extracting lineage from database query logs (if available), or using DataHub's SDK.


Data Quality to the forefront with Acryl Observe

If you joined us for the August DataHub Town Hall, you probably heard about Acryl Observe, the data quality module in Acryl Cloud. In the October Town Hall, DataHub contributor John Joyce, a founding engineer with Acryl Data, provided an update on what the team has been cranking on, including some pretty cool features.?

First, some True North principles:

  • Data customers should not be the first to encounter data issues…
  • But they should be the first to be informed.

Column Assertions and Custom SQL Assertions are two new Acryl Observe features you can use to monitor and validate your data against criteria that you define in advance. These new features build on other Acryl Observe features, like Freshness Assertions (for monitoring data timeliness) and Volume Assertions (for tracking unexpected row-count changes) to enable real-time data observability and governance.

Column Assertions are an automated mechanism you can use to check individual columns in a table to proactively identify common issues, including duplicate entries, invalid data types or values, or missing data. They can be scheduled to run at different times/frequencies and can validate on the basis of null count, unique count, max, min, stddev (standard deviation), mean, and other metrics. You can create alerts that automatically notify relevant stakeholders immediately as soon as data quality issues or anomalies are detected.

Custom SQL Assertions allow you to run arbitrary SQL queries and validate them against expected results. When Acryl Observe runs these queries, the actual output of these queries is compared against user-defined expected results. If there is a mismatch, Acryl Observe triggers an alert. These allow you to automate checks to validate (a) that data complies with business rules and (b) that referential integrity is maintained across tables in an RDBMS or dataset.


Case Study: DataHub and Federated Governance in Action

One of the highlights of the October DataHub Town Hall was a presentation by Sherin Thomas, a staff software engineer with FinTech innovator Chime , who served up a case-study-like walkthrough of how her company uses DataHub—specifically, Acryl Cloud, the fully managed service based on Datahub—as a central metadata repository and data catalog for its decentralized teams.

Sherin talked about the challenge of promoting data discovery while also practicing good data governance, especially in organizations that are prioritizing autonomy, responsibility, and ownership for decentralized teams. She observed how challenges with data duplication, data quality, and other issues are not only much more common, but—especially in decentralized structures—have the potential to become uncontrollable without federated governance.

The linchpin of this is Acryl Cloud and DataHub, Thomas argued: it’s a common platform (or “watercooler,” in her terminology) where producers and consumers of data can interact, share feedback, and collaborate, mitigating some of the challenges stemming from siloing between domains, groups, or function areas. She explained how Chime uses Google Protocol Buffers (Protobufs) to allow for a schema-agnostic approach to managing schema definitions across teams. Thomas talked about how DataHub’s Protobufs SDK facilitates interoperability between Acryl Cloud and Protobufs, allowing Acryl Cloud to not only ingest schema definitions from Protobufs, but also capture detailed information—like comments and documentation from engineers—as metadata. This helps downstream data producers and consumers.

Thomas shared that DataHub’s lineage capabilities are probably her favorite feature, enabling Chime to pinpoint and rapidly resolve issues that could impact business operations, as well as clarifying ownership and usage. She championed a “crowdsourcing” approach to metadata ingestion, involving stakeholders from cross-functional teams to participate and contribute expertise. Thomas talked about how Chime leverages the DataHub API to allow its teams to directly push their metadata into Acryl Cloud. She also emphasized the importance of clear ownership of data, which she said ensures accountability, along with rapid, effective problem resolution. Finally, she highlighted the critical role data contracts play in federated governance, explaining how DataHub’s and Acryl Cloud's support for data contracts allows Chime’s data leaders, data producers, and data consumers to enforce standards for schema, freshness, and quality. She noted a few additional benefits of data contracts, including a shared semantic understanding, along with a sense of accountability among data producers.


Thanks Be to YOU!

As I wrap up this month's journey through DataHub's latest, I’m reminded that behind every DataHub update, every DataHub feature, and every line added to DataHub’s clean, elegant codebase, there's a community of contributors and users who make it all possible. The season of giving thanks is officially upon us, but trust me when I say that we’re grateful to and for you, the DataHub community, the whole year ‘round. Thank you for your continued support and enthusiasm. We quite literally couldn’t be doing this without you.

We only covered the highlights of everything that was covered in town hall, but if you want to see everything that was discussed, you can watch the entire town hall too!



?

要查看或添加评论,请登录

Acryl Data的更多文章

社区洞察

其他会员也浏览了