DDL Ep 02: Decoding AI and the Role of?Metadata
Acryl Data
Reliable Data. Compliant AI. Simple | Driving DataHub, the #1 Open Source Metadata Platform | Discover, Govern, Observe.
In the rapidly evolving world of artificial intelligence, understanding the importance of metadata is key to unlocking success. How has the AI landscape transformed in recent years? What foundational aspects have remained constant? And which emerging trends should the AI community keep an eye on?
In the second episode of DataHub’s “Decoding Data Leadership” series, Shirshanka Das (Co-founder/CTO of Acryl Data, Founder of DataHub) engages in a thought-provoking discussion with Hema Raghavan , founder of cutting-edge AI startup Kumo.AI . Their conversation delves deep into the synergy between data and AI, exploring the vital role of metadata, the shifts and constants within the industry, and the trends poised to shape the future of AI.
Read on for the conversation and a summary of the main takeaways, and check out the full recording on YouTube??
The conversation has been edited for clarity and brevity.
Shirshanka: Hema, could you share a bit about yourself and your background in AI?
Hema: I am the co-founder of Kumo.ai where I lead engineering. I started in AI before it was cool and in LLMs before they were a household name. My first project used a language model to disambiguate the user’s intent. For example, if they typed “Java,” did they mean coffee or the programming language?
A common pattern in my journey over the last 20+ years has been focusing on achieving efficiency. My PhD was in a field called Active Learning, which is about the minimal set of training examples we need machine learning to do AI.
In my Yahoo days, we were looking at how to optimally rank advertisements. At LinkedIn, I worked on several recommender problems, with People You May Know (PYMK) being one of the most complex computational problems I worked on.
Shirshanka: Given the amount of time you’ve spent in AI and with different generations of AI systems, what are your thoughts on the role of metadata in the space?
Hema: Metadata has always played a critical role in ML Ops, and it will continue to do so as AI evolves.
At LinkedIn, metadata was key to the smooth functioning of ML models like “People You May Know” (PYMK). A good example is a situation we encountered after the Fourth of July weekend one year. We expected users would not be checking LinkedIn over the holiday weekend and their activity would bounce back once people returned from the break. But instead, usage did not bounce back. Were people taking longer vacations that year? We had every hypothesis on the table to understand the unexpected shift in behavior.
We traced the issue back to a seemingly minor change a front-end engineer had made in May, disabling a tracking event. This small change had a lagged, cascading effect through our data pipelines, ultimately picked up by PYMK during the Fourth of July weekend. Luckily, the team’s investment in lineage tracking made it much faster to identify and resolve such issues, even though the front-end change had happened months earlier.
The other areas where metadata was impactful were onboarding new teammates to help them know which datasets were where and navigating GDPR.
Shirshanka: On the topic of GDPR, I continue to hear from the DataHub Community and Acryl customers about how important it is to know if personally identifiable information (PII) is used in training a feature or building a model. Tracking PII data across lineage not only helps with regulatory compliance but also helps companies make accurate guarantees to their customers about how their data will be used.
Hema: Absolutely.
Shirshanka: How does metadata play a role in the modern AI stack, particularly in your current work at Kumo?
Hema: Kumo.ai stems from our experience at LinkedIn, Airbnb, and Pinterest, where we’d seen the proliferation of hundreds and hundreds of offline datasets over time?—?for the purpose of feature engineering. And these become increasingly hard to maintain over time, and include datasets created by data scientists as far back as 2009–10.
And as the company evolves, the ecosystem gets bigger. My cofounders and I noticed similar patterns—often, datasets are developed as part of what I call the “feature engineering” process. Someone creates a centralized derived dataset for multiple users, and then others build on that dataset, creating more derived data.
The Kumo promise is to bring the deep learning revolution to enterprise data with these problems.
In my background with language modeling and text, we used to do feature engineering, such as part of speech taggers and named entity recognizers. Now, we let neural networks handle those tasks. Kumo offers a similar promise: use your raw data?—?tables, tracking events, user data, account information?—?and let the neural network determine the intermediate elements.
For Kumo to work effectively, we need good metadata. Kumo uses graph neural networks (GNNs) and automatically infers the graph, identifying key entities like users and products. However, understanding these entities requires metadata. The first three screens of our product involve users ingesting their dataset and annotating it with metadata. This step is crucial because identifiers like “user ID” in one table might be named differently in another, making it hard to match entities without proper metadata.
Shirshanka: I’m curious if you’ve explored other tools or considered what would be on your wish list to avoid that first step of friction?—?getting people to annotate metadata before they can get to value with your product.
Hema: Ideally, if there were centralized standards or if everyone used one or two key tools for data annotation?—?like if DataHub became the de facto standard?—?we could streamline this. Then, Kumo could directly call DataHub for annotations, potentially bypassing that initial step in the journey. Automating these steps?—?such as a single-click option to exclude PII from processing?—?would enhance efficiency and reduce friction for our users.
For example, on a SaaS platform like ours, we often don’t need to know personal details like a full name or address. We advise customers to mask such data to avoid sending PII to our platform.
Shirshanka: That’s interesting. We see almost three kinds of metadata in this space. One is compliance or classification metadata. The second is semantic metadata, which is more descriptive, helping to understand what “customer” really means versus the specific CUST_ID column being tagged as PII. The third is operational metadata, which tracks the functioning and performance of data systems. At LinkedIn, we saw how centralizing metadata?—?including operational metadata from AI systems?—?achieved this holy convergence of all kinds of metadata behind one platform.
I’m curious about your perspective on this now that you’re outside an environment where centralization can be mandated. At LinkedIn, we could make layering decisions as an organization to ensure this convergence. But the market has tended to resist centralization in favor of efficiency.
What are your thoughts on whether different classes of metadata should be combined or stored separately?
Hema: The value of centralizing metadata is clear?—?especially for efficiency and compliance, and especially for data scientists.
As regulations evolve and consumers become more privacy-conscious, we?—?as a collective of people working within data systems?—?will be building or needing systems that are much more centralized and that we can manage metadata.
Shirshanka: You mentioned applying GNNs for entity graph creation and solving those issues. There are similar challenges in Master Data Management (MDM), where the goal is to create a master customer record. I’m curious to hear your thoughts on whether these techniques are universally applicable.
Hema: Do you mean helping a consumer understand the user ID and helping them see that the Salesforce table here is the same as the ID column in a Snowflake table somewhere else?
Shirshanka: Exactly. Like when you create a Customer-360 on your data. First, you create a Customer-360 on your metadata to understand where all the data actually lives, and then you describe how to transform it.
Hema: Absolutely. That’s a classic graph problem, right?
Under the graph machine learning literature, it comes under the entity linking paradigm.
We get a lot of asks from customers because they often have data from different sources or company mergers or acquisitions. They often pose it as an entity resolution problem, but it’s effectively the same.
领英推荐
Shirshanka: Got it. What trends in AI are you excited about for 2024? What resolutions should the AI community be making or should have made?
Hema: I think LLMs will continue to grow. I do want to say that this will be the year where we also get a reality check on where LLMs can actually help. For example, we have a telecom customer who has a lot of char logs and LLMs do great there. But a lot of their other data?—?cookies, device information, etc.?—?is in tables, which is deeply relational, structured data.
LLMs work well on sequences and grids but won’t necessarily solve all the world’s problems.
My resolution is to have an open mind and to continue to pick the best tools for each problem. I know I have a bias towards using GNNs for arbitrary structures, and we will continue to use them to solve the arbitrary structure when we see that pattern.
But I also recognize that in certain regulated environments like credit and interest, more classical methods might make more sense.
So, this might be a controversial take, but LLMs are not a panacea.
Shirshanka: Oh yes, the portfolio approach. Will this be the year the AI community finally agrees that features are really metrics?
Hema: Coming from Kumo, I’d say that feature engineering was not needed in the classic way it was thought to be.
But you need metrics for operational efficiency.
Shirshanka: Thanks for doing this, Hema. This was such a great chat and we covered so much?—?how metadata is important to AI, how the AI revolution doesn’t mean you forget old techniques, and that GNNs are useful for data and metadata.
Key Takeaways
1. The Importance of Metadata in Modern?AI
Operational Efficiency: Metadata has and will continue to play a crucial role in AI systems for operational efficiency.
For instance, features like LinkedIn’s People You May Know (PYMK) depended on meticulous metadata management and lineage tracking. Small changes, such as turning off a front-end tracking event, can cascade through data pipelines, causing significant issues, which underscores the need for thorough lineage tracking.
Data Quality Metadata helps ensure data quality and compliance with regulations like GDPR. This involves managing and classifying data, including personally identifiable information (PII), which is critical for customer trust and regulatory adherence.
GNNs for AI solutions Comprehensive metadata is essential for products like Kumo to work effectively. They use graph neural networks (GNNs) to automatically infer relationships within the data, requiring metadata to understand and match entities across different datasets.
GNNs identify key entities (e.g., users, products) in a dataset, but accurate identification requires understanding the data context.
2. Kumo.ai’s Approach to Leveraging Metadata
Kumo.ai leverages graph neural networks (GNNs) to infer graphs and automatically identify key entities. Effective metadata is critical to understanding data relationships and matching entities across different datasets.
3. Types of?Metadata
Metadata, as we see it today, can be categorized into:
a. Compliance & classification (for regulatory adherence) b. Semantic (for descriptive understanding of data attributes) c. Operational (for tracking data systems’ performance and functioning)
There’s merit in centralizing these metadata categories?—?particularly as regulations evolve and consumer privacy awareness increases.
For instance, LinkedIn's data team also stored operational metadata from AI like regular metadata, achieving a convergence of all types behind one platform.
4. Operational Challenges in Using?Metadata
Operational challenges in using metadata for AI include integrating metadata from various systems, managing the volume and variety of metadata across diverse data sources, ensuring consistency and standardization, maintaining detailed data lineage, meeting regulatory requirements, and protecting privacy.
Centralized approaches, such as using DataHub, could streamline the process and reduce friction. For instance, automation for tasks like mapping user IDs and handling PII would enhance efficiency and reduce the burden on data scientists.
5. Future Trends and Directions in?AI
While large language models (LLMs) will continue to grow, we will face a reality check in their applicability to different data types, especially relational and structured data. While GNNs are great for arbitrary structures, classical methods may be more suitable in regulated environments.
That’s why the community needs an open-minded approach, combining classical methods and advanced techniques like GNNs will be key.
TL;DR: LLMs are not a panacea and should be part of a portfolio approach.
Watch this space for more such conversations with industry leaders.
Connect with?DataHub