#183 Are Lakehouses Ready for AI Guests?
Data+AI Lakehouse

#183 Are Lakehouses Ready for AI Guests?

In previous newsletters (#135,#142,#144), I emphasized the crucial need to transform all data sources into vector embeddings and also the need of a unified repository for these embeddings, much like data lakes did for big data. Recently, data lakes have evolved into lakehouses, combining the flexibility of data lakes with the performance of data warehouses. Let's explore whether lakehouses are ready to host these AI guests?

When considering AI guests, there are two primary types: vector embeddings and models. Vector embeddings are essential because they are the only input format that large language models (LLMs) understand. As LLMs become ubiquitous, so too will embeddings.

Vector embeddings extract meaning and context from raw data, without which the intelligence of LLMs cannot be manifested.

Key Applications of Embeddings

Continued Pre-training

I do not foresee a situation where an enterprise will pre-train a model from scratch. However, in some cases, enterprises might prefer to further pre-train an existing model using additional corpus. This unsupervised learning process involves converting domain-specific corpora into embeddings, allowing the model to learn specialized language patterns and knowledge. This step enhances the LLM's capabilities in particular areas without the need for a complete re-training.

Instruction Fine-Tuning

When adapting pre-trained models to specific tasks, embeddings of both inputs and desired outputs (ground truth) are used to refine the model's parameters, enhancing performance on targeted applications.

Prompt Processing

User queries and prompts are converted into embeddings, transforming text into a format that LLMs can process efficiently.

Retrieval-Augmented Generation (RAG)

RAG systems use embeddings for two key purposes: a) Converting knowledge base content into embeddings for efficient storage. b) Performing similarity searches to retrieve relevant information based on the user's query embedding.

Context Augmentation

Retrieved information is combined with the original query, creating an augmented prompt. This combined text is then converted into a single embedding for LLM processing.

The Lakehouse Architecture

Before we delve into whether lakehouses are suitable for hosting embeddings, let's first understand what a lakehouse is. While there is no official standards body for lakehouses, an unofficial standard known as the "open lakehouse" is emerging. Since Lakehouse is both a metaphor and a technical term, let’s continue the metaphor to understand different parts of an open lakehouse:

Imagine a lakehouse as a high-tech nature resort for your data guests. Here's what makes an Open Lakehouse resort special:

Smart Luggage Lockers (Parquet)

Our lakehouse resort, offers state-of-the-art luggage lockers for all our data guests. These aren't your ordinary lockers – they're columnar storage marvels, primarily using Parquet, that compress and organize your data luggage for quick and easy access. Imagine folding your clothes so efficiently that you can fit an entire wardrobe in a carry-on! That's what Parquet does for your data.

Examples: Parquet as foundation, JSON as needed

The Magical Luggage Ticket Service (Delta/Iceberg/Hudi)

While our Parquet lockers are excellent for storing your data luggage, managing them directly can be quite cumbersome, especially when you need to add, update, or keep track of your belongings over time. That's where our Magical Luggage Ticket Service comes in, powered by advanced systems like Delta tables or Apache Iceberg tables.

Imagine checking your luggage at our resort. Instead of dealing with individual lockers, you receive a single, magical ticket. This ticket, much like Delta tables or Iceberg tables, keeps track of all your belongings, regardless of how many times you add to or modify your luggage. Need to add a new data suitcase? Or perhaps update the contents of an existing one? No problem - your ticket automatically updates to reflect these changes. It's like having a dynamic luggage tag that always knows the current state of your belongings, their history, and even allows you to "time travel" to see what was in your luggage at any point in the past.

The Universal Luggage Retrieval Language (SQL)

At our high-tech data resort, everyone speaks SQL. It's like the universal language of our international hotel—widely understood and accepted by guests and staff alike. In the past, retrieving luggage (or data) could be slow, like waiting for an old-fashioned dumbwaiter to bring up your suitcase. This was once a major hurdle in adopting the lakehouse model. But thanks to recent technological breakthroughs, we've revolutionized our luggage retrieval system.

The All-Knowing Luggage Concierge (Unity Catalog/Polaris+Horizon)

Every great resort needs an exceptional concierge, and ours is second to none. Our Luggage Concierge Service, powered by products like Databricks' Unity Catalog, keeps track of every item in every suitcase. Need to know which suitcase contains your favorite data shirt? Or perhaps you're trying to remember when you last packed a specific data item? Our concierge has all the answers. But that's not all – we take your luggage privacy very seriously. Just like a high-end hotel restricts access to your personal belongings, Unity Catalog ensures that only authorized personnel can inquire about or access your data luggage. It's like having a personal assistant with a photographic memory for your entire data wardrobe, who also doubles as a security guard, ensuring you can always find what you need, when you need it, while keeping prying eyes away.

The Magical Luggage-Viewing Service (Delta Sharing)

Our lakehouse resort has recently expanded its magical luggage-viewing service. With Delta Sharing, we offer seamless sharing for guests using our Delta ticket system. It's like allowing other guests or even people from other hotels to see inside your luggage without actually giving them physical access to your belongings. They can see real-time updates to your luggage contents, but can't modify anything.

The All-Weather Data Resort (Streaming & Batch)

Our lakehouse resort isn't just a fair-weather destination. We pride ourselves on being open and fully operational in all data climates. Think of us as an all-season resort that caters to every type of data, no matter how it arrives.

Imagine our resort nestled beside a pristine data lake. On one side, we have a constant flow of data rivers (streaming data) flowing directly into our lake. These rivers represent real-time data streams - maybe it's the continuous flow of social media posts, IoT sensor readings, or live transaction data. Our resort is equipped with special waterways and channels that can capture, process, and store this endless stream of information without missing a drop.

But that's not all. We're also fully prepared for the data rain (batch data) that pours in periodically. Picture sudden downpours of large datasets - perhaps nightly backups, weekly reports, or monthly analytics. Whether it's a light sprinkle of updates or a heavy deluge of new information, our resort can handle it all with ease.

AI Guest Readiness: Accommodating Embeddings and Models in Lakehouses

Lakehouses are excellent for hosting data guests, but how do they fare with AI guests? When AI guests arrive, they typically ask a few common questions. Let's explore these inquiries and how lakehouses can address them effectively.

Q1: Where Will I Store My Luggage?

When I previously suggested that embeddings would dominate the AI landscape, I was only partially correct. As embeddings proliferate, so will the number of models. As discussed in our last newsletter, we'll see models of various sizes for diverse needs. If lakehouses aspire to be the premier destinations in the AI world, outcompeting alternatives, they must consider facilities for both embeddings and models. Fortunately, while embeddings require standard rooms, models are humble and can manage with luxurious presidential suites.

Many lakehouse "rooms" have been optimized for Parquet, creating a natural inclination to maximize their use due to cost-efficiency. Parquet excels at efficient column storage, offering compression and encoding techniques like dictionary and run-length encoding. However, it struggles with efficiently storing complex, nested structures within a row - a limitation when dealing with embeddings and models.

Model Storage: A Different Beast

It's crucial to recognize the difference between embedding and model storage needs. While embeddings benefit from small efficient gains, models require efficient storage for hundreds of millions of parameters, weights, and biases. This vast scale difference means that despite various existing formats like ONNX, TensorFlow formats, and PMML, large language models (LLMs) often opt for custom formats.

LLMs, being the dominant force in the AI world, often dictate storage standards. Model registries (MLFlow, SnowPark, sageMaker) will immediately support whatever custom formats these giants utilize. Among the formats likely to prevail are GGUF (successor to GGML) and the Transformers format used by Hugging Face.

Model file formats will be determined by the requirements of the Llama family. The key factors will be efficiency and interoperability, as evidenced by the widespread use of GGUF and Transformer formats.

Q2: How will I find other AI guests?

After you have check into to a hotel and arranged your luggage, you would like to find where your friends are.

The Evolution of Data Storage and Search

Traditionally, data has been organized into rows and columns, with SQL and indexing optimized for efficient retrieval. However, the big data era introduced new data types like time-series and streaming data, leading to the rise of columnar storage formats such as Parquet. These advancements improved compression and query performance for analytical workloads.

While these developments significantly enhanced data management for structured and semi-structured data, they still rely on traditional indexing and query methods. As we move into the realm of embeddings—high-dimensional vectors representing semantic meaning—these conventional approaches fall short. Embeddings require a fundamentally different approach to storage and search, one that can efficiently handle the unique characteristics of vector data and similarity-based queries.

Searching for Embeddings: A New Paradigm

Searching for embeddings fundamentally differs from traditional data retrieval. Instead of exact matches, we look for similarity in high-dimensional space. Two primary methods are commonly used: Euclidean distance and cosine similarity. Think of Euclidean distance as the Earth-Moon relationship; their physical proximity indicates a strong connection. Cosine similarity, on the other hand, is like aligning distant stars in zodiac constellations; it's not about absolute distance, but directional alignment from our perspective.

However, the high dimensionality of embeddings poses significant challenges. Imagine searching for a specific star in the vast universe – the task becomes increasingly complex as the search space expands. To address this, we use approximate nearest neighbor (ANN) search techniques. ANN is like having a smart telescope that quickly scans the sky, focusing on the most promising areas rather than examining every star individually. It might occasionally miss the absolute closest star, but it finds very close matches much faster.

Looking Ahead: Future Trends in Data+AI Lakehouses

It's a near certainty that customers will expect both Data and AI workloads to be served from the same architecture. In our metaphor, both data and AI guests need to be accommodated in the same lakehouse resort. This convergence is driving several emerging trends.

Catalogs as First-Class Citizens

Traditionally, databases were the central focus of data management. However, the emphasis has shifted towards catalogs, which are now becoming the backbone of data and AI asset management. This shift is evident in Databricks' significant focus on Unity Catalog, which centralizes governance, metadata management, and access control across multiple cloud environments. Unity Catalog's recent open-sourcing efforts reflect the industry's move towards more integrated and accessible data management solutions.

Evolution of Classic ML Tooling

Classic ML tools, such as MLFlow in Databricks or SageMaker in AWS, are expected to undergo significant changes. While they won't become obsolete, their roles will narrow as the focus shifts towards fine-tuned models. The majority of use cases will be addressed by these refined models, with the Llama family potentially leading in the open-weights category. Experimentation as a concept will become less critical. Instead, the emphasis will be on operationalizing models, seamlessly incorporating fine-tuning into the process. For example, Mosaic ML's Compound AI framework is already moving in this direction. Similarly, AWS is likely to see a shift where SageMaker's importance diminishes, and Bedrock takes on a larger role in operationalizing foundational models.

Dipanshu Mansingka

Principal Consultant / NITI's AIM/ATL Mentor

3 个月

In data lakes with tools like privacera / apache ranger data on different data sources like hive, ... are protected. Gateway confirms authorization and pass on the query to get data which also confirms from AD/DC the data access limitation user has and accordingly returns the result set. When we store data in vector DB, we do not have this control. Also there is a risk of sharing the query output along with questions to the llm models. How do we protect data on vector db.

Ishu Bansal

Optimizing logistics and transportation with a passion for excellence | Building Ecosystem for Logistics Industry | Analytics-driven Logistics

3 个月

What steps can be taken to bridge the gap between lakehouses and generative AI workloads? Excited to see the advancements in data and AI integration! #Databricks #Lakehouse #LLMs #UnityCatalog.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了