登录查看更多内容

#183 Are Lakehouses Ready for AI Guests?

Rishi Yadav

Founder & CEO at Roost.ai

发布日期: 2024年7月5日

In previous newsletters (#135,#142,#144), I emphasized the crucial need to transform all data sources into vector embeddings and also the need of a unified repository for these embeddings, much like data lakes did for big data. Recently, data lakes have evolved into lakehouses, combining the flexibility of data lakes with the performance of data warehouses. Let's explore whether lakehouses are ready to host these AI guests?

When considering AI guests, there are two primary types: vector embeddings and models. Vector embeddings are essential because they are the only input format that large language models (LLMs) understand. As LLMs become ubiquitous, so too will embeddings.

Vector embeddings extract meaning and context from raw data, without which the intelligence of LLMs cannot be manifested.

Key Applications of Embeddings

Continued Pre-training

I do not foresee a situation where an enterprise will pre-train a model from scratch. However, in some cases, enterprises might prefer to further pre-train an existing model using additional corpus. This unsupervised learning process involves converting domain-specific corpora into embeddings, allowing the model to learn specialized language patterns and knowledge. This step enhances the LLM's capabilities in particular areas without the need for a complete re-training.

Instruction Fine-Tuning

When adapting pre-trained models to specific tasks, embeddings of both inputs and desired outputs (ground truth) are used to refine the model's parameters, enhancing performance on targeted applications.

Prompt Processing

User queries and prompts are converted into embeddings, transforming text into a format that LLMs can process efficiently.

Retrieval-Augmented Generation (RAG)

RAG systems use embeddings for two key purposes: a) Converting knowledge base content into embeddings for efficient storage. b) Performing similarity searches to retrieve relevant information based on the user's query embedding.

Context Augmentation

Retrieved information is combined with the original query, creating an augmented prompt. This combined text is then converted into a single embedding for LLM processing.

The Lakehouse Architecture

Before we delve into whether lakehouses are suitable for hosting embeddings, let's first understand what a lakehouse is. While there is no official standards body for lakehouses, an unofficial standard known as the "open lakehouse" is emerging. Since Lakehouse is both a metaphor and a technical term, let’s continue the metaphor to understand different parts of an open lakehouse:

Imagine a lakehouse as a high-tech nature resort for your data guests. Here's what makes an Open Lakehouse resort special:

Smart Luggage Lockers (Parquet)

Our lakehouse resort, offers state-of-the-art luggage lockers for all our data guests. These aren't your ordinary lockers – they're columnar storage marvels, primarily using Parquet, that compress and organize your data luggage for quick and easy access. Imagine folding your clothes so efficiently that you can fit an entire wardrobe in a carry-on! That's what Parquet does for your data.

Examples: Parquet as foundation, JSON as needed

The Magical Luggage Ticket Service (Delta/Iceberg/Hudi)

While our Parquet lockers are excellent for storing your data luggage, managing them directly can be quite cumbersome, especially when you need to add, update, or keep track of your belongings over time. That's where our Magical Luggage Ticket Service comes in, powered by advanced systems like Delta tables or Apache Iceberg tables.

Imagine checking your luggage at our resort. Instead of dealing with individual lockers, you receive a single, magical ticket. This ticket, much like Delta tables or Iceberg tables, keeps track of all your belongings, regardless of how many times you add to or modify your luggage. Need to add a new data suitcase? Or perhaps update the contents of an existing one? No problem - your ticket automatically updates to reflect these changes. It's like having a dynamic luggage tag that always knows the current state of your belongings, their history, and even allows you to "time travel" to see what was in your luggage at any point in the past.

The Universal Luggage Retrieval Language (SQL)

At our high-tech data resort, everyone speaks SQL. It's like the universal language of our international hotel—widely understood and accepted by guests and staff alike. In the past, retrieving luggage (or data) could be slow, like waiting for an old-fashioned dumbwaiter to bring up your suitcase. This was once a major hurdle in adopting the lakehouse model. But thanks to recent technological breakthroughs, we've revolutionized our luggage retrieval system.

The All-Knowing Luggage Concierge (Unity Catalog/Polaris+Horizon)

Every great resort needs an exceptional concierge, and ours is second to none. Our Luggage Concierge Service, powered by products like Databricks' Unity Catalog, keeps track of every item in every suitcase. Need to know which suitcase contains your favorite data shirt? Or perhaps you're trying to remember when you last packed a specific data item? Our concierge has all the answers. But that's not all – we take your luggage privacy very seriously. Just like a high-end hotel restricts access to your personal belongings, Unity Catalog ensures that only authorized personnel can inquire about or access your data luggage. It's like having a personal assistant with a photographic memory for your entire data wardrobe, who also doubles as a security guard, ensuring you can always find what you need, when you need it, while keeping prying eyes away.

The Magical Luggage-Viewing Service (Delta Sharing)

Our lakehouse resort has recently expanded its magical luggage-viewing service. With Delta Sharing, we offer seamless sharing for guests using our Delta ticket system. It's like allowing other guests or even people from other hotels to see inside your luggage without actually giving them physical access to your belongings. They can see real-time updates to your luggage contents, but can't modify anything.

Darryl Williams 5 个月前

Unlocking the Power of Multi-Modal Analysis with LLMs:…

Michael Drob 8 个月前

RAG RAG on the wall

Prashant Pandey 4 个月前

The All-Weather Data Resort (Streaming & Batch)

Our lakehouse resort isn't just a fair-weather destination. We pride ourselves on being open and fully operational in all data climates. Think of us as an all-season resort that caters to every type of data, no matter how it arrives.

Imagine our resort nestled beside a pristine data lake. On one side, we have a constant flow of data rivers (streaming data) flowing directly into our lake. These rivers represent real-time data streams - maybe it's the continuous flow of social media posts, IoT sensor readings, or live transaction data. Our resort is equipped with special waterways and channels that can capture, process, and store this endless stream of information without missing a drop.

But that's not all. We're also fully prepared for the data rain (batch data) that pours in periodically. Picture sudden downpours of large datasets - perhaps nightly backups, weekly reports, or monthly analytics. Whether it's a light sprinkle of updates or a heavy deluge of new information, our resort can handle it all with ease.

AI Guest Readiness: Accommodating Embeddings and Models in Lakehouses

Lakehouses are excellent for hosting data guests, but how do they fare with AI guests? When AI guests arrive, they typically ask a few common questions. Let's explore these inquiries and how lakehouses can address them effectively.

Q1: Where Will I Store My Luggage?

When I previously suggested that embeddings would dominate the AI landscape, I was only partially correct. As embeddings proliferate, so will the number of models. As discussed in our last newsletter, we'll see models of various sizes for diverse needs. If lakehouses aspire to be the premier destinations in the AI world, outcompeting alternatives, they must consider facilities for both embeddings and models. Fortunately, while embeddings require standard rooms, models are humble and can manage with luxurious presidential suites.

Many lakehouse "rooms" have been optimized for Parquet, creating a natural inclination to maximize their use due to cost-efficiency. Parquet excels at efficient column storage, offering compression and encoding techniques like dictionary and run-length encoding. However, it struggles with efficiently storing complex, nested structures within a row - a limitation when dealing with embeddings and models.

Model Storage: A Different Beast

It's crucial to recognize the difference between embedding and model storage needs. While embeddings benefit from small efficient gains, models require efficient storage for hundreds of millions of parameters, weights, and biases. This vast scale difference means that despite various existing formats like ONNX, TensorFlow formats, and PMML, large language models (LLMs) often opt for custom formats.

LLMs, being the dominant force in the AI world, often dictate storage standards. Model registries (MLFlow, SnowPark, sageMaker) will immediately support whatever custom formats these giants utilize. Among the formats likely to prevail are GGUF (successor to GGML) and the Transformers format used by Hugging Face.

Model file formats will be determined by the requirements of the Llama family. The key factors will be efficiency and interoperability, as evidenced by the widespread use of GGUF and Transformer formats.

Q2: How will I find other AI guests?

After you have check into to a hotel and arranged your luggage, you would like to find where your friends are.

The Evolution of Data Storage and Search

Traditionally, data has been organized into rows and columns, with SQL and indexing optimized for efficient retrieval. However, the big data era introduced new data types like time-series and streaming data, leading to the rise of columnar storage formats such as Parquet. These advancements improved compression and query performance for analytical workloads.

While these developments significantly enhanced data management for structured and semi-structured data, they still rely on traditional indexing and query methods. As we move into the realm of embeddings—high-dimensional vectors representing semantic meaning—these conventional approaches fall short. Embeddings require a fundamentally different approach to storage and search, one that can efficiently handle the unique characteristics of vector data and similarity-based queries.

Searching for Embeddings: A New Paradigm

Searching for embeddings fundamentally differs from traditional data retrieval. Instead of exact matches, we look for similarity in high-dimensional space. Two primary methods are commonly used: Euclidean distance and cosine similarity. Think of Euclidean distance as the Earth-Moon relationship; their physical proximity indicates a strong connection. Cosine similarity, on the other hand, is like aligning distant stars in zodiac constellations; it's not about absolute distance, but directional alignment from our perspective.

However, the high dimensionality of embeddings poses significant challenges. Imagine searching for a specific star in the vast universe – the task becomes increasingly complex as the search space expands. To address this, we use approximate nearest neighbor (ANN) search techniques. ANN is like having a smart telescope that quickly scans the sky, focusing on the most promising areas rather than examining every star individually. It might occasionally miss the absolute closest star, but it finds very close matches much faster.

Looking Ahead: Future Trends in Data+AI Lakehouses

It's a near certainty that customers will expect both Data and AI workloads to be served from the same architecture. In our metaphor, both data and AI guests need to be accommodated in the same lakehouse resort. This convergence is driving several emerging trends.

Catalogs as First-Class Citizens

Traditionally, databases were the central focus of data management. However, the emphasis has shifted towards catalogs, which are now becoming the backbone of data and AI asset management. This shift is evident in Databricks' significant focus on Unity Catalog, which centralizes governance, metadata management, and access control across multiple cloud environments. Unity Catalog's recent open-sourcing efforts reflect the industry's move towards more integrated and accessible data management solutions.

Evolution of Classic ML Tooling

Classic ML tools, such as MLFlow in Databricks or SageMaker in AWS, are expected to undergo significant changes. While they won't become obsolete, their roles will narrow as the focus shifts towards fine-tuned models. The majority of use cases will be addressed by these refined models, with the Llama family potentially leading in the open-weights category. Experimentation as a concept will become less critical. Instead, the emphasis will be on operationalizing models, seamlessly incorporating fine-tuning into the process. For example, Mosaic ML's Compound AI framework is already moving in this direction. Similarly, AWS is likely to see a shift where SageMaker's importance diminishes, and Bedrock takes on a larger role in operationalizing foundational models.

GPT & Generative AI Microdose

4,750 位关注者

Dipanshu Mansingka

Principal Consultant / NITI's AIM/ATL Mentor

3 个月

In data lakes with tools like privacera / apache ranger data on different data sources like hive, ... are protected. Gateway confirms authorization and pass on the query to get data which also confirms from AD/DC the data access limitation user has and accordingly returns the result set. When we store data in vector DB, we do not have this control. Also there is a risk of sharing the query output along with questions to the llm models. How do we protect data on vector db.

3 次回应

Ishu Bansal

Optimizing logistics and transportation with a passion for excellence | Building Ecosystem for Logistics Industry | Analytics-driven Logistics

3 个月

What steps can be taken to bridge the gap between lakehouses and generative AI workloads? Excited to see the advancements in data and AI integration! #Databricks #Lakehouse #LLMs #UnityCatalog.

2 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Key Applications of Embeddings

Continued Pre-training

Instruction Fine-Tuning

Prompt Processing

Retrieval-Augmented Generation (RAG)

Context Augmentation

The Lakehouse Architecture

Smart Luggage Lockers (Parquet)

The Magical Luggage Ticket Service (Delta/Iceberg/Hudi)

The Universal Luggage Retrieval Language (SQL)

The All-Knowing Luggage Concierge (Unity Catalog/Polaris+Horizon)

The Magical Luggage-Viewing Service (Delta Sharing)

领英推荐

The All-Weather Data Resort (Streaming & Batch)

AI Guest Readiness: Accommodating Embeddings and Models in Lakehouses

Q1: Where Will I Store My Luggage?

Model Storage: A Different Beast

Q2: How will I find other AI guests?

The Evolution of Data Storage and Search

Searching for Embeddings: A New Paradigm

Looking Ahead: Future Trends in Data+AI Lakehouses

Catalogs as First-Class Citizens

Evolution of Classic ML Tooling

GPT & Generative AI Microdose

4,750 位关注者

#193 NotebookLM & The Power of Magic Wands

2024年10月6日

#192 o1's Reasoning: The Mezzanine Level to AGI

2024年10月2日

#191 The Discomfort of Agentic AI's Disruption

2024年9月18日

#190 The Next Scale: Bespoke Gigawatt Data Centers

2024年9月13日

#189 The Sufficient Condition for Open-Weights Future

2024年8月6日

#188 Agentic AI and Creative Destruction

2024年7月26日

#187 The Shadow AI Trojan Horse: How BYOAI is Breaching Corporate Defenses

2024年7月17日

#186 Is AI Really Slowing Down?

2024年7月11日

#185: LLMSE as the Gold Standard for Software Development and Testing

2024年7月9日

#184 Explainability & Interpretability

2024年7月8日

社区洞察

其他会员也浏览了

Harnessing the Power of Document Data Extraction with GPT: Unleashing AI's Potential

UNVEILING DATA'S TREASURES: ENVISIONING OF AI WITH POWER BI AS A DYNAMIC PAIRING DUO -NOT PITTING AI AGAINST POWER BI IN DIGITAL TRANSFORMATION

Towards Advanced RAG

Victoria Edelman Newsletter #2

From Years to Seconds: Generative AI's Disruption of Data Analytics

When I want to customize my LLM with data, what are all the options and which method is the best?

Beyond Mock Data: The Synthetic Revolution of AI development

#95 Will Vector Embeddings Exert a More Pervasive Influence Than Anticipated, Casting a Broader Shadow?

When It Comes To AI—Synthetic Data Has A Dirty Little Secret

Detection of Objects Using Mask R-CNN and Supervisely