Exploring Data Lake Houses

Exploring Data Lake Houses

I’m excited to say that this week’s topic is something I previously knew nothing about: lake house architecture. How does it compare to data lakes and data warehouses? How might it benefit customers? What’s the connection to AI?

?

I ran these questions and more by Ori Rafael, the CEO and Co-Founder of lake house pioneer Upsolver. With nearly 20 years of experience in data management and entrepreneurship, he’s extremely well-versed in helping businesses handle complex data to drive efficiency and cost savings.

?

From implementation to governance, here’s what Rafael shared about the shift to lake houses on a recent bonus episode of The Business of Tech.

?

An introduction to the lake houses

?

You’re already familiar with data warehouses, but what’s the idea of the lake house?

?

Per Rafael, the lake house is a type of data warehouse architecture. Earlier architectures for enterprise data warehouses were the likes of Oracle, Terraform, and Teradata, followed by the cloud and, later, the decoupled data warehouse. Although the former offered elasticity, you still couldn’t decouple your storage and your compute – leading to many folks becoming frustrated with being ‘locked’ to Oracle, for example.

?

“You hated it because it cost you a lot of money, and it hated you because you needed to use Oracle for the things it wasn't really built for,” he said.

?

But with the lakehouse, you and your customer are in charge of building and maintaining your data layer – meaning your actual files and your own metadata. You can create tables, then use your data warehouse as pure compute.

?

“The idea of this is that you can use any number of query engines or warehouses on top of the same data that you and the customer are storing in your account,? and you can share that across multiple engines, but you manage the security for that layer,” he said.

?

For the record, Rafael thinks Iceberg is the clear winner of the ‘open table format’ wars. So, in action:

?

“You can write an Apache Iceberg-based lake house, and you can read that data from pretty much every warehouse that you want. That's the idea of the lake house,” he said.

?

The competitive advantage

?

So what benefits does the lake house offer that previous data warehouse architectures don’t? Rafael gave me two.

?

First, cost reduction. On one hand, you’re not using the engine that you don’t want for a use case – you’re not locked in, and you have more negotiation power. On the other hand, you don’t need to communicate with a specific warehouse to write Iceberg, for example.

?

“So all of that budget that you're currently paying to whatever warehouse vendor for ingesting data into it, in some cases also transforming data on top of it, All of that budget just goes away. For my customer base, in many cases, more than 50% of your warehouse budget would go away. There is a substantial cost reduction,” he said.

?

Second, the ability to use multiple engines.

?

“I don't want to marry a specific warehouse vendor. I want to allow and enable the AI revolution by creating an open data layer using data. So that would be the business strategy reason,” he said.

?

Improving data queries

?

One of Upsolver’s main focuses is optimizing storage to improve speed and efficiency around data queries. Why is Rafael zeroing in there?

?

To explain why, he shared this example:

?

Imagine that you want to move from a warehouse to a lake house. You’re doing a POC, you’re building your own lake house layer, you’re building your own storage, and finally, you go to run your queries. Suddenly, you find that all your queries are two times slower. Now, there are a ton of objections from within your organization to moving to the lake house because you're hurting user experience. On top of that, you’re now paying twice as much for storage.

?

The issue here is the management of your file system. It adds burden, making it harder to get value from the lake house. So, Upsolver creates three solutions:

1)??? File system layer management

2)??? Storage efficiency

3)??? Real-time compatibility

?

Making data engineering skills… optional?

?

Another thing Upsolver focuses on is making data engineering skills more optional, particularly as they relate to data lakes. I wanted to know more about that transformation and what skills remain important.

?

Rafael explained that whether you’re a traditional warehouse/SQL person or more of a Spark person, he aims to accelerate and improve your warehouse’s efficiency. But Upsolver is especially valuable for Spark people who often have to craft DIY-style solutions – per Rafael, Upsolver saves them a 6-9 month production journey.

?

So, as platforms like Upsolver become more and more popular, how does he predict the role of the data engineer will evolve?

?

In short, they’re not going anywhere. But, per Rafael, they will be released from the burden of everything being funneled through data engineering, making them more independent from software engineers.

?

“They're going to give them the platform. They're going to support it. They're going to monitor it. They're going to make sure the security is maintained. They're going to make sure users' data privacy is maintained. So data engineers are the chaperones, but they don't need to be the developers,” he said.

?

Emerging data management trends

?

Data management is an increasingly important industry. What big trends does Rafael think business owners should prepare for?

?

First on his list was the open table format. With the lake house, for example:

?

“You took the database, you ripped it apart, so now you have a catalog… You have three pieces of what used to be four pieces with the query. So you have four pieces of what used to be just one piece. You need to understand it. What catalog would I want to work with? Do I want to buy it from a warehouse company? Do I want to buy it from the cloud provider? So, all of that education is something that a manager in an enterprise working with data needs to do,” he said.

?

Second, AI. From a data management perspective, it’s time to figure out a better way to support AI, and open table formats are probably not enough. Vector databases are very relevant with gen AI, so he recommends preparing there.

?

Finally, governance. How are you going to manage your governance across multiple engines? The number of data engines you're going to use is going to increase, so how is governance going to be maintained? Luckily, there’s a lot of innovation in this space that Rafael’s excited to watch.

?

The AI relationship

?

That led me to my final question: a lot of organizations are struggling to strategize around data governance in order to take advantage of AI. What does Rafael recommend here?

?

His answer was simple: policy. You already have policies for data in general, but if you’re using multiple engines, they won’t translate:

?

“You kind of need to choose where your catalog is going to be, because now you're going to actually have the option of having policies that will translate across multiple engines. That would be piece number one, who actually has access to the data,” he said.

?

Although it’s not his area of expertise, Rafael believes now is the time to consider who has access to your data and how AI may create vulnerabilities.

?


?

Have you used a lake house? Thinking of making the switch? As always, my inbox is open for stories, insights, questions, and more.

要查看或添加评论,请登录