Tabular: Turning Your Data Swamp into a Data Lakehouse with Apache Iceberg
By Jamin Ball and Gwen Umbach
Today we’re very excited to announce our partnership and Series B investment in Tabular, a company building around Apache Iceberg. Tabular is a compelling data lakehouse solution, meaning it brings data warehouse functionality (SQL semantics + ease of use) to the data lake (cost-efficient and scalable).?
If you want your data platform to run like those at Netflix, Salesforce, Stripe, AirBNB and many others (i.e. object storage + Iceberg), then we believe that Tabular is the right platform for you! Iceberg is one ingredient in the fully packaged Tabular lakehouse.?
If you don’t want to manage all of the infrastructure around Iceberg (plus allocating headcount to do this!) but want the benefits of a compelling data lakehouse, then Tabular is the right platform for you.?
If you want features in your lakehouse (on top of open source Iceberg) for ingestion, CDC, streaming (file loading, Kafka connect, etc), schema evolution, compaction, optimization, time travel, snapshots, auto-scaling, maintenance (no more writing spark jobs to delete snapshots!), universal access controls (RBAC) or a REST catalog, then we feel that Tabular is the right platform for you!?
Before diving into more specifics about what Tabular does, I’d like to start with a brief overview of how cloud-native data architectures have evolved over the last decade. Then I’ll weave in where Tabular / Iceberg fit, and why they’re already playing a prominent role in the future (and present) of data infrastructure. The majority of this post will be centered around the data lakehouse architecture. Data Lakes have been a staple for a long time, storing both structured and unstructured data. Over the last few years, the importance of data lakes has risen dramatically as their capabilities have evolved. The rise of foundation models and generative AI only furthers this trend. A key benefit of foundation models is their ability to add reason and logic on top of unstructured data. Data Lakehouses provide the crucial piece of infrastructure for foundation models to securely and efficiently access both structured and unstructured data with governance intelligently wrapped around it. But this isn’t another post about AI, it’s about the future of data infrastructure. However, the two are intertwined together. As Frank Slootman? (Snowflake CEO) said, “Enterprises are also realizing that they cannot have an AI strategy without a data strategy to base it on.” Similarly, Satya Nadella (Microsoft CEO)? said “Every AI app starts with data, and having a comprehensive data and analytics platform is more important than ever.” This post will dig into how the foundation layer of AI, data infrastructure, is evolving. Let's dive in.
Data lakes that turn into swamps
People describe data as the new oil but I think it’s more than that. Data is oxygen; it’s necessary for survival. Data is becoming the driving competitive force companies differentiate on. Data creates long-term moats and ultimately helps organizations deliver differentiated products to their end customers. How you acquire customers, which features you build, which promotions you run are all data-driven decisions that compound into long term advantages.?
One key component of a modern cloud-native data architecture is the data lake - which allows for cost-efficient storage of large amounts of data used for analytics and ML at scale. Another key component is the data warehouse, which provides a central store of (typically) structured and transformed data ready for SQL queries.
I wrote a post about the data lake / data warehouse a few years back you can find here. One key graphic I’ll share from that post is below. This image is an oversimplified view of the two-tier lake / warehouse architecture. The data lake is optimized for cheap and scalable storage (but not retrieval) of all kinds of structured or unstructured data. Data sitting in a data lake is then extracted, loaded and transformed into a data warehouse. The warehouse is then optimized for efficient access (typically through SQL) to that data, with a number of other properties layered in (like governance, access, security, etc).
In this post, I’ll focus more on the data lake portion of the diagram above. Typical data lake storage solutions include AWS S3, Azure Data Lake Storage (ADLS), Google Cloud Storage (GCS) or Hadoop Distributed File System (HDFS).?
A natural question is “why do we have two tiers? Why not have one central place where data is stored?” Well, the data lake comes with many challenges. At the end of the day it’s optimized for storage, not access (this is something I’ll refer to a lot).?
There have been four big issues with data lakes - (1) you can’t perform ACID transactions, (2) you can’t write SQL against them, making it difficult to retrieve data (but Hive? Just terrible performance…), (3) there’s no real security / governance, and (4) data quality controls don’t exist. These 4 aren’t necessarily as absolute as I make it sound, but doing any of the 4 things above on a data lake is incredibly hard. Because of all of these issues, the data lake was really more of a data swamp. They were hard to use and messy.
I mentioned ACID transactions, so let’s unpack that. At a high level it’s one of the core properties of any production database, and it guarantees that database transactions are processed accurately and reliably. Here’s what the acronym stands for:
In summary, ACID properties guarantee that if I ask a question of a database I get the right answer back, no matter what. This is important in large-scale production systems where you may have thousands of concurrent requests (both reads and writes), systems fail, yet ultimately the same answer must be given to one single question, no matter what.?
Because of these data lake shortcomings, the cloud data warehouse was born. The data warehouse (Snowflake, AWS Redshift, Azure Synapse, Google BigQuery) provided ACID guarantees, let you write SQL to retrieve data, came with powerful query engines to optimize the retrieval of data, offered an engine to transform, manipulate and organize tables, and wrapped all of this inside tight security and governance policies.
From Data Swamp to Data Lakehouse
Over time, people have tried to merge these two worlds (the data warehouse and data lake) into what many now call a data lakehouse. Look no further than the Snowflake / Databricks debate for evidence of this! Both of these vendors want to provide a one stop shop for all of your data needs; Snowflake starts from a warehouse perspective and Databricks from that of a data lakehouse.?
In truth, innovations in the past few years are helping evolve the data lake (swamp) into a data lakehouse! I’ll dig in more on this later on, but to summarize ahead of time the 4 main components of a data lakehouse are:
Data Lakehouse Components: Data lake Storage, File Formats, Table Formats, Compute Engines
On top of this (above the compute engines) we have applications that use data like BI tools or notebooks (Looker, Tableau, Hex, Sigma, Jupyter Notebooks, etc.)
One of the first popular data lake engines was Apache Hive (in the Hadoop era, designed for HDFS, not necessarily object stores like S3). As I described above, the data lake solutions (S3, ADLS, GCS, HDFS) were really just storage formats. The Hive engine gave us more efficient access patterns to data lake storage. It was file format-agnostic and could give a single answer to “what data is in this table.” However, there were MANY challenges that ultimately led to the end of the Hadoop / Hive era.?
Note that Hive is a compute engine combined with a metastore that became a de facto catalog for the data lake.? But, Hive does not provide a true table specification.
领英推荐
Apache Iceberg – the Table Specification for your Lakehouse
This is where Iceberg comes in. Iceberg is an open table format developed by Ryan Blue and Dan Weeks (2 of the 3 co-founders of Tabular) while they were at Netflix. The third Tabular founder is Jason Reid, Ryan and Dan’s internal customer as Director of Data Engineering. Netflix was facing many challenges (some described above) with their data lake, and Iceberg was created to solve a lot of the challenges of Hive (at Netflix scale).?
Tabular provides a managed Iceberg and data lakehouse solution.There are still thousands of companies still using Hive. If you are one of them, and have any frustrations with it, I think you should chat more with Tabular :)?
So what is a table format, and what is Iceberg? A table format is a way to organize data files (Parquet, JSON, etc) and present them as a single table. It’s an abstraction of file formats that can be presented to query engines (it’s difficult for query engines to direct questions directly at the data lake storage layer).?
Iceberg uses metadata for the heavy lifting, as opposed to the directory listing like Hive. The metadata defines the table structure partitions and data files so you don’t need to query a directory (more on that architecture below). Iceberg is open source, and is the leading table format, having been? adopted by Snowflake, AWS, GCP, Databricks, Confluent and many others, with contributions and usage coming from some of the largest organizations like Netflix, Apple, LinkedIn, Adobe, Salesforce, Stripe, Pinterest, AirBNB, Expedia and many others. The adoption of Iceberg across the data industry is accelerating and it’s clear it’s well on its way to becoming an industry staple. Here are just a few Iceberg announcements from major platforms:
How does Iceberg accomplish so much?? The chart below is from the Iceberg spec.
I won’t go into the specifics of each layer (catalog, metadata, and data layers), but if you want more technical detail this post does a great job.
So what does Iceberg enable? What are the benefits of Iceberg and Tabular?
With Iceberg and its share nothing architecture you can do branching and tagging. When you’re running merge commands or you’re testing changes to the merge commands you can do that in a branch of the data. You can test, validate and then push a branch to production. Tagging is also quite cool - you can tag a table used for quarterly reports or training an AI model so you can refer back to it. Ultimately, every change to an Iceberg table is a snapshot.
When Ryan and team initially set out to create Iceberg the goal was to bring the capabilities of a data warehouse to a data lake, thus creating a data lakehouse. What they discovered, and what Iceberg / Tabular has turned into, is an entirely new paradigm. We’ve seen the number of compute engines explode. And these engines have vastly different architectures or paradigms (JVM vs Python, Batch vs Streaming, etc). This has led to an intricate and complex web of data platforms that large enterprises manage.?
Databricks and Snowflake pitch a single place for everything, but the reality is Spark is great for some use cases, Snowflake for others, Flink for streaming, DuckDB or Clickhouse for faster real time systems, etc.?
Further, each engine maintains their own security, governance and access policies!
Tabular - an Independent Data Platform from the Creators of Apache Iceberg
Ten years ago Snowflake did something very innovative - they separated storage and compute, and scaled each independently. Since then, as I described above, the number of compute engines has exploded. This has resulted in data stored in S3, then replicated to a number of other systems, all of which manage their own governance and security. What a headache!
Enter Tabular - Tabular is creating a new data lakehouse architecture that even further separates storage and compute by allowing you to use services from different vendors for storage (including table management) and compute. With Tabular, you have one source of truth for your storage: your data lakehouse (i.e. Tabular). Then, compute engines are mixed and matched on top based on your use cases.?
This provides a very important data security advantage. Because Tabular provides RBAC on storage - down to the table level, and soon to the column, you define security and governance policies at the lowest level once, and they get enforced? across your compute engines automatically.?
To recap, with Tabular you store once and use anywhere. You define and create a Tabular catalog once that acts as a single source of truth for data discovery and access. Data compaction makes the storage efficient. You then have separate compute and query engines (Snowflake, Spark, Flink, Trino, etc) that you bring to this data (ie bringing compute to the data). You store the data once, and use it anywhere (use it with any engine).
If you’re a large enterprise, you don’t have to pick between Databricks and Snowflake - you can have both with unified security and governance! On top of this, you can completely avoid vendor lock-in to either one of these platforms (as the data gravity moves away from the compute providers themselves). Further separating storage and compute into separate vendors truly enables a best of breed architecture, creating a unified data architecture. Tabular makes the data storage layer universal. Anyone doing any task can share their data without copying, syncing or worrying about permissions. You govern the entire data layer itself in one place.
Avoid Compute Lock-In!
No one wants to get “Oracle-d.” However, as some of the more modern platforms today push a one-stop shop all-you-can-eat data platform, the risk of lock-in is higher. Especially at the compute layer, where customer costs / vendor revenues are soaring.?
And while Iceberg provides an open table spec, each vendor of an Iceberg managed service who also sells a compute layer has a vested interest in providing a performance advantage to their engine.? Therefore, you can still end up with a form of lockin.?
With Tabular these fears go away. You store your data once, and then are free to use the best compute engine for each workload (ie portable compute, point #1 above) and since Tabular doesn’t have a horse in the compute engine race, it’s in their best interests to treat them all the same way.
This is the cloud-native data architecture of the future, and why we’re so excited to partner with Ryan and the Tabular team. Bringing the flexibility of the data warehouse to the scalability of the data lake is incredibly powerful. The ability to safely use multiple engines that don’t know anything about one another on the same data sets at the same time (whether it’s a Python process someone is running in Hex, or a Snowflake job, or a Spark job, etc) is incredibly powerful. And for most enterprises, they’re already using all of these compute engines. We see a world in the future where best of breed storage and compute vendors are further separated. Providing SQL warehouse behavior on the data lake with strong guarantees makes people and companies more productive with their data. Having a cloud-native data strategy will differentiate companies over the next decade, and I highly encourage everyone to reach out to the Tabular team to learn how their solution can truly up level your data strategy.
With Tabular, I believe you can turn your data swamp into a data lakehouse.
Open Lakehouse & AI @ Databricks | Founding Tabular GTM Team
1 年free tier sign up: https://app.tabular.io/signup
Open Lakehouse & AI @ Databricks | Founding Tabular GTM Team
1 年https://www.businesswire.com/news/home/20230919876739/en/Tabular-Secures-26M-for-Independent-Data-Platform-based-on-Apache-Iceberg