Harnessing the Power of Iceberg

Harnessing the Power of Iceberg

This article is first in appreciation of work behind Apache Iceberg, and next to show our incremental but hopefully significant contribution to the world of databases as enabled by these technologies. While I was not part of the team or have a window into the genesis of Iceberg (in full disclosure, no one in my team is connected to either of the two teams as of writing this article), I thought I should share how we looked at this technology and how it helped us innovate with our product.

In the rest of this article we will just work with Iceberg as that is what my team at Compute.Ai has currently implemented. Delta, a roadmap item for us, looks very powerful also but without hands-on experience I'll stick to what my team and I know. It's safe to assume that at the level of discussion below both technologies are well suited for our solution.

Building a Transaction Processing system is hard. Don't let Jim Gray's book by the same name fall on your toes—its is heavy stuff, lol. OK, so what does Transaction Processing have to do with Iceberg?

One way to look at a database is to break its functional components into DDL, DML and DQL. The first two deal with creating schemas, tables, and inserting data into the tables (roughly). The third, DQL, stands for Data Query Language. This is SQL being issued as SELECT statements, or in other words doing read-only work.

If someone told me to build a read-only system I would not for a second even consider such a thing because at some level it seems stupidly simple. Truth is that it is quite simple and that is the point of this writeup. Interestingly, we built such a system ourselves but with some other core technology incorporated (that I believe has an extremely high level of difficulty technically) that is out of the scope of this article. Simplicity along with some of our own secret sauce makes for a potent offering, I hope. If you are curious about that aspect of our deep tech then read this article.

Back to Iceberg...

Iceberg handles DDL & DML. This extremely powerful. Data in the form of Parquet files "inherits" powerful metadata that Iceberg creates. We won't get into the details because volumes have been written on Iceberg and the space continues to move forward rapidly.

Iceberg created metadata is a boon to a SQL query engine, especially its CBO (Cost Based Optimizer) layer. Let's put this into perspective...if you take a traditional database (including most cloud databases), many of us will wear out just waiting for the data to be loaded. This time can be in hours if not days for real-world database workloads. And it can take many minutes for the incremental batches of data (also called CDC traditionally) to be loaded/inserted into existing tables. This is a huge tax that we must pay every time we induct data into our "data warehouse," before a single query is run. The Transaction Processing Committee (TPC) knew 4 decades ago about this and discounted the "load times" when running benchmarks (e.g., TPC-H, TPC-DS, others). However, this is not just the time to that it takes to load data; it includes a key compute component—this long wait for data from files to become database Tables (called ETL) involves generating stats, as we call them in the database world. These stats are not terribly different than the Iceberg created metadata; they help the CBO.

Why is the above even interesting? Let's look at what Iceberg brought to the table. My view of the technology is that it took a traditional database and chopped it into two unequal halves (unintended oxymoron, but should help get to the point faster). The part on the left has these beautiful Parquet/Iceberg files, and the one on the right is a relational compute engine that leverages the data (Parquet) and the metadata (Iceberg). Note: there is also some metadata within a Parquet file itself that the CBO leverages. Once a data lake is formed, open platforms like Spark, Presto/Trino can query the Parquet/Iceberg directly. You often find people referring to this as "SQL directly on Files and External Tables." At Compute.AI we believe this is a path to the future of analytics and also for feeding AI/ML workloads. [Compute.AI is a value-add feature for the Spark, Presto/Trino ecosystems that prevents OOM-kills by providing memory overcommitment. Our goal is to make OSS database products become enterprise grade like a traditional Oracle-like database...highly reliable, no-touch (no DevOps or skilled engineers needed for production systems), all without sacrificing performance. We have focused on removing the final barriers to entry for customers looking to embrace open standards/platforms. End of plug.]

Let's delve into why this new Iceberg architecture is monumental...

Taking DDL/DML (in other words Transaction Processing) out of a relational compute and making it a pre-step lends itself to a simpler, faster, and more robust query platform as we were able to prove with an actual implementation. In fact, since our team had written a transactional system from scratch in the past we wondered if we could "pare it down/simplify it" to a read-only platform. That is, rip out the transactions. Nope, that looked very hard to do and we decided not to take that path. Writing a new relational platform from scratch would be faster. Besides, the the key here is that we needed to be able to code this rapidly so that we could spend a good part of the development on our core IP that deals with memory overcommitment.

The Iceberg step is very natural for analytics workloads where you can take an elastic (say) Spark cluster and make it run with an Iceberg plugin in the backend while running transactionally correct queries. Yes, all the database features you love (point in time rollback or time travel, snapshot, checkpoint, and much more) simply work with Iceberg! This stuff is enterprise grade and production ready.

However, nothing in computer science comes without tradeoffs. So let's look into what's trading off against what. The first and the most important one is speed + scale + efficiency + cost. Yes, all four together make for a huge single win for Iceberg. Did we mention that when running with Iceberg one is not sitting with their data locked in some proprietary silo (a cloud data warehouse aka a brick walled garden)? But in the case that the data is vendor locked in, the customer is now at the mercy of the vendor charging whatever. Locked data means losing control on managing one's compute costs.

You get the point. So, welcome to an open data lake or a lake house with Iceberg. Once here, you can scale your transactional pieces of work with open technology that scales well and also robustly. Note, you are not burning a hole in your pocket paying a proprietary vendor for loading your data into a warehouse and computing your stats. Those compute costs can be big as most CIOs, CDOs, and CFOs, among others know.

The ability to scale with open platforms is a huge cost saver. It also benefits your SLAs. The speed/scale at which a cloud data warehouse loads data and computes its stats pales in comparison to a highly scalable Spark cluster that can process your data and create Parquet/Iceberg. A clear win.

Is there a use case that may not shine as much with Iceberg?

When transaction rates are high (we are talking batch or micro-batches of data where each batch can contain millions of transactions), the number of new Parquet/Iceberg files created experience a sprawl. Reading them for querying comes with a cost until other "defrag" style processes kick in to reduce the files at the Parquet/Iceberg level (a background job), or we can use a variety of techniques (insert, merge on read, etc.). In our implementation at a Wall St. bank, we were able to reduce the time for transactional updates coming in at the rates of 1-10M transactions per micro-batch, down from 15 minutes to ~1 minute using query optimizations. These numbers show much promise in being able to view analytics results in near real-time using Iceberg and are extremely competitive with a data warehouse while simplifying and making the pipeline cost-effective.

Hope the article gives you folks something to think about. None of this is passing judgement on anything out there. There are many solutions in the market and there happen to be good reasons for that. So weighing in some of these considerations can help you with your data lake strategies. All this this while acknowledging that the speed at which Iceberg code bases are moving, we will see some significant strides in technology and implementation going forward. At Compute.AI we are excited to see what comes next with Iceberg and feel fortunate to be able to leverage these products and create great value for customers.

The article officially ended above but a few asked me for some takeaways. So here they are:

  1. Go open, break out of proprietary data warehouse silos, embrace open standards, and save hugely on your compute costs while achieving enormous flexibility as well as a dozen (if not more) second sourcing options
  2. A complete transition from a data warehouse to a data lake or a lakehouse may not be feasible for a business but can be tackled in baby steps: (i) Create a staging area using Parquet/Iceberg and then load that data into your warehouse, (ii) An open staging area offers a richer toolchain, including for AI/ML, and prevents proprietary solutions that would be needed on data in the warehouse, (iii) As an example, ELT directly on the Parquet/Iceberg staging area with DBT using Spark/Presto/Trino can save enormous warehouse based compute costs
  3. Iceberg is the stepping stone to going open
  4. Any concerns about performance of SQL directly on Files & External Tables as being lesser than a cloud data warehouse are hopefully addressed in this article—open data lakes can run an order of magnitude faster than walled gardens and provide infinite scalability

Hope this helps. I understand that there are always arguments and counter arguments. But the key here is to be aware of the advances in data lake and lakehouse technologies, staying abreast of what's coming, and factor that into our 2-3 year strategic roadmaps as we look into growing our businesses.

Related Articles

Why Was Compute.AI Founded

Compute.AI's Vertically Integrated vs Distributed Shared Memory Spark

Do We Need Another Version of Spark

An Approach to Database Fine-Grained Access Controls

要查看或添加评论,请登录

Vikram Joshi的更多文章

社区洞察

其他会员也浏览了