The Fourth Community over Code Performance Engineering Track (Bratislava, Slovakia, 5 June 2024)
The 4th Community over Code Performance Engineering track was on recently in Bratislava. Thanks to everyone who made it such a success, particularly the speakers Márton Balassi Péter Váry David Kjerrumgaard Gabor Kaszab Zoltán Borók-Nagy and the track co-chair Stefan Vodita who was also doubling up as a volunteer.
We had talks on various performance aspects of Apache Flink, Apache Iceberg, Oxia, Apache Impala, and Apache Kafka. Also thanks to the around 180 attendees who made it worthwhile (and asked lots of interesting questions). See my original post.
Prior to the talks, I didn't know much about Apache Iceberg (but had come across it recently in the Kafka Summit in India) so I increased my knowledge (particularly about specific performance aspects) x10 at least.
Oxia looks like a viable alternative for Apache ZooKeeper which has limited write scalability, although Apache Kafka has solved this by using a new internal meta-data store and distributed systems controller (KRaft).
I gave my "cosmological Kafka" talk ("Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandaries Explored") for the first time (using high-level top 10 Kafka cluster metrics and distribution data across all clusters) and concluded that Apache Kafka is highly horizontally scalable, comes in all sizes imaginable (and likely unbounded) and that clusters can be highly optimised for very specific customer workloads. Also, that a Kafka performance model would be useful.
Details of the talks and links:
1
Marton Balassi, Peter Vary, "Latency Ingestion to Large Tables via Apache Flink and Apache Iceberg":
One of the primary challenges of data ingestion is the tradeoff between the latency of data availability for the downstream systems and the extent to which data is optimised for efficient reading. When ingesting continuous incoming data streams with low latency, Apache Flink is a data processing engine that shines. Apache Iceberg is one of the most popular table formats for large tables. To get the best of both worlds, and continuously ingest data and see near real-time changes to tables queried by various engines, tight integration is needed between these two Apache projects.
Basic integration has been available in open source for a long time, but when processing high volume data, the performance becomes crucial. Near real-time read from Iceberg tables needs frequent commits, and each commit creates a new set of files. On the other hand, reading from Iceberg tables is more optimal when the number of files are smaller. There are several ongoing projects to balance these needs and keep the number of files small. Balanced writes helps when the number of partitions are comparable to the parallelization level. Performing periodic compaction helps when the write throughput is more important and additional resources could be used to rewrite the data in a more optimal format.
Development of these new features required changes in both the Apache Flink and the Apache Iceberg code base. In our talk discuss we the planning process coordinating two Apache communities, the implementation and the synchronization between projects. We compare our approach with alternative solutions like Apache Hudi and Apache Paimon, highlight the pros and cons of the different solutions, and showcase the possibilities in a brief demo.
2
David Kjerrumgaard, "Oxia - A Horizontally Scalable Alternative to Apache Zookeeper":
For over a decade, Apache Zookeeper has played a crucial role in maintaining configuration information and providing synchronization within distributed systems. Its unique ability to provide these features made it the de facto standard for distributed systems within the Apache community.
Despite its prolific adoption, there is an emerging trend toward eliminating the dependency on Zookeeper altogether and replacing it with an alternative technology. The most notable example is the KRaft subproject within the Apache Kafka community,
While the KRaft project achieved its goal of making Kafka more self-contained by eliminating the need for an external Zookeeper ensemble, the benefits of the KRaft project are limited to the Kafka community.
This talk introduces Oxia, a subproject within the Apache Pulsar community aimed at providing a horizontally scalable alternative to the traditional Zookeer-based consensus architecture. The goal of Oxia is two-fold:
Develop a consensus and coordination system that doesn’t suffer from Zookeeper’s horizontal scalability limitations.
Create a compelling Zookeeper replacement that can be used across the entire Apache ecosystem and just Apache Pulsar.
This talk will discuss Apache Zookeeper’s inherent scalability issues and demonstrate how Oxia’s architecture is designed to eliminate them entirely. We will also highlight how Oxia’s Java client library makes it easy for projects across the Apache ecosystem to utilize Oxia as a Zookeeper replacement.
3
Gabor Kaszab, Zoltan Borok-Nagy, "Let’s see how fast Impala runs on Iceberg":
领英推荐
Apache Impala is a distributed massively parallel query engine designed for high-performance querying of large-scale data. There has been a long list of new features recently around supporting Apache Iceberg tables such as reading, writing, time traveling, and so on. However, in a big data environment, it is also a must to be performant. Since Impala has been designed to be fast, it has its own way of reading Iceberg tables. Other engines might simply use the Iceberg library to perform reads, while Impala has a C++ implementation itself optimized for speed.
Nowadays, even big data storage techniques have to offer the possibility not just to store data but also to alter and delete data on a row level. Apache Iceberg solves this by using delete files that live alongside the data files. It is the responsibility of the query engines to then apply the delete files on the data files when querying the data. To efficiently read the data of such tables we implemented new Iceberg-specific operators in Impala.
In this talk we will go into the implementation details and reveal what is the secret behind Impala’s great performance in general and also when reading Iceberg tables with position delete files. We will also show some measurements where we compare Impala’s performance with other open-source query engines.
By the end of this talk, you should have a high-level understanding of Impala’s and Iceberg’s architecture, the performance tricks we implemented in Impala specifically for Iceberg, and you will see how Impala competes with other engines.
4
Paul Brebner, "Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandaries Explored)":
Instaclustr (now part of NetApp) manages 100s of Apache Kafka clusters of many different sizes, for a variety of use cases and customers. For the last 7 years, I’ve been focused outwardly on exploring Kafka application development challenges, but recently I decided to look inward and see what I could discover about the performance, scalability and resource characteristics of the Kafka clusters themselves. Using a suite of Performance Engineering techniques, I will reveal some surprising discoveries about cosmic Kafka mysteries in our data centres, related to: cluster sizes and distribution (using Zipf’s Law), horizontal vs. vertical scalability, and predicting Kafka performance using metrics, modelling and regression techniques. These insights are relevant to Kafka developers and operators.
By the end of the event, the list of technologies covered in the Performance Engineering track now includes:
Of course, we didn't have a monopoly on performance-themed talks and there were a few others I noticed:
The 5th and 6th iterations of the track are on at Community over Code Asia (July) and NA (Denver) later in the year (October), so we look forward to seeing you there, with Roger Abelenda as co-chair in Denver. Here's the Denver CFP (details of accepted talks coming soon).
After the conference, I discovered a train museum in Bratislava, but didn't see the Green Anton featured in the CFP here. But I did find this cute Czech Railways 310 (0-6-0) nicknamed "Kafemlejnek" - Coffee Grinder (I think these details are correct, the actual museum building was closed for a movie shoot but I was able to explore the train yards).
Ok, so a shunting engine may not be very performance-related, so how about a dynamite car?!