Pink Casino sister sites,Ta777 Casino Login.Recharge Every day and Get Bonus up-to 50%!

The 4th Community over Code Performance Engineering track was on recently in Bratislava. Thanks to everyone who made it such a success, particularly the speakers Márton Balassi Péter Váry David Kjerrumgaard Gabor Kaszab Zoltán Borók-Nagy and the track co-chair Stefan Vodita who was also doubling up as a volunteer.

We had talks on various performance aspects of Apache Flink, Apache Iceberg, Oxia, Apache Impala, and Apache Kafka. Also thanks to the around 180 attendees who made it worthwhile (and asked lots of interesting questions). See my original post.

Prior to the talks, I didn't know much about Apache Iceberg (but had come across it recently in the Kafka Summit in India) so I increased my knowledge (particularly about specific performance aspects) x10 at least.

Oxia looks like a viable alternative for Apache ZooKeeper which has limited write scalability, although Apache Kafka has solved this by using a new internal meta-data store and distributed systems controller (KRaft).

I gave my "cosmological Kafka" talk ("Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandaries Explored") for the first time (using high-level top 10 Kafka cluster metrics and distribution data across all clusters) and concluded that Apache Kafka is highly horizontally scalable, comes in all sizes imaginable (and likely unbounded) and that clusters can be highly optimised for very specific customer workloads. Also, that a Kafka performance model would be useful.

Details of the talks and links:

1

https://eu.communityovercode.org/sessions/2024/efficient-low-latency-ingestion-to-large-tables-via-apache-flink-and-apache-iceberg/

Marton Balassi, Peter Vary, "Latency Ingestion to Large Tables via Apache Flink and Apache Iceberg":

One of the primary challenges of data ingestion is the tradeoff between the latency of data availability for the downstream systems and the extent to which data is optimised for efficient reading. When ingesting continuous incoming data streams with low latency, Apache Flink is a data processing engine that shines. Apache Iceberg is one of the most popular table formats for large tables. To get the best of both worlds, and continuously ingest data and see near real-time changes to tables queried by various engines, tight integration is needed between these two Apache projects.

Basic integration has been available in open source for a long time, but when processing high volume data, the performance becomes crucial. Near real-time read from Iceberg tables needs frequent commits, and each commit creates a new set of files. On the other hand, reading from Iceberg tables is more optimal when the number of files are smaller. There are several ongoing projects to balance these needs and keep the number of files small. Balanced writes helps when the number of partitions are comparable to the parallelization level. Performing periodic compaction helps when the write throughput is more important and additional resources could be used to rewrite the data in a more optimal format.

Development of these new features required changes in both the Apache Flink and the Apache Iceberg code base. In our talk discuss we the planning process coordinating two Apache communities, the implementation and the synchronization between projects. We compare our approach with alternative solutions like Apache Hudi and Apache Paimon, highlight the pros and cons of the different solutions, and showcase the possibilities in a brief demo.

2

https://eu.communityovercode.org/sessions/2024/oxia-a-horizontally-scalable-alternative-to-apache-zookeeper/

David Kjerrumgaard, "Oxia - A Horizontally Scalable Alternative to Apache Zookeeper":

For over a decade, Apache Zookeeper has played a crucial role in maintaining configuration information and providing synchronization within distributed systems. Its unique ability to provide these features made it the de facto standard for distributed systems within the Apache community.

Despite its prolific adoption, there is an emerging trend toward eliminating the dependency on Zookeeper altogether and replacing it with an alternative technology. The most notable example is the KRaft subproject within the Apache Kafka community,

While the KRaft project achieved its goal of making Kafka more self-contained by eliminating the need for an external Zookeeper ensemble, the benefits of the KRaft project are limited to the Kafka community.

This talk introduces Oxia, a subproject within the Apache Pulsar community aimed at providing a horizontally scalable alternative to the traditional Zookeer-based consensus architecture. The goal of Oxia is two-fold:

Develop a consensus and coordination system that doesn’t suffer from Zookeeper’s horizontal scalability limitations.

Create a compelling Zookeeper replacement that can be used across the entire Apache ecosystem and just Apache Pulsar.

This talk will discuss Apache Zookeeper’s inherent scalability issues and demonstrate how Oxia’s architecture is designed to eliminate them entirely. We will also highlight how Oxia’s Java client library makes it easy for projects across the Apache ecosystem to utilize Oxia as a Zookeeper replacement.

3

https://eu.communityovercode.org/sessions/2024/lets-see-how-fast-impala-runs-on-iceberg/

Gabor Kaszab, Zoltan Borok-Nagy, "Let’s see how fast Impala runs on Iceberg":

Apache Impala is a distributed massively parallel query engine designed for high-performance querying of large-scale data. There has been a long list of new features recently around supporting Apache Iceberg tables such as reading, writing, time traveling, and so on. However, in a big data environment, it is also a must to be performant. Since Impala has been designed to be fast, it has its own way of reading Iceberg tables. Other engines might simply use the Iceberg library to perform reads, while Impala has a C++ implementation itself optimized for speed.

Nowadays, even big data storage techniques have to offer the possibility not just to store data but also to alter and delete data on a row level. Apache Iceberg solves this by using delete files that live alongside the data files. It is the responsibility of the query engines to then apply the delete files on the data files when querying the data. To efficiently read the data of such tables we implemented new Iceberg-specific operators in Impala.

In this talk we will go into the implementation details and reveal what is the secret behind Impala’s great performance in general and also when reading Iceberg tables with position delete files. We will also show some measurements where we compare Impala’s performance with other open-source query engines.

By the end of this talk, you should have a high-level understanding of Impala’s and Iceberg’s architecture, the performance tricks we implemented in Impala specifically for Iceberg, and you will see how Impala competes with other engines.

4

https://eu.communityovercode.org/sessions/2024/why-apache-kafka-clusters-are-like-galaxies-and-other-cosmic-kafka-quandaries-explored/

Paul Brebner, "Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandaries Explored)":

Instaclustr (now part of NetApp) manages 100s of Apache Kafka clusters of many different sizes, for a variety of use cases and customers. For the last 7 years, I’ve been focused outwardly on exploring Kafka application development challenges, but recently I decided to look inward and see what I could discover about the performance, scalability and resource characteristics of the Kafka clusters themselves. Using a suite of Performance Engineering techniques, I will reveal some surprising discoveries about cosmic Kafka mysteries in our data centres, related to: cluster sizes and distribution (using Zipf’s Law), horizontal vs. vertical scalability, and predicting Kafka performance using metrics, modelling and regression techniques. These insights are relevant to Kafka developers and operators.

By the end of the event, the list of technologies covered in the Performance Engineering track now includes:

Apache Kafka
Apache JMeter & Selenium
Kubernetes
Apache Arrow
Java Profiling
Apache Flink
Apache Spark/ML
Apache Hadoop
Apache Ozone
Apache Cassandra
Apache Camel
Apache Lucene
Apache Iceberg
Apache Impala
Oxia

Of course, we didn't have a monopoly on performance-themed talks and there were a few others I noticed:

https://eu.communityovercode.org/sessions/2024/navigating-challenges-and-enhancing-performance-of-llm-based-applications/

https://eu.communityovercode.org/sessions/2024/cassandra-vector/

https://eu.communityovercode.org/sessions/2024/apache-ratis-a-high-performance-raft-library/

The 5th and 6th iterations of the track are on at Community over Code Asia (July) and NA (Denver) later in the year (October), so we look forward to seeing you there, with Roger Abelenda as co-chair in Denver. Here's the Denver CFP (details of accepted talks coming soon).

After the conference, I discovered a train museum in Bratislava, but didn't see the Green Anton featured in the CFP here. But I did find this cute Czech Railways 310 (0-6-0) nicknamed "Kafemlejnek" - Coffee Grinder (I think these details are correct, the actual museum building was closed for a movie shoot but I was able to explore the train yards).

Ok, so a shunting engine may not be very performance-related, so how about a dynamite car?!

The Fourth Community over Code Performance Engineering Track (Bratislava, Slovakia, 5 June 2024)

Paul Brebner

Open Source Technology Evangelist at Instaclustr by NetApp

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Introduction to Apache Kafka

Learn Kubernetes weekly — issue 25

Resources for Learning more about Catalog level versioning with Project Nessie & Dremio Arctic (Rollbacks, Branching, Tagging and Multi-Table Txns)

Top Postgres Extensions in 2023

Kubernetes Cluster Upgrade[Master & Worker Nodes]: Step by Step

Cluster Architecture in APACHE SPARK

A Complete Guide to Apache Kafka for Developers (or, everything I know about Kafka in one place)

Unleashing the Magic of Apache Beam: A Detailed Guide to its Features and Benefits

FLiP Stack Weekly - 15 Jan 2023

Exploring Apache Beam's ParDo Function: A Key for Parallel Processing

领英推荐

Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

2024年11月22日

Summary of the 6th Community over Code Performance Engineering Track (October 7, 2024, Denver, Colorado, USA)

2024年10月23日

Seven Years of Open Source DevRel Technology Fun With Instaclustr

2024年8月6日

Kafka Summit Bangalore 2024 - Interesting Talks

2024年5月9日

What Do Hanoi Intersections And Water Puppets Have In Common With Distributed Cloud Systems?

2024年4月22日

Connecting to Instaclustr Managed PostgreSQL? and Apache Kafka? from Payara Cloud

2024年3月14日

CFP for Community Over Code NA (Denver) 2024 Performance Engineering Track

2024年3月13日

What?! It's the 29th of February? A Cosmic Spatiotemporal Anomaly? No, Happy Leap Year Day!

2024年2月29日

Performance Engineering Track at Community Over Code EU 2024 - Sessions Announced

2024年2月26日

What's Green, Cries like a Baby, and Wakes you up at the Crack of Dawn? Where is Vector Search when you need it?!

2024年2月25日