Open Source Performance Engineering: Blogs – Part 1
I recently needed to track down and summarise some of my Performance Engineering blogs (covering performance, scalability, benchmarking, analysis, latency, observability, modelling, etc) so I started to collect them together - here's the first attempt, mainly of pre-2024 blogs, in no particular order.
But why a (fast) train picture as an introduction? It's related to the Performance Engineering track in Community over Code which I'm involved in.
Open Source performance and scalability challenges, solutions and results – from my older blogs. Covering Apache Cassandra, Apache Kafka, Kafka Connect, Kafka Streams, Redis, Kubernetes, Observability, Debezium, PostgreSQL, ZooKeeper, Curator, MM2, etc.
Note that since these blogs, we now have OpenSearch and Valkey on the NetApp Instaclustr managed platform.
?
Apache Cassandra Elastic Auto-Scaling Using Instaclustr’s Dynamic Cluster Resizing, Instaclustr 3 Dec 2019.
Part of a 2 part series on the Instaclustr Provisioning API.? This blog is an extensive evaluation (by benchmarking and modelling) of the performance, scalability and resizing times taken to elastically auto scale Cassandra clusters, using multiple approaches and settings (node vs rack, and concurrency).? Sections: Dynamic resizing overview, dynamics of resizing (what’s happening to the cluster), resize modelling (how a performance model can predict times and performance etc), ?auto scaling Cassandra elasticity (when should we initiate a resize, and what is the impact?), resizing rules (by node or rack, how could we automate resizing using metrics?), and resizing rules (different concurrency settings) results. ??If you do a search for “Cassandra elastic auto scaling” this is the top hit on google – there are very few platforms and analysis of this Cassandra use case.
Improving Apache Kafka? Performance and Scalability With the Parallel Consumer, Parts 1 and 2, Instaclustr 4 May 2023
How can you scale Kafka consumers with higher concurrency, but less consumers and partitions? With the parallel consumer. In this blog we explore (with benchmarking and modelling) how the new parallel consumer can improve Kafka consumer throughput with less resources but more threads.? I summarise other approaches I’ve taken to increase consumer throughput, and introduce the parallel consumer and options. We use Little’s Law again to model the benefits of increased consumer concurrency with multiple threads.
In Part 1 of this series, we had a look at Kafka concurrency and throughput work, recapped some earlier approaches I used to improve Kafka performance, and introduced the Kafka Parallel Consumer and supported ordering options (Partition, Key, and Unordered). In this second part we continue our investigations with some example code, a trace of a “slow consumer” example, how to achieve 1 million TPS in theory, some experimental results, what else do we know about the Kafka Parallel Consumer, some alternatives, and finally, if you should use it in production. All illustrated with photos from a much older concurrent technology used to make multiple ribbons at once - a Jacquard loom!
Instaclustr Cluster Performance Insights: Cluster Size Distribution and Zipf’s Law: Why Open Source Computer Clusters Are Like Galaxies, Instaclustr 12 Dec 2023
Recently I decided to take a closer look at the cloud computing clusters that Instaclustr uses to provide managed open source Big Data services to 100s of customers, to see if I could discover any interesting resource and performance/scalability insights. I plan to examine a variety of cluster data that we have available, including the number and size of clusters and performance metrics.?? ?I will focus on the number and size of clusters and reveal some discoveries about the nature of scalability in Apache Cassandra?, Apache Kafka?, OpenSearch? (and other) technologies.? Using Zipf’s law, we make some interesting discoveries about the potential upper size of clusters, predict how many clusters and nodes are in all clusters just from knowing the largest cluster, understand why most clusters are smaller than average, and prove that Kafka and Cassandra are indeed highly horizontally scalable.
The Kafka specific version (C/C EU Talk)
"Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandaries Explored)" C/C EU Bratislava, 5 June 2024
Instaclustr manages 100s of Apache Kafka clusters of many different sizes, for a variety of use cases and customers. For the last 7 years, I’ve been focused outwardly on exploring Kafka application development challenges, but recently I decided to look inward and see what I could discover about the performance, scalability and resource characteristics of the Kafka clusters themselves. Using a suite of Performance Engineering techniques, I will reveal some surprising discoveries about cosmic Kafka mysteries in our data centres, related to: cluster sizes and distribution (using Zipf’s Law), horizontal vs. vertical scalability, and predicting Kafka performance using metrics, modelling and regression techniques. These insights are relevant to Kafka developers and operators.
This blog provides an overview of the two fundamental concepts in Apache Kafka: Topics and Partitions. While developing and scaling our Anomalia Machina application we have discovered that distributed applications using Apache Kafka and Cassandra? clusters require careful tuning to achieve close to linear scalability, and critical variables included the number of Apache Kafka topics and partitions.
In this blog, we test that theory and answer questions like “What impact does increasing partitions have on throughput?” and “Is there an optimal number of partitions for a cluster to maximize write throughput?” And more! Note: This was my/Instaclustr’s most popular blog for several years, then when the blog URLs were changed en masse a few years ago it suddenly plummeted. And to some extent has been superseded by the newer KRaft series.
Kafka in Production: Kafka performance and scalability
The final blogs in the Kongo series focus on deployment and running Kongo in production, and reveal some things to watch out for regarding Kafka performance and scalability.
The 1st blog introduces the problem – Kongo, a complex streaming IoT Logistics demo application. The application involved a real-time simulation of 10’s of warehouses with 1000’s of goods moved around by 100’s trucks, while IoT devices (e.g. RFID tags and readers, and sensors) produced streams of data that had to be checked in real-time for transportation (e.g. permitted good types co-location) and environmental (e.g. vibration, heat, etc.) rules compliance.
The first iteration of a loosely-coupled architecture. We add explicit event types, an event-bus (Google Guava) to add multiple topics and improve scalability, and explore the design implications.
We evolve the application into a Kafka application by adding serializers and deserializers to the event types, introduce real Kafka topics, explore the potential impact of one or many topics on performance and scalability, and consider event ordering.
[A few blogs are missing here as they covered Kafka Connect and Kafka Streams, see elsewhere]
We deploy the Kongo IoT application to a production Kafka cluster, using Instraclustr’s Managed Apache Kafka service on AWS. We explore Kafka cluster creation and how to deploy the Kongo code. Then we revisit the design choices made previously regarding how to best handle the high consumer fan out and the number of topics in the Kongo application, by running a series of benchmarks to compare the scalability of different options.
We explore how to scale the application on the Instaclustr Kafka cluster we created, and introduce some code changes for automatically scaling the producer load and optimizing the number of consumer threads.
The final goal is to scale the Kongo IoT application on Production Instaclustr Kafka clusters. We compare various approaches including scale-out, scale-up, multiple clusters, response times, and server vs. client resources. There are two versions to the story. In the Blue Pill version, scaling everything goes according to plan and scaling is easy. In the Red Pill version, we encounter and overcome some scalability challenges caused by a large number of consumers. E.g. The Kafka “Key Parking Problem” or hash collisions; and consumer group rebalancing storms.
A couple of blogs introduce Kafka Streams, use cases, and Kafka Streams performance
Kafka Streams is a framework for stream data processing. In this blog, we introduce Kafka Streams concepts (processors, streams, and tables), the topology of streams processing, DSL operations and order (with a very useful diagram and table showing permitted combinations), SERDES, and take a look at one of the DSL operations, Joins, in more detail.
A fun introduction to a simple Apache Kafka Streams example using the murder mystery game Cluedo as a simple problem domain. Dr. Black has been murdered in the Billiard Room with a Candlestick! Whodunnit?!?We present a complete standalone streams example (using a KTable) to reveal who doesn’t have an alibi, and list a few things to watch out for in streams programming.
In this blog, we develop a more complex streams application to keep track of the weight of goods in trucks for our Kongo IoT application. During the journey, we encounter topology exceptions, and find a useful streams topology mapping tool to help fix the problems. We also employ some transactional “Magic Pixie Dust” to prevent truck weights from going negative and possibly floating away.
Massively Scalable Anomaly Detection with Apache Cassandra, Apache Kafka, Kubernetes, Prometheus, Grafana, OpenTracing, and Jaeger
[Note: The Kubernetes blogs received > 1000 views each on Medium, well above average for other blogs at the time]
“Anomalia Machina” is Latin for “Anomaly Machine”. This series was designed to build a demonstration application that could use both Cassandra and Kafka and get some “big” performance numbers. Because we wanted the application to test out the data layers in particular, we picked a simple “CUSUM” anomaly detection algorithm that processes an incoming data stream and compares each event with a fixed window of previous values for the same key. It’s designed to work with Billions of keys and Billions of anomaly checks a day. Potential applications are numerous including infrastructure and applications, financial services and fraud, IoT, web click-stream analytics, etc.
To ensure adequate visibility into the application end-to-end, we explored open source monitoring and tracing such as Prometheus, Grafana, OpenTracing, and Jaeger. We used Kubernetes for application scalability, and Prometheus monitoring on Kubernetes using the Prometheus Operator. After several attempts at configuring and tuning the system for scale, we finally achieved 19 Billion anomaly checks/day, with 574 CPU cores across the Cassandra, Kafka, and Kubernetes clusters.
What did we learn from this series? It’s possible to build complex massively scalable applications using multiple Big Data open source technologies that combine streaming data with historical data. Kafka also works well as a buffer (one of Kafka’s “superpowers”!), enabling a peak load of 10 times the steady-state load (190 Billion/day). Open-source monitoring and tracing were critical for debugging, tuning, and scaling. Kubernetes was effective for application scaling, and resizing the system is possible, so cost is incremental as more or less resources are needed. You can easily run a smaller system for less cost, or scale up to something even more massive. However, it’s best to tune the system incrementally as you scale it out, as there are a number of software tuning configurations that need to be optimized as it gets bigger (you need to minimize the number of Kafka consumers and Cassandra connections, while maximizing the concurrency of the anomaly detector algorithm and the Cassandra clients).
Kafka Connect Pipeline Blog Series
Building and Scaling Robust Zero-Code Streaming Data Pipelines with Open Source Technologies (Apache Kafka, Apache Kafka Connect, Apache Camel Kafka Connectors, Elasticsearch/Open Distro/OpenSearch, Kibana, PostgreSQL, Apache Superset, Prometheus, Grafana).
This multi-part series started life as a talk at ApacheCon 2020, with the latest version delivered at FOSSASIA 2022. The goal was to build a robust zero-code streaming data pipeline for NOAA tidal data using Apache Kafka, Apache Kafka Connect, Elasticsearch, and Kibana. This pipeline worked (briefly), but then encountered problems with connectors failing due to attempting to process error messages, so we tried Apache Camel Kafka Connectors for more robust error handling. To scale the system we used Prometheus and Grafana to monitor relevant metrics across multiple systems. While scaling the system we encountered several performance “speed humps” and resolved them.
Then we decided to try another path for the pipeline: by replacing Elasticsearch with PostgreSQL, and Kibana with Apache Superset. This involved evaluating and finally modifying Kafka PostgreSQL/JDBC sink connectors to correctly insert JSON data into PostgreSQL (as a JSONB data type). And finally, getting Superset to work with the JSONB data and trying out some different visualization types. I’ve also done some performance benchmarking of the results which are reported in parts 8 and 9.
What did we learn from this series? There are multiple ways of building a streaming JSON data pipeline, there are lots of Kafka connectors available, but they are of varying quality, robustness, functionality, and scalability, and you may end up “hacking” one (or even writing one from scratch) to get the desired results. You need to understand what errors can occur in the pipeline and how to cope with them, how Kafka connect scalability works (tasks, and how they are balanced), and the specifics of the target sink system scalability. Kibana works easily to visualize Elasticsearch data, but Apache Superset may have more chart types and will work with most SQL databases (and potentially Elasticsearch).
What else did this series demonstrate? How to evaluate Kafka Connect connectors, connector error handling, how to monitor and scale Kafka connect “systems” (i.e. source systems, Kafka Connect and Kafka clusters and connectors, sink systems), REST source connector, Elasticsearch, and PostgreSQL sink connectors.
Redis performance
It’s an In-Memory Key-Value Store! It’s a Database! It’s Redis?! Instaclustr, 9 Sep 2020
Introduction to Redis, why fast/low latency caching is a useful performance goal, and how fast is our managed Redis! We did some benchmarking and discovered how to tune Redis pipelines, and trade-offs in throughput and latency.
Redis? Java Clients and Client-Side Caching, 24 Sep 2020
In my previous Redis blog, we discovered what Redis really is! It’s an open-source in-memory data structures server. And we discovered how fast it is! For a 6 node Instaclustr Managed Redis cluster latencies are under 20ms and throughput is in the millions of “gets” and “sets” a second. However, in order to actually use Redis as a low-latency server in a real application, you need to use one of the many programming language-specific Redis clients. And did you know that Redis also works as a cache (server or client side), so it is possible to get even better latency results? In this blog, we try out two different Redis Java clients (Jedis and Radisson) and explore Redisson client-side caching in more detail.? We do some benchmarking and also build a performance model of Redis client side caching to better understand the trade-offs between cache hit rates and latencies, and the impact of increasing load.
Jedis and Redisson are both good open source Redis Java clients with support for Redis clusters. Redisson also offers extra functionality for some caching and cluster operations in a paid version. Jedis is faster for both “get” and “set” operations without the use of client-side caching. However, using Redisson client-side caching, faster “get” latencies can be achieved, potentially even achieving the advertised Redis goal of “sub-ms” latencies!
But how fast depends on a number of factors including client-side cache memory, ratio of “gets” to “sets”, and the number of clients. Using the client-side cache can also reduce the load on the Redis cluster, and enable higher client-side throughputs. A good rule of thumb would be that latency is improved most with cache hit rates > 50% and less “sets”, clients and load, but more substantial throughput savings are achieved with more clients—it just depends what your particular performance/throughput tradeoff goals are in practice. And with all benchmarking and modeling activities, it pays to repeat a similar exercise with your own clusters, applications, clients, and workloads, as the results may differ.
Apache Kafka vs. Redis Pub/Sub & Streams
A 3-part blog series with a detailed analysis of Apache Kafka vs Redis messaging patterns (pub/sub and streams) – how do the functionality and performance compare?
Apache ZooKeeper Meets the Dining Philosophers, 9 May, 2021
A ZooKeeper walks into a pub, and ends up helping some Philosophers solve their fork resource contention problem. This was a fun blog to test out our new Apache ZooKeeper managed service. I wrote a simple Dining Philosophers solution, using Apache ZooKeeper and Apache Curator, tried it out on a single Server and an Ensemble, and reported the performance results and implications for Apache Kafka ZooKeeper performance (hint: it doesn’t have great write scalability). Apache Kafka is leaving the Zoo(keeper) soon (KRaft is now the default meta-data system), but there are still lots of distributed applications in need of some coordination help.
This blog was an overview of using Kafka Connect, Debezium and PostgreSQL.
Performance summary:
One limitation of the Debezium PostgreSQL connector is that it can only run as a single task. I ran some load tests and discovered that a single task can process a maximum of 7,000 change data events/second. This also corresponds to transactions/second as long as there’s only one change event per transaction. If there are multiple events per transaction, then the transaction throughput will be less. In a previous blog (pipeline series part 9) we achieved 41,000 inserts/s into PostgreSQL, and 7,000 is only 17% of that. So, this part of the CDC pipeline is acting more like an elephant than a cheetah in practice. However, typical PostgreSQL workloads have a mix of writes and reads, so the write rate may be substantially less than this, making the Debezium PostgreSQL connector a more viable solution.
I also noticed another slightly odd behaviour that you may need to be aware of. If two (or more) tables are being watched for change events, and the load is unbalanced across the tables (e.g. if a batch of changes occurs in one table slightly before the other), then the connector processes all the changes from the first table before starting on changes for the second table. This was a 10 minute delay for the example I discovered. I’m not really sure what’s going on here, but it looks like the connector has to process all changes for one table, before moving onto other tables. For more normal, well-balanced workloads, this may not be an issue, but for spikey/batch loads that heavily load a single table, this may cause problems for timely processing of change events from other tables.
One solution may be to run multiple connectors. This appears to be possible (e.g. see this useful blog), and may also help with overcoming the 7,000 events/s max processing limit. However, it would probably only work if there is no overlap of tables between the connectors, and you need to have multiple replication slots for this work (there is a connector configuration option for “slot.name”).
“Around the World in (approximately) 8 Datacenters”: Building a low-latency globally distributed FinTech (Stock Broker) application
This blog series explores the design and performance of a low-latency globally distributed FinTech (StockBroker) application, using Cassandra multi DC for redundancy /replication/failover across multiple AWS regions to ensure sub-100ms latency for automated stock trades.? The first two parts examine latency between/in AWS regions, how Cassandra multi-DC works, and then we design and build a demo distributed stock broker application and get some speedy results (3ms avg, 60ms max) in the next two parts, including looking at costs (due to write amplification).
Kafka MirrorMaker2 (MM2)
In this two-part blog series we turn our gaze to the newest version of MirrorMaker 2 (MM2), the Apache Kafka cross-cluster mirroring, or replication, technology. MirrorMaker 2 is built on top of the Kafka Connect framework for increased reliability and scalability and is suitable for more demanding geo-replication use cases including migration, backup, disaster recovery, and fail-over. In the first instalment in this series, we focus on MirrorMaker 2 theory (Kafka replication, architecture, components, and terminology) and invent some MirrorMaker 2 rules. Part Two is more practical, and we try out Instaclustr’s managed MirrorMaker 2 service and test the rules out with some experiments. Watch out for two tricky performance puzzles: infinite loops, and infinite topic creation! But don’t panic, we explain how to prevent them.
Terra-Locus Anomalia Machina—Massively Scalable Geospatial Anomaly Detection With Apache Kafka and Cassandra (SASI, Cassandra Lucene Index Plugin, Geohashes, 3D Geohashes).
Geospatial Anomaly Detection: Apache Cassandra, meet Apache Lucene—The Cassandra Lucene Index Plugin
This 4 part blog mini-series extended the Anomalia Machina series to work with geospatial data. We added location data to the Kafka events, so the problem changes to find a sufficient number of events nearby to satisfy the anomaly detection algorithm. This is potentially a very demanding problem for database queries, and we investigated multiple solutions including bounded boxes, Cassandra clustering columns, secondary indexes, SASIs, geohashes (including my implementation of 3D geohashes), and multiple denormalized tables, and the Cassandra Lucene Index Plugin. There were large differences in performance, and we revealed the solution that performed the best.
Kafka Connect scalability
?In this blog we explore how to monitor and scale Kafka connector tasks.
?In this part, we are now ready to increase the load and scale the number of Kafka Connector tasks and demonstrate the scalability of the stream data pipeline end-to-end. To help with this we use Little’s Law, and encounter and solve two scaling speed humps.
?
So what can we conclude about Kafka connect pipeline scalability?
Throughput certainly increases with increasing tasks, but you also have to ensure that the number of distinct key values is high enough, that the number of partitions is sufficient (but not too many), and also keep an eye of the cluster hardware resources, particularly the target sink system, as it’s easy to have a high throughput pipeline on the Kafka side, but overload slower target sink systems.
?This is one of the catches of using Kafka connect, as there are always two (three really) systems involved—the source, Kafka, and sink sides—so monitoring and expertise is required to ensure a smooth flowing pipeline across multiple systems.
Of course, an important use case of Kafka is to act as a buffer, as in general, it’s not always feasible to ensure comparable capacities for all systems involved in a pipeline or for them to be perfectly elastic—increasing resource takes time (and money), and for some applications being able to buffer the events in Kafka while sink systems eventually catch up may be the prime benefit of having Kafka connect in the pipeline (My Anomalia Machina blog series and ApacheCon 2019 talk cover Kafka as a buffer).
The number of Kafka connector tasks may also be higher than expected if you have a mismatch between systems (i.e. one is a lot slower than the other), as this introduces latency in the connector which requires increased concurrency to increase throughput to the target rate. There may be some “tricks”—for example, Elasticsearch supports a Bulk indexing API (but this would require support from the connector to use it, which is what the Aggregator is for).
Finally, if you care about end-to-end latency, don’t let the input rate exceed the pipeline processing capacity, as the lag will rapidly increase pushing out the total delivery latency. Even with sufficient attention to monitoring, it’s tricky to prevent this, as I noticed that restarting the Kafka sink connectors with more tasks actually results in a drop in throughput for 10s of seconds (due to consumer group rebalancing), momentarily increasing the lag.
Moreover, event based systems typically have open (vs. closed) workloads and don’t receive input events at a strictly constant/average rate, so the arrival distribution is often Poisson (or worse). This means you can get a lot more than the average number of arrivals in any time period, so you need more processing capacity (headroom) than the average rate to prevent backlogs.
We, therefore, recommend that you benchmark and monitor your Kafka connect applications to determine adequate cluster sizes (number of nodes, node types, etc.) and configuration (number of Kafka connector tasks, topic partitions, Elasticsearch shards, and replicas, etc.) for your particular use case, workload arrival rates and distributions, and SLAs.
?Thereby achieving “Humphrey” scaling! (“What do you call a Camel with no humps?”…)
2001 "Space Odyssey" Cassandra and Spark ML blog series (August - November 2017) - ML over streaming data to predict future performance problems
This was my first blog series, a 10-episode 2001 “Space Odyssey” themed introduction to Cassandra, Spark, and MLLib (Parts 1, 2, 3, 4, 5, 6, 7, 8, 9, 10). I used the 2001 theme as I was learning Cassandra and Spark from scratch during the series, which was like trying to understand the impenetrable Monoliths in the movie, with eventual enlightenment. The goal in the series was to apply Machine Learning (using Spark MLLib) to enormous amounts of machine data (100s of hours of monitoring and 1000s of metrics) from our managed Cassandra clusters, to predict future performance issues (e.g. long JVM garbage collections, or if the SLA is likely to be violated in the next hour). Some of the sub-topics included how to model time-series data in Cassandra (buckets are your friend), regression analysis, getting and using features, training and testing different types of decision tree models, building pipelines, data cleaning, exploring data with Apache Zeppelin, and using Spark MLLib, Dataframes and streaming data. This was therefore a somewhat self-referential example of open source performance engineering - using Apache Cassandra and Apache Spark to solve performance problems for Apache Cassandra clusters!
And finally, here are some talks based on the blogs (the abstracts may do a better job at summarising them):
Improving the Observability of Cassandra, Kafka and Kubernetes applications with Prometheus and OpenTracing, ApacheCon North America · Sep 5, 2019
Abstract: As distributed applications grow more complex, dynamic, and massively scalable, “observability” becomes more critical. Observability is the practice of using metrics, monitoring and distributed tracing to understand how a system works. In this presentation we’ll explore two complementary Open Source technologies: Prometheus for monitoring application metrics; and OpenTracing and Jaeger for distributed tracing. We’ll discover how they improve the observability of a massively scalable Anomaly Detection system - an application which is built around Apache Cassandra and Apache Kafka for the data layers, and dynamically deployed and scaled on Kubernetes, a container orchestration technology. We will give an overview of Prometheus and OpenTracing/Jaeger, explain how the application is instrumented, and describe how Prometheus and OpenTracing are deployed and configured in a production environment running Kubernetes, to dynamically monitor the application at scale. We conclude by exploring the benefits of monitoring and tracing technologies for understanding, debugging and tuning complex dynamic distributed systems built on Kafka, Cassandra and Kubernetes, and introduce a new use case to enable Cassandra Elastic Autoscaling, by combining Prometheus alerts, Instaclustr’s Provisioning API for Dynamic Resizing, and the new Prometheus monitoring API.
Real-time Anomaly Detection on 19 Billion events a day, ApacheCon North America, Sep 5, 2019, and ApacheCon EU Oct 23, 2019
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters. In this presentation, we will reveal how we architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream. It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. We will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from our experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
Kongo: Building a Scalable Streaming IoT Application using Apache KafkaKongo: Building a Scalable Streaming IoT Application using Apache Kafka, ApacheCon Europe · Oct 23, 2019
Join with me in a journey of exploration upriver with "Kongo", a scalable streaming IoT logistics demonstration application using Apache Kafka, the popular open source distributed streaming platform. Along the way you'll discover: an example logistics IoT problem domain (involving the rapid movement of thousands of goods by trucks between warehouses, with real-time checking of complex business and safety rules from sensor data); an overview of the Apache Kafka architecture and components; lessons learned from making critical Kaka application design decisions; an example of Kafka Streams for checking truck load limits; and finish the journey by overcoming final performance challenges and shooting the rapids to scale Kongo on a production Kafka cluster.
Building a real-time data processing pipeline using Apache Kafka, Kafka Connect, Elasticsearch and Kibana, ApacheCon, Oct 1, 2020
With the rapid onset of the global Covid-19 Pandemic from the start of this year the USA Centers for Disease Control and Prevention (CDC) had to quickly implement a new Covid-19 specific pipeline to collect testing data from all of the USA’s states and territories, and carry out other critical steps including integration, cleaning, checking, enrichment, analysis, and enforcing data governance and privacy etc. The pipeline then produces multiple consumable results for federal and public agencies. They did this in under 30 days, using Apache Kafka. In this presentation we'll build a similar (but simpler) pipeline for ingesting, integrating, indexing, searching/analysing and visualising some publicly available tidal data. We'll briefly introduce each technology and component, and walk through the steps of using Apache Kafka, Kafka Connect, Elasticsearch and Kibana to build the pipeline and visualise the results.
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra, ApacheCon Oct 2, 2020
This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can efficiently process spatiotemporal data (space and time). In order to find location-specific anomalies, we need ways to represent locations, to index locations, and to query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. For each representation we also explore possible Cassandra implementations including: Clustering columns, Secondary indexes, Denormalized tables, and the Cassandra Lucene Index Plugin. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
Building and Scaling a Robust Zero-code Data Pipeline with Open Source Technologies, FOSSASIA 15 March 2021 (Also Percona, 2 June 2021; and ApacheCon 12 Oct, 2021; All Things Open ATO 13 Oct 2021, and updated for FOSSASIA 10 April, 2022)
We built a demonstration pipeline for ingesting, indexing, and visualizing some publicly available tidal data using multiple open source technologies including Apache Kafka, Apache Kafka Connect, Apache Camel Kafka Connectors, Open Distro for Elasticsearch and Kibana, Prometheus and Grafana.
In this talk, we introduce each technology, the pipeline architecture, and walk through the steps, challenges and solutions to build an initial integration pipeline to consume USA National Oceanic and Atmospheric Administration (NOAA) Tidal data, map and index the data types in Elasticsearch, and add missing data with an ingest pipeline. The goal being to visualize the results with Kibana, where we’ll see the period of the “Lunar” day, and the size and location of some small and large tidal ranges.
But what can go wrong? The initial pipeline only worked briefly, failing when it encountered exceptions. To make the pipeline more robust, we investigated Apache Kafka Connect exception handling, and evaluated the benefits of using Apache Camel Kafka Connectors, and Elasticsearch schema validation.
With a sufficiently robust pipeline in place, it’s time to scale it up. The first step is to select and monitor the most relevant metrics, across multiple technologies. We configured Prometheus to collect the metrics, and Kibana to produce a dashboard. With the monitoring in place we were able to systematically increase the pipeline throughput by increasing Kafka connector tasks, while watching out for potential bottlenecks. We discovered, and fixed, two bottlenecks in the pipeline, proving the value of this approach to pipeline scaling.
97 Things Every Data Engineer Should Know (Chapter 80: "The Yin and Yang of Big Data Scalability"), O'Reilly · Jun 21, 2021
Take advantage of today's sky-high demand for data engineers. With this in-depth book, current and aspiring engineers will learn powerful real-world best practices for managing data big and small. Contributors from notable companies including Twitter, Google, Stitch Fix, Microsoft, Capital One, and Instaclustr share their experiences and lessons learned for overcoming a variety of specific and often nagging challenges.
Edited by Tobias Macey, host of the popular Data Engineering Podcast, this book presents 97 concise and useful tips for cleaning, prepping, wrangling, storing, processing, and ingesting data. Data engineers, data architects, data team managers, data scientists, machine learning engineers, and software engineers will greatly benefit from the wisdom and experience of their peers.
I contributed chapter 80: The Yin and Yang of Big Data Scalability
Change Data Capture (CDC) With Kafka Connect and the Debezium PostgreSQL Source Connector, PostgresConf.CN & PGConf.Asia 15 Dec 2021
Modern event-based/streaming distributed systems embrace the idea that change is inevitable and actually desirable! Without being change-aware, systems are inflexible, can’t evolve or react, and are simply incapable of keeping up with real-time real-world data. But how can we speed up an “Elephant” (PostgreSQL) to be as fast as a “Cheetah” (Kafka)? In this talk, we'll introduce the Debezium PostgreSQL Connector, and explain how to deploy, configure and run it on a Kafka Connect cluster, explore the semantics and format of the change data events (including Schemas and Table/Topic mapping), and test the performance to see if we can really achieve "Cheetah" speed. Finally, we'll show how to stream the change data events into an example downstream system, Elasticsearch, using an open source sink connector and Single Message Transformations (SMTs).
HotCloudPerf2022 invited Keynote: “Scaling Open Source Big Data Cloud Applications is Easy/Hard”, International Conference on Performance Engineering, ICPE · Apr 9, 2022
In the last decade, the development of modern horizontally scalable open-source Big Data technologies such as Apache Cassandra (for data storage), and Apache Kafka (for data streaming) enabled cost-effective, highly scalable, reliable, low-latency applications, and made these technologies increasingly ubiquitous. To enable reliable horizontal scalability, both Cassandra and Kafka utilize partitioning (for concurrency) and replication (for reliability and availability) across clustered servers. But building scalable applications isn’t as easy as just throwing more servers at the clusters, and unexpected speed humps are common. Consequently, you also need to understand the performance impact of partitions, replication, and clusters; monitor the correct metrics to have an end-to-end view of applications and clusters; conduct careful benchmarking, and scale and tune iteratively to take into account performance insights and optimizations. In this presentation, I will explore some of the performance goals, challenges, solutions, and results I discovered over the last 5 years building multiple realistic demonstration applications. The examples will include trade-offs with elastic Cassandra auto-scaling, scaling a Cassandra and Kafka anomaly detection application to 19 Billion checks per day, and building low-latency streaming data pipelines using Kafka Connect for multiple heterogeneous source and sink systems.
APACHE ZOOKEEPER AND APACHE CURATOR MEET THE DINING PHILOSOPHERS, ApacheCon Asia July 29, 2022; and ApacheCon NA 3 Oct, 2022
A ZooKeeper walks into a pub … (actually an Outback pub), and ends up helping some Philosophers solve their fork resource contention problem. This talk is an introduction to Apache Zookeeper and Apache Curator to solve a new variant of the classic computer science Dining Philosophers problem. We’ll introduce Zookeeper (a mature and widely-used de facto technology for distributed systems coordination) and the Dining Philosophers problem, and explore how we used Apache Curator (a high-level Java client for Zookeeper) to implement the solution and show how it works. We tested the application on Instaclustr’s new managed Apache Zookeper cloud service, so we can also reveal performance results using a single Zookeeper server vs. an Ensemble. Finally, we take a look at the progress to remove Zookeeper from Apache Kafka. Even though Apache Kafka may be leaving the Zoo(keeper) soon, there are still lots of distributed applications in need of some coordination help. and it’s worth learning about Apache Zookeeper and Curator.
SCALING OPEN SOURCE BIG DATA CLOUD APPLICATIONS IS EASY/HARD, ApacheCon Asia 29 July 2022
In the last decade, the development of modern horizontally scalable open-source Big Data technologies such as Apache Cassandra (for data storage), and Apache Kafka (for data streaming) enabled cost-effective, highly scalable, reliable, low-latency applications, and made these technologies increasingly ubiquitous. To enable reliable horizontal scalability, both Cassandra and Kafka utilize partitioning (for concurrency) and replication (for reliability and availability) across clustered servers.
But building scalable applications isn’t as easy as just throwing more servers at the clusters, and unexpected speed humps are common. Consequently, you also need to understand the performance impact of new server types and partitions, replication, consumers, connections, etc; monitor the correct metrics to have an end-to-end view of applications and clusters; conduct careful benchmarking, and scale and tune iteratively to take into account performance insights and optimizations.
In this presentation, Paul will explore some of the performance goals, challenges, solutions, and insights I discovered over the last 5 years of building multiple realistic demonstration applications. The examples include trade-offs and automation of elastic Cassandra auto-scaling, scaling a Cassandra and Kafka anomaly detection application to 19 Billion checks per day, understanding and mitigating the impact of Kafka partitions and replication on cluster throughput, and building low-latency streaming data pipelines using Kafka Connect.
The Impact of Hardware and Software Version Changes on Apache Kafka Performance and Scalability, ApacheCon NA 6 Oct 2022 (1st Performance Engineering track presentation)
Apache Kafka's performance and scalability can be impacted by both hardware and software dimensions. In this presentation, we explore two recent experiences from running a managed Kafka service.
The first example recounts our experiences with running Kafka on AWS's Graviton2 (ARM) instances. We performed extensive benchmarking but didn't initially see the expected performance benefits. We developed multiple hypotheses to explain the unrealized performance improvement, but we could not experimentally determine the cause. We then profiled the Kafka application, and after identifying and confirming a likely cause, we found a workaround and obtained the hoped-for improved price/performance.
The second example explores the ability of Kafka to scale with increasing partitions. We revisit our previous benchmarking experiments with the newest version of Kafka (3.X), which has the option to replace Zookeeper with the new KRaft protocol. We test the theory that Kafka with KRaft can 'scale to millions of partitions' and also provide valuable experimental feedback on how close KRaft is to being production-ready.
Watch this space for more to come including Kafka KRaft performance, Uber's Cadence + Kafka, TensorFlow, RisingWave, ClickHouse, etc. ?
?