Ten Interesting Things About Apache Cassandra For Developers
Apache Cassandra is a free and open-source, distributed, wide-column NoSQL database (The Parthenon, Paul Brebner, 2009)

Ten Interesting Things About Apache Cassandra For Developers

Recently I was looking over my previous blogs about Apache Cassandra to find some relevant links for another colleague in our DevRel team, and I realised I haven't written a summary blog of what I learned over the last 7 years about Cassandra, so here goes! There just happen to be 10 "things" of interest.

1 Write Speed


"Explorer 1" - Clocks on satellites run faster than clocks on Earth (Public Domain)


Once Upon A Time, before starting at Instaclustr, I was the CTO of an R&D startup commercialising performance modelling and simulation. Around 10 years ago we built a cloud-hosted multi-user version of our performance modelling tool and needed a way to store the generated metrics for each simulation run for subsequent graphing and comparison. Now, simulation is way faster than real-time (like clocks on satellites - see this explanation by Dr Karl to understand why), so a few minutes of simulation was producing massive amounts of data (potentially equivalent to months of wall-clock time data), and we didn't want to slow the simulation down with a slow database. So after evaluating several databases, we settled on Apache Cassandra. It turned out to be ideal - it was really fast for writes in particular, it could keep up with the simulation, and was open source, so we could host and run it ourselves.

2 Connecting to Cassandra

Cassandra (By Evelyn De Morgan - Flickr and [1], Public Domain

One of the first blogs I wrote for Instaclustr was about connecting to Cassandra with a Java program, "Hello Cassandra! A Java Client Example" - for some odd reason (that we could never work out) it was the most popular blog on our web site for several years. The previous blog ("Consulting Cassandra") introduced CQL, the Cassandra Query Language.

3 Wide Columns

Greek Columns (Source: Paul Brebner)


My 1st Instaclustr blog series (A 2001 Space Odessey-themed introduction to Cassandra and Spark ML) used Cassandra wide-columns for ML features, check these two blogs out for more details: https://www.instaclustr.com/blog/behind-the-scenes/ and https://www.instaclustr.com/blog/fourth-contact-monolith/

Wide columns are one feature of Cassandra's "NoSQL" approach and allow for the flexibility to have as many different columns as you like (each row can have different numbers, types and names of columns).

4 Massive Scalability

Anomaly Checks per day with increasing Cassandra nodes (Source: Paul Brebner)


In 2019 I built my second realistic demonstration with our open source technologies, this time using Apache Kafka and Cassandra (and Kubernetes etc) combined for real-time anomaly detection, the series was called "Anomalia Machina". Here's the last blog in the series that revealed the results - we could process 19 Billion anomaly checks per day with 574 cores in total (Cassandra had 48 nodes in the cluster). Scaling wasn't trivial, and there were several lessons we learned along the way (incremental scaling with tuning is a good idea).

5 Can You Use Cassandra for Geospatial Queries? - Yes You Can!

Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space.


My next project was extending Anomalia Machina to detect anomalies of space as well as time - "Terra Locus Anomalia Machina" was the new series title. Now, space is tricky as it's big and queries require the ability to find everything within a certain distance of something else etc. We tried some basic Cassandra tricks including "allow filtering", bounding boxes, clustering columns, secondary indexes and SASI (SSTable Attached Secondary Indexes) with varying effectiveness. In part 2 we introduced Geohashes and tried different implementations including multiple indexes, denormalized multiple tables (which is "normal" for Cassandra), and single and multiple clustering columns - which are very useful for modelling hierarchical or nested data. So this makes another point!

6 Modelling Nested/Hierarchical data using Cassandra Clustering Columns


Nested hierarchies are common in nature (By Ernst Haeckel - Escaneado por L. Fdez. 2005-12-28, Public Domain)

Here's what I learned:

Clustering columns work well for modelling and efficiently querying hierarchically organized data, so geohashes are a good fit, i.e. a single clustering column is often used to retrieve data in a particular order (e.g. for time series data) but multiple clustering columns are good for nested relationships, as Cassandra stores and locates clustering column data in nested sort order. The data is stored hierarchically, which the query must traverse (either partially or completely). To avoid “full scans” of the partition (and to make queries more efficient), a select query must include the higher-level columns (in the sort order) restricted by the equals operator. Ranges are only allowed on the last column in the query. A query does not need to include all the clustering columns, as it can omit lower-level clustering columns.


In part 3 of the blog series we introduced the 3rd spatial dimension (up and down) just to make things more realistic - 3d geohashes worked just as well as 2d.

7 Searching Cassandra with the Lucene Index

But can you use Cassandra for Search? In the final part of the geospatial blog series I took a look at the Cassanda Lucene Index which is a plugin for Apache Cassandra. This gave us lots of possible ways of performing spatial searches over Cassandra data, and the conclusion that some of the approaches we tried were a lot better than others.

8 Elastic Cassandra Autoscaling

Ok, so Cassandra Geospatial and Search are a bit unusual. And so is Elastic Cassandra Autoscaling! However, it is possible with the Instaclustr dynamic resizing cluster API! We did some tests, built a performance model, and explored Cassandra Elasticity.

9 Cassandra Multi-Datacenters

Around the World in 80 days route, By Roke - Self-published work by Roke, CC BY-SA 3.0,


What's better than 1 Cassandra cluster in one Cloud region? A Cassandra cluster that magically extends over multiple Datacenters (DCs). In a new blog series ("Around the World in approximately 8 Data Centers") I wrote about building a low-latency FINTECH (StockBroker) application that worked across multiple AWS regions. My conclusions were:

In this blog we built and experimented with a prototype of the globally distributed stock broker application, focussing on testing the multi-DC Cassandra part of the system which enabled us to significantly reduce the impact of planetary scale latencies (from seconds to very low milliseconds) and ensure greater redundancy (across multiple AWS regions), for the real-time stock trading function.


Here are the blogs in the series:

Part 1

Part 2

Part 3

Part 4

10 Change Data Capture with the Debezium Cassandra Connector

Oddly enough, there is only one word permitted in Scrabble which ends in “ezium”, trapezium! Sadly. Debezium hasn’t made it into the Scrabble dictionary yet. A trapezium (in most of the world) is a quadrilateral with at least one pair of parallel sides. But note that in the U.S. this is called a trapezoid, and a “trapezium” has no parallel sides. Here’s an “Australian” Trapezium = U.S. Trapezoid:

So you've got all your data safely (and quickly!) into a Cassandra cluster, what next? Can you turn writes into events and pump them into Kafka? Yes you can, in my 2 part blog series I tried it out with some help from Kafka and Debezium here and here.


WARNING: Some of these blogs were written a long time ago and some of the details regarding the technologies and our managed services are likely to have changed!

要查看或添加评论,请登录

Paul Brebner的更多文章

社区洞察

其他会员也浏览了