登录查看更多内容

Ten Interesting Things About Apache Cassandra For Developers

Paul Brebner

Open Source Technology Evangelist at Instaclustr by NetApp

发布日期: 2024年1月22日

Recently I was looking over my previous blogs about Apache Cassandra to find some relevant links for another colleague in our DevRel team, and I realised I haven't written a summary blog of what I learned over the last 7 years about Cassandra, so here goes! There just happen to be 10 "things" of interest.

1 Write Speed

"Explorer 1" - Clocks on satellites run faster than clocks on Earth (Public Domain)

Once Upon A Time, before starting at Instaclustr, I was the CTO of an R&D startup commercialising performance modelling and simulation. Around 10 years ago we built a cloud-hosted multi-user version of our performance modelling tool and needed a way to store the generated metrics for each simulation run for subsequent graphing and comparison. Now, simulation is way faster than real-time (like clocks on satellites - see this explanation by Dr Karl to understand why), so a few minutes of simulation was producing massive amounts of data (potentially equivalent to months of wall-clock time data), and we didn't want to slow the simulation down with a slow database. So after evaluating several databases, we settled on Apache Cassandra. It turned out to be ideal - it was really fast for writes in particular, it could keep up with the simulation, and was open source, so we could host and run it ourselves.

2 Connecting to Cassandra

Cassandra (By Evelyn De Morgan - Flickr and [1], Public Domain

One of the first blogs I wrote for Instaclustr was about connecting to Cassandra with a Java program, "Hello Cassandra! A Java Client Example" - for some odd reason (that we could never work out) it was the most popular blog on our web site for several years. The previous blog ("Consulting Cassandra") introduced CQL, the Cassandra Query Language.

3 Wide Columns

My 1st Instaclustr blog series (A 2001 Space Odessey-themed introduction to Cassandra and Spark ML) used Cassandra wide-columns for ML features, check these two blogs out for more details: https://www.instaclustr.com/blog/behind-the-scenes/ and https://www.instaclustr.com/blog/fourth-contact-monolith/

Wide columns are one feature of Cassandra's "NoSQL" approach and allow for the flexibility to have as many different columns as you like (each row can have different numbers, types and names of columns).

4 Massive Scalability

Anomaly Checks per day with increasing Cassandra nodes (Source: Paul Brebner)

In 2019 I built my second realistic demonstration with our open source technologies, this time using Apache Kafka and Cassandra (and Kubernetes etc) combined for real-time anomaly detection, the series was called "Anomalia Machina". Here's the last blog in the series that revealed the results - we could process 19 Billion anomaly checks per day with 574 cores in total (Cassandra had 48 nodes in the cluster). Scaling wasn't trivial, and there were several lessons we learned along the way (incremental scaling with tuning is a good idea).

5 Can You Use Cassandra for Geospatial Queries? - Yes You Can!

Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space.

My next project was extending Anomalia Machina to detect anomalies of space as well as time - "Terra Locus Anomalia Machina" was the new series title. Now, space is tricky as it's big and queries require the ability to find everything within a certain distance of something else etc. We tried some basic Cassandra tricks including "allow filtering", bounding boxes, clustering columns, secondary indexes and SASI (SSTable Attached Secondary Indexes) with varying effectiveness. In part 2 we introduced Geohashes and tried different implementations including multiple indexes, denormalized multiple tables (which is "normal" for Cassandra), and single and multiple clustering columns - which are very useful for modelling hierarchical or nested data. So this makes another point!

6 Modelling Nested/Hierarchical data using Cassandra Clustering Columns

领英推荐

Postgres for Everything IRL

Timescale 9 个月前

August 2023 - Iceberg Community News

Tabular (now part of Databricks) 1 年前

A Complete Guide to Apache Kafka for Developers (or…

Paul Brebner 2 年前

Nested hierarchies are common in nature (By Ernst Haeckel - Escaneado por L. Fdez. 2005-12-28, Public Domain)

Here's what I learned:

Clustering columns work well for modelling and efficiently querying hierarchically organized data, so geohashes are a good fit, i.e. a single clustering column is often used to retrieve data in a particular order (e.g. for time series data) but multiple clustering columns are good for nested relationships, as Cassandra stores and locates clustering column data in nested sort order. The data is stored hierarchically, which the query must traverse (either partially or completely). To avoid “full scans” of the partition (and to make queries more efficient), a select query must include the higher-level columns (in the sort order) restricted by the equals operator. Ranges are only allowed on the last column in the query. A query does not need to include all the clustering columns, as it can omit lower-level clustering columns.

In part 3 of the blog series we introduced the 3rd spatial dimension (up and down) just to make things more realistic - 3d geohashes worked just as well as 2d.

7 Searching Cassandra with the Lucene Index

But can you use Cassandra for Search? In the final part of the geospatial blog series I took a look at the Cassanda Lucene Index which is a plugin for Apache Cassandra. This gave us lots of possible ways of performing spatial searches over Cassandra data, and the conclusion that some of the approaches we tried were a lot better than others.

8 Elastic Cassandra Autoscaling

Ok, so Cassandra Geospatial and Search are a bit unusual. And so is Elastic Cassandra Autoscaling! However, it is possible with the Instaclustr dynamic resizing cluster API! We did some tests, built a performance model, and explored Cassandra Elasticity.

9 Cassandra Multi-Datacenters

Around the World in 80 days route, By Roke - Self-published work by Roke, CC BY-SA 3.0,

What's better than 1 Cassandra cluster in one Cloud region? A Cassandra cluster that magically extends over multiple Datacenters (DCs). In a new blog series ("Around the World in approximately 8 Data Centers") I wrote about building a low-latency FINTECH (StockBroker) application that worked across multiple AWS regions. My conclusions were:

In this blog we built and experimented with a prototype of the globally distributed stock broker application, focussing on testing the multi-DC Cassandra part of the system which enabled us to significantly reduce the impact of planetary scale latencies (from seconds to very low milliseconds) and ensure greater redundancy (across multiple AWS regions), for the real-time stock trading function.

Here are the blogs in the series:

10 Change Data Capture with the Debezium Cassandra Connector

Oddly enough, there is only one word permitted in Scrabble which ends in “ezium”, trapezium! Sadly. Debezium hasn’t made it into the Scrabble dictionary yet. A trapezium (in most of the world) is a quadrilateral with at least one pair of parallel sides. But note that in the U.S. this is called a trapezoid, and a “trapezium” has no parallel sides. Here’s an “Australian” Trapezium = U.S. Trapezoid:

So you've got all your data safely (and quickly!) into a Cassandra cluster, what next? Can you turn writes into events and pump them into Kafka? Yes you can, in my 2 part blog series I tried it out with some help from Kafka and Debezium here and here.

WARNING: Some of these blogs were written a long time ago and some of the details regarding the technologies and our managed services are likely to have changed!

要查看或添加评论，请登录

Paul Brebner的更多文章

Load Testing - of a bridge, by lots of trains!

2025年3月3日

Load Testing - of a bridge, by lots of trains!

Finally, an opportunity to combine software performance engineering with trains in a way that's not too far-fetched! I…
Three decades of laptop computers

2025年2月23日

Three decades of laptop computers

I was tidying up the garage on the weekend and came across a stack of old laptops that I've been "accidentally"…

1 条评论
Open Source Performance Engineering: Blogs – Part 1

2025年2月19日

Open Source Performance Engineering: Blogs – Part 1

I recently needed to track down and summarise some of my Performance Engineering blogs (covering performance…
20 years of Open Source from Grid to Cloud Computing

2024年12月17日

20 years of Open Source from Grid to Cloud Computing

Given that it's coming to the end of 2024 I was thinking back to what I was up to 20 years ago, in 2004. That feels…
Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

2024年11月22日

Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

Kafka Connect: Build and Run Data Pipelines, by Mickael Maison and Kate Stanley, O'Reilly September 2023, 400 pages. I…

2 条评论
Summary of the 6th Community over Code Performance Engineering Track (October 7, 2024, Denver, Colorado, USA)

2024年10月23日

Summary of the 6th Community over Code Performance Engineering Track (October 7, 2024, Denver, Colorado, USA)

After much anticipation, the 6th Community over Code Performance Engineering track was held on October 7 2024 in…

2 条评论
Seven Years of Open Source DevRel Technology Fun With Instaclustr

2024年8月6日

Seven Years of Open Source DevRel Technology Fun With Instaclustr

Seven years ago tomorrow I joined Instaclustr as the first Technology Evangelist to help explain multiple open source…

4 条评论
The Fourth Community over Code Performance Engineering Track (Bratislava, Slovakia, 5 June 2024)

2024年6月17日

The Fourth Community over Code Performance Engineering Track (Bratislava, Slovakia, 5 June 2024)

The 4th Community over Code Performance Engineering track was on recently in Bratislava. Thanks to everyone who made it…
Kafka Summit Bangalore 2024 - Interesting Talks

2024年5月9日

Kafka Summit Bangalore 2024 - Interesting Talks

Last week I attended the Apache Kafka Summit Bangalore (India, along with thousands of other speakers and attendees -…
What Do Hanoi Intersections And Water Puppets Have In Common With Distributed Cloud Systems?

2024年4月22日

What Do Hanoi Intersections And Water Puppets Have In Common With Distributed Cloud Systems?

Last week I presented at FOSSASIA which was held in Hanoi, Vietnam. During my time in Hanoi, I had two experiences that…

3 条评论

See all articles

Ten Interesting Things About Apache Cassandra For Developers

Paul Brebner

Open Source Technology Evangelist at Instaclustr by NetApp

1 Write Speed

2 Connecting to Cassandra

3 Wide Columns

4 Massive Scalability

5 Can You Use Cassandra for Geospatial Queries? - Yes You Can!

6 Modelling Nested/Hierarchical data using Cassandra Clustering Columns

领英推荐

7 Searching Cassandra with the Lucene Index

8 Elastic Cassandra Autoscaling

9 Cassandra Multi-Datacenters

10 Change Data Capture with the Debezium Cassandra Connector

Paul Brebner的更多文章

社区洞察

其他会员也浏览了

FLaNK-AIM: 13 May 2024

Understanding the Future of Apache Iceberg Catalogs

Create A Flask App To Use PostgreSQL Database

Elastic Search

What is Apache Beam and how does it fit in the data processing ecosystem?

Spring Data with MongoDB

MongoDB : Hello World with Python & Node

Cheers to Real-time Analytics with Apache Flink : Part 1 of 3

Performing DML Operations on Apache Iceberg ?? Tables in Jupyter Notebook with MinIO

Async Internode Messaging in Apache Cassandra? 4.0 - Hands-On

1 Write Speed

2 Connecting to Cassandra

3 Wide Columns

4 Massive Scalability

5 Can You Use Cassandra for Geospatial Queries? - Yes You Can!

6 Modelling Nested/Hierarchical data using Cassandra Clustering Columns

领英推荐

7 Searching Cassandra with the Lucene Index

8 Elastic Cassandra Autoscaling

9 Cassandra Multi-Datacenters

10 Change Data Capture with the Debezium Cassandra Connector

Paul Brebner的更多文章

Load Testing - of a bridge, by lots of trains!

Three decades of laptop computers

Open Source Performance Engineering: Blogs – Part 1

20 years of Open Source from Grid to Cloud Computing

Kafka Connect: Build and Run Data Pipelines - Book Review, Paul Brebner

Summary of the 6th Community over Code Performance Engineering Track (October 7, 2024, Denver, Colorado, USA)

Seven Years of Open Source DevRel Technology Fun With Instaclustr

The Fourth Community over Code Performance Engineering Track (Bratislava, Slovakia, 5 June 2024)

Kafka Summit Bangalore 2024 - Interesting Talks

What Do Hanoi Intersections And Water Puppets Have In Common With Distributed Cloud Systems?

社区洞察

其他会员也浏览了

FLaNK-AIM: 13 May 2024

Understanding the Future of Apache Iceberg Catalogs

Create A Flask App To Use PostgreSQL Database

Elastic Search

What is Apache Beam and how does it fit in the data processing ecosystem?

Spring Data with MongoDB

MongoDB : Hello World with Python & Node

Cheers to Real-time Analytics with Apache Flink : Part 1 of 3

Performing DML Operations on Apache Iceberg ?? Tables in Jupyter Notebook with MinIO

Async Internode Messaging in Apache Cassandra? 4.0 - Hands-On