Ten Interesting Things About Apache Cassandra For Developers
Recently I was looking over my previous blogs about Apache Cassandra to find some relevant links for another colleague in our DevRel team, and I realised I haven't written a summary blog of what I learned over the last 7 years about Cassandra, so here goes! There just happen to be 10 "things" of interest.
1 Write Speed
Once Upon A Time, before starting at Instaclustr, I was the CTO of an R&D startup commercialising performance modelling and simulation. Around 10 years ago we built a cloud-hosted multi-user version of our performance modelling tool and needed a way to store the generated metrics for each simulation run for subsequent graphing and comparison. Now, simulation is way faster than real-time (like clocks on satellites - see this explanation by Dr Karl to understand why), so a few minutes of simulation was producing massive amounts of data (potentially equivalent to months of wall-clock time data), and we didn't want to slow the simulation down with a slow database. So after evaluating several databases, we settled on Apache Cassandra. It turned out to be ideal - it was really fast for writes in particular, it could keep up with the simulation, and was open source, so we could host and run it ourselves.
2 Connecting to Cassandra
One of the first blogs I wrote for Instaclustr was about connecting to Cassandra with a Java program, "Hello Cassandra! A Java Client Example" - for some odd reason (that we could never work out) it was the most popular blog on our web site for several years. The previous blog ("Consulting Cassandra") introduced CQL, the Cassandra Query Language.
3 Wide Columns
My 1st Instaclustr blog series (A 2001 Space Odessey-themed introduction to Cassandra and Spark ML) used Cassandra wide-columns for ML features, check these two blogs out for more details: https://www.instaclustr.com/blog/behind-the-scenes/ and https://www.instaclustr.com/blog/fourth-contact-monolith/
Wide columns are one feature of Cassandra's "NoSQL" approach and allow for the flexibility to have as many different columns as you like (each row can have different numbers, types and names of columns).
4 Massive Scalability
In 2019 I built my second realistic demonstration with our open source technologies, this time using Apache Kafka and Cassandra (and Kubernetes etc) combined for real-time anomaly detection, the series was called "Anomalia Machina". Here's the last blog in the series that revealed the results - we could process 19 Billion anomaly checks per day with 574 cores in total (Cassandra had 48 nodes in the cluster). Scaling wasn't trivial, and there were several lessons we learned along the way (incremental scaling with tuning is a good idea).
5 Can You Use Cassandra for Geospatial Queries? - Yes You Can!
My next project was extending Anomalia Machina to detect anomalies of space as well as time - "Terra Locus Anomalia Machina" was the new series title. Now, space is tricky as it's big and queries require the ability to find everything within a certain distance of something else etc. We tried some basic Cassandra tricks including "allow filtering", bounding boxes, clustering columns, secondary indexes and SASI (SSTable Attached Secondary Indexes) with varying effectiveness. In part 2 we introduced Geohashes and tried different implementations including multiple indexes, denormalized multiple tables (which is "normal" for Cassandra), and single and multiple clustering columns - which are very useful for modelling hierarchical or nested data. So this makes another point!
6 Modelling Nested/Hierarchical data using Cassandra Clustering Columns
领英推荐
Here's what I learned:
Clustering columns work well for modelling and efficiently querying hierarchically organized data, so geohashes are a good fit, i.e. a single clustering column is often used to retrieve data in a particular order (e.g. for time series data) but multiple clustering columns are good for nested relationships, as Cassandra stores and locates clustering column data in nested sort order. The data is stored hierarchically, which the query must traverse (either partially or completely). To avoid “full scans” of the partition (and to make queries more efficient), a select query must include the higher-level columns (in the sort order) restricted by the equals operator. Ranges are only allowed on the last column in the query. A query does not need to include all the clustering columns, as it can omit lower-level clustering columns.
In part 3 of the blog series we introduced the 3rd spatial dimension (up and down) just to make things more realistic - 3d geohashes worked just as well as 2d.
7 Searching Cassandra with the Lucene Index
But can you use Cassandra for Search? In the final part of the geospatial blog series I took a look at the Cassanda Lucene Index which is a plugin for Apache Cassandra. This gave us lots of possible ways of performing spatial searches over Cassandra data, and the conclusion that some of the approaches we tried were a lot better than others.
8 Elastic Cassandra Autoscaling
Ok, so Cassandra Geospatial and Search are a bit unusual. And so is Elastic Cassandra Autoscaling! However, it is possible with the Instaclustr dynamic resizing cluster API! We did some tests, built a performance model, and explored Cassandra Elasticity.
9 Cassandra Multi-Datacenters
What's better than 1 Cassandra cluster in one Cloud region? A Cassandra cluster that magically extends over multiple Datacenters (DCs). In a new blog series ("Around the World in approximately 8 Data Centers") I wrote about building a low-latency FINTECH (StockBroker) application that worked across multiple AWS regions. My conclusions were:
In this blog we built and experimented with a prototype of the globally distributed stock broker application, focussing on testing the multi-DC Cassandra part of the system which enabled us to significantly reduce the impact of planetary scale latencies (from seconds to very low milliseconds) and ensure greater redundancy (across multiple AWS regions), for the real-time stock trading function.
Here are the blogs in the series:
10 Change Data Capture with the Debezium Cassandra Connector
So you've got all your data safely (and quickly!) into a Cassandra cluster, what next? Can you turn writes into events and pump them into Kafka? Yes you can, in my 2 part blog series I tried it out with some help from Kafka and Debezium here and here.
WARNING: Some of these blogs were written a long time ago and some of the details regarding the technologies and our managed services are likely to have changed!