Cassandra Overview: From Zero To Hero

Cassandra Overview: From Zero To Hero

1.1 Introduction

If you come from the software business, you must have worked with or at least heard about Cassandra. Well, it's a catchy name, isn't it? Like a royal queen. No surprises, she actually was.. the most beautiful of the daughters of Priam, the last king of Troy according to Greek mythology. Maybe the name is inspired from her but here we are not talking about the queen. Sorry to disappoint ^^

Cassandra, in the software world, is a powerful NoSQL database designed to handle vast amounts of data. Popular for its highly distributed and scalable architecture, data in Cassandra is distributed across multiple in-sync servers called nodes. The nodes in Cassandra are similar to your typical friends, gossiping all the time. No kidding! These nodes are constantly communicating with each other in order to exchange data between them so much so that it's called gossiping (Gossip Protocol). Cassandra operates on a peer-to-peer basis, unlike most traditional SQL databases which operate on a master-slave setup.

Cassandra of Troy
Cassandra of Troy

1.2. Components of Cassandra

  • Node: A node in Cassandra is a single instance of Cassandra database running on a physical or a virtual machine. Each node stores a portion of data, and collectively all nodes work together to form a single fault tolerant database.
  • Datacenter: A datacenter is a logical grouping of nodes located within the same geographical location.
  • Cluster: A cluster is a collection of one or more datacenter that serves as the top-level entity in the Cassandra database system.?
  • Keyspace: In Cassandra, a keyspace is a data container. It's analogous to database in the relational world.
  • Column family: In Apache Cassandra, a column family is a data structure that consists of rows, each of which contains columns. It's essentially a logical grouping of columns, somewhat analogous to a table in a relational database management system (RDBMS). However, it's important to note that column families in Cassandra are not exactly the same as tables in RDBMS. Data in Cassandra is stored in key-value pairs, where each column has a unique name and act as key and the corresponding data is the value. Column families are schema-flexible, meaning that each row within a column family can have a different set of columns, and columns can be added or removed dynamically without affecting other rows. With the introduction of the CQL (discussed in section 1.9) in Cassandra 1.2 and the adoption of a more structured data model resembling traditional relational databases, the term "column family" has become less commonly used and is interchangeably used with table.

Components of Cassandra
Components of Cassandra

1.3 How Data is Stored?

Data in a Cassandra database is distributed over several machines called nodes that operate in conjunction. Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to them. Since data is replicated across several nodes, data is accessible even if one of the nodes fails, thus preventing single point of failure. If a user were to insert a new row in the database, the data will go to at least one of these nodes and get replicated to a certain number of nodes (decided by replication factor discussed later in this section).?When a user requests for a row, the nodes exchange information to figure out which one has that data and then returns it back to the user. Interestingly, each node can handle requests for any data, even if it doesn't actually hold it. All of this is managed internally by the database. That's the beauty of it. The nodes in the cluster are aware of their distributed setup, communicating constantly with each other as if they were gossiping.

Just as in relational world database is the first point of entry, in Cassandra it is keyspace. Keyspace is the outermost container for data in Cassandra. In order to create a keyspace it is vital to know what its basic attributes are. Below is the syntax to create a keyspace:

create KEYSPACE my_first_keyspace WITH replication={'class':'SimpleStrategy', 'replication_factor':'3'};        

  • Replication factor: It is the number of machines in the cluster that will receive copies of the same data. A replication factor of 3 means 3 nodes will contain the same data.?Yes, redundancy means reliability.
  • Replica placement strategy:?It is nothing but the strategy to place replicas in the ring. We have strategies such as simple strategy (rack-aware strategy), old network topology strategy (rack-aware strategy), and network topology strategy (datacenter-shared strategy).
  • Column families:? Column families represent the structure of your data. Each keyspace has at least one and often many column families.

A column family is a data structure that consists of rows, each of which contains columns. Each column or a data unit has two major components - name and value. Timestamp is also associated with the data unit which tells when the data was entered or last modified. Let's say, we need to create a catalog for books. For that, we need to create a bunch of columns with names such as id, title, author, isbn, publisher. For a data entry, each of these columns have their corresponding values. And each data entry is identified by a row key which is referred to its primary key. So in total, Cassandra stores its data in the form of Map<RowKey, SortedMap<ColKey, ColVal>> as depicted in the diagrams below.

Illustration of column-family
Illustration of column-family

1.4 How Cassandra Compares With Relational Databases

Cassandra and traditional RDBMS like MySQL are quite different in their core aspects: data models, scalability, consistency models, schema flexibility, data distribution strategies, and typical use cases.

Cassandra uses a distributed, wide-column NoSQL model with horizontal scalability and tunable consistency. On the other hand, RDBMS relies on a tabular model with vertical scalability and strong consistency.

Cassandra's schema flexibility and decentralised architecture suit applications needing high availability, fault tolerance, and scalable performance, such as real-time analytics and IoT data processing. In contrast, RDBMS are better suited for transactional applications that require strong consistency, complex queries, and multi-table joins, like e-commerce systems and financial applications.

Ultimately, the choice between Cassandra and RDBMS hinges on the specific needs of the application.

  • Schema first approach is followed while designing relational databases. On the contrary, query first approach is followed while designing Cassandra DBs. You need to know what queries need to be executed before defining your database schema in Cassandra
  • Relational databases have a fixed schema where columns must have at least a null value whereas Cassandra has a flexible schema where columns can have empty values.
  • RDBMS supports the concepts of foreign keys and joins whereas there is no concept of joins in Cassandra.
  • RDBMS provides strong support for ACID whereas Cassandra provides weak ACID support.
  • Data is stored in tabular form in a relational databases. On the other hand, a nested map is a more accurate analogy of how data is stored in Cassandra.
  • While querying data, any column(s) can be used to fetch data in relational dbs like MySQL, PostgreSQL. Notably, in Cassandra, primary key(s) must be used in the where clause while querying data. ?

1.5 When To Choose Cassandra

Cassandra offers several advantages that make it a top choice for building applications. If you want to build an application that is scalable, highly available, fault tolerant, and performant, Cassandra is definitely the right choice for you.?Given Cassandra's high availability, opting for Cassandra becomes a more logical choice, especially for write-heavy applications.?

Some common use cases where Cassandra excels include:

  • Real-Time Analytics: Cassandra is ideal for real-time analytics applications that require processing large volumes of data with low latency. It can handle high write throughput, making it suitable for capturing and analyzing streaming data, event logs, and sensor data.?
  • Messaging Platforms: Cassandra can serve as the backend storage for messaging platforms, chat applications, and social networks. It can handle high write throughput and support for multiple concurrent users, ensuring fast and reliable message delivery and retrieval. Chat applications such as Discord have been largely using Cassandra before migrating to Scylla db (migration is a different story altogether).
  • Time-Series Data: Cassandra's support for wide rows and efficient storage of time-series data makes it suitable for applications that require storing and analyzing time-stamped data, such as financial transactions, log data, and monitoring metrics. Top-notch companies such as Apple, Twitter, Spotify utilize Cassandra for storing and analyzing time-series data related to user activity, transaction logs, and system monitoring.
  • Content Management Systems (CMS): Cassandra's linear scalability and high availability make it suitable for content management systems that serve large numbers of concurrent users. It can handle the storage and retrieval of diverse content types, such as articles, images, videos, and user-generated content.
  • Recommendation Engines: Cassandra can power recommendation engines that analyze user behavior, preferences, and interactions to provide personalized recommendations. Its ability to store and query large datasets in real-time enables the delivery of relevant content, products, or services to users. Netflix is a popular example in this category.?
  • Internet of Things (IoT): With its ability to handle massive amounts of time-series data and support for geographically distributed deployments, Cassandra is well-suited for IoT applications. It can store sensor data, telemetry data, and device logs efficiently while providing real-time insights and analytics.
  • Inventory Management and E-commerce: Cassandra can power inventory management systems and e-commerce platforms that require handling large product catalogs, order processing, and inventory tracking. It provides high availability and scalability to support peak traffic and seasonal fluctuations.

1.6 When Not to Choose Cassandra

Cassandra biggest advantage is also its biggest weakness. No, this statement is not taken from a typical Bollywood movie - it is, in fact, true. ?Due to its highly distributed nature, Cassandra offers weak ACID support.

Although Cassandra offers tunable consistency levels, but it's primarily an eventually consistent database. This can lead to issues with data staleness.?For a system to be consistent, all nodes should remain in-sync at all times. With nodes placed far apart, in most cases across continents, this sort of constant data syncing and communication isn’t reliable in such a distributed system, especially when you are getting thousands of requests per second. To add to that, Cassandra offers node-level atomicity but doesn't support rollbacks if write operation succeeds in one node and fails in another.?Depending on your Cassandra configuration, your cluster can be in an inconsistent state, causing nodes to hold conflicting data. In short, if you want your application to have strong ACID support, refrain from using Cassandra.?

1.7. Installing Cassandra

Now that the basic terminologies and use-cases are clear, we can go ahead and install Cassandra in our systems. In order to use Apache Cassandra in our local, we need to go the official website and download the latest stable GA version. The website provides us with a zip file which we, then, need to unzip in the location we want. It is advisable to rename the extracted folder to a simpler name, say cassandra. Now, we need to perform a couple of configurational steps before we are good to go. Firstly, navigate to the cassandra folder and create two directories - one is the data directory where Cassandra stores its data and the second is?cassandraLogs which, essentially, can be anywhere but its location needs to updated in the configuration file - logback.xml. This configuration file?is located inside the directory cassandra/conf. Now, replace the variable ${cassandra.logdir} inside logback.xml with the location of cassandraLogs directory. This location specifies where the database logs will be stored. The next and final step of the process is to add the following two statements in the file?.bash_profile (if the system is mac) or .bashrc (if the system is linux).?

export CASSANDRA_HOME=$HOME/cassandra
export PATH=$PATH:$CASSANDRA_HOME/bin        

Just as any change made in the bash_profile, make sure to source ~/.bash_profile or restrart the terminal. Cassandra is all set up in our system. Use the commands below to start the Cassandra server and connect to its cluster.?

cassandra -f 	        // start cassandra server
cqlsh			// connect to the cassandra cluster        

1.8. Cassandra Cluster Manager

Cassandra cluster manager or CCM is a powerful library which makes it easy to create, destroy and manage a Cassandra cluster on a local box. CCM is extremely useful when it comes to testing a Cassandra based application on local. CCM is not recommended for production databases. In order to start using ccm on your local, you need to download the zip file from the apache cassandra ccm github and unzip it at your desired location.?

1.8.1 Important ccm commands

// create a cluster named test where Cassandra version is 4.1.3
ccm create test -v 4.1.3

// Add 3 nodes to the cluster
ccm populate -n 3

// Add loopback alias for node 2
sudo ifconfig lo0 alias 127.0.0.2

// Add loopback alias for node 3
sudo ifconfig lo0 alias 127.0.0.3

// List all the available clusters
ccm list

// Check status of nodes in a cluster
ccm status

// Start a cluster; you cannot start a cluster with no nodes
ccm start

// Stop a cluster
ccm stop

// Display node specific information  
ccm node1 show

// Destroy a cluster
ccm remove        

1.9. Cassandra Query Language

Cassandra Query Language or CQL is the query language that users can leverage to communicate with the Cassandra cluster.?Programmers use cqlsh: a prompt to work with CQL or separate application language drivers. Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between the client and the nodes holding the data. CQL based queries are structurally very similar to what we use in SQL with some minor adjustments. This similarity is very intentional so that a person switching sides from SQL to noSQL adapt quickly. Even the data types are pretty much similar to what we have in the SQL world. Here too we have data types such as text, varchar, int, bigint, float, double, bigdecimal, set, list, map and many more. Let's say we own a shop containing a wide range of products and we want to manage all the inventory details using a database. Our choice in this case is Cassandra. Why, you ask? Because it is an article on Cassandra silly (:P). Below is a sequential list of CQL queries aimed to demonstrate CQL syntax. Before starting to take a shot at these queries yourself, make sure your Cassandra server is up and running, and you are connected to your cluster. ?

// Create a keyspace
create KEYSPACE catalog WITH replication={'class':'SimpleStrategy', 'replication_factor':'3'};

// View all keyspaces
describe keyspaces;

// Move into your keyspace
use catalog;

// View all column families in a keyspace
desc <table_name>;

// Create a column family
CREATE TABLE IF NOT EXISTS product (productId VARCHAR PRIMARY KEY, title TEXT, brand VARCHAR, publisher VARCHAR, length INT, breadth INT, height INT);

// Add a column
alter columnfamily product add modelId text;

// Modify column property
alter columnfamily product with gc_grace_seconds = 86400;
	 
// Insert data into column family
insert into product(productId, brand, modelId) values(‘MOB1’, ‘Samsung’, ‘GalaxyS6’);

// Add timestamp to a row
insert into product(productId, title, length, breadth) values('POST2', 'Led Zepellin', 22, 36) using timestamp 1468580580000;

// TTL = Time to Live (entry gets deleted after a day)
insert into product(productId, title, breadth, length) values('POST2', 'adele', 22, 36) using TTL 86400;

// Get data using '=' operator
select * from product where productId = ‘BOOK1’;

// Get multiple rows using 'in' operator
 select productId, title, modelId from product where productId in (‘POST1’, ‘MOB1’);

// Add a column of data type set
alter columnfamily product add keyfeatures set<text>;

// Insert data into set
insert into product(productId, title, brand, keyfeatures) values('COM1', 'Acer One', 'Acer', {'detachable keyboard', 'multitouch display'});	  

// Add a column of data type list
alter columnfamily product add service_type list<text>;

// Insert data into list
insert into product(productId, title, brand, service_type) values('SOFA1', 'Urban Living Derby Sofa', 'Urban Living', ['Needs to Call Seller', 'Service Engineer will come to the site']);

// Add a column of data type map
alter columnfamily product add camera map<text, text>;

// Insert data into map
insert into product(productId, title, brand, camera) values ('COM1', 'Acer One', 'Acer', {'front': 'VGA', 'rear':'2MP'});

// Counter stores incremental data
create columnfamily productviewcount(productId text primary key, viewcount COUNTER);

// Update or insert a counter
update productviewcount set viewcount = viewcount + 1 where productId = ‘COM1’;

// Modify existing field
update product set modelId = 'S6' where productid='MOB1'

// Modify multiple fields in one go
update product set length=25, breadth=18, height=2 where productId in ('COM1', 'SOFA1');

// Add a list item
update product set service_type=service_type+[‘element’] where productId=‘SOFA1’;

// Remove a list item
update product set service_type=service_type-[‘element’] where productId=‘SOFA1’;

// Replace a list item
update product set service_type[1]=‘element’ where productId=‘SOFA1’;

// Add a set item
update product set keyfeatures=keyfeatures+{‘Flat 10% off’} where productId=‘COM1’;

// Remove a set item
update product set keyfeatures=keyfeatures-{‘Flat 10% off’} where productId=‘COM1’;

// Discount get removed after a day
update product using TTL 86400 set keyfeatures=keyfeatures+{‘Flat 10% off for 1 day’} where productId=‘COM1’;?

// Add a map entry
update product set camera=camera+{‘periscope’: ’10MP’} where productId=‘COM1’;

// Update map value
update product set camera[‘periscope]=‘8MP’ where productId=‘COM1’;	 

// Remove map item
delete camera[‘periscope’] from product where productId=‘COM1’;

// Add a role with super user access
create role appUser WITH password = 'password' and LOGIN = true and superuser = true;

// Add a role without super user access
create role dummy with password = 'password1' and LOGIN = true;

// Drop a role
drop role dummy;

// List all roles
list roles;        



Sajan Tonge

Educate, Innovate, and Give Back to Society - Aspiring Investor || Founder @MyAnalyticsSchool || Product Manager @ICICI Bank || IIT BHU Alumni'20 || Founder - SSC Council, Sahyog Club & Jagriti IIT BHU

11 个月

Thanks for sharing

Kumar Divyank

PhonePe | ex-ZestMoney | IIT (BHU) Varanasi

11 个月

Insightful.

Full of information, and it is a zero hero in one go.. thank you for such content.

要查看或添加评论,请登录

Saubhagya Gaurav的更多文章

  • Will Devin Replace Software Engineers?

    Will Devin Replace Software Engineers?

    Devin is the world’s first AI software engineer developed by the engineers at Cognition, an applied AI lab company…

    6 条评论

社区洞察

其他会员也浏览了