Cassandra Overview: From Zero To Hero
Saubhagya Gaurav
Software entrepreneur | Co-founder and ex-CTO @ My Analytics School | Ad Technology | Expedia
1.1 Introduction
If you come from the software business, you must have worked with or at least heard about Cassandra. Well, it's a catchy name, isn't it? Like a royal queen. No surprises, she actually was.. the most beautiful of the daughters of Priam, the last king of Troy according to Greek mythology. Maybe the name is inspired from her but here we are not talking about the queen. Sorry to disappoint ^^
Cassandra, in the software world, is a powerful NoSQL database designed to handle vast amounts of data. Popular for its highly distributed and scalable architecture, data in Cassandra is distributed across multiple in-sync servers called nodes. The nodes in Cassandra are similar to your typical friends, gossiping all the time. No kidding! These nodes are constantly communicating with each other in order to exchange data between them so much so that it's called gossiping (Gossip Protocol). Cassandra operates on a peer-to-peer basis, unlike most traditional SQL databases which operate on a master-slave setup.
1.2. Components of Cassandra
1.3 How Data is Stored?
Data in a Cassandra database is distributed over several machines called nodes that operate in conjunction. Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to them. Since data is replicated across several nodes, data is accessible even if one of the nodes fails, thus preventing single point of failure. If a user were to insert a new row in the database, the data will go to at least one of these nodes and get replicated to a certain number of nodes (decided by replication factor discussed later in this section).?When a user requests for a row, the nodes exchange information to figure out which one has that data and then returns it back to the user. Interestingly, each node can handle requests for any data, even if it doesn't actually hold it. All of this is managed internally by the database. That's the beauty of it. The nodes in the cluster are aware of their distributed setup, communicating constantly with each other as if they were gossiping.
Just as in relational world database is the first point of entry, in Cassandra it is keyspace. Keyspace is the outermost container for data in Cassandra. In order to create a keyspace it is vital to know what its basic attributes are. Below is the syntax to create a keyspace:
create KEYSPACE my_first_keyspace WITH replication={'class':'SimpleStrategy', 'replication_factor':'3'};
A column family is a data structure that consists of rows, each of which contains columns. Each column or a data unit has two major components - name and value. Timestamp is also associated with the data unit which tells when the data was entered or last modified. Let's say, we need to create a catalog for books. For that, we need to create a bunch of columns with names such as id, title, author, isbn, publisher. For a data entry, each of these columns have their corresponding values. And each data entry is identified by a row key which is referred to its primary key. So in total, Cassandra stores its data in the form of Map<RowKey, SortedMap<ColKey, ColVal>> as depicted in the diagrams below.
1.4 How Cassandra Compares With Relational Databases
Cassandra and traditional RDBMS like MySQL are quite different in their core aspects: data models, scalability, consistency models, schema flexibility, data distribution strategies, and typical use cases.
Cassandra uses a distributed, wide-column NoSQL model with horizontal scalability and tunable consistency. On the other hand, RDBMS relies on a tabular model with vertical scalability and strong consistency.
Cassandra's schema flexibility and decentralised architecture suit applications needing high availability, fault tolerance, and scalable performance, such as real-time analytics and IoT data processing. In contrast, RDBMS are better suited for transactional applications that require strong consistency, complex queries, and multi-table joins, like e-commerce systems and financial applications.
Ultimately, the choice between Cassandra and RDBMS hinges on the specific needs of the application.
领英推荐
1.5 When To Choose Cassandra
Cassandra offers several advantages that make it a top choice for building applications. If you want to build an application that is scalable, highly available, fault tolerant, and performant, Cassandra is definitely the right choice for you.?Given Cassandra's high availability, opting for Cassandra becomes a more logical choice, especially for write-heavy applications.?
Some common use cases where Cassandra excels include:
1.6 When Not to Choose Cassandra
Cassandra biggest advantage is also its biggest weakness. No, this statement is not taken from a typical Bollywood movie - it is, in fact, true. ?Due to its highly distributed nature, Cassandra offers weak ACID support.
Although Cassandra offers tunable consistency levels, but it's primarily an eventually consistent database. This can lead to issues with data staleness.?For a system to be consistent, all nodes should remain in-sync at all times. With nodes placed far apart, in most cases across continents, this sort of constant data syncing and communication isn’t reliable in such a distributed system, especially when you are getting thousands of requests per second. To add to that, Cassandra offers node-level atomicity but doesn't support rollbacks if write operation succeeds in one node and fails in another.?Depending on your Cassandra configuration, your cluster can be in an inconsistent state, causing nodes to hold conflicting data. In short, if you want your application to have strong ACID support, refrain from using Cassandra.?
1.7. Installing Cassandra
Now that the basic terminologies and use-cases are clear, we can go ahead and install Cassandra in our systems. In order to use Apache Cassandra in our local, we need to go the official website and download the latest stable GA version. The website provides us with a zip file which we, then, need to unzip in the location we want. It is advisable to rename the extracted folder to a simpler name, say cassandra. Now, we need to perform a couple of configurational steps before we are good to go. Firstly, navigate to the cassandra folder and create two directories - one is the data directory where Cassandra stores its data and the second is?cassandraLogs which, essentially, can be anywhere but its location needs to updated in the configuration file - logback.xml. This configuration file?is located inside the directory cassandra/conf. Now, replace the variable ${cassandra.logdir} inside logback.xml with the location of cassandraLogs directory. This location specifies where the database logs will be stored. The next and final step of the process is to add the following two statements in the file?.bash_profile (if the system is mac) or .bashrc (if the system is linux).?
export CASSANDRA_HOME=$HOME/cassandra
export PATH=$PATH:$CASSANDRA_HOME/bin
Just as any change made in the bash_profile, make sure to source ~/.bash_profile or restrart the terminal. Cassandra is all set up in our system. Use the commands below to start the Cassandra server and connect to its cluster.?
cassandra -f // start cassandra server
cqlsh // connect to the cassandra cluster
1.8. Cassandra Cluster Manager
Cassandra cluster manager or CCM is a powerful library which makes it easy to create, destroy and manage a Cassandra cluster on a local box. CCM is extremely useful when it comes to testing a Cassandra based application on local. CCM is not recommended for production databases. In order to start using ccm on your local, you need to download the zip file from the apache cassandra ccm github and unzip it at your desired location.?
1.8.1 Important ccm commands
// create a cluster named test where Cassandra version is 4.1.3
ccm create test -v 4.1.3
// Add 3 nodes to the cluster
ccm populate -n 3
// Add loopback alias for node 2
sudo ifconfig lo0 alias 127.0.0.2
// Add loopback alias for node 3
sudo ifconfig lo0 alias 127.0.0.3
// List all the available clusters
ccm list
// Check status of nodes in a cluster
ccm status
// Start a cluster; you cannot start a cluster with no nodes
ccm start
// Stop a cluster
ccm stop
// Display node specific information
ccm node1 show
// Destroy a cluster
ccm remove
1.9. Cassandra Query Language
Cassandra Query Language or CQL is the query language that users can leverage to communicate with the Cassandra cluster.?Programmers use cqlsh: a prompt to work with CQL or separate application language drivers. Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between the client and the nodes holding the data. CQL based queries are structurally very similar to what we use in SQL with some minor adjustments. This similarity is very intentional so that a person switching sides from SQL to noSQL adapt quickly. Even the data types are pretty much similar to what we have in the SQL world. Here too we have data types such as text, varchar, int, bigint, float, double, bigdecimal, set, list, map and many more. Let's say we own a shop containing a wide range of products and we want to manage all the inventory details using a database. Our choice in this case is Cassandra. Why, you ask? Because it is an article on Cassandra silly (:P). Below is a sequential list of CQL queries aimed to demonstrate CQL syntax. Before starting to take a shot at these queries yourself, make sure your Cassandra server is up and running, and you are connected to your cluster. ?
// Create a keyspace
create KEYSPACE catalog WITH replication={'class':'SimpleStrategy', 'replication_factor':'3'};
// View all keyspaces
describe keyspaces;
// Move into your keyspace
use catalog;
// View all column families in a keyspace
desc <table_name>;
// Create a column family
CREATE TABLE IF NOT EXISTS product (productId VARCHAR PRIMARY KEY, title TEXT, brand VARCHAR, publisher VARCHAR, length INT, breadth INT, height INT);
// Add a column
alter columnfamily product add modelId text;
// Modify column property
alter columnfamily product with gc_grace_seconds = 86400;
// Insert data into column family
insert into product(productId, brand, modelId) values(‘MOB1’, ‘Samsung’, ‘GalaxyS6’);
// Add timestamp to a row
insert into product(productId, title, length, breadth) values('POST2', 'Led Zepellin', 22, 36) using timestamp 1468580580000;
// TTL = Time to Live (entry gets deleted after a day)
insert into product(productId, title, breadth, length) values('POST2', 'adele', 22, 36) using TTL 86400;
// Get data using '=' operator
select * from product where productId = ‘BOOK1’;
// Get multiple rows using 'in' operator
select productId, title, modelId from product where productId in (‘POST1’, ‘MOB1’);
// Add a column of data type set
alter columnfamily product add keyfeatures set<text>;
// Insert data into set
insert into product(productId, title, brand, keyfeatures) values('COM1', 'Acer One', 'Acer', {'detachable keyboard', 'multitouch display'});
// Add a column of data type list
alter columnfamily product add service_type list<text>;
// Insert data into list
insert into product(productId, title, brand, service_type) values('SOFA1', 'Urban Living Derby Sofa', 'Urban Living', ['Needs to Call Seller', 'Service Engineer will come to the site']);
// Add a column of data type map
alter columnfamily product add camera map<text, text>;
// Insert data into map
insert into product(productId, title, brand, camera) values ('COM1', 'Acer One', 'Acer', {'front': 'VGA', 'rear':'2MP'});
// Counter stores incremental data
create columnfamily productviewcount(productId text primary key, viewcount COUNTER);
// Update or insert a counter
update productviewcount set viewcount = viewcount + 1 where productId = ‘COM1’;
// Modify existing field
update product set modelId = 'S6' where productid='MOB1'
// Modify multiple fields in one go
update product set length=25, breadth=18, height=2 where productId in ('COM1', 'SOFA1');
// Add a list item
update product set service_type=service_type+[‘element’] where productId=‘SOFA1’;
// Remove a list item
update product set service_type=service_type-[‘element’] where productId=‘SOFA1’;
// Replace a list item
update product set service_type[1]=‘element’ where productId=‘SOFA1’;
// Add a set item
update product set keyfeatures=keyfeatures+{‘Flat 10% off’} where productId=‘COM1’;
// Remove a set item
update product set keyfeatures=keyfeatures-{‘Flat 10% off’} where productId=‘COM1’;
// Discount get removed after a day
update product using TTL 86400 set keyfeatures=keyfeatures+{‘Flat 10% off for 1 day’} where productId=‘COM1’;?
// Add a map entry
update product set camera=camera+{‘periscope’: ’10MP’} where productId=‘COM1’;
// Update map value
update product set camera[‘periscope]=‘8MP’ where productId=‘COM1’;
// Remove map item
delete camera[‘periscope’] from product where productId=‘COM1’;
// Add a role with super user access
create role appUser WITH password = 'password' and LOGIN = true and superuser = true;
// Add a role without super user access
create role dummy with password = 'password1' and LOGIN = true;
// Drop a role
drop role dummy;
// List all roles
list roles;
Educate, Innovate, and Give Back to Society - Aspiring Investor || Founder @MyAnalyticsSchool || Product Manager @ICICI Bank || IIT BHU Alumni'20 || Founder - SSC Council, Sahyog Club & Jagriti IIT BHU
11 个月Thanks for sharing
PhonePe | ex-ZestMoney | IIT (BHU) Varanasi
11 个月Insightful.
--
11 个月Full of information, and it is a zero hero in one go.. thank you for such content.