The Whole Ramayan of Cassandra DB

The Whole Ramayan of Cassandra DB

This post is for those who has prior basic knowledge of RDBMS,NOSQL and planning to setup cassandra DB in(DEV,QA,PROD) env.

Simply I love this Nosql DB because of open-source, distributed, wide-column store ,schema-free, support easy replication, have simple API, eventually consistentand capicity to handle large amounts of data. Although i am following it up since quite time but recently i got chance to work live on this. I am trying to share my views while working with such great and complex DB.

There is no need of load balancer in case of cassandra cluster because it works on gossip protocol

Cassandra is a NOsql DB which it recognised for faster write. it comes from column-oriented database . Direct sibling is scylladb. Also Hbase is from column-oriented DB but it is known for faster read. Hbase work on Hadoop plateform .

In nutshell scylladb is actual counterpart of cassandra, you may use inplace of cassandra.

##Lets start from installation:

--Install java8

--install python3 (check version of cassandra4/python3, cassandra3/python2-python3)

Better to check cqlshlib through below command

>find /usr/lib -name cqlshlib

Try to make symlink of python if it is not present. sudo ln -s /usr/bin/python3 /usr/bin/python

There are 3 ways to install cassandra(cluster/single node)

1. Download from site and configure cassandra as per your need

2. installer (linux-yum, ubuntu-apt etc)

3. Docker plateform

My observation about installation is to install through yum/apt installer because if you go through first option you must have linux/unix admin knowldge(james bond :)).

There are so many changes in file permission , you have to take into consideration. i tried with both option and found point 2 is good with little changes. These installation could be on premises or cloud with single node or cluster.

Docker installation is good option for novice person to learn. It can be useful for actual env too but in some of cases it has latency (like open tracing -microservices). so if you are setting up prod env better to stick with physical nodes.

##Managed service-AWS

AWS is providing plateform for cassandra but not supporting it. might be AWS product management and sales don’t see enough demand to justify building a team to operate Cassandra as a managed service.

So better to use EC-2 machines :)

##Cassandra installation commands:(Follow google for more )

sudo wget -q -O - https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -

sudo sh -c 'echo "deb https://www.apache.org/dist/cassandra/debian 311x main" > /etc/apt/sources.list.d/cassandra.list'

>sudo apt update

>sudo apt install cassandra -y

#Open below Security Ports

----------------------------

7000 (not used if tls enabled)-port is used for inter-node communication on the cluster

7001 internode comm (used if tls enabled)

9160 thrift client

9042 Cql native transport-port for CQL clients

7199 jmx(backup/snapshot)

##Below are folders after cassandra installation(default).

/usr/sbin/cassandra

/etc/cassandra

/usr/share/cassandra

/var/lib/cassandra/hints

/var/lib/cassandra/commitlog

/var/lib/cassandra/saved_caches

/var/lib/cassandra/data

There are two major command to work with cassandra. Nodetool and cqlsh. First is for cassandra cluster management etc and second is for DB client . check all nodetool command options.

if you run any of command and getting error of python , better to remove below lib:

cassandra-3.11.11/lib/six-1.7.3-py2.py3-none-any.zip


##Configuration:

Now come to cassandra first config file cassandra.yaml. Although this file is so huge but major properties are below to work with cassandra cluster or single node with any env. you can change hints,commitlog,saved_caches,data folder location as per your requirement. But be careful with file permissions and ownership.

Check for endpoint snitch value in yaml file or official website. use whichever is applicable in your case.

Like for AWS 3 node cluster -RackInferringSnitch, single node- SimpleSnitch. Also check if you have multi region cluster installation.

For seeds just add one node per region comma deliminated or within region add all nodes. check your requirement. Seed node is for bootstrap.

Ex- seeds: 3.144.8.18,3.144.8.19,3.144.8.20

Another config files is Caasandra-env and cassandra-rackdc.properties which you can see on official documentation or Datastax site.

cassandra.yaml:

------------------

cluster name: 'Testcluster01 '

data_file_directories: /home/cassandra/casdb/data

commitlog_directory: /home/cassandra/casdb/commitlog

saved_caches_directory: /home/cassandra/casdb/saved_caches

commitlog_segment_size_in_mb: 128

seeds: "3.144.8.18"

listen_address: 3.144.8.18

start_rpc: true

rpc_address: 3.144.8.18

thrift_framed_transport_size_in_mb: 100

read_request_timeout_in_ms: 600000

range_request_timeout_in_ms: 600000

write_request_timeout_in_ms: 600000

truncate_request_timeout_in_ms: 600000

request_timeout_in_ms: 600000

endpoint snitch: RackInferringSnitch #all nodes

batch_size_warn_threshold_in_kb: 3000

batch_size_fail_threshold_in_kb: 5000

num_tokens: 256

hints_directory: /home/cassandra/local/apps/casdb

max_hints_file_size_in_mb: 128

column_index_cache_size_in_kb: 16

key_cache_size_in_mb: 100000


Once installation is complete. try to run

cqlsh 3.144.8.18 and you will prompted with cassandra DB console.

cqlsh>

##play with DB

CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};

use test;

DESCRIBE keyspaces;

SELECT * FROM system_schema.keyspaces;

CREATE TABLE test.emp(emp_id int PRIMARY KEY, emp_name text, emp_city text,emp_sal varint,emp_phone varint );

DESCRIBE tables;

INSERT INTO emp (emp_id, emp_city, emp_name, emp_phone, emp_sal) values(1,'Noida','chandra',123456789, 60000);

select * from emp;

## For data repair among nodes(nightly build). This command can be used as cron on all nodes.

nodetool repair

## Test your DB performance (Benchmark)

There is yahoo tool (ycsb) to check performance. tool has provided different kind of load like workloada,workloadb,workloadc. Through running these loads we can benchmark our cluster and config yaml file accordingly.

https://github.com/brianfrankcooper/YCSB

#Heavy work Load 50/50% mix of read and writes

Ycsb>./bin/ycsb load cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloada

Ycsb>./bin/ycsb run cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloada

#Read mostly work Load 95/5% mix of read and writes

Ycsb>./bin/ycsb load cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloadb

Ycsb>./bin/ycsb run cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloadb

#Read only 100% read

Ycsb>./bin/ycsb load cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloadc

Ycsb>./bin/ycsb run cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloadc

##Backup/restore:

incremental_backups: true

>nodetool snapshot

##GUI tool:

git clone https://github.com/Kindrat/cassandra-client.git

Troubleshoot:

Below are few errors when you will try to change/add cluster name, add node, remove node, day to day task, etc.

Errors:

--Unable to start Cassandra: "node already exists"

--replacing a node without bootstrap risks invalidating consistency guarantee as the..

--Cannot start node if snitch's data center (137) differs from previous data center (datacenter1)

My recomendation is to look into cassandra-env.sh with different options(-Dcassandra.****).

EX --Cannot start node if snitch's data center (137) differs from previous data center (datacenter1)

JVM_OPTS=”$JVM_OPTS -Dcassandra.ignore_dc=true”

If it starts successfully, execute:

nodetool repair

nodetool cleanup

After successful operation you may comment above JVM_OPTS.

##If you may go with below topics if more interested##

Cassandra migration

Cassandra format

Commit log

Memtable/SSTable-

Management tools -- DataStaxOpsCenter ,SPM Nodetool Utility

Partitioners in Cassandra

Gossip Protocol

Token

Snitch

Compaction

Merkel Tree

Anti-Entropy

Tombstone

Bloom Filters

Heap Memory

Exceptions Count

Key Cache Hit Rate

Read/Write Latency

Python Stress test in Cassandra

Benchmarking tool(YCSB,siege,stress tool)

Snapshots -Backup and creates hard links for SSTables in snapshots folder

Materialized Views- New in cassandra3

Quorum Consistency ,Repair


I know that cassandra is very big topic and complex too. At very initial stage it is hard to understand. i just tried to give a direction/highlevel path to all enthusiast who loves to work with cassandra.

For any query or suggestion contact to [email protected]



Kumar Gaurav

SRE manager | AWS | GCP | Kubernetes l CKA certified | Docker | DevOps | Agile | NoSQL | AI | LLM | Terraform | Ex-Google Operations Center

2 年

Insightful one.

Dr Manoj Kumar

Blockchain Trainer and developer, Cryptography, Cyber Security Consultant and Research Associate, Corporate Trainer for Cyber Security, Data structures and algorithms, Python, C++ or C, Assistant Professor

2 年

Awesome writing done on Cassandra DB

要查看或添加评论,请登录

社区洞察

其他会员也浏览了