The Whole Ramayan of Cassandra DB
This post is for those who has prior basic knowledge of RDBMS,NOSQL and planning to setup cassandra DB in(DEV,QA,PROD) env.
Simply I love this Nosql DB because of open-source, distributed, wide-column store ,schema-free, support easy replication, have simple API, eventually consistentand capicity to handle large amounts of data. Although i am following it up since quite time but recently i got chance to work live on this. I am trying to share my views while working with such great and complex DB.
There is no need of load balancer in case of cassandra cluster because it works on gossip protocol
Cassandra is a NOsql DB which it recognised for faster write. it comes from column-oriented database . Direct sibling is scylladb. Also Hbase is from column-oriented DB but it is known for faster read. Hbase work on Hadoop plateform .
In nutshell scylladb is actual counterpart of cassandra, you may use inplace of cassandra.
##Lets start from installation:
--Install java8
--install python3 (check version of cassandra4/python3, cassandra3/python2-python3)
Better to check cqlshlib through below command
>find /usr/lib -name cqlshlib
Try to make symlink of python if it is not present. sudo ln -s /usr/bin/python3 /usr/bin/python
There are 3 ways to install cassandra(cluster/single node)
1. Download from site and configure cassandra as per your need
2. installer (linux-yum, ubuntu-apt etc)
3. Docker plateform
My observation about installation is to install through yum/apt installer because if you go through first option you must have linux/unix admin knowldge(james bond :)).
There are so many changes in file permission , you have to take into consideration. i tried with both option and found point 2 is good with little changes. These installation could be on premises or cloud with single node or cluster.
Docker installation is good option for novice person to learn. It can be useful for actual env too but in some of cases it has latency (like open tracing -microservices). so if you are setting up prod env better to stick with physical nodes.
##Managed service-AWS
AWS is providing plateform for cassandra but not supporting it. might be AWS product management and sales don’t see enough demand to justify building a team to operate Cassandra as a managed service.
So better to use EC-2 machines :)
##Cassandra installation commands:(Follow google for more )
sudo wget -q -O - https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -
sudo sh -c 'echo "deb https://www.apache.org/dist/cassandra/debian 311x main" > /etc/apt/sources.list.d/cassandra.list'
>sudo apt update
>sudo apt install cassandra -y
#Open below Security Ports
----------------------------
7000 (not used if tls enabled)-port is used for inter-node communication on the cluster
7001 internode comm (used if tls enabled)
9160 thrift client
9042 Cql native transport-port for CQL clients
7199 jmx(backup/snapshot)
##Below are folders after cassandra installation(default).
/usr/sbin/cassandra
/etc/cassandra
/usr/share/cassandra
/var/lib/cassandra/hints
/var/lib/cassandra/commitlog
/var/lib/cassandra/saved_caches
/var/lib/cassandra/data
There are two major command to work with cassandra. Nodetool and cqlsh. First is for cassandra cluster management etc and second is for DB client . check all nodetool command options.
if you run any of command and getting error of python , better to remove below lib:
cassandra-3.11.11/lib/six-1.7.3-py2.py3-none-any.zip
##Configuration:
Now come to cassandra first config file cassandra.yaml. Although this file is so huge but major properties are below to work with cassandra cluster or single node with any env. you can change hints,commitlog,saved_caches,data folder location as per your requirement. But be careful with file permissions and ownership.
Check for endpoint snitch value in yaml file or official website. use whichever is applicable in your case.
Like for AWS 3 node cluster -RackInferringSnitch, single node- SimpleSnitch. Also check if you have multi region cluster installation.
For seeds just add one node per region comma deliminated or within region add all nodes. check your requirement. Seed node is for bootstrap.
Ex- seeds: 3.144.8.18,3.144.8.19,3.144.8.20
Another config files is Caasandra-env and cassandra-rackdc.properties which you can see on official documentation or Datastax site.
cassandra.yaml:
------------------
cluster name: 'Testcluster01 '
data_file_directories: /home/cassandra/casdb/data
commitlog_directory: /home/cassandra/casdb/commitlog
saved_caches_directory: /home/cassandra/casdb/saved_caches
commitlog_segment_size_in_mb: 128
seeds: "3.144.8.18"
listen_address: 3.144.8.18
start_rpc: true
rpc_address: 3.144.8.18
thrift_framed_transport_size_in_mb: 100
read_request_timeout_in_ms: 600000
range_request_timeout_in_ms: 600000
write_request_timeout_in_ms: 600000
truncate_request_timeout_in_ms: 600000
request_timeout_in_ms: 600000
endpoint snitch: RackInferringSnitch #all nodes
batch_size_warn_threshold_in_kb: 3000
batch_size_fail_threshold_in_kb: 5000
num_tokens: 256
hints_directory: /home/cassandra/local/apps/casdb
max_hints_file_size_in_mb: 128
column_index_cache_size_in_kb: 16
key_cache_size_in_mb: 100000
Once installation is complete. try to run
cqlsh 3.144.8.18 and you will prompted with cassandra DB console.
cqlsh>
##play with DB
CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};
use test;
DESCRIBE keyspaces;
SELECT * FROM system_schema.keyspaces;
CREATE TABLE test.emp(emp_id int PRIMARY KEY, emp_name text, emp_city text,emp_sal varint,emp_phone varint );
DESCRIBE tables;
INSERT INTO emp (emp_id, emp_city, emp_name, emp_phone, emp_sal) values(1,'Noida','chandra',123456789, 60000);
select * from emp;
## For data repair among nodes(nightly build). This command can be used as cron on all nodes.
nodetool repair
## Test your DB performance (Benchmark)
There is yahoo tool (ycsb) to check performance. tool has provided different kind of load like workloada,workloadb,workloadc. Through running these loads we can benchmark our cluster and config yaml file accordingly.
https://github.com/brianfrankcooper/YCSB
#Heavy work Load 50/50% mix of read and writes
Ycsb>./bin/ycsb load cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloada
Ycsb>./bin/ycsb run cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloada
#Read mostly work Load 95/5% mix of read and writes
Ycsb>./bin/ycsb load cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloadb
Ycsb>./bin/ycsb run cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloadb
#Read only 100% read
Ycsb>./bin/ycsb load cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloadc
Ycsb>./bin/ycsb run cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloadc
##Backup/restore:
incremental_backups: true
>nodetool snapshot
##GUI tool:
git clone https://github.com/Kindrat/cassandra-client.git
Troubleshoot:
Below are few errors when you will try to change/add cluster name, add node, remove node, day to day task, etc.
Errors:
--Unable to start Cassandra: "node already exists"
--replacing a node without bootstrap risks invalidating consistency guarantee as the..
--Cannot start node if snitch's data center (137) differs from previous data center (datacenter1)
My recomendation is to look into cassandra-env.sh with different options(-Dcassandra.****).
EX --Cannot start node if snitch's data center (137) differs from previous data center (datacenter1)
JVM_OPTS=”$JVM_OPTS -Dcassandra.ignore_dc=true”
If it starts successfully, execute:
nodetool repair
nodetool cleanup
After successful operation you may comment above JVM_OPTS.
##If you may go with below topics if more interested##
Cassandra migration
Cassandra format
Commit log
Memtable/SSTable-
Management tools -- DataStaxOpsCenter ,SPM Nodetool Utility
Partitioners in Cassandra
Gossip Protocol
Token
Snitch
Compaction
Merkel Tree
Anti-Entropy
Tombstone
Bloom Filters
Heap Memory
Exceptions Count
Key Cache Hit Rate
Read/Write Latency
Python Stress test in Cassandra
Benchmarking tool(YCSB,siege,stress tool)
Snapshots -Backup and creates hard links for SSTables in snapshots folder
Materialized Views- New in cassandra3
Quorum Consistency ,Repair
I know that cassandra is very big topic and complex too. At very initial stage it is hard to understand. i just tried to give a direction/highlevel path to all enthusiast who loves to work with cassandra.
For any query or suggestion contact to [email protected]
SRE manager | AWS | GCP | Kubernetes l CKA certified | Docker | DevOps | Agile | NoSQL | AI | LLM | Terraform | Ex-Google Operations Center
2 年Insightful one.
Blockchain Trainer and developer, Cryptography, Cyber Security Consultant and Research Associate, Corporate Trainer for Cyber Security, Data structures and algorithms, Python, C++ or C, Assistant Professor
2 年Awesome writing done on Cassandra DB