登录查看更多内容

The Whole Ramayan of Cassandra DB

Chandrashekhar Kumar

发布日期: 2022年2月4日

This post is for those who has prior basic knowledge of RDBMS,NOSQL and planning to setup cassandra DB in(DEV,QA,PROD) env.

Simply I love this Nosql DB because of open-source, distributed, wide-column store ,schema-free, support easy replication, have simple API, eventually consistentand capicity to handle large amounts of data. Although i am following it up since quite time but recently i got chance to work live on this. I am trying to share my views while working with such great and complex DB.

There is no need of load balancer in case of cassandra cluster because it works on gossip protocol

Cassandra is a NOsql DB which it recognised for faster write. it comes from column-oriented database . Direct sibling is scylladb. Also Hbase is from column-oriented DB but it is known for faster read. Hbase work on Hadoop plateform .

In nutshell scylladb is actual counterpart of cassandra, you may use inplace of cassandra.

##Lets start from installation:

--Install java8

--install python3 (check version of cassandra4/python3, cassandra3/python2-python3)

Better to check cqlshlib through below command

>find /usr/lib -name cqlshlib

Try to make symlink of python if it is not present. sudo ln -s /usr/bin/python3 /usr/bin/python

There are 3 ways to install cassandra(cluster/single node)

1. Download from site and configure cassandra as per your need

2. installer (linux-yum, ubuntu-apt etc)

3. Docker plateform

My observation about installation is to install through yum/apt installer because if you go through first option you must have linux/unix admin knowldge(james bond :)).

There are so many changes in file permission , you have to take into consideration. i tried with both option and found point 2 is good with little changes. These installation could be on premises or cloud with single node or cluster.

Docker installation is good option for novice person to learn. It can be useful for actual env too but in some of cases it has latency (like open tracing -microservices). so if you are setting up prod env better to stick with physical nodes.

##Managed service-AWS

AWS is providing plateform for cassandra but not supporting it. might be AWS product management and sales don’t see enough demand to justify building a team to operate Cassandra as a managed service.

So better to use EC-2 machines :)

##Cassandra installation commands:(Follow google for more )

sudo wget -q -O - https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -

sudo sh -c 'echo "deb https://www.apache.org/dist/cassandra/debian 311x main" > /etc/apt/sources.list.d/cassandra.list'

>sudo apt update

>sudo apt install cassandra -y

#Open below Security Ports

----------------------------

7000 (not used if tls enabled)-port is used for inter-node communication on the cluster

7001 internode comm (used if tls enabled)

9160 thrift client

9042 Cql native transport-port for CQL clients

7199 jmx(backup/snapshot)

##Below are folders after cassandra installation(default).

/usr/sbin/cassandra

/etc/cassandra

/usr/share/cassandra

/var/lib/cassandra/hints

/var/lib/cassandra/commitlog

/var/lib/cassandra/saved_caches

/var/lib/cassandra/data

There are two major command to work with cassandra. Nodetool and cqlsh. First is for cassandra cluster management etc and second is for DB client . check all nodetool command options.

if you run any of command and getting error of python , better to remove below lib:

cassandra-3.11.11/lib/six-1.7.3-py2.py3-none-any.zip

##Configuration:

Now come to cassandra first config file cassandra.yaml. Although this file is so huge but major properties are below to work with cassandra cluster or single node with any env. you can change hints,commitlog,saved_caches,data folder location as per your requirement. But be careful with file permissions and ownership.

Check for endpoint snitch value in yaml file or official website. use whichever is applicable in your case.

Like for AWS 3 node cluster -RackInferringSnitch, single node- SimpleSnitch. Also check if you have multi region cluster installation.

For seeds just add one node per region comma deliminated or within region add all nodes. check your requirement. Seed node is for bootstrap.

Ex- seeds: 3.144.8.18,3.144.8.19,3.144.8.20

Another config files is Caasandra-env and cassandra-rackdc.properties which you can see on official documentation or Datastax site.

cassandra.yaml:

------------------

cluster name: 'Testcluster01 '

data_file_directories: /home/cassandra/casdb/data

commitlog_directory: /home/cassandra/casdb/commitlog

saved_caches_directory: /home/cassandra/casdb/saved_caches

commitlog_segment_size_in_mb: 128

seeds: "3.144.8.18"

listen_address: 3.144.8.18

start_rpc: true

rpc_address: 3.144.8.18

thrift_framed_transport_size_in_mb: 100

read_request_timeout_in_ms: 600000

range_request_timeout_in_ms: 600000

write_request_timeout_in_ms: 600000

truncate_request_timeout_in_ms: 600000

request_timeout_in_ms: 600000

endpoint snitch: RackInferringSnitch #all nodes

batch_size_warn_threshold_in_kb: 3000

batch_size_fail_threshold_in_kb: 5000

num_tokens: 256

hints_directory: /home/cassandra/local/apps/casdb

max_hints_file_size_in_mb: 128

column_index_cache_size_in_kb: 16

领英推荐

WHAT IS CASSANDRA

Ashish Ranjan 1 年前

HIVE

Darshika Srivastava 2 年前

Hadoop And Apache SparK: Which Is Suitable for Your…

Amit Kataria 5 年前

key_cache_size_in_mb: 100000

Once installation is complete. try to run

cqlsh 3.144.8.18 and you will prompted with cassandra DB console.

cqlsh>

##play with DB

CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};

use test;

DESCRIBE keyspaces;

SELECT * FROM system_schema.keyspaces;

CREATE TABLE test.emp(emp_id int PRIMARY KEY, emp_name text, emp_city text,emp_sal varint,emp_phone varint );

DESCRIBE tables;

INSERT INTO emp (emp_id, emp_city, emp_name, emp_phone, emp_sal) values(1,'Noida','chandra',123456789, 60000);

select * from emp;

## For data repair among nodes(nightly build). This command can be used as cron on all nodes.

nodetool repair

## Test your DB performance (Benchmark)

There is yahoo tool (ycsb) to check performance. tool has provided different kind of load like workloada,workloadb,workloadc. Through running these loads we can benchmark our cluster and config yaml file accordingly.

https://github.com/brianfrankcooper/YCSB

#Heavy work Load 50/50% mix of read and writes

Ycsb>./bin/ycsb load cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloada

Ycsb>./bin/ycsb run cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloada

#Read mostly work Load 95/5% mix of read and writes

Ycsb>./bin/ycsb load cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloadb

Ycsb>./bin/ycsb run cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloadb

#Read only 100% read

Ycsb>./bin/ycsb load cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloadc

Ycsb>./bin/ycsb run cassandra-cql -p hosts="127.0.0.1" -s -P workloads/workloadc

##Backup/restore:

incremental_backups: true

>nodetool snapshot

##GUI tool:

git clone https://github.com/Kindrat/cassandra-client.git

Troubleshoot:

Below are few errors when you will try to change/add cluster name, add node, remove node, day to day task, etc.

Errors:

--Unable to start Cassandra: "node already exists"

--replacing a node without bootstrap risks invalidating consistency guarantee as the..

--Cannot start node if snitch's data center (137) differs from previous data center (datacenter1)

My recomendation is to look into cassandra-env.sh with different options(-Dcassandra.****).

EX --Cannot start node if snitch's data center (137) differs from previous data center (datacenter1)

JVM_OPTS=”$JVM_OPTS -Dcassandra.ignore_dc=true”

If it starts successfully, execute:

nodetool repair

nodetool cleanup

After successful operation you may comment above JVM_OPTS.

##If you may go with below topics if more interested##

Cassandra migration

Cassandra format

Commit log

Memtable/SSTable-

Management tools -- DataStaxOpsCenter ,SPM Nodetool Utility

Partitioners in Cassandra

Gossip Protocol

Token

Snitch

Compaction

Merkel Tree

Anti-Entropy

Tombstone

Bloom Filters

Heap Memory

Exceptions Count

Key Cache Hit Rate

Read/Write Latency

Python Stress test in Cassandra

Benchmarking tool(YCSB,siege,stress tool)

Snapshots -Backup and creates hard links for SSTables in snapshots folder

Materialized Views- New in cassandra3

Quorum Consistency ,Repair

I know that cassandra is very big topic and complex too. At very initial stage it is hard to understand. i just tried to give a direction/highlevel path to all enthusiast who loves to work with cassandra.

For any query or suggestion contact to [email protected]

Kumar Gaurav

2 年

Insightful one.

1 次回应

Dr Manoj Kumar

Blockchain Trainer and developer, Cryptography, Cyber Security Consultant and Research Associate, Corporate Trainer for Cyber Security, Data structures and algorithms, Python, C++ or C, Assistant Professor

2 年

Awesome writing done on Cassandra DB

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

The Whole Ramayan of Cassandra DB

Chandrashekhar Kumar

领英推荐

For any query or suggestion contact to [email protected]

更多精彩文章

社区洞察

其他会员也浏览了

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Hive

HBase Performance Tuning

Evolution of Apache's Big Data Ecosystem

CONFIGURING HADOOP CLUSTER USING ANSIBLE

APACHE HIVE

Configuration of HDFS Cluster with Ansible

What is Hive?

HBase MemStore

Getting started with Apache Spark

领英推荐

For any query or suggestion contact to [email protected]

How Docker Development setup Can Significantly Reduce Costs for Businesses

2024年8月14日

Trained Resource Crunch and Indian Universities/Engineering colleges

2024年6月18日

Back ground verification in Indian IT industry

2024年6月17日

Magic of Snowflake (part1)

2024年1月31日

Interview Process in IT Industry

2023年12月25日

Impact of chat GPT in the market

2023年9月28日

Spark with Kubernetes

2022年12月29日

Pure Language (Scala)

2022年10月19日

Food, Timing, and Software Industry

2022年9月1日

Data Lake Journey

2022年7月28日

社区洞察

其他会员也浏览了

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Hive

HBase Performance Tuning

Evolution of Apache's Big Data Ecosystem

CONFIGURING HADOOP CLUSTER USING ANSIBLE

APACHE HIVE

Configuration of HDFS Cluster with Ansible

What is Hive?

HBase MemStore

Getting started with Apache Spark