NOSQL + Cassandra's Architecture

NOSQL + Cassandra's Architecture

Cassandra is a peer-to-peer distributed database that runs on a cluster of homogeneous nodes. Cassandra has been architected from the ground up to handle large volumes of data while providing high availability. Cassandra provides high write and read throughput. A Cassandra cluster has no special nodes i.e. the cluster has no masters, no slaves or elected leaders. This enables Cassandra to be highly available while having no single point of failure.

Key Concepts :

Data Partitioning Stores data by dividing data evenly around its cluster of nodes. Each node is responsible for part of the data. The act of distributing data across nodes is referred to as data partitioning

Consistent Hashing Determining a node on which a specific piece of data should reside on. Minimising data movement when adding or removing nodes

Data Replication Replication of data ensures fault tolerance and reliability.

Eventual Consistency Nodes/replicas will eventually return the last updated value

Gossip Protocol discover node state for all nodes in a cluster. Nodes discover information about other nodes by exchanging state information about themselves and other nodes they know about. State information about every node propagates throughout the cluster

Bloom Filters - Fast way to test the existence of a data structure in a set. A bloom filter can tell if an item might exist in a set or definitely does not exist in the set. False positives are possible but false negatives are not. Bloom filters are a good way of avoiding expensive I/O operation.

Merkle Tree - A hash tree which provides an efficient way to find differences in data blocks. Leaves contain hashes of individual data blocks and parent nodes contain hashes of their respective children. This enables efficient way of finding differences between nodes.

SSTable - A Sorted String Table (SSTable) ordered immutable key value map. It is basically an efficient way of storing large sorted data segments in a file.

Write Back Cache - A write back cache is where the write operation is only directed to the cache and completion is immediately confirmed. This is different from Write-through cache where the write operation is directed at the cache but is only confirmed once the data is written to both the cache and the underlying storage structure.

Memtable - A memtable is a write back cache residing in memory which has not been flushed to disk yet.

Cassandra Keyspace - Keyspace is similar to a schema in the RDBMS world. A keyspace is a container for all your application data. When defining a keyspace, you need to specify a replication strategy and a replication factor i.e. the number of nodes that the data must be replicate too.

Column Family - A column family is analogous to the concept of a table in an RDBMS. But that is where the similarity ends. Instead of thinking of a column family as RDBMS table think of a column family as a map of sorted map. A row in the map provides access to a set of columns which is represented by a sorted map. Map<RowKey, SortedMap<ColumnKey, ColumnValue>> Please note in CQL (Cassandra Query Language) lingo a Column Family is referred to as a table.

Row Key - A row key is also known as the partition key and has a number of columns associated with it i.e. a sorted map as shown above. The row key is responsible for determining data distribution across a cluster.

Cassandra Cluster/Ring

Cassandra Write Path

Cassandra Read Path

Happy designing :) Abhijeet K.



要查看或添加评论,请登录

Abhijeet K的更多文章

社区洞察

其他会员也浏览了