登录查看更多内容

Cassandra data modelling: Redundant data, a tough decision

Gautam Tiwari

Senior Development Manager at Mediaocean

发布日期: 2015年12月18日

The biggest challenge around building an efficient data model for Cassandra is data redundancy.

Though the basic rules for data modelling with Cassandra, mention the usual RDBMS modelling goals, as non goals for Cassandra (Refer: Basic rules for C* data modelling), it builds upon assumptions that clusters are built on commodity hardware, storage is cheap, and as data needs increase more nodes can be added to the cluster incurring very low cost.

But in real life we are faced with technical as well as non technical problems.

a. Keeping multiple column families in sync can be a major overhead, if the same data is spread across them. What if writes to some CF succeed and some fail? How long and how much will we retry?

b. Horizontal scalability may be a truth, but think of a mundane question of where to keep all those heat producing, energy guzzling machines?

So how do we model the database that does not allow joins without redundancy.

The simple answer is, we do not. What we do is, manage redundancy.

Let us consider a case where we need to query a dataset containing 1000 attributes, and the queries involve two mutually exclusive identifying keys.

If we know that the criteria is going to yield just a few rows for each of the keys, we'd rather build one column family indexed on the "more often used key" (say key1, be it used just .1% more, the idea is choose the key more used) and a second column family containing mapping between the two keys, indexed on the other key. We'd choose to do an in memory join*.

On the other hand, if we need to query a dataset containing a 100 attributes, and the queries involve two mutually exclusive keys and each key would yield 10000 rows of data, we'd want to live with data redundancy. An in memory join would be a really bad idea here.

Some may ask, what if the dataset contains 1000 attributes and 10000 rows of data. Keeping with the idea that our data is well spread across the cluster, this case would fall into the category of "have we really spread the data correctly"? Another factor to consider will be to remember, model around your queries. Do all our queries need all those 1000 attributes? Do all our queries need all those 10000 rows? Most of the time, the answer will be no to one of those questions. If it is no to first one, we create our column families with only relevant columns. If it is no to the second one, we divide our data more evenly by choosing another attribute from that data and creating a composite partition key.

We can further enhance it by choosing a column family with a generated id as partition key. Then build lookup column families with partition keys as the query keys and collections of generated ids as attribute. Thus the data can be read by identifying generated ids and then invoking individual queries based on those ids. This will reduce data redundancy and also keep the in memory processing, light. Also the data will be much better spread across the cluster.

P. S: Try to have the queries read one key at a time only. Contrary to RDBMS, an IN clause with multiple keys will do more harm than good. Rather loop over the keys and invoke multiple queries. If need, the queries can be invoked asynchronously.

-- Cross posting from Blogger

Cassandra data modelling: Redundant data, a tough decision

Gautam Tiwari

Senior Development Manager at Mediaocean

更多精彩文章

社区洞察

其他会员也浏览了

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

Best practices for collecting and processing Big Data with Resources

Big Data: The Power of Big Data: How Large Datasets Are Driving Innovation and Improvement

Navigating the Delta Lake Foundation

Transforming User Insights: Real-Time Data Analysis with Kafka, Spark, PostgreSQL, Docker and Cassandra

Data engineering mystery - rerouting large data in Kafka

In NOSQL, data modeling doesn't have to be logical!

HDFS Architecture

Data Cleaning with Apache Spark

Optimize Data lake layout using Clustering in Apache Hudi

AI driven coding assistants: Personal experience

2024年1月31日

Container ready spring boot application

2017年6月23日

POV: Prototype injection into Spring singleton, an antipattern. Opinions invited

2017年6月2日

Upgrading from standalone SOLR to SOLR Cloud

2017年4月10日

Catch hold of that Exception and hide that stacktrace!!!

2017年1月29日

Nuances of Spatial data and operations Apache SOLR

2016年9月16日

Musings on SSL troubles in websphere container

2016年3月16日

Using JNDI managed JMS objects with Apache CAMEL

2016年2月3日

Why should your startup deploy on Google App Engine?

2015年12月19日