Cassandra data modelling: Redundant data, a tough decision
The biggest challenge around building an efficient data model for Cassandra is data redundancy.
Though the basic rules for data modelling with Cassandra, mention the usual RDBMS modelling goals, as non goals for Cassandra (Refer: Basic rules for C* data modelling), it builds upon assumptions that clusters are built on commodity hardware, storage is cheap, and as data needs increase more nodes can be added to the cluster incurring very low cost.
But in real life we are faced with technical as well as non technical problems.
a. Keeping multiple column families in sync can be a major overhead, if the same data is spread across them. What if writes to some CF succeed and some fail? How long and how much will we retry?
b. Horizontal scalability may be a truth, but think of a mundane question of where to keep all those heat producing, energy guzzling machines?
So how do we model the database that does not allow joins without redundancy.
The simple answer is, we do not. What we do is, manage redundancy.
Let us consider a case where we need to query a dataset containing 1000 attributes, and the queries involve two mutually exclusive identifying keys.
If we know that the criteria is going to yield just a few rows for each of the keys, we'd rather build one column family indexed on the "more often used key" (say key1, be it used just .1% more, the idea is choose the key more used) and a second column family containing mapping between the two keys, indexed on the other key. We'd choose to do an in memory join*.
On the other hand, if we need to query a dataset containing a 100 attributes, and the queries involve two mutually exclusive keys and each key would yield 10000 rows of data, we'd want to live with data redundancy. An in memory join would be a really bad idea here.
Some may ask, what if the dataset contains 1000 attributes and 10000 rows of data. Keeping with the idea that our data is well spread across the cluster, this case would fall into the category of "have we really spread the data correctly"? Another factor to consider will be to remember, model around your queries. Do all our queries need all those 1000 attributes? Do all our queries need all those 10000 rows? Most of the time, the answer will be no to one of those questions. If it is no to first one, we create our column families with only relevant columns. If it is no to the second one, we divide our data more evenly by choosing another attribute from that data and creating a composite partition key.
We can further enhance it by choosing a column family with a generated id as partition key. Then build lookup column families with partition keys as the query keys and collections of generated ids as attribute. Thus the data can be read by identifying generated ids and then invoking individual queries based on those ids. This will reduce data redundancy and also keep the in memory processing, light. Also the data will be much better spread across the cluster.
P. S: Try to have the queries read one key at a time only. Contrary to RDBMS, an IN clause with multiple keys will do more harm than good. Rather loop over the keys and invoke multiple queries. If need, the queries can be invoked asynchronously.
-- Cross posting from Blogger