Choose relational vs NoSQL database?
Relational or NoSQL?
You have this great software application in mind. Now, to implement it, what database should you use? Should it be relational or NoSQL?
Concepts
To make this decision, it is important to understand two software concepts.
CAP Theorem
With requirements of Consistency (reading the latest data), Availability (client with data to write must be able to hand over the data without worrying about retries & db should accept the data and handle reliably writing it) & Partition tolerance (tolerate nodes or partial internal network errors), it is possible to meet only two requirements in full with compromises required on the third.
3Vs
A concept in Big Data. What is the volume, variety & velocity of the incoming data?
Choosing Relational
When the data is structured well (lacks variety) & within reasonable limits of volume & velocity, a relational database is usually a good choice. Relational databases have a mature plethora of features & supporting products that NoSQL databases will struggle to match.
In a SQL database, reads (with the right indexes) are fast while writes are slow since they have to go to one location. A SQL database is designed to follow CA (Consistency {ACID-Atomicity, Consistency, Isolation, Durability} & Availability {with multiple servers}) at the cost of partition tolerance. Complex solutions can reduce the partition tolerance.
For a social network requirement, a Graph database is a good choice, which also follows CA (Consistency/Availability) at the cost of P (Partition Tolerance).
Choosing NoSQL
If even one of the 3Vs (volume, variety, velocity) is high, it is better to go with a NoSQL database.
Availability & Partition Tolerance
Cassandra is a NoSQL database server that can have multiple nodes. It will eventually need a DevOps team to manage complexity. Consistency is quick (but not immediate & guaranteed), dirty reads are a possibility & occasional bugs were reported in the past when a node goes down with dirty data & brought back up.
Amazon Dynamo: A commercial Storage as a service solution with DevOps managed by the provider.
Both Cassandra & Dynamo are column based key-value databases.
Consistency & Partition Tolerance
If consistency & partition tolerance are more important than availability, MongoDb & HBase are options.
MongoDb is a document based database with an id as a key & indexed documents. By default, it will keep 3 copies in sync & will not succeed a write unless all copies are written successfully. It will ensure the reads don't read dirty data.
HBase is a columnar database (similar to Cassandra & Dynamo) but ensures 3 copies by default (similar to MongoDb).
Availability can be improved with complex solutions on top.
NoSql database Categories
Key-Value
Memcached, Redis, Riak
Column-oriented
Cassandra
Document oriented (JSON)
MongoDb, CouchDb
Tabular
Big Table, HBase
Graph
Neo4J, ArangoDb (supports sharding with auto-rebalancing), MarkLogic
XML, Object based
eXist, BaseX, MarcLogic, ZODB
Conclusion
Depending on the requirements, there are multiple options available for someone wishing to build a software solution. The right tools have to be chosen for the right job.
Co-authored with Srinivasan Varadharajan.