SQL, NoSQL or Hadoop for 'Big Data'?
In an earlier post, we had looked at how to use the 3Vs (Volume, Velocity & Variety) of data & the CAP theorem (Consistency, Availability & Partition Tolerance) to decide between relational SQL & NoSQL databases.
In a continuation of the discussion, let us consider technologies such as Hadoop (HDFS). When to choose these over others?
Problem
There is a huge volume of data being collected on user actions & behavior. This data is correlated to infer causation. The improved understanding is used for marketing & selling (upsell/cross-sell) products. Even a 1% increase in sales/revenue can mean a huge financial windfall.
How big is 'Big Data'? How do we store this enormous amount of data, analyze it & generate reports? How do we make this cost-effective?
Relational SQL databases
The desire for user telemetry & high volume/velocity of data had always existed. Relational SQL databases were first used. OLTP data-warehousing solutions were used to analyze data. As the databases started having scaling & performance issues, Database administrators would ask engineers to drop triggers, drop foreign key constraints, drop non-clustered indexes (that store the indexes as a B-tree & refers to the clustered index causing a 2 hop look-up), avoid expensive JOINs in queries, denormalize & avoid locks. Look up data only with the clustered index key, where the data is stored with the key. This effectively makes this close to a NoSQL database.
NoSQL databases
NoSQL databases take the opposite approach. They are performant with only a clustered index to begin with. The index is hashed to determine the node partition (bucketing) & a B-Tree or equivalent is used to store the data within the node. Additional constraints & indexes can be gradually created that can provide functionality but start dropping its performance.
Datawarehousing solutions are column based. While relational SQL databases store data in rows, columnar databases such as Cassandra, DynamoDb or HBase store data by columns making analyzing column based data faster. Secondary indexes are internally stored as new tables.
When any of the 3Vs (Velocity, Variety & Volume) are high, a NoSQL database is a better choice than a relational SQL database.
For telemetry, all 3Vs can be large. So, why not just use a NoSQL database? Why even consider Hadoop?
Cost
Pricey
NoSQL databases can be expensive. As an example, Amazon's DynamoDb internally uses expensive storage such as SSDs or fast spinning disks. Data is looked up with one key with internal hashing & sharding implementations. 1 TB can cost $1k/month. A home-grown Cassandra solution will be similarly expensive to administer & manage.
Lower cost
In contrast, Amazon's S3 can provide Petabytes of storage for the same cost as a Terabyte on DynamoDb. This is using lower cost storage on commodity hardware. Its Cold Storage (Glacier) service is even cheaper by using cheaper storage underneath. This is an example of Blob storage. Competing solutions can be a custom Hadoop (HDFS) solution, Google Cloud Blob storage or Azure Blob storage.
When performance is not a concern, such as using machine learning to analyze data in batches, a lower cost Blob storage solution is more cost effective.
The data stored on multiple nodes can be processed in parallel in multiple nodes using Map-Reduce. Eg: To find the word count in a document, search all nodes (with portions of the document) in parallel (Map) & aggregate the response (Reduce).
Data flow
When the data inflow is huge as in telemetry, there is no time to edit existing data, which can be slow. Data is stored as immutable (unchanging) data. New data is simply appended. Reads have to read the entire transaction log based off a key to figure out the latest data.
Queue
Software adage: Every software engineering problem can be solved by adding a new layer of abstraction (except the problem of too many abstractions).
When volume & velocity is high, we may need one more level of abstraction to accept the incoming data. Initially, there were queueing system solutions (eg: MSMQ). The message queue would reside on a single server for writes but allow multiple reads, which would read this into a database. The single server can be scaled up with better CPU & memory, but would easily become a bottleneck.
Distributed Queue
To overcome this bottleneck, there were distributed queues implemented. This is conceptually similar to the database solutions that we saw earlier where a message is written to 3 or more nodes to improve availability & partition tolerance.
For example, Kafka has a pub-sub (publisher-subscriber) mechanism. All data that comes in can be scaled out to multiple machines with redundancy. Publish data to topics (eg: http producers) & consumers read from topics & can persist the data to say a Hadoop HDFS system. Though this data can be queried, the data is meant to be temporary. In contrast, a HDFS system is designed for long-term storage & batch processing & querying of data.
Another example is Flume, which will push data to a memory or disk channel, then flume-sync to persist to storage (like a Hadoop HDFS system or a NoSql database).
Conclusion
Depending on the requirements & budget, big data inflow can be setup with a distributed queue to eventually persist to Blob storage on a NoSQL database.
Co-authored with Srinivasan Varadharajan.