SQL, NoSQL or Hadoop for 'Big Data'?

In an earlier post, we had looked at how to use the 3Vs (Volume, Velocity & Variety) of data & the CAP theorem (Consistency, Availability & Partition Tolerance) to decide between relational SQL & NoSQL databases.

In a continuation of the discussion, let us consider technologies such as Hadoop (HDFS). When to choose these over others?

Problem

There is a huge volume of data being collected on user actions & behavior. This data is correlated to infer causation. The improved understanding is used for marketing & selling (upsell/cross-sell) products. Even a 1% increase in sales/revenue can mean a huge financial windfall.

How big is 'Big Data'? How do we store this enormous amount of data, analyze it & generate reports? How do we make this cost-effective?

Relational SQL databases

The desire for user telemetry & high volume/velocity of data had always existed. Relational SQL databases were first used. OLTP data-warehousing solutions were used to analyze data. As the databases started having scaling & performance issues, Database administrators would ask engineers to drop triggers, drop foreign key constraints, drop non-clustered indexes (that store the indexes as a B-tree & refers to the clustered index causing a 2 hop look-up), avoid expensive JOINs in queries, denormalize & avoid locks. Look up data only with the clustered index key, where the data is stored with the key. This effectively makes this close to a NoSQL database.

NoSQL databases

NoSQL databases take the opposite approach. They are performant with only a clustered index to begin with. The index is hashed to determine the node partition (bucketing) & a B-Tree or equivalent is used to store the data within the node. Additional constraints & indexes can be gradually created that can provide functionality but start dropping its performance.

Datawarehousing solutions are column based. While relational SQL databases store data in rows, columnar databases such as Cassandra, DynamoDb or HBase store data by columns making analyzing column based data faster. Secondary indexes are internally stored as new tables.

When any of the 3Vs (Velocity, Variety & Volume) are high, a NoSQL database is a better choice than a relational SQL database.

For telemetry, all 3Vs can be large. So, why not just use a NoSQL database? Why even consider Hadoop?

Cost

Pricey

NoSQL databases can be expensive. As an example, Amazon's DynamoDb internally uses expensive storage such as SSDs or fast spinning disks. Data is looked up with one key with internal hashing & sharding implementations. 1 TB can cost $1k/month. A home-grown Cassandra solution will be similarly expensive to administer & manage.

Lower cost

In contrast, Amazon's S3 can provide Petabytes of storage for the same cost as a Terabyte on DynamoDb. This is using lower cost storage on commodity hardware. Its Cold Storage (Glacier) service is even cheaper by using cheaper storage underneath. This is an example of Blob storage. Competing solutions can be a custom Hadoop (HDFS) solution, Google Cloud Blob storage or Azure Blob storage.

When performance is not a concern, such as using machine learning to analyze data in batches, a lower cost Blob storage solution is more cost effective.

The data stored on multiple nodes can be processed in parallel in multiple nodes using Map-Reduce. Eg: To find the word count in a document, search all nodes (with portions of the document) in parallel (Map) & aggregate the response (Reduce).

Data flow

When the data inflow is huge as in telemetry, there is no time to edit existing data, which can be slow. Data is stored as immutable (unchanging) data. New data is simply appended. Reads have to read the entire transaction log based off a key to figure out the latest data. 

Queue

Software adage: Every software engineering problem can be solved by adding a new layer of abstraction (except the problem of too many abstractions).

When volume & velocity is high, we may need one more level of abstraction to accept the incoming data. Initially, there were queueing system solutions (eg: MSMQ). The message queue would reside on a single server for writes but allow multiple reads, which would read this into a database. The single server can be scaled up with better CPU & memory, but would easily become a bottleneck.

Distributed Queue

To overcome this bottleneck, there were distributed queues implemented. This is conceptually similar to the database solutions that we saw earlier where a message is written to 3 or more nodes to improve availability & partition tolerance.

For example, Kafka has a pub-sub (publisher-subscriber) mechanism. All data that comes in can be scaled out to multiple machines with redundancy. Publish data to topics (eg: http producers) & consumers read from topics & can persist the data to say a Hadoop HDFS system. Though this data can be queried, the data is meant to be temporary. In contrast, a HDFS system is designed for long-term storage & batch processing & querying of data.

Another example is Flume, which will push data to a memory or disk channel, then flume-sync to persist to storage (like a Hadoop HDFS system or a NoSql database).

Conclusion

Depending on the requirements & budget, big data inflow can be setup with a distributed queue to eventually persist to Blob storage on a NoSQL database.

Co-authored with Srinivasan Varadharajan.


要查看或添加评论,请登录

Swaminathan Saikumar的更多文章

  • Why is Go fast?

    Why is Go fast?

    Why is Go fast? Go has become popular for microprocesses & for scaling. What are the design decisions that make Go…

  • Cloud native architecture-an overview

    Cloud native architecture-an overview

    Any infrastructure has two main components: compute & storage. Software needs compute to run & storage to read/write.

  • I Bonds during high inflation

    I Bonds during high inflation

    During times of high inflation in the USA, consider the Series I Bonds issued by the US treasury. Currently, I bonds…

  • Microservices deployment

    Microservices deployment

    History of deployment options: Physical machines: 1990s. Fast deployment, best performance.

  • Microservices security & tracking

    Microservices security & tracking

    Security: AAA: Authentication, Authorization, Accounting/Auditing Secure interprocess communication (TLS) Security…

    1 条评论
  • Isolation & Locks

    Isolation & Locks

    The CAP theorem states that two out three of Consistency, Availability & Partition Tolerance may be achieved. RDBMS…

  • Messaging architecture

    Messaging architecture

    Message formats: Text, such as JSON/XML. Readable & easier for debugging.

  • Microservices API Gateway

    Microservices API Gateway

    Benefits: Instead of specific services, clients talk to the API gateway, which provides a client-specific API…

  • Choose relational vs NoSQL database?

    Choose relational vs NoSQL database?

    Relational or NoSQL? You have this great software application in mind. Now, to implement it, what database should you…

  • Scaling applications-an overview

    Scaling applications-an overview

    As a business grows, its software applications will need to scale. Computing bottleneck Run multiple identical…

社区洞察

其他会员也浏览了