登录查看更多内容

SQL, NoSQL or Hadoop for 'Big Data'?

Swaminathan Saikumar

Software

发布日期: 2020年11月28日

In an earlier post, we had looked at how to use the 3Vs (Volume, Velocity & Variety) of data & the CAP theorem (Consistency, Availability & Partition Tolerance) to decide between relational SQL & NoSQL databases.

In a continuation of the discussion, let us consider technologies such as Hadoop (HDFS). When to choose these over others?

Problem

There is a huge volume of data being collected on user actions & behavior. This data is correlated to infer causation. The improved understanding is used for marketing & selling (upsell/cross-sell) products. Even a 1% increase in sales/revenue can mean a huge financial windfall.

How big is 'Big Data'? How do we store this enormous amount of data, analyze it & generate reports? How do we make this cost-effective?

Relational SQL databases

The desire for user telemetry & high volume/velocity of data had always existed. Relational SQL databases were first used. OLTP data-warehousing solutions were used to analyze data. As the databases started having scaling & performance issues, Database administrators would ask engineers to drop triggers, drop foreign key constraints, drop non-clustered indexes (that store the indexes as a B-tree & refers to the clustered index causing a 2 hop look-up), avoid expensive JOINs in queries, denormalize & avoid locks. Look up data only with the clustered index key, where the data is stored with the key. This effectively makes this close to a NoSQL database.

NoSQL databases

NoSQL databases take the opposite approach. They are performant with only a clustered index to begin with. The index is hashed to determine the node partition (bucketing) & a B-Tree or equivalent is used to store the data within the node. Additional constraints & indexes can be gradually created that can provide functionality but start dropping its performance.

Datawarehousing solutions are column based. While relational SQL databases store data in rows, columnar databases such as Cassandra, DynamoDb or HBase store data by columns making analyzing column based data faster. Secondary indexes are internally stored as new tables.

When any of the 3Vs (Velocity, Variety & Volume) are high, a NoSQL database is a better choice than a relational SQL database.

For telemetry, all 3Vs can be large. So, why not just use a NoSQL database? Why even consider Hadoop?

Cost

Pricey

NoSQL databases can be expensive. As an example, Amazon's DynamoDb internally uses expensive storage such as SSDs or fast spinning disks. Data is looked up with one key with internal hashing & sharding implementations. 1 TB can cost $1k/month. A home-grown Cassandra solution will be similarly expensive to administer & manage.

Lower cost

In contrast, Amazon's S3 can provide Petabytes of storage for the same cost as a Terabyte on DynamoDb. This is using lower cost storage on commodity hardware. Its Cold Storage (Glacier) service is even cheaper by using cheaper storage underneath. This is an example of Blob storage. Competing solutions can be a custom Hadoop (HDFS) solution, Google Cloud Blob storage or Azure Blob storage.

When performance is not a concern, such as using machine learning to analyze data in batches, a lower cost Blob storage solution is more cost effective.

The data stored on multiple nodes can be processed in parallel in multiple nodes using Map-Reduce. Eg: To find the word count in a document, search all nodes (with portions of the document) in parallel (Map) & aggregate the response (Reduce).

Data flow

When the data inflow is huge as in telemetry, there is no time to edit existing data, which can be slow. Data is stored as immutable (unchanging) data. New data is simply appended. Reads have to read the entire transaction log based off a key to figure out the latest data.

Queue

Software adage: Every software engineering problem can be solved by adding a new layer of abstraction (except the problem of too many abstractions).

When volume & velocity is high, we may need one more level of abstraction to accept the incoming data. Initially, there were queueing system solutions (eg: MSMQ). The message queue would reside on a single server for writes but allow multiple reads, which would read this into a database. The single server can be scaled up with better CPU & memory, but would easily become a bottleneck.

Distributed Queue

To overcome this bottleneck, there were distributed queues implemented. This is conceptually similar to the database solutions that we saw earlier where a message is written to 3 or more nodes to improve availability & partition tolerance.

For example, Kafka has a pub-sub (publisher-subscriber) mechanism. All data that comes in can be scaled out to multiple machines with redundancy. Publish data to topics (eg: http producers) & consumers read from topics & can persist the data to say a Hadoop HDFS system. Though this data can be queried, the data is meant to be temporary. In contrast, a HDFS system is designed for long-term storage & batch processing & querying of data.

Another example is Flume, which will push data to a memory or disk channel, then flume-sync to persist to storage (like a Hadoop HDFS system or a NoSql database).

Conclusion

Depending on the requirements & budget, big data inflow can be setup with a distributed queue to eventually persist to Blob storage on a NoSQL database.

Co-authored with Srinivasan Varadharajan.

要查看或添加评论，请登录

Swaminathan Saikumar的更多文章

Why is Go fast?

2023年4月16日

Why is Go fast?

Why is Go fast? Go has become popular for microprocesses & for scaling. What are the design decisions that make Go…
Cloud native architecture-an overview

2023年1月31日

Cloud native architecture-an overview

Any infrastructure has two main components: compute & storage. Software needs compute to run & storage to read/write.
I Bonds during high inflation

2022年5月26日

I Bonds during high inflation

During times of high inflation in the USA, consider the Series I Bonds issued by the US treasury. Currently, I bonds…
Microservices deployment

2020年12月25日

Microservices deployment

History of deployment options: Physical machines: 1990s. Fast deployment, best performance.
Microservices security & tracking

2020年12月25日

Microservices security & tracking

Security: AAA: Authentication, Authorization, Accounting/Auditing Secure interprocess communication (TLS) Security…

1 条评论
Isolation & Locks

2020年12月24日

Isolation & Locks

The CAP theorem states that two out three of Consistency, Availability & Partition Tolerance may be achieved. RDBMS…
Messaging architecture

2020年12月24日

Messaging architecture

Message formats: Text, such as JSON/XML. Readable & easier for debugging.
Microservices API Gateway

2020年12月12日

Microservices API Gateway

Benefits: Instead of specific services, clients talk to the API gateway, which provides a client-specific API…
Choose relational vs NoSQL database?

2020年10月27日

Choose relational vs NoSQL database?

Relational or NoSQL? You have this great software application in mind. Now, to implement it, what database should you…
Scaling applications-an overview

2020年10月27日

Scaling applications-an overview

As a business grows, its software applications will need to scale. Computing bottleneck Run multiple identical…

See all articles

SQL, NoSQL or Hadoop for 'Big Data'?

Swaminathan Saikumar

Software

Problem

Relational SQL databases

NoSQL databases

Cost

Pricey

Lower cost

Data flow

Queue

Distributed Queue

Conclusion

Swaminathan Saikumar的更多文章

社区洞察

其他会员也浏览了

Difference between RDBMS and HBase

Hadoop vs. Snowflake: Which One is Better

Understanding Hadoop and Managed Cloud Versions from Microsoft, AWS, and GCP

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Is cloud replacing Hadoop?

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop

All about BIG data

HDFS (Hadoop Distributed File System):

Problem

Relational SQL databases

NoSQL databases

Cost

Pricey

Lower cost

Data flow

Queue

Distributed Queue

Conclusion

Swaminathan Saikumar的更多文章

Why is Go fast?

Cloud native architecture-an overview

I Bonds during high inflation

Microservices deployment

Microservices security & tracking

Isolation & Locks

Messaging architecture

Microservices API Gateway

Choose relational vs NoSQL database?

Scaling applications-an overview

社区洞察

其他会员也浏览了

Difference between RDBMS and HBase

Hadoop vs. Snowflake: Which One is Better

Understanding Hadoop and Managed Cloud Versions from Microsoft, AWS, and GCP

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Is cloud replacing Hadoop?

Data Analysis Using Apache Hadoop and Apache Spark

Hadoop

All about BIG data

HDFS (Hadoop Distributed File System):