登录查看更多内容

Developer Notes: noSQL vs SQL. An Overview and Use Cases

Mohd Zaki Zakaria

Data Engineering, Power BI, Python, R, SQL, JavaScript, C# Developer, API Development, Business Analyst, Technical Writer, and Application Design

发布日期: 2018年8月21日

Overview

noSQL has taken the database world by storm. It gives a breath-taking alternative to the rigid row-column SQL implementation. The advent of Big Data (structured and unstructured) that results from social media, IoT (Internet of Things), and cloud applications, computer scientists and database experts had to come out with a new database model (NoSQL or New SQL) where using traditional SQL to comprehend Big Data structure had became a day-to-day database hacking. This effort had greater demands due to analytics of historical and live-streaming of massive data. Here I am using term noSQL with noSQL database(s) interchangeably as they are meant the same technological entity. This is also true with SQL and SQL database(s).

Still, at large, developers who are already comfortable with SQL databases and where their applications are not in the category of Big Data or Analytics, wonder if shifting to noSQL is really worth the effort. I had this skeptical feeling before delving into noSQL. To date there more than 225 noSQL databases available that fall under various data-structured categories. Most common types are:

Wide Column Store / Column Families,
Document Store, Key Value / Tuple Store,
Graph Databases, and
Multi-model Databases

Still, there are also at least 19 new SQL databases merging into the database market. These emerging SQLs are known as New SQL which try to combine the benefits of NoSQL and SQL together. Changing the implementation of data storage to noSQL will embark a new learning curve to understand the paradigm and mechanism of noSQL and understand its potential to solve certain area of common problems where SQL has quite limited capabilities and flexibility. So where does noSQL fit in many application use cases?

Transactions

In a nutshell, noSQL could be used in any application data storage except where the strong deterministic ACID transactions are needed to ensure information reliability and strong consistency under strict atomic transaction level. If something wrong happening to either one or more of the transaction sub tasks, the whole transaction is considered null and void. The data will be rolled back to the previous valid state. For less stringent transaction requirements, noSQL with eventual consistency (usually limited to a single key-value entry) could be sufficient enough or better to match SQL implementation in any data storage and information query scenarios. However, noSQL databases especially that under a Multi-Model type also support an ACID transaction and offer SQL queries. Take for example ArangoDB and Microsoft CosmosDB. Recently, MongoDB has a new transaction support for multiple key-value documents that can be regarded as supporting an ACID transaction.

Data Structure

SQL schema is engaged on the database table structure and relationship between tables. That relationship allow us to enforce referential integrity and cascading effect of data insertions, updates, and deletions. These features allow us to normalize certain information into different interconnected tables where only references between tables' primary keys and foreign keys are needed to enable query composition of joins to get the specific records. By default, the referential integrity if enforced will make SQL database responsible for the read-write data validation. In many scenarios, experienced SQL developers usually did not use referential integrity and turned off the cascading effects. Me myself prefer that integrity control is imposed at the application level. I leave it to you to find out more discussions on this. However, when it comes to multi-dimensional (cube) queries in data warehousing and analytics, the denormalization of data is more appropriate and employed at various level in dimension tables that relates to fact tables. This type of database structure is crucial for read performance especially in reporting and analytic queries. Here is the sample of denormalization:

On the other hand, noSQL databases are so open that you could throw any data structure into them and dynamically change the schema of one record without affecting others . You could conceptually normalize and denormalize data based on you requirements and control the integrity validation at the application level. Besides, noSQL sacrifices locks for reference integrity and every updates and deletions are actually creating new file (previous files are immutable) latest time-stamp and smaller index. When you read the data or document, internally noSQL strategically search files with the latest time-stamp. Files are merged at later time periodically using the technique called compaction removing duplicates of deletions and updates.

noSQL at its core is greatly influenced by a 2006 research paper published by Google; Big Table: A Distributed Storage System for Structured Data. Many noSQL databases such a MongoDB, CouchDB, and Cassandra employ Log-Structured Merge-Tree (LSM-Tree) indexing technique against Binary-Tree (B-Tree) that has been long adopted by SQL databases. Lately new development had came out with a new proposal for scalable LSM-Tree (SLSM-Tree) for low latency analytics. Also, Fractal-Tree indexing emerged as a refinement of B-Tree algorithm. Fractal-Tree indexing is popularized by Tokutek (acquired by Percona).

B-Tree

LSM-Tree

Fractal-Tree

High Availability

High availability is fail-over and disaster recovery mechanism that adopting several methods of data replications (snapshot, transactional, and merge) spread and load-balanced into many servers. Both noSQL and SQL offer high availability with master-to-slave clustering capability and load balancing feature that usually use single read-write on the master and several read-only to other slaves. There are also noSQL databases that support master-to-master replication approach. Take a look at Cassandra and a subset of MariaDB. In order to demonstrate the above mentioning, I choose the following MongoDB, PostgreSQL, and Cassandra high availability and redundancy strategies:

MongoDB

MongoDB use a group of asynchronous processes that maintain the same data set that is called a replica set. It consists of primary and a single or multiple secondary instances. The single secondary can be promoted to a primary if the current primary failed to operate. MongoDB redundancy use the method of an election to vote which secondary is suitable for promotion.

Besides, MongoDB also introduces another approach to enable resolving the election process to promote the secondary instance to become a primary. It uses a special instance that act as an arbiter. An arbiter does not maintain any data set and used only for mediating votes if each of secondary instances have an even number of votes.

PostgreSQL

PostgreSQL master instance streams write-ahead-log (WAL) segments synchronously to one or more slaves. It ensures that no active transactions is in place while WAL is writing. It is a complete set of master copy to the slave.

PostgreSQL also allow third party providers to enhance its replication approaches. Slony-l, and Pgpool-II are most popular PostgreSQL replication plugins.

Cassandra

Cassandra employs master-less replication strategy where each replica can be added dynamically under each sharded node. Cassandra's ring of nodes include the range of data set with reference to primary key hashes for horizontal scaling. Each node instead of having primary sharded data could also store replicas that related to other primaries (different range).

When clients write the data, Casandra coordinator will forward those writes to all replicas. The coordinator needs to receive write acknowledgements for enough number of replicas that has been configured under certain write and read consistency level.

Consistency level is determine for its accuracy versus availability. For example, setting write and read consistency level to ALL, will ensure the accuracy of data to all replicas but the lowest level of availability due as it needs to wait all replicas to confirm their writes and reads. If one of the replica failed, then the write or read would be considered a failed operation. For more detail on Cassandra's consistency levels, please visit this link.

For disaster recovery, the whole cluster of nodes with replicas can be extended and can be extended in geographically dispersed data centers.

Vertical and Horizontal Scaling

Adding more power of one database server with more RAM, disk space, and CPU is regarded as scaling the database vertically. This is quite easy. But when comes to horizontally scale of the database into multiple physical or virtual nodes, it is a daunting task especially for SQL databases to ensure the consistency of the operation as a whole. Horizontal scaling means you could distribute your database workload into multiple instances (nodes) whether virtually in a single server, across multiple physical servers, or in the cloud grid. The rise of container technology such as Docker that use small footprint of server processes has made it possible to scale out database services into thousands of nodes in the cloud. Latency and throughput of the data are often used to benchmark the performance of horizontal scaling.

The above shows the benchmark of average time taken by clusters of Cassandra vs MySQL NDB (Network Database) for CRUD (create-read-update-delete) operations. You could get more detail at this blog.

Horizontal scaling is pretty much a big topic. I could not afford to compare as many noSQL and SQL databases. Both noSQL and SQL providers (especially New SQL provider) are racing to provide better configurations for distributed database service. Some databases offer elastic scaling that means we could scale up the capacity, scale out the number of nodes, or strip down both capacity and nodes at any time without impacting the connected client applications. Most of noSQL and SQL use technique called sharding to distribute data into a number of nodes. Data distributed will be partitioned into many nodes under certain range of keys. As to get some insights on both offerings for scalability, here I choose MongoDB, VoltDB SQL, and Cassandra which these technologies I believe are quite revolutionizing.

MongoDB

MongoDB use a few approaches of sharding to scale out the database. MongoDB also allow certain collections (collections are like tables in SQL) not to be sharded and live only in the primary partition. The whole non-replicated or a set of replicated database (a master and replicas) built before for high availability can be sharded into a cluster of multiple units or groups of replicated instances.

A MongoDB sharded cluster depicted on the above image consists of one or more sharded instances, a load-balanced config servers that store metadata and unique to this sharded cluster. It also save authentication settings for the cluster. Authentication can be a RBAC (Roled-Based Access Control), using 6 to 1024 secret key characters, or based on x.509 certificates with a secure TLS 1.1+/SSL connection. Routers will receive query instructions from the client applications, use the information in the config servers' metadata, and route to the appropriate partitions of the cluster based on that information.

MongoDB allow two strategies for sharding:

Hashed Sharding - chunks of data based on the hashed shard key values
Ranged Sharding - partitions based on ranged of key values.

These sharded keys can be further grouped into zones. Zones act like tags. You could group sharded partitions based on you custom zones for separating sharded instances into more meaningful categories.

VoltDB

VoltDB is an in-memory highly scalable ACID compliant SQL database. VoltDB is built for throughput over latency. VoltDB provides shell commands, system procedures, and automated snapshot to enable data persistence into disks. As VoltDB prefers scaling out tables into multiple partitions and each partition is deployed to one CPU core. This configuration is designed to achieve single-threaded process per partition. Therefore it is possible to execute queries without locking, latching, and transaction logs.

One VoltDB's cluster of partitions can be replicated into one or multiple physical machines per replica. This is different from MongoDB where the replication instances are sharded into a sharded cluster. VoltDB horizontal scaling is closely related to their transaction model. VoltDB automatically manage queries routing to the sharded or partitioned tables. Client applications do not need to to extra works when sending common SQL queries while VoltDB routing manage read and write into partitions in the background.

This is possible as each partition contains data and execution engine. As VoltDB prefer scheduling against shared memory to achieve better concurrency, the execution engine contains a queue for transaction requests. Requests then are run to completion serially (single-threaded fashion) at each partition.

Cassandra

Cassandra was created by Facebook in 2008 based on Google Big Table (2006) and Amazon Dynamo (2007). In February 2010, it graduated to a top-level Apache project. It combines distributed features of Dynamo and data model under storage structure from Big Table. Cassandra is elastic by default. It automatically rehash its partitions in the ring in the event of addition or reduction of its nodes.

Cassandra cluster can be visualized as a ring. This is because it uses peer-to-peer architecture that data is spread evenly across nodes (partitions) with a consistent hashing algorithm. Under this architecture, every node uses gossip protocol to communicate its status, states, and pass queries to other nearby peers. Each node store a segment of data based on a token range (hashed key). This token determine the node's position in the ring cluster.

Cassandra nodes can be extended into multiple virtual or physical machines. It is decentralized and ensure no single point of failures. The whole cluster and another extended cluster are considered masters. There is no slave. Adding of nodes increase performance throughput linearly. As I describe before, Cassandra allow write and read consistency level to be tuned between strong and eventual consistency.

Use Cases

Here I lay down three use cases for for types of applications that I feel more appropriate for each noSQL and SQL databases. This is based on my limited personal views. Feel free to comment whether you agree or if you wanted to add more use cases.

NoSQL Database Use Cases

Human Resource Management

When profiling information for employees, there is a big probability that new requirements will impact types of records that need to be introduced in the database schema. New columns need to be created as a result of such exercise. eg. New tax deduction to only certain types of employees or a new column for family information. Flexibility is needed as introducing new columns would impact the overall system stability. Using document-based NoSQL like MongoDB could eliminate the need to update a substantial amount of backend codes that attached to the target database.

Product Cataloging System

Product description and specifications vary within different categories. Take for example the scientific apparatus with mechanical gears. It would be cumbersome for SQL database to create a different product table that is related to each category of products. It is inefficient to store a large chunk of specifications into one column of a record. The parsing of that specification would be a nightmare and error-prone. The flexibility offered by NoSQL key-value pair or document database could rescue this limited capability of SQL.

Survey Application

As analyzing certain use cases in business, medical, sports, or engineering, survey form that use to collect responses from participants will be frequently updated with less or new parameterized questions. Besides, there is indeed involve a complex study of relationship within the survey's indicators. Graph database and key-value pair NoSQL databases are more appropriate to adapt to ever-changing schema updates.

SQL Database Use Cases

Billing System

The nature of the billing system is that once deployed to the production, it is very rare to update with the new column. If that is the case, the new column introduced would affect the whole records of billing items, Such rigid schema design is more appropriate to SQL database.

Retail Banking System

ACID (Atomicity, Consistency, Isolation, Durability) transaction behaviour that will roll back if any of task bounded in the transaction failed. The atomicity of the transaction is critical to ensure the deterministic of banking operations. Even NoSQL databases like MongoDB has confirmed its support for an ACID transaction for multiple documents, they are no match for the maturity of SQL databases like Oracle and MSSQL for the deterministic read-write operations.

Accounting System

Schema of an accounting system and its reporting is usually governed by the standards organization like IFRS (International Financial Reporting Standard). Such rigid schema and its number's decimal precision has taken SQL database is preferable. It does not mean that we can not use NoSQL database to empower accounting system. But the facility provided by SQL database for numbers precision, its simplicity, and maturity towards the operation of numbers aggregation has made it a far more appropriate for this kind of use cases.

Early decision on database technology to be employed is very critical for a new or updated application development. The solution architect must consider current and projected daily data growth, high availability, scalability, wide reference of the subject, good documentation and guides from the provider, and knowledge development of the team members. Apart from the above mentioned, there are Multi-Model, In-Memory, Real-Time, and Graph databases that are worth further checking:

Multi-model databases: ArangoDB, CosmosDB, FoundationDB, OrientDB, Yugabyte, CortexDB, GunDB, Datomic, MapR
In-Memory databases: Redis, Memcached, Aerospike, Apache Ignite
Real-Time databases: Google Firebase, RethinkDB
Graph Databases: Neo4J, GraphDB, JanusGraph

Finally, SQL giants like Oracle, Microsoft SQL, PostgreSQL, MariaDB, MySQL, and IBM DB2 are somehow giving options for In-Memory and NoSQL features in their offerings. This approach could give huge benefits to their existing clients for more flexibility and performance.

Hope you enjoy reading this article. If you find it interesting, do click on the "Like" button below and share it. Do not forget to click the "Thumbs Up" button should you feel that this really help you moving forward. If anything inaccurate in this article, please correct me and I will update accordingly. Thanks.

Mohd Zaki Zakaria的更多文章

Developer Notes: Refactoring

2018年8月10日

Developer Notes: Refactoring

Last year on my previous job, I was handed over a major task to orchestrate refactoring the front-end application for…

1 条评论

Developer Notes: noSQL vs SQL. An Overview and Use Cases

Mohd Zaki Zakaria

Data Engineering, Power BI, Python, R, SQL, JavaScript, C# Developer, API Development, Business Analyst, Technical Writer, and Application Design

Overview

Transactions

Data Structure

High Availability

Vertical and Horizontal Scaling

Use Cases

NoSQL Database Use Cases

SQL Database Use Cases

Mohd Zaki Zakaria的更多文章

社区洞察

其他会员也浏览了

SQL is (almost) always superior to NoSQL

What are the differences between the SQL and NoSQL databases?

WHICH ONE: NoSQL vs. SQL?

NoSQL vs SQL: Key Benefits

SQL vs NoSQL: Picking the Right Side in the Database Showdown

SQL vs NoSQL: Choosing the Right One, Future Trends & Best Practices

SQL vs NoSQL Databases - Part 1

Hasura GraphQL Remote JOINs on Distributed SQL Running on AKS & GKE

Choosing between SQL and NoSQL

MongoDB Vs SQL : A Detailed Comparison

Overview

Transactions

Data Structure

High Availability

Vertical and Horizontal Scaling

Use Cases

NoSQL Database Use Cases

SQL Database Use Cases

Mohd Zaki Zakaria的更多文章

Developer Notes: Refactoring

社区洞察

其他会员也浏览了

SQL is (almost) always superior to NoSQL

What are the differences between the SQL and NoSQL databases?

WHICH ONE: NoSQL vs. SQL?

NoSQL vs SQL: Key Benefits

SQL vs NoSQL: Picking the Right Side in the Database Showdown

SQL vs NoSQL: Choosing the Right One, Future Trends & Best Practices

SQL vs NoSQL Databases - Part 1

Hasura GraphQL Remote JOINs on Distributed SQL Running on AKS & GKE

Choosing between SQL and NoSQL

MongoDB Vs SQL : A Detailed Comparison