SQL vs. No SQL
Maria Masood
Certified Data Scientist | Project Manager (PMI) | PowerBI | Tableau | Teradata | AI | NN | NLP
Relational databases have been around at least since the 1970s and they're several reasons behind, the data is fairly well structured, records are organized into tables. Tables consist of rows, which are identified by unique keys, or primary keys, organize data into tables and then join them or link them together, so we don't have to lump all of our data into one large structure. Another important feature is support for something called transactions. And there's an acronym for that that we use, it's called ACID. Atomicity Consistency Isolation and Durability. Atomicity is a feature that supports transactions so that multi-step operations like transferring funds from your checking account to your savings account all have to occur for a transaction to succeed. Consistency means that the database is always kept in a consistent state. It follows all of the rules and constraints that you've specified for your database. Isolation means the transactions don't interrupt each other and finally Durability means that data is stored persistently so that you don't have to worry about losing your data if power is lost or your server crashes.
Now, those were some of the main features of relational databases that are important, Normalized data models are also important which means structure the data in ways that minimize the chance of introducing mistakes or anomalies as they're known. Furthermore, RDBS is widely supported in terms of other application development tools and programming languages, their schemas or data structures are fixed, and we know how our schemas perform when we start structuring programs but Why should any of us even bother turning our attention to NoSQL databases? As in RDBMS joins are computationally costly and limited kinds of data structures that we can store in tables and finally, RDBS are difficult to scale, and this is becoming a problem as data volumes become bigger and bigger.
Amazingly, we have few ways to work, Denormalization is a technique that allows us to avoid joins, the way that we expanding the number of columns in a single table, and organize those tables in ways that when a query is executed, it only needs to query a single table. This does improve read performance, but it also introduces the possibility of data anomalies. Sharding is an additional technique, a way of breaking up a database and storing pieces of the database on different servers to query from subsets of the data and don't need to query across the entire data sets, this improves read and writes performance but it is complex to manage and organize. Further, Replication is also in practice, as the data that's stored in tables and indexes copies and store those copies on different servers so that the servers can be used to respond to different queries. Now, this improves read performance, but it introduces the possibility of inconsistencies between the copies, so that's something we need to manage for.
So, RDBs are quite useful and have many features, but there are some disadvantages that we can sometimes workaround. But these workarounds also come with some disadvantages nevertheless as we move on to NoSQL, a database system that naturally allows us to Denormalize the data, support scalability in exchange for these benefits some tradeoffs i.e. relaxing ACID constraints. ACID is very important to many application areas, but not for all. Furthermore, NoSQL is a good option for those applications that don't require full ACID compliance. NoSQL databases also support Sharding which makes it very easy to scale and improve the read and write performance. Finally, NoSQL databases offer us new ways to query our data and it is especially important in Data Science where we're dealing with large data sets, complex data sets. Sometimes simply queries across tables are insufficient, but with NoSQL databases, we have new ways to find patterns and documents in hierarchical structures and even to navigate and traverse graph structures. So, NoSQL databases offer us new ways to query more complex data structures then we're able to do in relational databases. And that's one of the key drivers to using NoSQL in Data Science.
NoSQL Databases are designed to overcome the limitations of RDBs. There are four different kinds of NoSQL databases. The simplest form is the Key-Value database like a dictionary where you know a word and you're able to look up its value, i.e. if you have a person's ID, you could look up their first name but in general it's not that valuable in Data science so we won't spend too much time talking about Key-Value databases. Second is the Document database and what distinguishes a document database is that they're able to store multiple key-values in a structure called a document. Documents roughly parallel rows in a table. Keys can be scalar, which just means that they're simple data types, like integers or strings. But the value themselves can be more complex structures, like lists or arrays. Third, is called Wide Column database most similar to relational databases, but we have to be careful because although it uses terms like table and column but remains different because it is in a wide column, the data is denormalized, columns are not fixed, they can change so we can add columns on the fly in our application and it's even the case that rows in the same table can have different columns, it’s like document databases, the values are in complex structures such as arrays and lists. The fourth type of NoSQL database is Graph, graphs are networks and have two parts, entities and relations between entities which are represented by links. Edges and entities have properties and it's important to query on, query can also be on links and paths between entities.
It gives flexibility to data scientists as they need to adapt to changing requirements while scaling to meet the compute and storage needs of their analytic projects. Doing this upfront helps you save time and avoid wild goose chase. As a data scientist, you are a commander with limited resources (i.e. time).
We'll get an overview of CRUD operations with NoSQL. Moreover learn splitting dataset, deciding on hyperparameters, setting up cross-validation in our next Chapters, So Stay tuned.
Until Next Time!
Maria Masood
Modern Data Stack Engineer
4 年Gud one ???
Sr Officer IT Data Management And Data Science at Zong CMPAK
4 年Brilliant