ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

THE SEVEN DATABASE PARADIGM

Issack Wambugu

Data & Analytics Engineer

å‘å¸ƒæ—¥æœŸ: 2021å¹´1æœˆ9æ—¥

I believe that we should always use the right tools for the job. When it comes to app development choosing the right database is one of the single most important decisions that you will ever have to make. I came across this very informative video and I felt the need to break it all down for you. This list will start from the most simple types of databases and gradually become more complex as I get to number seven.

1. Key-value Databases

Popular databases in this space include Redis, Memcached. The database itself is structured almost like a java-script object or a python dictionary. You have a set of keys where each key is unique and points to a value.

In Redis you can read and write data using commands:

redis> SET User:23: "Data Engineering"

>>ok

redis>GET User:23:bio

"Data Engineering"

We use the SET command followed by a key and value to write data and the GET command to retrieve that data in the future. In the case of Memcached and Redis the data is held in the machines memory as opposed to most other databases that store data in the disk. This limits the amount of data you can store, however makes the database extremely fast because it doesn't require a round trip to the disk for every operation. In addition it doesn't support queries, joins or anything like the so your data modelling options are very limited but the again its very fast, like sub-milliseconds fast. You wouldn't want to use key-value databases for your main app data but often they are used as a cache to reduce data latency. Apps like Twitter, GitHub and Snapchat all use Redis for real time delivery of their data. Their other use cases are message queries, pub/sub and gaming leaderboards. Often key-value databases are used as cache on top of some other persistent data ware.

2. Wide-column Databases

Popular options in this family include Cassandra and apache HBase. A wide-column Database is like you took a key-value database and added a second dimension to it. At the outer layer we have a key-space which holds one or more column family and each column family holds a set of ordered rows. This makes it possible to group related data together but unlike a relational database it doesn't have a schema so it can handle unstructured data, this is nice for developers since you get a query language called CQL that is very similar to sql although much more limited since you cant do joins however, it is much easier to scale up and replicate data across multiple nodes. Unlike a sql database it is decentralized and can scale horizontally. A popular use-case is for scaling a large amount of time series data like records from an IOT device, weather sensors or in the case of Netflix a history of different shows you have watched. It is often used in a situation where you have frequent writes but infrequent updates and reads. It is not going to be your primary app database, for that you need a more general purpose database like a Document-Oriented database.

3. Document -oriented-database

Popular options in the document family include Mongo DB, DynamoDB and a few others.In this paradigm we have containers where each document is a container for key-value pairs. They are unstructured and do not require a schema. The documents can be grouped together in a collection. Fields within a collection can be indexed and collections can be organized in logical hierarchy allowing you to model and retrieve relational data to a pretty significant degree. They don't support joins so instead of normalizing data into a bunch of small data you are encouraged to embed the data in a single document. This creates a trade off where reads from a user application are much faster, however writing or updating data tends to be more complex. Document oriented databases are more general purpose than all the other options I have looked at so far. From a developer perspective they are very easy to use. They are suitable for mobile games, IOT, content management etc. If you are not very sure how your data is structure at this point a document database is probably the best place to start. Where they generally fall short is where you have a lot of disconnected but related data that is updated often like a popular social app that has many users who have many friends who have many comments who have many likes and you want to see the comments that your friends like. Data like this needs to be joined. Luckily we have a relational database.

4. Relational Databases

Popular options include MySQL, PostgreSQL. They are the most popular types of databases today. They use sql which allows you to access and write data in the database. However, they require a schema so if you don't know the shape of data upfront they can be a little hard to work with. They are also ACID compliant which means when there is a transaction in the database, volidity is guaranteed even when there are network and hardware failures. This is essential for banks and financial institutions but makes this database inherently harder to scale. However, there are modern databases like cockroach labs that are specifically designed to operate at scale. In any case, they still remain the most popular database today but instead of modelling relationships in a schema what if we create relationships as data? Enters the Graph Database.

5. Graph Database

Your data is represented as node and the relationship between the as edges. Popular options include Neo4j. Lets imagine you want to create a many to many relationship in a relational database, we do that by setting up a join table with a foreign key that defines that relationship in a graph database we do no need that middle man table we just define an edge and connect to the other record. We can now query this data with a statement that is more concise and readable in addition we can achieve better performance on large datasets. They can be a great alternative for sql especially if you are working with a lot of joins and performance is taking a hit because of that. They are often used for fraud detection in finance, building internal knowledge graphs within companies and to power recommendation engines like Airbnb.

Lets imagine Google for an instance, a user provides a small amount of text then your Database needs to return the most relevant results aligned in a proper order from a huge amount of data. For that you want a full text search engine.

6 Search Databases

Most of the databases in this paradigm are built on top of the the apache Lucene project which has been around since 1999 like solr, elastic search. In addition we have cloud based options like algoria and Meilisearch. From a developer perspective they work very similar to a document oriented Database. You start with an index, then you add a bunch of data objects to it. The difference is that under the hood the search database will analyze all the text in the document and create an index of the searchable items. Essentially it works like the index you would find at the back of a text book. When a user performs a search it only has to scan the index as opposed to every document in the database and that makes it fast even on large datasets. The database can also learn a variety of different learning algorithms to rank those results, filter out irrelevant hits and handle typos. They can be expensive to run at scale however they can add a ton of value to the user experience if you are building something like typeahead search box

7 Multi-model Database

In this paradigm there are a few different options . I will focus on the Fauna database which is totally different from all the other databases. If you are a front end developer all you care about is the data you consume on the front end. You don't have to thing about data, modelling, schema. With Fauna database you describe how you access your data using GraphQL. In this example let look at a user model and a post model. If you upload your GraphQL to Fauna. It automatically creates collections where we can store data and index the data. In the background it is figuring out how to take advantage of the multiple database paradigms based on the GraphQl code you provided. You create data by adding to collections just like you would with document database but you are not stuck with the inherent limitations when it comes to data modelling. On top of that it is ACID compliant and extremely fast and you never have to worry about provisioning the actual infrastructure. You just decide how you want to consume the data and let the cloud help configure everything else for you.

THE SEVEN DATABASE PARADIGM

Issack Wambugu

Data & Analytics Engineer

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Differences between SQL and NoSQL

From ElasticSearch back to SQL Server

Self-Reflection of MongoDB-Workshop

Apache Drill, not bad at all!

Important MS SQL Questions

Do You Really Need to Suffer with No-SQL and Big Data? ??Be happy ?? and just use PostgreSQL! ??

Understanding SQL vs. NoSQL Databases

?? Choosing the Right Database for Your Next Project: SQL vs. NoSQL ??

Handling SQLite DB in Lambda Functions Using Zappa

SQL is (almost) always superior to NoSQL