Database Choice for Analytics
It is not a best practice to query your data from a production database because you could reorder the data and likely slow down your app. You might also accidentally delete important info if you have analysts or engineers poking around in there.
A need of hour then is a separate kind of database for analytics. The question is which one is right? After doing some research, I found out that these are the key factors considered while selecting a DB:
Type of Data
If you have data like Excel (rows and columns), a relation database like MySQL, Postgres, BigQuery or RedShift will fit your need. These types of DBs are best when you know what kind of data you are receiving so that you can define a scheme. User traits like names, emails, and billing plans fit nicely into a table as do user events and their properties.
If your data is like dumping text into a piece of paper or text file, then you should look for a non-relation DB (NoSQL) like we have Mongo, Couch or Hadoop. These DB excel with extremely large data points for semi structured data, for example, emails, books, social media, audio, video etc. If you are planning to do text mining on large scale, language or image processing then you will have to go for non-relation data stores.
Size of Data
Next important thing to consider the amount of data you are dealing with. The data size range is classified as:
- ~2TB or less then Postgres and MySQL are good choice
- 2TB – 64TB: Amazon Aurora
- 64TB – 2PB: Amazon Redshift, Google Big Query
- All of the data: Hadoop
As the size of data increases, you have to go for the non-relation data store since it has fewer constraints to ingest data. Postgres has good price to performance ration but it slows down around 6TB. Aurora is different flavor of MySQL and can easily handle 64TB of data. Amazon store is built on top Red shift and is good choice for analytics.
Engineering team
If you have small engineering team, you will need them to focus more on product development rather than building data pipelines. In this case, your best option is to go for relation database. Advantage being you can use SQL in these DBs which is very common, they take less time to setup and require less maintenance.
Doing analytics on semi or un-structured data requires knowledge of OOP, code heavy background, good amount of time for setting up data pipelines to and from data stores and usually requires maintenance.
Fast retrieval
The speed at which you require the data is an important factor while selecting a DB. While real time analytics is all the rage for use cases like fraud detection but most analysis doesn’t require this.
If you are mostly working on after the fact analysis, you should go for analytics optimized DB like Redshift or Big Query. These are built to store large amount of data, perform fast joins and unions and loads data reasonably fast.
If you want absolute real time data, then your best option is Hadoop. You can design hadoop to fulfill load, transform and store needs.