Database Choice for Analytics

It is not a best practice to query your data from a production database because you could reorder the data and likely slow down your app. You might also accidentally delete important info if you have analysts or engineers poking around in there.

A need of hour then is a separate kind of database for analytics. The question is which one is right? After doing some research, I found out that these are the key factors considered while selecting a DB:

Type of Data

If you have data like Excel (rows and columns), a relation database like MySQL, Postgres, BigQuery or RedShift will fit your need. These types of DBs are best when you know what kind of data you are receiving so that you can define a scheme. User traits like names, emails, and billing plans fit nicely into a table as do user events and their properties.

If your data is like dumping text into a piece of paper or text file, then you should look for a non-relation DB (NoSQL) like we have Mongo, Couch or Hadoop. These DB excel with extremely large data points for semi structured data, for example, emails, books, social media, audio, video etc. If you are planning to do text mining on large scale, language or image processing then you will have to go for non-relation data stores.

Size of Data

Next important thing to consider the amount of data you are dealing with. The data size range is classified as:

-         ~2TB or less then Postgres and MySQL are good choice

-         2TB – 64TB: Amazon Aurora

-         64TB – 2PB: Amazon Redshift, Google Big Query

-         All of the data: Hadoop

As the size of data increases, you have to go for the non-relation data store since it has fewer constraints to ingest data. Postgres has good price to performance ration but it slows down around 6TB. Aurora is different flavor of MySQL and can easily handle 64TB of data. Amazon store is built on top Red shift and is good choice for analytics.

Engineering team

If you have small engineering team, you will need them to focus more on product development rather than building data pipelines. In this case, your best option is to go for relation database. Advantage being you can use SQL in these DBs which is very common, they take less time to setup and require less maintenance.

Doing analytics on semi or un-structured data requires knowledge of OOP, code heavy background, good amount of time for setting up data pipelines to and from data stores and usually requires maintenance.

Fast retrieval

The speed at which you require the data is an important factor while selecting a DB. While real time analytics is all the rage for use cases like fraud detection but most analysis doesn’t require this.

If you are mostly working on after the fact analysis, you should go for analytics optimized DB like Redshift or Big Query. These are built to store large amount of data, perform fast joins and unions and loads data reasonably fast.

If you want absolute real time data, then your best option is Hadoop. You can design hadoop to fulfill load, transform and store needs.



要查看或添加评论,请登录

Salman Ghauri的更多文章

  • Fog Computing and Spark

    Fog Computing and Spark

    While big data analytics may be getting a lot of attention, the concept that really sparks the tech community’s…

  • Key Uses of Apache Spark

    Key Uses of Apache Spark

    Here are some of the most common and important uses of Spark that I have came across. These will help you to understand…

  • Docker Compose - Build powerful an d portable applications.

    Docker Compose - Build powerful an d portable applications.

    The next thing in developing a containerized application using docker is to use its utility which is named 'docker…

  • Skeptical data scientist.

    Skeptical data scientist.

    US election results 2016 proved most of the predictions made by data scientist wrong. It open a new challenge for data…

  • Docker

    Docker

    Want to go deeper in Docker? Check out my series of docker tutorials. Any kind of feedback is highly appreciated.

  • Laravel vs. ROR

    Laravel vs. ROR

    Laravel is no doubt an up and coming PHP framework, and the support which the community has shown towards it is simply…

    2 条评论
  • Migrations in PHP

    Migrations in PHP

    Are you searching for PHP seven migration? If thus then you want to study the nineteen new options launched in PHP…

  • LAMP vs. MEAN Stack

    LAMP vs. MEAN Stack

    The Operating System The first choice in any tech stack is the operating system. While the LAMP stack locked the…

  • Tech'16

    Tech'16

    2016 is going to be big for developers, already with the launch of PHP 7 and node.js OS has surprised the world.

    2 条评论

社区洞察

其他会员也浏览了