登录查看更多内容

Database Choice for Analytics

Salman Ghauri

Data Engineer

发布日期: 2017年11月15日

It is not a best practice to query your data from a production database because you could reorder the data and likely slow down your app. You might also accidentally delete important info if you have analysts or engineers poking around in there.

A need of hour then is a separate kind of database for analytics. The question is which one is right? After doing some research, I found out that these are the key factors considered while selecting a DB:

Type of Data

If you have data like Excel (rows and columns), a relation database like MySQL, Postgres, BigQuery or RedShift will fit your need. These types of DBs are best when you know what kind of data you are receiving so that you can define a scheme. User traits like names, emails, and billing plans fit nicely into a table as do user events and their properties.

If your data is like dumping text into a piece of paper or text file, then you should look for a non-relation DB (NoSQL) like we have Mongo, Couch or Hadoop. These DB excel with extremely large data points for semi structured data, for example, emails, books, social media, audio, video etc. If you are planning to do text mining on large scale, language or image processing then you will have to go for non-relation data stores.

Size of Data

Next important thing to consider the amount of data you are dealing with. The data size range is classified as:

- ~2TB or less then Postgres and MySQL are good choice

- 2TB – 64TB: Amazon Aurora

- 64TB – 2PB: Amazon Redshift, Google Big Query

- All of the data: Hadoop

As the size of data increases, you have to go for the non-relation data store since it has fewer constraints to ingest data. Postgres has good price to performance ration but it slows down around 6TB. Aurora is different flavor of MySQL and can easily handle 64TB of data. Amazon store is built on top Red shift and is good choice for analytics.

Engineering team

If you have small engineering team, you will need them to focus more on product development rather than building data pipelines. In this case, your best option is to go for relation database. Advantage being you can use SQL in these DBs which is very common, they take less time to setup and require less maintenance.

Doing analytics on semi or un-structured data requires knowledge of OOP, code heavy background, good amount of time for setting up data pipelines to and from data stores and usually requires maintenance.

Fast retrieval

The speed at which you require the data is an important factor while selecting a DB. While real time analytics is all the rage for use cases like fraud detection but most analysis doesn’t require this.

If you are mostly working on after the fact analysis, you should go for analytics optimized DB like Redshift or Big Query. These are built to store large amount of data, perform fast joins and unions and loads data reasonably fast.

If you want absolute real time data, then your best option is Hadoop. You can design hadoop to fulfill load, transform and store needs.

要查看或添加评论，请登录

Salman Ghauri的更多文章

Fog Computing and Spark

2018年2月16日

Fog Computing and Spark

While big data analytics may be getting a lot of attention, the concept that really sparks the tech community’s…
Key Uses of Apache Spark

2018年2月16日

Key Uses of Apache Spark

Here are some of the most common and important uses of Spark that I have came across. These will help you to understand…
Docker Compose - Build powerful an d portable applications.

2016年11月14日

Docker Compose - Build powerful an d portable applications.

The next thing in developing a containerized application using docker is to use its utility which is named 'docker…
Skeptical data scientist.

2016年11月10日

Skeptical data scientist.

US election results 2016 proved most of the predictions made by data scientist wrong. It open a new challenge for data…
Docker

2016年10月29日

Docker

Want to go deeper in Docker? Check out my series of docker tutorials. Any kind of feedback is highly appreciated.
Laravel vs. ROR

2016年4月12日

Laravel vs. ROR

Laravel is no doubt an up and coming PHP framework, and the support which the community has shown towards it is simply…

2 条评论
Migrations in PHP

2016年3月31日

Migrations in PHP

Are you searching for PHP seven migration? If thus then you want to study the nineteen new options launched in PHP…
LAMP vs. MEAN Stack

2016年2月2日

LAMP vs. MEAN Stack

The Operating System The first choice in any tech stack is the operating system. While the LAMP stack locked the…
Tech'16

2016年1月5日

Tech'16

2016 is going to be big for developers, already with the launch of PHP 7 and node.js OS has surprised the world.

2 条评论

See all articles

Database Choice for Analytics

Salman Ghauri

Data Engineer

Type of Data

Size of Data

Engineering team

Fast retrieval

Salman Ghauri的更多文章

社区洞察

其他会员也浏览了

Exploring the Different Types of Tables in Apache Hive

Where is the database schema? #SQL #NoSQL

20 Free Don’t Miss Out Big Data Books In This Year 2016 !

“THE FUNDAMENTALS OF BIG DATA TOOLS: MapReduce, Spark, and Hive”

To SQL, NoSQL or NewSQL, that’s the Query!

High Data Ingestion: LeanXcale Dual Interface SQL & NoSQL

Delta Lake Format: Understanding Parquet under the hood.

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

SQL vs. NoSQL vs. Graph vs. Key-Value: Choosing the Right Database Tool for Your Data Engineering Projects

Databricks SQL - The new Cloud Data Ware(Lake)house

Type of Data

Size of Data

Engineering team

Fast retrieval

Salman Ghauri的更多文章

Fog Computing and Spark

Key Uses of Apache Spark

Docker Compose - Build powerful an d portable applications.

Skeptical data scientist.

Docker

Laravel vs. ROR

Migrations in PHP

LAMP vs. MEAN Stack

Tech'16

社区洞察

其他会员也浏览了

Exploring the Different Types of Tables in Apache Hive

Where is the database schema? #SQL #NoSQL

20 Free Don’t Miss Out Big Data Books In This Year 2016 !

“THE FUNDAMENTALS OF BIG DATA TOOLS: MapReduce, Spark, and Hive”

To SQL, NoSQL or NewSQL, that’s the Query!

High Data Ingestion: LeanXcale Dual Interface SQL & NoSQL

Delta Lake Format: Understanding Parquet under the hood.

DBMS for Data Science: Why Neo4j vs. your tRusty ol’ RDBMS

SQL vs. NoSQL vs. Graph vs. Key-Value: Choosing the Right Database Tool for Your Data Engineering Projects

Databricks SQL - The new Cloud Data Ware(Lake)house