登录查看更多内容

Druid - Fast analytics on Batch & Real-time data

Thulasiram Gunipati

Data Engineering @ldc

发布日期: 2020年3月1日

Apache Druid

Druid is an analytics database for low latency queries with the sub-second response time. It is a system that scales very well to thousands of servers. Druid supports both batch and streaming data. It integrates well with Hadoop and Kafka. Druid was created at Metamarkets in 2011 and was open-sourced as an Apache project in 2018. It is used by Companies like Netflix, Airbnb, Lyft, Slack etc. for analytics at scale. It supports clusters both on-prem and cloud.

Use Cases for Druid

Druid shines very well when a query needs to process both historical and real-time data. It can be used for both real-time and historical data processing. Druid is also suitable for interactive dashboards and analytics applications.

Anti-Patterns for Druid

Druid cannot be used where it is required to join two large tables. If both the tables are in petabyte-scale, then Druid is not suitable in such cases. Time should always be present in data. Druid always partitions on time. Druid can handle streaming appends or batch overwrites but cannot do streaming updates

How Druid works

Druid architecture consists of three types of servers.

Master
Query
Data

Master Server – It consists of the Co-Ordinator and Overlord. Co-Ordinator looks after the management and distribution of segments (More on this later). Overlord takes care of data ingestion and supervision of tasks.

Query Server - Broker is part of the query server. It receives queries from the users. It tracks where the data is stored in the data servers. It directs the queries to the appropriate data servers and merges the results which are obtained and serves to the end-users.

Data Server – Historical and Middle manager are part of the Data Server. The data is partitioned in the Data servers and stored as segments. Historical consists of batch or non-real time data. The data is not persisted permanently in the data servers. S3, HDFS or google storage can be used for persisting data. Once segments are created and not required for querying then it is sent for persistent storage. Whenever the segments are required for querying it is brought to the Historical from persistent storage.

Middle manager – It reads the incoming data from real-time and creates segments for pushing them down to deep storage. During real-time ingestion, the middle manager also serves queries on real-time data.

A user submits a query – The query is submitted to the broker. The broker keeps track of where the data segments are stored in the Historical and real-time processing happening in the Middle manager. It sends the queries to the appropriate segments in the Historical and Middle manager. The results are collected by the broker and sent to the user.

A data source needs to be ingested by Druid - When a request to ingest a data source is submitted to the Master node, Overlord will instruct the Middle manager to process the data. Middle manager will segment and group the data based on time. Once the segments are created it is sent to deep storage. The Historical will ingest the data as required from the deep storage. If a segment is deleted from the server then it cannot be queried. Even though the segment is deleted from the server it will exist in the deep storage. If the segments are deleted from the deep storage, then it is permanently deleted.

Segments once created are immutable. If a segment needs to be updated, then it should be rewritten. The data in a segment is stored in a column-oriented format. Each column in Druid is packed and compressed individually. String values are dictionary encoded to enable compressing and fast querying. The performance on strings is further improved by bitmap indexing. Numeric columns are also compressed based on the data. Lossless compression LZ4 is further used for getting better performance from Druid.

Druid also comes with a web UI for loading the data, checking the number of segments, servers running etc.

Druid favours denormalization. It is because querying this is faster than doing a join with another large table. It uses complex data types like Hyperunique, approxHistogram and Datasketches. Data sketches are a family of algorithms to process huge amounts of data very quickly with little loss of accuracy. Druid can use these algorithms for providing results quickly saving time and computing costs.

To summarise, Druid is an analytics database for sub-second query response time. It can handle both batch and real-time data. It consists of three types of servers – Master, Query and Data. The data is stored in immutable segments partitioned on time. Druid shines well where real-time and batch needs to be queried at the same time.

Check these links to know more about Druid: -

https://druid.apache.org/

https://imply.io/druid-university/intro-to-druid-university

A video on Apache Druid from imply.io

要查看或添加评论，请登录

Thulasiram Gunipati的更多文章

New Year Resolution - Improve my Coding Skills

2022年1月5日

New Year Resolution - Improve my Coding Skills

My coding needs improvement. I changed my industry and took up data science years back.

4 条评论
Notes - Decision Trees, Random Forests, Bagging, Boosting (AdaBoost, XGBoost), Stacking

2020年9月13日

Notes - Decision Trees, Random Forests, Bagging, Boosting (AdaBoost, XGBoost), Stacking

Decision Trees · They require very little data preparation · Don’t require feature scaling or centering at all · Gini…

1 条评论
What is a DMP and How to create it?

2020年5月24日

What is a DMP and How to create it?

"There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information…

17 条评论
Anomaly Detection and Interpretability 2020

2020年5月10日

Anomaly Detection and Interpretability 2020

"There is no wealth like knowledge, and no poverty like ignorance" - Buddha "Share your knowledge. It's a way to…

1 条评论
Ml models: From Jupyter notebook to a Database

2020年5月1日

Ml models: From Jupyter notebook to a Database

(Image credit: kdnuggets.com) Machine learning model in a Jupyter notebook is not the end of a data science project.

4 条评论
Data Lakehouse – Powerhouse of the future

2020年3月7日

Data Lakehouse – Powerhouse of the future

Databricks recently published a blog post on Data lakehouse. As soon as this blog was published, the word ''lakehouse"…
Business Intelligence and Data Reflections

2020年2月21日

Business Intelligence and Data Reflections

Business intelligence (BI) tools need costly extracts or data ingestion to the cubes. Presently enterprises are…

2 条评论
Must-know Machine Learning Questions – Logistic Regression

2018年7月5日

Must-know Machine Learning Questions – Logistic Regression

Welcome to the second part of the series of commonly asked interview questions based on machine learning algorithms. We…
Converting Business Problems to Data Science Problems

2018年7月2日

Converting Business Problems to Data Science Problems

In a lot of Data Science interviews, it is common to ask business related questions. The interviewee is expected to…
Must-know Machine Learning Interview Questions – Linear Regression

2018年6月29日

Must-know Machine Learning Interview Questions – Linear Regression

It is a common practice to test data science aspirants on commonly used machine learning algorithms in interviews…

3 条评论

See all articles

Druid - Fast analytics on Batch & Real-time data

Thulasiram Gunipati

Data Engineering @ldc

Thulasiram Gunipati的更多文章

社区洞察

其他会员也浏览了

Hadoop to Azure Databricks Migration

The Evolution of Big Data Technologies

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

Azure Data Lake

“THE FUNDAMENTALS OF BIG DATA TOOLS: MapReduce, Spark, and Hive”

Introduction to Big Data Technologies and Concepts: Building a Foundation for Data-Driven Success

Delta Lake Format: Understanding Parquet under the hood.

AWS and Open Source Big Data and Analytic Frameworks

Apache Iceberg: Transforming Data Lake Management for the AI Era

Big Data Lambda (λ) Architecture variants Explained!

Thulasiram Gunipati的更多文章

New Year Resolution - Improve my Coding Skills

Notes - Decision Trees, Random Forests, Bagging, Boosting (AdaBoost, XGBoost), Stacking

What is a DMP and How to create it?

Anomaly Detection and Interpretability 2020

Ml models: From Jupyter notebook to a Database

Data Lakehouse – Powerhouse of the future

Business Intelligence and Data Reflections

Must-know Machine Learning Questions – Logistic Regression

Converting Business Problems to Data Science Problems

Must-know Machine Learning Interview Questions – Linear Regression

社区洞察

其他会员也浏览了

Hadoop to Azure Databricks Migration

The Evolution of Big Data Technologies

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

Azure Data Lake

“THE FUNDAMENTALS OF BIG DATA TOOLS: MapReduce, Spark, and Hive”

Introduction to Big Data Technologies and Concepts: Building a Foundation for Data-Driven Success

Delta Lake Format: Understanding Parquet under the hood.

AWS and Open Source Big Data and Analytic Frameworks

Apache Iceberg: Transforming Data Lake Management for the AI Era

Big Data Lambda (λ) Architecture variants Explained!