Druid - Fast analytics on Batch & Real-time data
Apache Druid
Druid is an analytics database for low latency queries with the sub-second response time. It is a system that scales very well to thousands of servers. Druid supports both batch and streaming data. It integrates well with Hadoop and Kafka. Druid was created at Metamarkets in 2011 and was open-sourced as an Apache project in 2018. It is used by Companies like Netflix, Airbnb, Lyft, Slack etc. for analytics at scale. It supports clusters both on-prem and cloud.
Use Cases for Druid
Druid shines very well when a query needs to process both historical and real-time data. It can be used for both real-time and historical data processing. Druid is also suitable for interactive dashboards and analytics applications.
Anti-Patterns for Druid
Druid cannot be used where it is required to join two large tables. If both the tables are in petabyte-scale, then Druid is not suitable in such cases. Time should always be present in data. Druid always partitions on time. Druid can handle streaming appends or batch overwrites but cannot do streaming updates
How Druid works
Druid architecture consists of three types of servers.
- Master
- Query
- Data
Master Server – It consists of the Co-Ordinator and Overlord. Co-Ordinator looks after the management and distribution of segments (More on this later). Overlord takes care of data ingestion and supervision of tasks.
Query Server - Broker is part of the query server. It receives queries from the users. It tracks where the data is stored in the data servers. It directs the queries to the appropriate data servers and merges the results which are obtained and serves to the end-users.
Data Server – Historical and Middle manager are part of the Data Server. The data is partitioned in the Data servers and stored as segments. Historical consists of batch or non-real time data. The data is not persisted permanently in the data servers. S3, HDFS or google storage can be used for persisting data. Once segments are created and not required for querying then it is sent for persistent storage. Whenever the segments are required for querying it is brought to the Historical from persistent storage.
Middle manager – It reads the incoming data from real-time and creates segments for pushing them down to deep storage. During real-time ingestion, the middle manager also serves queries on real-time data.
A user submits a query – The query is submitted to the broker. The broker keeps track of where the data segments are stored in the Historical and real-time processing happening in the Middle manager. It sends the queries to the appropriate segments in the Historical and Middle manager. The results are collected by the broker and sent to the user.
A data source needs to be ingested by Druid - When a request to ingest a data source is submitted to the Master node, Overlord will instruct the Middle manager to process the data. Middle manager will segment and group the data based on time. Once the segments are created it is sent to deep storage. The Historical will ingest the data as required from the deep storage. If a segment is deleted from the server then it cannot be queried. Even though the segment is deleted from the server it will exist in the deep storage. If the segments are deleted from the deep storage, then it is permanently deleted.
Segments once created are immutable. If a segment needs to be updated, then it should be rewritten. The data in a segment is stored in a column-oriented format. Each column in Druid is packed and compressed individually. String values are dictionary encoded to enable compressing and fast querying. The performance on strings is further improved by bitmap indexing. Numeric columns are also compressed based on the data. Lossless compression LZ4 is further used for getting better performance from Druid.
Druid also comes with a web UI for loading the data, checking the number of segments, servers running etc.
Druid favours denormalization. It is because querying this is faster than doing a join with another large table. It uses complex data types like Hyperunique, approxHistogram and Datasketches. Data sketches are a family of algorithms to process huge amounts of data very quickly with little loss of accuracy. Druid can use these algorithms for providing results quickly saving time and computing costs.
To summarise, Druid is an analytics database for sub-second query response time. It can handle both batch and real-time data. It consists of three types of servers – Master, Query and Data. The data is stored in immutable segments partitioned on time. Druid shines well where real-time and batch needs to be queried at the same time.
Check these links to know more about Druid: -
https://imply.io/druid-university/intro-to-druid-university
A video on Apache Druid from imply.io