Open-Source Data Warehousing – Druid, Apache Airflow & Superset
These days everyone talks about open-source, however still not common in the Data Warehouse (DWH) field. Why is this? In my recent blog, I researched OLAP technologies and what’s coming next, in this blog I choose one of the open-source technologies and build it together to have a full data architecture for DWH use-cases based on modern open-source technology. As the title says it all, I went for Apache Druid, querying with Apache Superset and using the Apache Airflow as an Orchestrator.
DRUID- THE DATA STORE
What is Druid
Druid is an open-source, column-oriented, distributed data store written in Java. It’s designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.
Why Druid?
Druid has many Key Features as Sub-second OLAP Queries, Real-time Streaming Ingestion, Scalable, Highly Available, Power Analytic Applications and Cost Effective.
With the Comparison of modern OLAP Technologies in mind, I choose Druid over ClickHouse, Pinot and Apache Kylin. As ClickHouse and Pinot compare as very similar and I didn’t find any much information about Apache Kylin, I decided to go for the most promising technologies also reading many different blog posts. Also just recently Microsoft announced to add Druid to their Azure HDInsight 4.0 (preview) seen on this article*.
Why not Druid?
“Druid’s strong points are very compelling but there are many important problems that Druid does not attempt to solve. Most important of these is joins. Druid has some basic capabilities to join its internal data against small dimension tables loaded from external systems (this is called query-time lookup in Druid terms.) This ability aside, the Druid authors make it clear that large-scale join support is not a priority.
Second, Druid’s SQL native implementation is extremely new. Though progress is being made, SQL is a huge and complex language and it is unlikely that we will see comprehensive SQL analytical capabilities such as set operators and windowing functions any time soon. In addition, we can expect that Druid’s SQL implementation will not support heavy-duty joins.” written by Carter Shanklin on Horthonwork.com.
The Architecture of Druid
Remarkable about Druid is that it is scalable due to hist cluster architecture. You have three different nodes type which is the Middle-Manager-Node, the Historical Node and the Broker. The great thing is that you can add as many nodes as you want in the specific area that fits best for you. If you have many queries done you would add more Brokers or if a lot of data needs to be batch-ingested you would add middle managers and so on. A simple architecture you can see below (more here).
Druid Architecture from AirBnB posted on Medium
APACHE SUPERSET – THE UI
The easiest way to query against Druid is a light way open-source BI-Dashboard tool called Apache Superset that integrated Druid well. It is easy to use and has all common chart types like Bubble Chart, Word Count, Heatmaps, Boxplot and many more. Druid provides a Rest-API and in the newest version also a SQL-Query API which makes it easy to consume it with any tool no matter if standard SQL, any BI-Tool or a custom application.
*Unfortunately Microsoft is not there yet. I installed the HDIsight Cluster on Azure and wanted to use a Druid Cluster. However you see the service in their Apache Ambari web portal, you are not able to “add” the service and install it. The link is dead, and nothing happens. Let’s hope Microsoft will it fix until the official release.
APACHE AIRFLOW – THE ORCHESTRATOR
As mentioned in Orchestrators – Scheduling and monitor workflows, this is one of the most critical decisions. As in traditional ETL, you had your ETL-Tools like Microsoft SQL Server Integration Services (SSIS) and others where your transformation, cleaning and normalisation took place. In more modern architecture this kind of tools isn’t enough anymore. Moreover, code and transformation-logics are much more valuable also to other data-savvy people in the company.
Read also the highly recommended blog post from Maxime Beauchemin about Functional Data Engineering — a modern paradigm for batch data processing ( same as YouTube ) which goes much deeper into how modern pipelines should be. Also, consider the read of The Downfall of the Data Engineer where Max explains about the breaking “data silo” and much more.
Why using Airflow?
Therefore it is wise to use a more open tool or even programming language. Very popular and a perfect tool for that purpose, in my opinion, is Apache Airflow. Airflow is written in Python, and also all the DAGswill be written in Python. Instead of encapsulating your critical transformation-logic somewhere in a tool, you place it where it belongs to inside the Orchestrator. Another advantage of it that is using plain python, there is no need to encapsulate other dependencies or requirements like fetching from an FTP, copy data from A to B, writing a batch-file, you do that and everything else at the same place.
Features of Airflow
Moreover, you get a full functionalities overview of all current task in one very place as this interactive image shows:
More features of Airflow:
- Write workflows as if you’re writing programs
- Jobs can pass parameters to other jobs downstream
- Logic within workflows (instead of logic is hidden ‘inside’ a program)
- Work gets distributed across your cluster at the task level, not at the DAG level
- Ease of deployment of workflow changes (continuous integration)
- Job testing through airflow itself
- The most common of tasks already implemented and usable
- Accessibility of log files and other meta-data through the web GUI
- Can (re)run only parts of the workflow and dependent tasks
- Jobs/tasks are run in a context, the scheduler passes in the necessary details
- Can verify what is running on airflow and see the actual code
- Run external jobs like Databricks, Spark, etc. within.
- many more
ETL WITH APACHE AIRFLOW
If you want to start with Apache Airflow as your new ETL-tool, please start with this ETL best practices with Airflow shared with you. It has examples simple ETL-examples, with plain SQL, with HIVE, with Data Vault, Data Vault 2, Data Vault with Big Data processes. It gives you an excellent overview of what’s possible and also how you would approach it.
At the same time, there is a Docker container that you can use, meaning you don’t even have to set-up any infrastructure, pull the container from here.
For the GitHub-repo follow the link on etl-with-airflow.
CONCLUSION
If your searching for open-source data architecture, you cannot ignore above-mentioned data store druid, for speedy OLAP responses, Apache Airflow as an orchestrator that keep you data lineage and schedules in line plus an easy to use dashboard tool as in Apache Superset.
My experience so far is that Druid is bloody fast and a perfect fit for OLAP cube replacements in a traditional way, but still needs a more relaxed startup to install clusters, ingesting data, viewing logs etc. If you need that, have a look at Impy which are the founders of Druid and creating all the services around for Druid that you need. Unfortunately not open-source.
Apache Airflow and its features as an orchestrator is something which I see not happen yet in traditional Business Intelligence environments. I believe this change comes very naturally when you start using open-source and more new technologies.
And Apache Superset as an easy and fast way to be up and running and showing data from Druid. There for sure more better tools like Tableau, etc., but not for free. That’s why superset fits well in the ecosystem if you’re already using above open-source technologies. But as an enterprise company, you might want to spend some money in that category because that is what the users can see at the end of the day.
Related Links:
- Understanding Apache Airflow’s key concepts
- How Druid enables analytics at Airbnb
- Google launches Cloud Composer, a new workflow automation tool for developers
- A fully managed workflow orchestration service built on Apache Airflow
- Integrating Apache Airflow and Databricks: Building ETL pipelines with Apache Spark
- ETL with Apache Airflow
- What is Data Engineering and the future of Data Warehousing
- Imply – Managed Druid platform (closed-source)
- Ultra-fast OLAP Analytics with Apache Hive and Druid
Originally published at www.sspaeti.com on November 29, 2018.
Executive at PlusOne Marketing Solutions
2 年Similarities between Apache Druid & Apache Pinot https://www.decipherzone.com/blog-detail/druid-vs-pinot Apache Druid and Apache Pinot are both fundamentally similar because they store data and process queries on the same nodes, deviating from the decoupled architecture of BigQuery. Druid and Pinot both have their data storing format with indexes, are tightly coupled with their query processing engines, and unsupportiveness of large data volumes between nodes, so the queries run faster in both than Big Data processing systems like Presto, Hive, Spark, Kudu and Parquet.
Executive at PlusOne Marketing Solutions
2 年What is Apache Druid Architecture? https://www.decipherzone.com/blog-detail/apache-druid-architecture Apache Druid is among the most popular open-source real-time solutions for Online Analytics Processing (OLAP). As pre-existing open-source databases like Relational Database Management System (RDBMS) and NoSQL were incapable of offering a low latency data ingestion and query platform for interactive applications, the need for Druid was stimulated.
Data analyst
2 年Thanks for the article!
Functional Data Engineer
4 年I had tried working (installation and development) on druid and the installation of open source druid was quite complex and even development. Currently I am working on Apache Kylin and the UI offered by Kylin is kind of savour. I would request you to post some information on comparison of Kylin and Druid.. And from one of your posts regarding tools like dremio as you had mentioned that no need to transfer data it works at semantic layer and in memory storage and processing.. Request you to elaborate on these too