登录查看更多内容

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Simon Sp?ti

Data Engineer, Author & Educator | ssp.sh, dedp.online

发布日期: 2018年12月8日

These days everyone talks about open-source, however still not common in the Data Warehouse (DWH) field. Why is this? In my recent blog, I researched OLAP technologies and what’s coming next, in this blog I choose one of the open-source technologies and build it together to have a full data architecture for DWH use-cases based on modern open-source technology. As the title says it all, I went for Apache Druid, querying with Apache Superset and using the Apache Airflow as an Orchestrator.

DRUID- THE DATA STORE

What is Druid

Druid is an open-source, column-oriented, distributed data store written in Java. It’s designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.

Why Druid?

Druid has many Key Features as Sub-second OLAP Queries, Real-time Streaming Ingestion, Scalable, Highly Available, Power Analytic Applications and Cost Effective.

With the Comparison of modern OLAP Technologies in mind, I choose Druid over ClickHouse, Pinot and Apache Kylin. As ClickHouse and Pinot compare as very similar and I didn’t find any much information about Apache Kylin, I decided to go for the most promising technologies also reading many different blog posts. Also just recently Microsoft announced to add Druid to their Azure HDInsight 4.0 (preview) seen on this article*.

Why not Druid?

“Druid’s strong points are very compelling but there are many important problems that Druid does not attempt to solve. Most important of these is joins. Druid has some basic capabilities to join its internal data against small dimension tables loaded from external systems (this is called query-time lookup in Druid terms.) This ability aside, the Druid authors make it clear that large-scale join support is not a priority.

Second, Druid’s SQL native implementation is extremely new. Though progress is being made, SQL is a huge and complex language and it is unlikely that we will see comprehensive SQL analytical capabilities such as set operators and windowing functions any time soon. In addition, we can expect that Druid’s SQL implementation will not support heavy-duty joins.” written by Carter Shanklin on Horthonwork.com.

The Architecture of Druid

Remarkable about Druid is that it is scalable due to hist cluster architecture. You have three different nodes type which is the Middle-Manager-Node, the Historical Node and the Broker. The great thing is that you can add as many nodes as you want in the specific area that fits best for you. If you have many queries done you would add more Brokers or if a lot of data needs to be batch-ingested you would add middle managers and so on. A simple architecture you can see below (more here).

Druid Architecture from AirBnB posted on Medium

APACHE SUPERSET – THE UI

The easiest way to query against Druid is a light way open-source BI-Dashboard tool called Apache Superset that integrated Druid well. It is easy to use and has all common chart types like Bubble Chart, Word Count, Heatmaps, Boxplot and many more. Druid provides a Rest-API and in the newest version also a SQL-Query API which makes it easy to consume it with any tool no matter if standard SQL, any BI-Tool or a custom application.

*Unfortunately Microsoft is not there yet. I installed the HDIsight Cluster on Azure and wanted to use a Druid Cluster. However you see the service in their Apache Ambari web portal, you are not able to “add” the service and install it. The link is dead, and nothing happens. Let’s hope Microsoft will it fix until the official release.

APACHE AIRFLOW – THE ORCHESTRATOR

As mentioned in Orchestrators – Scheduling and monitor workflows, this is one of the most critical decisions. As in traditional ETL, you had your ETL-Tools like Microsoft SQL Server Integration Services (SSIS) and others where your transformation, cleaning and normalisation took place. In more modern architecture this kind of tools isn’t enough anymore. Moreover, code and transformation-logics are much more valuable also to other data-savvy people in the company.

Read also the highly recommended blog post from Maxime Beauchemin about Functional Data Engineering — a modern paradigm for batch data processing ( same as YouTube ) which goes much deeper into how modern pipelines should be. Also, consider the read of The Downfall of the Data Engineer where Max explains about the breaking “data silo” and much more.

Why using Airflow?

Therefore it is wise to use a more open tool or even programming language. Very popular and a perfect tool for that purpose, in my opinion, is Apache Airflow. Airflow is written in Python, and also all the DAGswill be written in Python. Instead of encapsulating your critical transformation-logic somewhere in a tool, you place it where it belongs to inside the Orchestrator. Another advantage of it that is using plain python, there is no need to encapsulate other dependencies or requirements like fetching from an FTP, copy data from A to B, writing a batch-file, you do that and everything else at the same place.

Features of Airflow

Moreover, you get a full functionalities overview of all current task in one very place as this interactive image shows:

More features of Airflow:

Write workflows as if you’re writing programs
Jobs can pass parameters to other jobs downstream
Logic within workflows (instead of logic is hidden ‘inside’ a program)
Work gets distributed across your cluster at the task level, not at the DAG level
Ease of deployment of workflow changes (continuous integration)
Job testing through airflow itself
The most common of tasks already implemented and usable
Accessibility of log files and other meta-data through the web GUI
Can (re)run only parts of the workflow and dependent tasks
Jobs/tasks are run in a context, the scheduler passes in the necessary details
Can verify what is running on airflow and see the actual code
Run external jobs like Databricks, Spark, etc. within.
many more

ETL WITH APACHE AIRFLOW

If you want to start with Apache Airflow as your new ETL-tool, please start with this ETL best practices with Airflow shared with you. It has examples simple ETL-examples, with plain SQL, with HIVE, with Data Vault, Data Vault 2, Data Vault with Big Data processes. It gives you an excellent overview of what’s possible and also how you would approach it.

At the same time, there is a Docker container that you can use, meaning you don’t even have to set-up any infrastructure, pull the container from here.

For the GitHub-repo follow the link on etl-with-airflow.

CONCLUSION

If your searching for open-source data architecture, you cannot ignore above-mentioned data store druid, for speedy OLAP responses, Apache Airflow as an orchestrator that keep you data lineage and schedules in line plus an easy to use dashboard tool as in Apache Superset.

My experience so far is that Druid is bloody fast and a perfect fit for OLAP cube replacements in a traditional way, but still needs a more relaxed startup to install clusters, ingesting data, viewing logs etc. If you need that, have a look at Impy which are the founders of Druid and creating all the services around for Druid that you need. Unfortunately not open-source.

Apache Airflow and its features as an orchestrator is something which I see not happen yet in traditional Business Intelligence environments. I believe this change comes very naturally when you start using open-source and more new technologies.

And Apache Superset as an easy and fast way to be up and running and showing data from Druid. There for sure more better tools like Tableau, etc., but not for free. That’s why superset fits well in the ecosystem if you’re already using above open-source technologies. But as an enterprise company, you might want to spend some money in that category because that is what the users can see at the end of the day.

Related Links:

Understanding Apache Airflow’s key concepts
How Druid enables analytics at Airbnb
Google launches Cloud Composer, a new workflow automation tool for developers
A fully managed workflow orchestration service built on Apache Airflow
Integrating Apache Airflow and Databricks: Building ETL pipelines with Apache Spark
ETL with Apache Airflow
What is Data Engineering and the future of Data Warehousing
Imply – Managed Druid platform (closed-source)
Ultra-fast OLAP Analytics with Apache Hive and Druid

Originally published at www.sspaeti.com on November 29, 2018.

mamta sharma

Executive at PlusOne Marketing Solutions

2 年

Similarities between Apache Druid & Apache Pinot https://www.decipherzone.com/blog-detail/druid-vs-pinot Apache Druid and Apache Pinot are both fundamentally similar because they store data and process queries on the same nodes, deviating from the decoupled architecture of BigQuery. Druid and Pinot both have their data storing format with indexes, are tightly coupled with their query processing engines, and unsupportiveness of large data volumes between nodes, so the queries run faster in both than Big Data processing systems like Presto, Hive, Spark, Kudu and Parquet.

mamta sharma

Executive at PlusOne Marketing Solutions

2 年

What is Apache Druid Architecture? https://www.decipherzone.com/blog-detail/apache-druid-architecture Apache Druid is among the most popular open-source real-time solutions for Online Analytics Processing (OLAP). As pre-existing open-source databases like Relational Database Management System (RDBMS) and NoSQL were incapable of offering a low latency data ingestion and query platform for interactive applications, the need for Druid was stimulated.

Martino Olivo

Data analyst

2 年

Thanks for the article!

Ritesh Kumar

Functional Data Engineer

4 年

I had tried working (installation and development) on druid and the installation of open source druid was quite complex and even development. Currently I am working on Apache Kylin and the UI offered by Kylin is kind of savour. I would request you to post some information on comparison of Kylin and Druid.. And from one of your posts regarding tools like dremio as you had mentioned that no need to transfer data it works at semantic layer and in memory storage and processing.. Request you to elaborate on these too

查看更多评论

要查看或添加评论，请登录

查看全部

Open-Source Data Warehousing – Druid, Apache Airflow & Superset

Simon Sp?ti

Data Engineer, Author & Educator | ssp.sh, dedp.online

DRUID- THE DATA STORE

What is Druid

Why Druid?

Why not Druid?

The Architecture of Druid

APACHE SUPERSET – THE UI

APACHE AIRFLOW – THE ORCHESTRATOR

Why using Airflow?

Features of Airflow

ETL WITH APACHE AIRFLOW

CONCLUSION

更多精彩文章

社区洞察

其他会员也浏览了

Data Lakehouse 101: The Who, What and Why of Data Lakehouses

5 Peta Byte Data Lake Design - Part 2

Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

The Latest in Distributed SQL - September

Advanced Techniques for Optimizing Apache Iceberg Lakehouse Performance

The Latest In Distributed SQL - July

The Essential Skills Every Aspiring Data Engineer Needs in 2024

Exploring the Different Types of Tables in Apache Hive

The Data Story of Powerplay (Part-2)

A beginner's guide to using Apache Hudi for data lake management

DRUID- THE DATA STORE

What is Druid

Why Druid?

Why not Druid?

The Architecture of Druid

APACHE SUPERSET – THE UI

APACHE AIRFLOW – THE ORCHESTRATOR

Why using Airflow?

Features of Airflow

ETL WITH APACHE AIRFLOW

CONCLUSION

Business Intelligence meets Data Engineering with Emerging Technologies

2020年6月17日

Email, and the way we (should) communicate at work

2019年5月1日

OLAP, what’s coming next?

2018年11月25日

Data Warehouse vs Data Lake | ETL vs ELT

2018年3月23日

Data Engineering, the future of Data Warehousing?

2018年3月8日

The 6 top most reasons why #DataWarehouseAutomation tools not widely used!

2017年6月28日

社区洞察

其他会员也浏览了

Data Lakehouse 101: The Who, What and Why of Data Lakehouses

5 Peta Byte Data Lake Design - Part 2

Part 2- Data Ingestion | A Step-by-Step Guide to Building End-to-End Data Engineering Projects with Azure

The Latest in Distributed SQL - September

Advanced Techniques for Optimizing Apache Iceberg Lakehouse Performance

The Latest In Distributed SQL - July

The Essential Skills Every Aspiring Data Engineer Needs in 2024

Exploring the Different Types of Tables in Apache Hive

The Data Story of Powerplay (Part-2)

A beginner's guide to using Apache Hudi for data lake management