登录查看更多内容

Apache Drill, not bad at all!

Rubén Garrigós Domínguez

Senior Azure Data Engineer - Full Remote

发布日期: 2022年12月16日

A few days ago I had some exposure to Apache Drill and I found it somehow interesting. Even if it can be considered a bit legacy and “not so cool” nowadays (there are so many MPP open source alternatives with SQL dialects...) Drill is still a good piece of software.

We can define Drill as an open-source distributed SQL query engine inspired by Google Dremel/BigQuery which is able to run distributed queries over large-scale datasets. It is designed to be scalable, flexible, and efficient, allowing us to quickly and easily access and analyze data from a wide variety of sources. It can run from just one node to hundreds/thousands of them coordinated using Zookeeper.

When a Drill client issues a query (using?JDBC, ODBC, a command line interface or the REST API) any Drillbit service in the cluster can accept the query. There is no global master-slave concept. Drillbit then parse the query, optimize it, and generate a distributed query plan that, most of the time, is optimized for fast and efficient execution.

The Drillbit that accepts the query becomes the driving Drillbit node for just that request. It will determine the appropriate nodes to execute various query plan fragments to maximize data locality. When the individual nodes finish their execution they will return the data to the driving Drillbit and from there the results are streamed back to the client.

Drill provides plug-and-play integration with existing Apache Hive and Apache HBase deployments. It also supports accessing data over variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, SMB and local files.

One of the key advantages of Apache Drill is its flexibility. Apache Drill is able to query data in a variety of formats, including unstructured, semi-structured, and structured data (including CSV, JSON, Parquet, and Avro). This makes it a good tool for working with data that is coming from a variety of sources, such as logs, sensors, and other types of unstructured data. Another advantage of Apache Drill is its low cost. Because it is open-source software, it is available for free, which can be a significant cost savings for organizations that are looking to implement big data solutions.

Another strength of Apache Drill is its performance. It uses a columnar storage format, which allows for efficient data access and query processing. It also uses a vectorized query execution engine, processing directly on columnar data without row materialization, so it can process large datasets in parallel with a low memory footprint and perform complex operations at high speed. This makes it ideal for processing large datasets quickly and efficiently. Drill generates highly efficient custom code for every single query. The following image shows the Drill compilation/code generation process:

At the same time, all this "complexity" is not a problem at all to start using Drill. The minimum configuration is just one node, so you can easily test it just creating a docker container in a few seconds:

docker run -it --name drill -p 8047:8047 -p 31010:31010 apache/drill

Regarding the connectivity, as we already said, we can use JDBC to connect to Drill, or the ODBC driver or there is also an API from where we can send SQL queries using HTTP POST requests.

I will show you now to query some parquet files with SQL from the web interface and using a POST request. After that I will show you some of the information we have exposed through the execution plans of SQL Drill queries.

Let’s start querying some sample data from the web interface. We just need to connect to the 8047 port of out container and click the Query option to run a sample query:

Now we can run the same query from Postman and get the same result in JSON format:

But the sample data included is just too small to run any interesting query. We will download to our sample folder (/opt/drill/sample-data by default) some “classic” parquet sample data ( the NY yellow taxi data):

wget --no-check-certificate https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet

After the download is completed, we can directly query the file and get some statistical information from it:

For this exploratory queries, is useful to know that from the web interface is easy to export the result to CSV using the Export button located in the top right of the result grid:

Apart of the result grid, the system also offers information regarding how long it took to plan the query (generate the execution plan), how long it was waiting in the queue, and how long it takes to run it. In this case all that took 674 ms:

领英推荐

Perfecting the Median: Transforming Hive UDAFs to…

Aliz 1 年前

Connecting NodeJS with Databases: Using MongoDB and…

Crest Infotech ? 2 个月前

Can MySQL Serve as a Foundation for an On-Premise…

DataTech Integrator 1 个月前

So here in Drill we have some classic steps like almost any other system that uses SQL: parsing, creation of a logical plan, optimization and then the final execution using a physical plan:

We can see also the details of each operation and here we can see that most of the time for this query was spent (as usual) in reading the data and doing the aggregation (using a hash algorithm in this case):

We have also a graphical representations of the plan, in a DAG format (same we have in other tools) with some “Stages” and with “Tasks” inside of them. Again lots of similarities with other "modern tools":

Regarding the operators, we can see that the physical join/aggregator operators are very similar to the ones we have in other DB engines:

Now we can create another query, for example one using window functions, so we can get for each number of passengers the two trips with the longer trip distance:

In this case, some operations involve sorts, so they can spill to disk if not enough memory is available. That will slow down considerably the query (that is what happens in most database engines too). Spilling will be always a “problem” regarding the performance, and in Drill we can also configure a specific directory where we can mount faster devices (faster SSDs with no storage redundancy, no geo replication and so on) to increase the performance of the sort & hash operations :

To continue with some testing, we can add 3 more parquet files to increase the volume of the data:

wget --no-check-certificate https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parque

wget --no-check-certificate https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-03.parquet

wget --no-check-certificate https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-04.parquett

In the SQL query we can use wildcards to reference them all together. Now the cost and the time to execute increases considerably because we need to order ~12 million rows and we are spilling more than before:

In general, it is worth to try to push down the filters to reduce the amount of data to be processed. Drill supports Parquet filter pushdown for the VARCHAR and DECIMAL data types. Drill uses binary statistics in the Parquet file or Drill metadata file to push filters to the data sources. So, for example, if we filter by the maximum trip_distance value, this will be in only one file so it will only scan 2.9 million rows of one of the files. Drill will also support partition pruning?when querying file systems (including an optimized Parquet metadata cache) and Hive tables when we have the proper folder partition hierarchy in place:

There are lots of parameters in Drill so we can, for example, control aggregation using the planner.enable_multiphase_agg option or we can control the broadcast threshold and factor or we can also enable/disable some join algorithms if we want. Again, I see lots of similarities with other systems:

At the end, almost all MPP systems will have similar options to fine tune the queries that are not performing as expected. Every MPP system will require sometimes some “human help” in form of plan shapping, "intelligent data redistribution", and eventually somebody will need to dig into the execution plans and “do the magic trick” to have a good query performance ??

It is good from time to time to get some exposure to different tools from a different stack to the one you are used too work with. Over the time, you will see that the similarities are enormous, much more than the differences, so please don't trust the marketing stuff. Their main focus is always to "differentiate" from the competition to be "the best" (but most of the time they are just a bit different and only better in some specific areas).

要查看或添加评论，请登录

Rubén Garrigós Domínguez的更多文章

Are we in the middle of a geopolitical cloud shift: West vs the East?

2024年5月2日

Are we in the middle of a geopolitical cloud shift: West vs the East?

I understand this topic might be somewhat controversial, but let’s dive into the wild world of geopolitics and cloud…
DuckDB: A swiss army tool

2023年8月22日

DuckDB: A swiss army tool

In the world of database management systems (DBMS), one truth prevails: there's no one-size-fits-all solution. Each…

2 条评论
Azure OpenAI ChatGPT vs Bing Chat Enterprise/Microsoft 365 Copilot

2023年7月20日

Azure OpenAI ChatGPT vs Bing Chat Enterprise/Microsoft 365 Copilot

With the new Bing Chat Enterprise / Microsoft 365 Copilot there are some tailor-made enterprise initiatives using Azure…
Synapse Serverless SQL Pool: DQP vs MPP

2023年1月13日

Synapse Serverless SQL Pool: DQP vs MPP

Introduction and architecture When you want to throw exploratory SQL queries to your Azure Synapse data lake, there are…

1 条评论
Presto! We need to replace Apache Drill!

2022年12月21日

Presto! We need to replace Apache Drill!

After exploring Apache Drill I was thinking about the alternative modern technologies which are replacing Apache Drill.…

1 条评论
Running Python scripts in Azure on the cheap

2022年8月17日

Running Python scripts in Azure on the cheap

When you need to run a Python script in Azure you don’t always need massively scalable capabilities so using Azure…

See all articles

Apache Drill, not bad at all!

Rubén Garrigós Domínguez

Senior Azure Data Engineer - Full Remote

领英推荐

Rubén Garrigós Domínguez的更多文章

社区洞察

其他会员也浏览了

MongoDB Schema Design approaches

Generic Abstractions

System design study: How Flipkart Scaled TiDB to 1M QPS

MongoDB Series - Part 1 - The Basics

Why are people shifting to MongoDB?

A gentle introduction to Embedded Databases

PARAMETERIZED QUERY IN MONGODB

Distributed PostgreSQL without Sharding constraint for SQL joins

MongoDB for Gaming.

Why Use MongoDB and When to Use It?

领英推荐

Rubén Garrigós Domínguez的更多文章

Are we in the middle of a geopolitical cloud shift: West vs the East?

DuckDB: A swiss army tool

Azure OpenAI ChatGPT vs Bing Chat Enterprise/Microsoft 365 Copilot

Synapse Serverless SQL Pool: DQP vs MPP

Presto! We need to replace Apache Drill!

Running Python scripts in Azure on the cheap

社区洞察

其他会员也浏览了

MongoDB Schema Design approaches

Generic Abstractions

System design study: How Flipkart Scaled TiDB to 1M QPS

MongoDB Series - Part 1 - The Basics

Why are people shifting to MongoDB?

A gentle introduction to Embedded Databases

PARAMETERIZED QUERY IN MONGODB

Distributed PostgreSQL without Sharding constraint for SQL joins

MongoDB for Gaming.

Why Use MongoDB and When to Use It?