Apache Drill, not bad at all!
https://technology.amis.nl/data-analytics/what-is-apache-drill-and-how-to-setup-your-proof-of-concept/

Apache Drill, not bad at all!

A few days ago I had some exposure to Apache Drill and I found it somehow interesting. Even if it can be considered a bit legacy and “not so cool” nowadays (there are so many MPP open source alternatives with SQL dialects...) Drill is still a good piece of software.

We can define Drill as an open-source distributed SQL query engine inspired by Google Dremel/BigQuery which is able to run distributed queries over large-scale datasets. It is designed to be scalable, flexible, and efficient, allowing us to quickly and easily access and analyze data from a wide variety of sources. It can run from just one node to hundreds/thousands of them coordinated using Zookeeper.

No alt text provided for this image

When a Drill client issues a query (using?JDBC, ODBC, a command line interface or the REST API) any Drillbit service in the cluster can accept the query. There is no global master-slave concept. Drillbit then parse the query, optimize it, and generate a distributed query plan that, most of the time, is optimized for fast and efficient execution.

The Drillbit that accepts the query becomes the driving Drillbit node for just that request. It will determine the appropriate nodes to execute various query plan fragments to maximize data locality. When the individual nodes finish their execution they will return the data to the driving Drillbit and from there the results are streamed back to the client.

No alt text provided for this image

Drill provides plug-and-play integration with existing Apache Hive and Apache HBase deployments. It also supports accessing data over variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, SMB and local files.

One of the key advantages of Apache Drill is its flexibility. Apache Drill is able to query data in a variety of formats, including unstructured, semi-structured, and structured data (including CSV, JSON, Parquet, and Avro). This makes it a good tool for working with data that is coming from a variety of sources, such as logs, sensors, and other types of unstructured data. Another advantage of Apache Drill is its low cost. Because it is open-source software, it is available for free, which can be a significant cost savings for organizations that are looking to implement big data solutions.

Another strength of Apache Drill is its performance. It uses a columnar storage format, which allows for efficient data access and query processing. It also uses a vectorized query execution engine, processing directly on columnar data without row materialization, so it can process large datasets in parallel with a low memory footprint and perform complex operations at high speed. This makes it ideal for processing large datasets quickly and efficiently. Drill generates highly efficient custom code for every single query. The following image shows the Drill compilation/code generation process:

No alt text provided for this image

At the same time, all this "complexity" is not a problem at all to start using Drill. The minimum configuration is just one node, so you can easily test it just creating a docker container in a few seconds:

docker run -it --name drill -p 8047:8047 -p 31010:31010 apache/drill        

Regarding the connectivity, as we already said, we can use JDBC to connect to Drill, or the ODBC driver or there is also an API from where we can send SQL queries using HTTP POST requests.

I will show you now to query some parquet files with SQL from the web interface and using a POST request. After that I will show you some of the information we have exposed through the execution plans of SQL Drill queries.

Let’s start querying some sample data from the web interface. We just need to connect to the 8047 port of out container and click the Query option to run a sample query:

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Now we can run the same query from Postman and get the same result in JSON format:

No alt text provided for this image

But the sample data included is just too small to run any interesting query. We will download to our sample folder (/opt/drill/sample-data by default) some “classic” parquet sample data ( the NY yellow taxi data):

wget --no-check-certificate https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet        

After the download is completed, we can directly query the file and get some statistical information from it:

No alt text provided for this image
No alt text provided for this image

For this exploratory queries, is useful to know that from the web interface is easy to export the result to CSV using the Export button located in the top right of the result grid:

No alt text provided for this image

Apart of the result grid, the system also offers information regarding how long it took to plan the query (generate the execution plan), how long it was waiting in the queue, and how long it takes to run it. In this case all that took 674 ms:

No alt text provided for this image

So here in Drill we have some classic steps like almost any other system that uses SQL: parsing, creation of a logical plan, optimization and then the final execution using a physical plan:

No alt text provided for this image

We can see also the details of each operation and here we can see that most of the time for this query was spent (as usual) in reading the data and doing the aggregation (using a hash algorithm in this case):

No alt text provided for this image

We have also a graphical representations of the plan, in a DAG format (same we have in other tools) with some “Stages” and with “Tasks” inside of them. Again lots of similarities with other "modern tools":

No alt text provided for this image

Regarding the operators, we can see that the physical join/aggregator operators are very similar to the ones we have in other DB engines:

No alt text provided for this image

Now we can create another query, for example one using window functions, so we can get for each number of passengers the two trips with the longer trip distance:

No alt text provided for this image
No alt text provided for this image

In this case, some operations involve sorts, so they can spill to disk if not enough memory is available. That will slow down considerably the query (that is what happens in most database engines too). Spilling will be always a “problem” regarding the performance, and in Drill we can also configure a specific directory where we can mount faster devices (faster SSDs with no storage redundancy, no geo replication and so on) to increase the performance of the sort & hash operations :

No alt text provided for this image
No alt text provided for this image

To continue with some testing, we can add 3 more parquet files to increase the volume of the data:

wget --no-check-certificate https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parque

wget --no-check-certificate https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-03.parquet

wget --no-check-certificate https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-04.parquett        

In the SQL query we can use wildcards to reference them all together. Now the cost and the time to execute increases considerably because we need to order ~12 million rows and we are spilling more than before:

No alt text provided for this image


No alt text provided for this image

In general, it is worth to try to push down the filters to reduce the amount of data to be processed. Drill supports Parquet filter pushdown for the VARCHAR and DECIMAL data types. Drill uses binary statistics in the Parquet file or Drill metadata file to push filters to the data sources. So, for example, if we filter by the maximum trip_distance value, this will be in only one file so it will only scan 2.9 million rows of one of the files. Drill will also support partition pruning?when querying file systems (including an optimized Parquet metadata cache) and Hive tables when we have the proper folder partition hierarchy in place:

No alt text provided for this image
No alt text provided for this image

There are lots of parameters in Drill so we can, for example, control aggregation using the planner.enable_multiphase_agg option or we can control the broadcast threshold and factor or we can also enable/disable some join algorithms if we want. Again, I see lots of similarities with other systems:

No alt text provided for this image
No alt text provided for this image

At the end, almost all MPP systems will have similar options to fine tune the queries that are not performing as expected. Every MPP system will require sometimes some “human help” in form of plan shapping, "intelligent data redistribution", and eventually somebody will need to dig into the execution plans and “do the magic trick” to have a good query performance ??

It is good from time to time to get some exposure to different tools from a different stack to the one you are used too work with. Over the time, you will see that the similarities are enormous, much more than the differences, so please don't trust the marketing stuff. Their main focus is always to "differentiate" from the competition to be "the best" (but most of the time they are just a bit different and only better in some specific areas).

要查看或添加评论,请登录

Rubén Garrigós Domínguez的更多文章

社区洞察

其他会员也浏览了