What the heck is GlareDB?
Introduction
It has been a while since my last “What the heck is??” article, and I’ve recently seen some rapid growth from GlareDB and wanted to learn more. What really piqued my interest was the recent announcements of support for Apache Iceberg and a new hybrid execution model. So, what the heck is GlareDB? Let’s take a look!
Overview
GlareDB is an open-source project utilizing the DataFusion project, part of the Apache Arrow project. DataFusion is a fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format. It offers SQL and Dataframe APIs and built-in support for CSV, Parquet, JSON, and Avro. There are also Python bindings as well as extensive customization possibilities. GlareDB is adding many features on top of it, such as cloud storage and the aforementioned hybrid execution feature, providing a? layer on top of various compute engines that can:
They describe how it fits in the stack in this diagram:
It supports data located on GCS or S3 of the following types:
They are quickly adding support for various engines, so this list could be incomplete when you read this.
领英推荐
What can I do with it?
At first blush, you look at this and think, hey, this seems a lot like Trino in that it is a federated query engine. On second glance, it seems like Motherduck for a couple of reasons. The first is that, like DuckDB, GlareDB is a single, tight executable written in Rust instead of C++. Second, they also support having this hybrid execution model (MotherDuck did it first), which I’ll cover shortly.
Given that Trino is written in Java, that means there is a lot of Java ecosystem you need to deal with if you want to use it. Sure, there are pre-built Docker containers around that can shorten this path, but generally, if you are “just trying to do something”, then you have a heavy lift to install and set up Trino. With GlareDB, you have a single executable to download and use or make use of their SaaS product, which looks like this when you first use it:
Now to Hybrid Execution. I’ll paraphrase some of what GlareDB had to say in their blog post on the topic. Say you have a CSV list of user IDs that had gotten extracted from some other tool from your database. Now, you want to enrich that data with some of the user's demographic information from your database. We’ll say our table name is user_demo and our CSV file is user_id.csv, and our query would look something like this:
SELECT ?
m.user_id, ?
m.first_name,
m.last_name, ?
m.birth_date
FROM ?
user_demo m
INNER JOIN '/user_id.csv' u on m.user_id = u.id
GROUP BY m.user_id;
Clearly, this is a simple example, but you could enhance it to get information out of other joined tables as well. You can also go in the other direction, where you have some local file with a key field and some data you are interested in that you can join to a table in a database where that extra data in the file doesn’t exist in the database. This has the advantage of not having to go through the process of creating a new table and loading it for this ad-hoc report, thus saving a lot of time.
That’s all just meant to give you a quick tickle about what GlareDB can do and where it is at currently. The docs and blogs on their site are well done, making it pretty quick to jump in.
Summary
GlareDB is very interesting, and I appreciate how quickly they are iterating and updating the software. I need to spend some more time thinking about how it plays in the Trino, StarRocks, or DuckDB space. Between the speed and the federated queries, there are some exciting possibilities. I really like the new hybrid execution, which could shortcut work in various situations. Try out a free account yourself if you’d like to give it a spin at GlareDB.
You can read the other “What the heck” articles at these links:
What The Heck Is DuckDB? (I was pretty out front on this one.)
What the Heck Is Malloy? (I was out front on this one, too.)
What the Heck is PRQL? (slower, but also growing)
Data geek and developer advocate supreme
1 年Make sure to read the piece that Mimoune Djouallah just wrote as well https://datamonkeysite.com/2023/09/24/glaredb-storage-format/