登录查看更多内容

What the heck is GlareDB?

Shawn Gordon

Data geek and developer advocate supreme

发布日期: 2023年9月20日

Introduction

It has been a while since my last “What the heck is??” article, and I’ve recently seen some rapid growth from GlareDB and wanted to learn more. What really piqued my interest was the recent announcements of support for Apache Iceberg and a new hybrid execution model. So, what the heck is GlareDB? Let’s take a look!

Overview

GlareDB is an open-source project utilizing the DataFusion project, part of the Apache Arrow project. DataFusion is a fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format. It offers SQL and Dataframe APIs and built-in support for CSV, Parquet, JSON, and Avro. There are also Python bindings as well as extensive customization possibilities. GlareDB is adding many features on top of it, such as cloud storage and the aforementioned hybrid execution feature, providing a? layer on top of various compute engines that can:

Query local and remote files
Query other databases and data sources
Store data and queries (as views)
Copy data from sources to destinations
Interop with DataFrame libraries in Python
Run one-off queries from the command line

They describe how it fits in the stack in this diagram:

It supports data located on GCS or S3 of the following types:

BigQuery

MongoDB (early release)

MySQL

Postgres

Snowflake

Preliminary Iceberg support
Redshift (coming soon)
ClickHouse (coming soon)

They are quickly adding support for various engines, so this list could be incomplete when you read this.

Alex Merced 5 个月前

How to process a 100 GB file or "it is possible to…

Andriy Bilenky 8 个月前

A Guide to dbt Macros - Purpose, Benefits, and Usage

Alex Merced 1 个月前

What can I do with it?

At first blush, you look at this and think, hey, this seems a lot like Trino in that it is a federated query engine. On second glance, it seems like Motherduck for a couple of reasons. The first is that, like DuckDB, GlareDB is a single, tight executable written in Rust instead of C++. Second, they also support having this hybrid execution model (MotherDuck did it first), which I’ll cover shortly.

Given that Trino is written in Java, that means there is a lot of Java ecosystem you need to deal with if you want to use it. Sure, there are pre-built Docker containers around that can shorten this path, but generally, if you are “just trying to do something”, then you have a heavy lift to install and set up Trino. With GlareDB, you have a single executable to download and use or make use of their SaaS product, which looks like this when you first use it:

Now to Hybrid Execution. I’ll paraphrase some of what GlareDB had to say in their blog post on the topic. Say you have a CSV list of user IDs that had gotten extracted from some other tool from your database. Now, you want to enrich that data with some of the user's demographic information from your database. We’ll say our table name is user_demo and our CSV file is user_id.csv, and our query would look something like this:

SELECT ?
 m.user_id, ?
 m.first_name, 
 m.last_name, ?
 m.birth_date 
FROM ?
 user_demo m 
INNER JOIN '/user_id.csv' u on m.user_id = u.id 
GROUP BY m.user_id;

Clearly, this is a simple example, but you could enhance it to get information out of other joined tables as well. You can also go in the other direction, where you have some local file with a key field and some data you are interested in that you can join to a table in a database where that extra data in the file doesn’t exist in the database. This has the advantage of not having to go through the process of creating a new table and loading it for this ad-hoc report, thus saving a lot of time.

That’s all just meant to give you a quick tickle about what GlareDB can do and where it is at currently. The docs and blogs on their site are well done, making it pretty quick to jump in.

Summary

GlareDB is very interesting, and I appreciate how quickly they are iterating and updating the software. I need to spend some more time thinking about how it plays in the Trino, StarRocks, or DuckDB space. Between the speed and the federated queries, there are some exciting possibilities. I really like the new hybrid execution, which could shortcut work in various situations. Try out a free account yourself if you’d like to give it a spin at GlareDB.

You can read the other “What the heck” articles at these links:

What The Heck Is DuckDB? (I was pretty out front on this one.)

What the Heck Is Malloy? (I was out front on this one, too.)

What the Heck is PRQL? (slower, but also growing)

What the heck is GlareDB?

Shawn Gordon

Data geek and developer advocate supreme

Introduction

Overview

领英推荐

What can I do with it?

Summary

更多精彩文章

社区洞察

其他会员也浏览了

FLaNK-AIM: 13 May 2024

Understanding the Future of Apache Iceberg Catalogs

How do you insert data into an SQL database table using Python, and what are the various methods available?

Spring Data with MongoDB

Bulk Insert via python to insert over 4 Million+ rows to MariaDB at localhost [Project-Based]

Setting up Java GraphQL Application with DGS MySQL FlyWay and JOOQ

Use Flask and SQLAlchemy as an ORM for a SQL Database

August 2023 - Iceberg Community News

Create A Flask App To Use PostgreSQL Database

Spark Tidbits - Lesson 9

Introduction

Overview

领英推荐

What can I do with it?

Summary

What The Heck is Apache Polaris?

2024年9月12日

What the Heck is GPTScript?

2024年4月18日

Spotlight on Ask On Data

2024年4月1日

What the Heck is Puppygraph?

2024年2月26日

What the Heck is Proton?

2023年12月28日

What the Heck is Apache Paimon?

2023年12月6日

What the Heck is SDF?

2023年10月25日

What the Heck is LanceDB?

2023年10月19日

What the Heck is Apache SeaTunnel?

2023年10月16日

Branches & Tags: Comparing Iceberg, Hudi, and Delta Lake Tables

2023年10月3日

社区洞察

其他会员也浏览了

FLaNK-AIM: 13 May 2024

Understanding the Future of Apache Iceberg Catalogs

How do you insert data into an SQL database table using Python, and what are the various methods available?

Spring Data with MongoDB

Bulk Insert via python to insert over 4 Million+ rows to MariaDB at localhost [Project-Based]

Setting up Java GraphQL Application with DGS MySQL FlyWay and JOOQ

Use Flask and SQLAlchemy as an ORM for a SQL Database

August 2023 - Iceberg Community News

Create A Flask App To Use PostgreSQL Database

Spark Tidbits - Lesson 9