Exploring FME 2024.1: Cloud Native Formats and  DuckDB

Exploring FME 2024.1: Cloud Native Formats and DuckDB

With the release of FME 2024.1 FME comes with a DuckDB reader. DuckDB is an open-source in-memory, columnar SQL database management system designed for analytical workloads. This new integration includes a native DuckDB reader and the ability to submit queries to DuckDB via the SQLExecutor transformer.

Why is this neat? FME now allows you to efficiently read data from DuckDB databases and query cloud-native data formats, such as Geoparquet files on S3, without needing to store large datasets locally.

Hold on, what's a Cloud-native data format? Formats like Geoparquet and FlatGeoBuf are file formats that support partial and parallel reads, they allow all the metadata to be read in one read which means small chunks of the file can be read through HTTP range requests making them compatible with object storage like S3 or Azure Blob storage.

An excellent slide show overview created by CloudNativeGeo.org can be reached here: guide.cloudnativegeo.org

So why would I want to use a cloud-native format? There are some excellent open source data projects at the moment the Overture Maps foundation being one, the Source Cooperative another, datasets include global building footprints, transport routes and places. Some of the datasets hosted on these sites are huge, and updating frequently - you likely won't have the drive space, bandwidth or energy to keep a local copy. Instead with cloud native formats you can direcly return the records you need.

Example: Using data directly from the UK Power Networks Open Data Portal

Let's look at an example, this week I discovered that UK Power Networks Open Data Portal now offers data in Parquet/Geoparquet and FlatGeoBuf. I wanted to review the spatial data relating to a project we at Avineon-Tensing are supporting at UK Power Networks , the Eastern Power Network Vectorisation Project.

The data is offered in a range of formats, for this example I went with the Parquet (which as the note mentions supports the Geoparquet specification).

UKPN Open Data Portal Export Page

Step 1: Viewing Data with DBeaver

For now lets view this data in the database tool DBeaver. To create an in-memory duckdb instance refer to the following step by step guide:

DBeaver SQL IDE

Running the script below lists out the structure of the Geoparquet file:

DESCRIBE SELECT * FROM parquet_scan('https://ukpowernetworks.opendatasoft.com/api/explore/v2.1/catalog/datasets/ukpn_epn_vectorisation_delivery_plan/exports/parquet?lang=en&timezone=Europe%2FLondon');        

This returns a table with columns like geo_point_2d, geo_shape, batch_id, dno, and more.

You can then create a select statement against directly against the URL. Nothing is stored locally, it remains in memory.

INSTALL spatial;
LOAD spatial;
SET force_download=TRUE;

SELECT batch_id, dno, operl_area, status, ST_GeomFromWKB(geo_shape) AS geom
FROM parquet_scan('https://ukpowernetworks.opendatasoft.com/api/explore/v2.1/catalog/datasets/ukpn_epn_vectorisation_delivery_plan/exports/parquet?lang=en&timezone=Europe%2FLondon')
WHERE status = 'Complete';        

Only the records where the status is 'Complete' are returned, which optimises bandwidth and processing time. With the spatial extension to DuckDB, the geometry is correctly presented in DBeaver's tabular and map views.

Output from the select query presented in DBeaver

Step 2: Using These Datasets in FME

Now that we've seen the data in DBeaver, let’s take it a step further and use these datasets in FME. Since FME 2023.1, both FlatGeoBuf and Geoparquet formats have full read/write support. Lets find some interesting data...

The UKPN open dataset makes available the National Chargepoint Register (NCR) database of publicly available chargepoints for electric vehicles, this is updated daily and covers all of UKPNs network area. This is a big dataset, over 25,000 locations and growing - data download page. To add the data in FlatGeoBuf and Geoparquet, open up the respective FME readers and copy in the URLs.

Interestingly, while the CSV version of the dataset is 7MB, the Geoparquet version is only 1.5MB, demonstrating the efficiency of this new format. Maybe one day Parquet might replace CSV!

A quick view of the National Charge Point Register sourced from UKPNs open data portal and read in using FlatGeoBuf and GeoParquet

No files are being retrieved locally by FME this is all being pulled directly from the UKPN data portal.

FME 2024.1 now with DuckDB support

Lets now move things up a notch, at the start I mentioned that FME support for DuckDB was exciting. Using FME 2024.1 and above you can submit queries to DuckDB via the SQLExecutor transformer, when you need to query a huge dataset stored online in geoparquet format, you can now do so. No need to copy the whole dataset.

The Overture Maps Places dataset is an open source dataset with over 53 million point representations, kept up to date by Meta and Microsoft data. Entities include businesses, hopsitals, landmarks, pubs etc. The dataset is available through the s3 path:

s3://overturemaps-us-west-2/release/2024-07-22.0/theme=places/type=place/

Overture places data, styled by data source: purple for Meta, orange for Microsoft.

Practical Exercise: Putting Cloud Native Formats and DuckDB to the Test

Imagine you're a savvy coffee business looking to set up shop near a new car charging station, where there's little to no existing competition. For fun, let's use FME to find newly installed charge points from UKPN open data in the last seven days and search the Overture Maps Places dataset to identify the closest coffee shop.

Step 1: Finding New Charge Points

We'll start by pulling the latest data from the UKPN open data, specifically the geoparquet file. We'll filter this data to identify charge points installed in the last seven days. From our search, we found 180 new entries.

FME Form workspace, reading in a geoparquet file and filtering records to the last 7 days

Step 2: Utilising the Overture Maps Places Dataset

To use the Overture Maps Places data effectively, we'll filter by area and place type. First, we'll create a bounding box around all identified charge points and extract the dimensions. These dimensions will be passed into the SQLExecutor transformer. We're looking for coffee shops, so we'll filter by the "coffee_shop" category, as listed in the OSM Wiki.

In the SQLExecutor, we'll choose DuckDB as the database format and build a SQL SELECT statement to query the Places dataset, parsing the bounding box range and the primary category.

SQL Statement used to query the geoparquet Places data hosted by Overture Maps - a dataset of over 50 Milion rows

Step 3: Finding the nearest coffee shop using NeighborFinder

Running this query returns nearly 3,000 coffee shops. After reprojecting the geometry to the British National Grid, the NeighbourFinder transformer identifies the closest existing coffee shop to each charging point.

Results of the nearest neighbour finder process in FME, turns out Costa Coffee is everywhere

Analysing the Results

Some charge points are more than 500 metres away from the nearest coffee shop, while others are just a few steps away.

In any case, it is quite amazing how powerful it is to be using SQL against files and two doing so on huge files and datasets that aren't even stored locally.

While you are here - FME Skills Booster and other events

I hope you liked this bried intoduction into using FME with DuckDB and cloud native geospatial formats - parquet and flatgeobuf. If you did and would like to get some hands on experience we will be running an Tensing FME Skills Booster on this topic later in the year.

If you are interested in learning about FME and integrations with AI we are running an FME+AI Skills booster for free on the 4th of September - Register Here

We are also running two Tensing FME Tour events this year - we invite you to register and save your spot via: tensing.com/en/fme-uk-tour join us for an informative journey into the world of GIS and data. We’ll be visiting the Royal Armouries in Leeds on the 1st October and the M Shed in Bristol on the 8th October for events filled with inspiration, learning, and networking - the best part? It’s free!

At Tensing we offer a range of FME Training covering Intro, Advanced and Authoring. If you would like to brush up your skills get in touch.


Thom Leeffers

Service Manager & Geospatial IT Consultant

7 个月

Seems like GeoParquet and DuckDB are game changers!

回复

要查看或添加评论,请登录

Oliver Morris的更多文章

社区洞察

其他会员也浏览了