Completely In-fused !
London's AirBnb listings displayed on H3 grid in Leaflet

Completely In-fused !

With Fused.io and integrated components like H3, DuckDB and some others, I was able to create a web application displaying the AirBnB listings in the world's major cities. Learn how it works below.

You certainly remember my last week's article about Fused.io, the serverless platform for Modern Geospatial... Since then, I had time to delve into it a bit more, and more importantly had the chance to experiment with the newly released DuckDB extension which allows you to run a DuckDB instance from inside your UDF (User Defined Functions) and have a seamless experience between Python and SQL / DuckDB magic. You can read more about the DuckDB - Fused.io integration here : https://medium.com/@fused/duckdb-fused-fly-beyond-the-serverless-horizon-886d892834aa

The data

To give fused.io a good try I had to find something more appealing and exciting than the simple display through fused.io of something I could already map without it. I discovered serendipitously this website about AirBNB data : https://insideairbnb.com and decided to build the same kind of listings maps directly from fused.io. The first dataset to be used is a simple HTML page containing the links to the different cities' datasets in csv.gz all named listings.csv.gz which makes the scrapping easier with a regular expression :

@fused.cache
def get_city_data(city):
    # downloading main file list
    print('Downloading listings...')
    listings = requests.get(
        "https://insideairbnb.com/get-the-data/"
    )
    url = None
    if listings.status_code == 200:
        html = listings.text
        regexp = "(https:\/\/data.insideairbnb.com\/\w+\S+\/listings.csv.gz)"
        # parsing the list to find the required city listings URL
        for m in re.findall(regexp, html):
            if m.split('/')[5] == city:
                url = m
                print('Data file located at ', url)
                return url
    return None        

So given a requested city (default city is Boston if not specified), the code parses the HTML page to find a matching URL. As the HTML page isn't supposed to change frequently, its content is cached in the fused.io infrastructure to make processing faster.

The processing

Once the correct URL is obtained, we can directly inject it into a DuckDB SQL query, as the Duck engine is able to parse compressed CSV files on the fly. Instead of retrieving latitude - longitude couples from the CSV file, we are going to aggregate their count in H3 hexagons. If you don't know about H3, it's a geospatial hierarchical indexation system made of hexagonal cells. First developed for Uber by Isaac Brodsky (now co-founder and CTO at Fused, hence its rapid integration into fused.io !) it's now an open-source standard you can learn more about here : https://h3geo.org/docs/. And if you wonder why hexagons instead of simple squares, it's because from one hexagon center to each of its 6 neighbors' centers you always have the same distance, and the 6 neighbors together form a ring, and those two qualities make a real difference in spatial analysis.

H3 grid, a cell and its 6 neighbors, forming the first ring. (from h3geo.com)

For those who know a bit about maps, you probably remember that you shouldn't represent quantities among surfaces, because surfaces already are a quantity on the map and so they distort your value, and that's why you use densities or ratios instead of absolute values. But with H3 cells, the space having been normalized, you can directly display quantities as soon as you are able to aggregate your data into the cell. It's exactly what the SQL query executed by DuckDB is doing :

@fused.cache
def read_data(url, resolution):
    query = """
        SELECT h3_h3_to_string(h3_latlng_to_cell(latitude, longitude, $resolution)) cell_id,
            h3_cell_to_boundary_wkt(cell_id) boundary,
            count(1) cnt
        FROM read_csv($url) 
        GROUP BY cell_id;
            """
    df = con.sql(query, params={'url': url,'resolution': resolution}).df()
    return df        

The CSV file is read, and the latitude and longitude fields are used to generate an H3 ID and boundary, with an associated count (cnt) for a given resolution because like Google Maps and other hierarchical indexation systems, H3 has different zoom levels, called resolutions. From one to the next, a single cell is divided into 7 has shown below :

A parent hexagon approximately contains seven children (from h3geo.com)

The leap from one to another resolution is thus bigger than for a square index which has a ratio of 4.

Once the query is executed, its result is then retrieved as a Panda DataFrame, transformed into a GeoDataframe with some GeoPandas / Shapely magic and can be directly displayed into the fused.io workbench :

Results for New York City

Smoke it out !

As cool as it can be to be able to display that map into the workbench, any geospatial practitioner knows a map is made to be showed to others and shared. Fused.io offers several output formats and app integrations, and we are going to use the simplest one, a simple HTTP request for the whole content (it could be tiled too, but our datasets are small enough to fit into a single request / GeoJSON content). The URL provided for our UDF is something like:

https://www.fused.io/server/v1/realtime-shared/c055cb0177c82cdad444f6953b9d85f49811a336323c1709c885edda873c8439/run/file?dtype_out_vector=geojson&city=Boston&resolution=9        

With a geojson output format it's now very easy to embed our content into a map. I won't explain how to do that as it is pretty straightforward, and you will see the full code in the HTML source.

The map

With a single HTML page I was able to showcase the data rather easily. You can access it here : https://ec2-3-83-164-170.compute-1.amazonaws.com/?city=Amsterdam&resolution=10

There is no sophistication in there, but you can play with the parameters in the URL, choosing a city you like (it has to be in the https://insideairbnb.com/get-the-data list, most major cities are), and an appropriate resolution for display. 8 or 9 are often a good fit, depending on the city size. You can have more detailed maps with 10 or 11, but it tends to flatten the result a lot, as most of the cells then only have a few listings referenced. Remember the ratio between 2 resolutions is 7, so it can be quite a leap !

It's not a bat but London's listings with the H3 resolution of 7


Still London's listings, with a resolution of 8. More detailed but also more confused ?


London again, resolution 9, which fits better to the landuse discontinuities (parks, rivers, canals, industrial compounds)


Each city has its own spatial signature, the result of both the topography and the concentration or scattering of the touristic offer.

Barcelona concentrates its offer in the Old Town and towards the Gracià and Sagrada Familia neighborhoods .


Montreal shows a strong concentration in the city center and then a more scattered profile.

The gist

I must say it's been a lot of fun playing with fused.io thanks to the early onboarding provided by Fused during this closed beta release period. Discovering all the possibilities of the platform and, furthermore, being able to improve and tweak the code thanks to the examples and the documentation make it really pleasant to dive into. The immediate execution of the code and the debugging window is a great help to stop messing around before spending too much time on something wrong and the live map visualisation gives an immediate feedback on what you are doing and entice you to jazz it up. As an originally more "server" than serverless person, I am quite amazed by the seamless experience of producing a new UDF once you master the basics, which comes pretty quickly. It's definitely a great playground which makes you focus more on the data and the result than on the underlying technology, and the recent introduction of the DuckDB extension already makes fused.io a great choice for ETL / GeoData Pipelining on the web. I mean, the possibilities are endless. Like if I want to count the number of AirBnB listings provided by a city I can simply do it from the command line on my laptop :

duckdb -c 'SELECT SUM(cnt) from read_csv("https://www.fused.io/server/v1/realtime-shared/c055cb0177c82cdad444f6953b9d85f49811a336323c1709c885edda873c8439/run/file?dtype_out_vector=csv&city=Boston&resolution=9");
'
┌──────────┐
│ sum(cnt) │
│  int128  │
├──────────┤
│     4204 │
└──────────┘        

Isn't that electrifying ?

Remember DuckDB can also connect to PostgreSQL/PostGIS, and so can fused.io. Both ways, in and out. You can even use it for data transformation and just get a log as a result. And go mapless geospatial...



Mingke Erin Li

Geospatial Data Scientist | Ph.D. in Geomatics Engineering

9 个月

Thank you for the interesting demonstration! Seems the demo URL doesn't respond, any idea why?

回复
Drew Breunig

Working on Data, Geo, and AI

10 个月

Wholeheartedly agree: the future of geospatial has less maps.

Plinio Guzman

National Geographic Explorer | Map Maker | Founding Engineer @ Fused.io

10 个月

"Mapless geospatial" ??

Plinio Guzman

National Geographic Explorer | Map Maker | Founding Engineer @ Fused.io

10 个月

This is a GREAT example of how DuckDB, H3, and Fused can work together to build bespoke responsive applications. Congrats Guillaume Sueur! Looking forward to seeing this evolve.

要查看或添加评论,请登录

Guillaume Sueur的更多文章

  • How can DuckDB leverage GDAL/OGR in everyday use ?

    How can DuckDB leverage GDAL/OGR in everyday use ?

    I think you may all know about both DuckDB and GDAL, so I won't provide an expansive introduction here. Yesterday the…

  • Can TileDB be used in geospatial?

    Can TileDB be used in geospatial?

    New geospatial formats and geospatial databases always rise a real interest from the community as the constant…

  • Fused.io makes online data processing less confusing.

    Fused.io makes online data processing less confusing.

    For those old enough to remember the WPS (Web Processing Service) OGC standard, largely stagnant since 2015, the idea…

  • On embauche

    On embauche

    Neogeo Technologies recrute 1 administrat(eur)(trice) système et 1 développeu(r)(se) https://www.neogeo-online.

社区洞察

其他会员也浏览了