XEE: Geospatial analysis made easier - a tutorial & code

XEE: Geospatial analysis made easier - a tutorial & code

I've been working with geospatial datasets for 13+ years now (since ~2011 that I started analyzing NASA-MODIS satellite data). Since then, there have been many advances in the field, all of which have made accessing and analyzing geospatial datasets easier and more convenient.

In this article, I'll present a tutorial focusing on a XEE, new python tool that combines the power of Google Earth Engine (GEE) and Xarray. I'll share a couple of examples and I'll provide the code and extra resources for each case. I hope you find this article useful and I hope you share it with others. If you're interested to learn more about Xarray, I had shared a tutorial for climate data analysis that can be useful as well.

Objectives:

  • Present a tutorial for a new geospatial data analysis tool (XEE)
  • Provide sample code for how to analyze geospatial data and generate high quality visualizations
  • Encourage people (especially students) to learn Python and showcase how easy it is to perform geospatial analysis

1. Problem statement:

Generally, the workflow for geospatial modeling begins with downloading and pre-processing various datasets from different sources. For instance, one would acquire satellite data (e.g. land cover or vegetation) from NASA, climate data from ECMWF (e.g. ERA-5), and weather predictions from NOAA (e.g. GFS). Each of these datasets may be in a different format (e.g. hdf, netcdf, grib, etc) and in various latitude/longitude resolutions. Therefore, the downloaded raw data is usually not useful later on. If one could only access different data sources in a consistent format, load the raw data to perform pre-processing and then save the desired pre-processed data, it would save quite a bit of time and storage, and it would also reduce the barriers to entry, and hence make geospatial analysis more accessible. Well, the wait seems to be over :)

A typical geospatial modeling workflow and how XEE has improved it

GEE provides access to a huge catalog of datasets (tens of petabytes), and the consistent format is surely a desirable feature. Meanwhile, Xarray (being compatible with numpy, pandas, dask, and cartopy) has proven to be a unique tool for analyzing multidimensional / geospatial data (especially climate data). I think the combination of the two is a very useful tool.

The growing number of startups and researchers conducting geospatial analyses (link), many of which are using similar data for training their models (mostly AI and data-driven nowadays) indicates that there's a decent market for new tools for improving geospatial modeling workflows.

2. Tutorial

To be able to run the code for this part, you'll need to have an active GEE account (preferably with a high volume endpoint).

import cartopy.crs
import cartopy.feature as cfeature
import dask.distributed
import ee
import google.auth.compute_engine
import matplotlib.pyplot as plt
import xarray as xr

# Starting dask for multi-processing:
client = dask.distributed.Client(n_workers=4)
client        

I'm running the notebook on a normal machine with only 4 cpus and about 16GB of RAM, but you'll see later on that our memory usage never exceeds 8GB (just in case you're worried if you can replicate this analysis on your machine).

If you want to learn more about Dask, here's the link to a good tutorial for it.

# initialize Earth Engine:
# You need to have an active EE acount for this to work.
credentials = google.auth.compute_engine.Credentials()
ee.Initialize(credentials,
    opt_url='https://earthengine-highvolume.googleapis.com')        

Let's import ERA5-Land daily data for air temperature across the Contiguous US since 2000, change the projection to WGS84 (aka EPSG4326) and regrid the data to 0.2 degrees. The original data is at 0.25 degree spatial resolution and hourly temporal resolution, and can be found at this link. Here, we import the data from Earth Engine (EE) data catalogue. On EE, you can find the description for all the bands at this link.

ic = ee.ImageCollection('ECMWF/ERA5_LAND/DAILY_AGGR').filterDate(
    '2000-01-01', '2023-12-31').select('temperature_2m')
# Another way to do the selection is to not select any band in ic, 
# and then use xarray to select the desired variable.
region = ee.Geometry.Rectangle(-130, 25, -60, 50)
ds = xr.open_dataset(ic,
                     engine='ee',
                     crs='EPSG:4326',
                     scale=0.2,
                     geometry=region)        

As you can see, with a couple lines of code, we did all the following at once:

  • Filter the date range
  • Select the variable of interest
  • Select (and filter) the region of interest
  • Change the geographic projection (coordinate reference system)
  • Change the spatial resolution (regridding)

If we look at the ds variable above, we'll see the details as follows:

The metadata of the xarray dataset

2.1. A simple analysis and visualization

For this section, let's do a simple analysis. Let's calculate the climatology (long-term average temperature) for each month. Then, let's plot the result for January.

ds_mon = ds.groupby(ds.time.dt.month).mean('time')
ds_jan = ds_mon.sel(month=1)
ds_jan.__setitem__('temperature_2m',(ds_jan.temperature_2m - 
    273.15)* 1.8 + 32)
ds_jan.temperature_2m.plot(x='lon', y='lat', vmin=0, vmax=60,
    cmap='jet')        

The above analysis took just about 60 seconds (with less than 8GB RAM used and no raw data being downloaded)! Normally, one should use ECMWF's API (i.e. cdsapi) to download the hourly data from ECMWF, convert it to daily, and then perform the analysis. Given the wait time that a queued API request takes on Copernicus climate data store (cds), that would have taken many hours.

The following image is the output of the above code:

The default plot of January climatology (with almost no revision or configuration)

As you can see, the plot is really basic and it's not the best presentation. In section 2.4, we'll use Cartopy and we'll add a lot more features to the plot to make the figure much more interesting.

2.2. Extracting point data:

Let's leverage xarray functionalities and extract timeseries of temperature data for Dallas, TX in February.

ds_dallas = ds.sel(lat=32.802, lon=-96.782, method='nearest').where(
    ds.time.dt.month==2, drop=True)
ds_dallas.__setitem__('temperature_2m',(ds_dallas.temperature_2m -
    273.15) * 1.8 + 32)        

Now let's plot the daily values for different years (all in one plot stacked on top of each other):

for yr in range(2000,2023):
    da_sel = ds_dallas.temperature_2m.where(
        ds_dallas.time.dt.year==yr, drop=True)
    if yr == 2021:
        plt.plot(da_sel.time.dt.day, da_sel, color='b', linewidth=2)
    else:
        plt.plot(da_sel.time.dt.day, da_sel, color='g', alpha=.4)
plt.grid()
plt.xlim([1,29])
plt.xlabel('Day in February')
plt.xticks(ticks=range(1,30,2))
plt.ylabel('temperature (°F)')
plt.title(
  'Dallas temperature in February since 2000 (2021 shown in blue)')        
Daily air temperature in Dallas, TX during Februaries of 2000-2023

The plot shows the daily temperature in Dallas, TX during February since the year 2000, and it highlights how severe the 2021 cold storm was. If you're interested to learn more about that event, you can take a look at this analysis I did at the time.

2.3. Saving the outputs:

Saving data with xarray is really convenient. you can save the data as netcdf or zarr (and even compress the data):

# Save as NetCDF:
ds_dallas.to_netcdf('t2m_dallas.nc')
# You can also simply save it as zarr (and use compression too):
# compressor = zarr.Blosc(cname="zstd", clevel=3, shuffle=2)
# ds_dallas.to_zarr("foo.zarr", 
#                   encoding={"foo":{"compressor":compressor}})        

2.4. More advanced visualization (using Cartopy):

Let's look back at the first figure we generated (in section 2.1) and add more features to it and generate a high quality plot for it.

fig, axis = plt.subplots(1, 1,
    subplot_kw=dict(projection=cartopy.crs.Orthographic(-100, 34.5)),
    figsize=(5.8, 5))
plt.subplots_adjust(top=.94, bottom=0, left=.1, right=.98)
ds_jan.temperature_2m.plot(
    x='lon',
    y='lat',
    vmin=0,
    vmax=60,
    cmap='jet',
    ax=axis,
    transform=cartopy.crs.PlateCarree(),
    cbar_kwargs={
        "orientation": "horizontal",
        "shrink": 0.85, # length of colorbar
        "pad": 0.08, # the distance between colorbar and the plot
        "aspect": 30, # the width (thickness) of colorbar
        "label": "2m temperature (°F)"
    },
    robust=True)
axis.coastlines()
axis.add_feature(cfeature.STATES, zorder=1, linewidth=.5, edgecolor='k')
axis.add_feature(cfeature.OCEAN, facecolor='b', alpha=.2)
gl = axis.gridlines(crs=cartopy.crs.PlateCarree(),
                    draw_labels=True,
                    linewidth=.6,
                    color='gray',
                    alpha=0.5,
                    linestyle='-.')
gl.top_labels = False  # suppress top labels
gl.right_labels = False  # suppress right labels
axis.set_extent([-120.5, -72, 23.6, 53], crs=cartopy.crs.PlateCarree())
plt.title('January 2020 mean temperature')
plt.savefig('Fig_t2m_199001.png', format='png', dpi=200)        

The above code generates the following figure:

Sample plot for January 2020 when using Cartopy and adding more features to the figure

Much better (in my opinion)! You can choose other geographic projections or add other features too (e.g. rivers, lakes, etc.).

4. Limitations:

  • GEE API requests are still limited and you can't request a lot of data at once (e.g. you can't ask for 50 years of daily data over the CONUS at 0.1 degree resolution).
  • When using GEE, you essentially depend on GEE for any data access and that may limit your control (but do we really have much control if we get the data directly from NOAA or ECMWF?).

??Note: there's no single solution for all purposes. XEE is a useful tool and it's an excuse for me to share this tutorial :)

5. Summary:

The combination of Google Earth Engine and xarray seems to be an amazing way of accessing geospatial data: really flexible and highly customizable, and definitely much faster and more convenient than a lot of other data portals out there for moderate to large size datasets. This is in particular true for many of the satellite datasets (e.g. NASA) that are available on many many tiles (for different regions and times). Nonetheless, I think improving large-scale ML workflows for geospatial problems is still needed and worth further exploration. Better data storage and access will make large-scale (e.g. continental or global scale) geospatial ML possible for not only the giant companies with vast resources, but also to many small startups. Startups like Earthmover , Fused , and Coiled have been working on data storage and processing to some extent, but there's still opportunities for deep tech startups in this area.

Roxana F.

Hydrologist at the Water Center @TecDeMonterrey | Professor in hydrology | GIS Specialist | Environmental consulting | Electronics

8 个月

Gracias!

回复
Imran A Khan

Assistant Professor at Karachi University

9 个月

very nice

回复
Nabilla Putri

Quality Control Department at PT. JGC Indonesia

12 个月

useful code for processing long spatiotemporal data. but, i'd like to ask if this xarray could help me to extract tabular data of chlorophyll value from Sentinel-2 Level 1C data? Since the results shows best interpretation in level 1 data, it is a must to done atmospheric correction and it was only use single image (no range of dates). So, is this xarray able to allow me to extract tabular data format? Ali Ahmadalipour

回复
Alex Merose

Founding Member of the Technical Staff @ Open Athena

1 年

Thanks for writing a tutorial for Xee!

Sai Krishna Dammalapati

Civic Technology | Statistics | Data | Science

1 年

Thank you Ali Ahmadalipour for this tutorial. Could you please check this once? In "xr.open_dataset(engine='ee')" -- there is an error popping up that ee is an unrecognised engine. The xarray documentation also does not show ee as one of the available engines. https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html I've installed xee library as well Curious to know how to executed it.

回复

要查看或添加评论,请登录

Ali Ahmadalipour的更多文章

社区洞察

其他会员也浏览了