XEE: Geospatial analysis made easier - a tutorial & code
Ali Ahmadalipour
Research Scientist at Google X | Geospatial, AI, Climate Change, Sustainability
I've been working with geospatial datasets for 13+ years now (since ~2011 that I started analyzing NASA-MODIS satellite data). Since then, there have been many advances in the field, all of which have made accessing and analyzing geospatial datasets easier and more convenient.
In this article, I'll present a tutorial focusing on a XEE, new python tool that combines the power of Google Earth Engine (GEE) and Xarray. I'll share a couple of examples and I'll provide the code and extra resources for each case. I hope you find this article useful and I hope you share it with others. If you're interested to learn more about Xarray, I had shared a tutorial for climate data analysis that can be useful as well.
Objectives:
1. Problem statement:
Generally, the workflow for geospatial modeling begins with downloading and pre-processing various datasets from different sources. For instance, one would acquire satellite data (e.g. land cover or vegetation) from NASA, climate data from ECMWF (e.g. ERA-5), and weather predictions from NOAA (e.g. GFS). Each of these datasets may be in a different format (e.g. hdf, netcdf, grib, etc) and in various latitude/longitude resolutions. Therefore, the downloaded raw data is usually not useful later on. If one could only access different data sources in a consistent format, load the raw data to perform pre-processing and then save the desired pre-processed data, it would save quite a bit of time and storage, and it would also reduce the barriers to entry, and hence make geospatial analysis more accessible. Well, the wait seems to be over :)
GEE provides access to a huge catalog of datasets (tens of petabytes), and the consistent format is surely a desirable feature. Meanwhile, Xarray (being compatible with numpy, pandas, dask, and cartopy) has proven to be a unique tool for analyzing multidimensional / geospatial data (especially climate data). I think the combination of the two is a very useful tool.
The growing number of startups and researchers conducting geospatial analyses (link), many of which are using similar data for training their models (mostly AI and data-driven nowadays) indicates that there's a decent market for new tools for improving geospatial modeling workflows.
2. Tutorial
To be able to run the code for this part, you'll need to have an active GEE account (preferably with a high volume endpoint).
import cartopy.crs
import cartopy.feature as cfeature
import dask.distributed
import ee
import google.auth.compute_engine
import matplotlib.pyplot as plt
import xarray as xr
# Starting dask for multi-processing:
client = dask.distributed.Client(n_workers=4)
client
I'm running the notebook on a normal machine with only 4 cpus and about 16GB of RAM, but you'll see later on that our memory usage never exceeds 8GB (just in case you're worried if you can replicate this analysis on your machine).
If you want to learn more about Dask, here's the link to a good tutorial for it.
# initialize Earth Engine:
# You need to have an active EE acount for this to work.
credentials = google.auth.compute_engine.Credentials()
ee.Initialize(credentials,
opt_url='https://earthengine-highvolume.googleapis.com')
Let's import ERA5-Land daily data for air temperature across the Contiguous US since 2000, change the projection to WGS84 (aka EPSG4326) and regrid the data to 0.2 degrees. The original data is at 0.25 degree spatial resolution and hourly temporal resolution, and can be found at this link. Here, we import the data from Earth Engine (EE) data catalogue. On EE, you can find the description for all the bands at this link.
ic = ee.ImageCollection('ECMWF/ERA5_LAND/DAILY_AGGR').filterDate(
'2000-01-01', '2023-12-31').select('temperature_2m')
# Another way to do the selection is to not select any band in ic,
# and then use xarray to select the desired variable.
region = ee.Geometry.Rectangle(-130, 25, -60, 50)
ds = xr.open_dataset(ic,
engine='ee',
crs='EPSG:4326',
scale=0.2,
geometry=region)
As you can see, with a couple lines of code, we did all the following at once:
If we look at the ds variable above, we'll see the details as follows:
2.1. A simple analysis and visualization
For this section, let's do a simple analysis. Let's calculate the climatology (long-term average temperature) for each month. Then, let's plot the result for January.
领英推荐
ds_mon = ds.groupby(ds.time.dt.month).mean('time')
ds_jan = ds_mon.sel(month=1)
ds_jan.__setitem__('temperature_2m',(ds_jan.temperature_2m -
273.15)* 1.8 + 32)
ds_jan.temperature_2m.plot(x='lon', y='lat', vmin=0, vmax=60,
cmap='jet')
The above analysis took just about 60 seconds (with less than 8GB RAM used and no raw data being downloaded)! Normally, one should use ECMWF's API (i.e. cdsapi) to download the hourly data from ECMWF, convert it to daily, and then perform the analysis. Given the wait time that a queued API request takes on Copernicus climate data store (cds), that would have taken many hours.
The following image is the output of the above code:
As you can see, the plot is really basic and it's not the best presentation. In section 2.4, we'll use Cartopy and we'll add a lot more features to the plot to make the figure much more interesting.
2.2. Extracting point data:
Let's leverage xarray functionalities and extract timeseries of temperature data for Dallas, TX in February.
ds_dallas = ds.sel(lat=32.802, lon=-96.782, method='nearest').where(
ds.time.dt.month==2, drop=True)
ds_dallas.__setitem__('temperature_2m',(ds_dallas.temperature_2m -
273.15) * 1.8 + 32)
Now let's plot the daily values for different years (all in one plot stacked on top of each other):
for yr in range(2000,2023):
da_sel = ds_dallas.temperature_2m.where(
ds_dallas.time.dt.year==yr, drop=True)
if yr == 2021:
plt.plot(da_sel.time.dt.day, da_sel, color='b', linewidth=2)
else:
plt.plot(da_sel.time.dt.day, da_sel, color='g', alpha=.4)
plt.grid()
plt.xlim([1,29])
plt.xlabel('Day in February')
plt.xticks(ticks=range(1,30,2))
plt.ylabel('temperature (°F)')
plt.title(
'Dallas temperature in February since 2000 (2021 shown in blue)')
The plot shows the daily temperature in Dallas, TX during February since the year 2000, and it highlights how severe the 2021 cold storm was. If you're interested to learn more about that event, you can take a look at this analysis I did at the time.
2.3. Saving the outputs:
Saving data with xarray is really convenient. you can save the data as netcdf or zarr (and even compress the data):
# Save as NetCDF:
ds_dallas.to_netcdf('t2m_dallas.nc')
# You can also simply save it as zarr (and use compression too):
# compressor = zarr.Blosc(cname="zstd", clevel=3, shuffle=2)
# ds_dallas.to_zarr("foo.zarr",
# encoding={"foo":{"compressor":compressor}})
2.4. More advanced visualization (using Cartopy):
Let's look back at the first figure we generated (in section 2.1) and add more features to it and generate a high quality plot for it.
fig, axis = plt.subplots(1, 1,
subplot_kw=dict(projection=cartopy.crs.Orthographic(-100, 34.5)),
figsize=(5.8, 5))
plt.subplots_adjust(top=.94, bottom=0, left=.1, right=.98)
ds_jan.temperature_2m.plot(
x='lon',
y='lat',
vmin=0,
vmax=60,
cmap='jet',
ax=axis,
transform=cartopy.crs.PlateCarree(),
cbar_kwargs={
"orientation": "horizontal",
"shrink": 0.85, # length of colorbar
"pad": 0.08, # the distance between colorbar and the plot
"aspect": 30, # the width (thickness) of colorbar
"label": "2m temperature (°F)"
},
robust=True)
axis.coastlines()
axis.add_feature(cfeature.STATES, zorder=1, linewidth=.5, edgecolor='k')
axis.add_feature(cfeature.OCEAN, facecolor='b', alpha=.2)
gl = axis.gridlines(crs=cartopy.crs.PlateCarree(),
draw_labels=True,
linewidth=.6,
color='gray',
alpha=0.5,
linestyle='-.')
gl.top_labels = False # suppress top labels
gl.right_labels = False # suppress right labels
axis.set_extent([-120.5, -72, 23.6, 53], crs=cartopy.crs.PlateCarree())
plt.title('January 2020 mean temperature')
plt.savefig('Fig_t2m_199001.png', format='png', dpi=200)
The above code generates the following figure:
Much better (in my opinion)! You can choose other geographic projections or add other features too (e.g. rivers, lakes, etc.).
4. Limitations:
??Note: there's no single solution for all purposes. XEE is a useful tool and it's an excuse for me to share this tutorial :)
5. Summary:
The combination of Google Earth Engine and xarray seems to be an amazing way of accessing geospatial data: really flexible and highly customizable, and definitely much faster and more convenient than a lot of other data portals out there for moderate to large size datasets. This is in particular true for many of the satellite datasets (e.g. NASA) that are available on many many tiles (for different regions and times). Nonetheless, I think improving large-scale ML workflows for geospatial problems is still needed and worth further exploration. Better data storage and access will make large-scale (e.g. continental or global scale) geospatial ML possible for not only the giant companies with vast resources, but also to many small startups. Startups like Earthmover , Fused , and Coiled have been working on data storage and processing to some extent, but there's still opportunities for deep tech startups in this area.
Hydrologist at the Water Center @TecDeMonterrey | Professor in hydrology | GIS Specialist | Environmental consulting | Electronics
8 个月Gracias!
Assistant Professor at Karachi University
9 个月very nice
Quality Control Department at PT. JGC Indonesia
12 个月useful code for processing long spatiotemporal data. but, i'd like to ask if this xarray could help me to extract tabular data of chlorophyll value from Sentinel-2 Level 1C data? Since the results shows best interpretation in level 1 data, it is a must to done atmospheric correction and it was only use single image (no range of dates). So, is this xarray able to allow me to extract tabular data format? Ali Ahmadalipour
Founding Member of the Technical Staff @ Open Athena
1 年Thanks for writing a tutorial for Xee!
Civic Technology | Statistics | Data | Science
1 年Thank you Ali Ahmadalipour for this tutorial. Could you please check this once? In "xr.open_dataset(engine='ee')" -- there is an error popping up that ee is an unrecognised engine. The xarray documentation also does not show ee as one of the available engines. https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html I've installed xee library as well Curious to know how to executed it.