登录查看更多内容

XEE: Geospatial analysis made easier - a tutorial & code

Ali Ahmadalipour

Research Scientist at Google X | Geospatial, AI, Climate Change, Sustainability

发布日期: 2024年1月8日

Link to tutorial on Github: https://github.com/hydroclimali/XEE_tutorial/blob/main/xee_demo.ipynb

I've been working with geospatial datasets for 13+ years now (since ~2011 that I started analyzing NASA-MODIS satellite data). Since then, there have been many advances in the field, all of which have made accessing and analyzing geospatial datasets easier and more convenient.

In this article, I'll present a tutorial focusing on a XEE, new python tool that combines the power of Google Earth Engine (GEE) and Xarray. I'll share a couple of examples and I'll provide the code and extra resources for each case. I hope you find this article useful and I hope you share it with others. If you're interested to learn more about Xarray, I had shared a tutorial for climate data analysis that can be useful as well.

Objectives:

Present a tutorial for a new geospatial data analysis tool (XEE)
Provide sample code for how to analyze geospatial data and generate high quality visualizations
Encourage people (especially students) to learn Python and showcase how easy it is to perform geospatial analysis

1. Problem statement:

Generally, the workflow for geospatial modeling begins with downloading and pre-processing various datasets from different sources. For instance, one would acquire satellite data (e.g. land cover or vegetation) from NASA, climate data from ECMWF (e.g. ERA-5), and weather predictions from NOAA (e.g. GFS). Each of these datasets may be in a different format (e.g. hdf, netcdf, grib, etc) and in various latitude/longitude resolutions. Therefore, the downloaded raw data is usually not useful later on. If one could only access different data sources in a consistent format, load the raw data to perform pre-processing and then save the desired pre-processed data, it would save quite a bit of time and storage, and it would also reduce the barriers to entry, and hence make geospatial analysis more accessible. Well, the wait seems to be over :)

A typical geospatial modeling workflow and how XEE has improved it

GEE provides access to a huge catalog of datasets (tens of petabytes), and the consistent format is surely a desirable feature. Meanwhile, Xarray (being compatible with numpy, pandas, dask, and cartopy) has proven to be a unique tool for analyzing multidimensional / geospatial data (especially climate data). I think the combination of the two is a very useful tool.

The growing number of startups and researchers conducting geospatial analyses (link), many of which are using similar data for training their models (mostly AI and data-driven nowadays) indicates that there's a decent market for new tools for improving geospatial modeling workflows.

2. Tutorial

Link to the Github repo: https://github.com/hydroclimali/XEE_tutorial/blob/main/xee_demo.ipynb

To be able to run the code for this part, you'll need to have an active GEE account (preferably with a high volume endpoint).

import cartopy.crs
import cartopy.feature as cfeature
import dask.distributed
import ee
import google.auth.compute_engine
import matplotlib.pyplot as plt
import xarray as xr

# Starting dask for multi-processing:
client = dask.distributed.Client(n_workers=4)
client

I'm running the notebook on a normal machine with only 4 cpus and about 16GB of RAM, but you'll see later on that our memory usage never exceeds 8GB (just in case you're worried if you can replicate this analysis on your machine).

If you want to learn more about Dask, here's the link to a good tutorial for it.

# initialize Earth Engine:
# You need to have an active EE acount for this to work.
credentials = google.auth.compute_engine.Credentials()
ee.Initialize(credentials,
    opt_url='https://earthengine-highvolume.googleapis.com')

Let's import ERA5-Land daily data for air temperature across the Contiguous US since 2000, change the projection to WGS84 (aka EPSG4326) and regrid the data to 0.2 degrees. The original data is at 0.25 degree spatial resolution and hourly temporal resolution, and can be found at this link. Here, we import the data from Earth Engine (EE) data catalogue. On EE, you can find the description for all the bands at this link.

ic = ee.ImageCollection('ECMWF/ERA5_LAND/DAILY_AGGR').filterDate(
    '2000-01-01', '2023-12-31').select('temperature_2m')
# Another way to do the selection is to not select any band in ic, 
# and then use xarray to select the desired variable.
region = ee.Geometry.Rectangle(-130, 25, -60, 50)
ds = xr.open_dataset(ic,
                     engine='ee',
                     crs='EPSG:4326',
                     scale=0.2,
                     geometry=region)

As you can see, with a couple lines of code, we did all the following at once:

Filter the date range
Select the variable of interest
Select (and filter) the region of interest
Change the geographic projection (coordinate reference system)
Change the spatial resolution (regridding)

If we look at the ds variable above, we'll see the details as follows:

2.1. A simple analysis and visualization

For this section, let's do a simple analysis. Let's calculate the climatology (long-term average temperature) for each month. Then, let's plot the result for January.

领英推荐

Milan's Data Science Insights #011

Milan Janosov 2 个月前

Milan's Data Science Insights #008

Milan Janosov 7 个月前

Milan's Data Science Insights #002

Milan Janosov 10 个月前

ds_mon = ds.groupby(ds.time.dt.month).mean('time')
ds_jan = ds_mon.sel(month=1)
ds_jan.__setitem__('temperature_2m',(ds_jan.temperature_2m - 
    273.15)* 1.8 + 32)
ds_jan.temperature_2m.plot(x='lon', y='lat', vmin=0, vmax=60,
    cmap='jet')

The above analysis took just about 60 seconds (with less than 8GB RAM used and no raw data being downloaded)! Normally, one should use ECMWF's API (i.e. cdsapi) to download the hourly data from ECMWF, convert it to daily, and then perform the analysis. Given the wait time that a queued API request takes on Copernicus climate data store (cds), that would have taken many hours.

The following image is the output of the above code:

The default plot of January climatology (with almost no revision or configuration)

As you can see, the plot is really basic and it's not the best presentation. In section 2.4, we'll use Cartopy and we'll add a lot more features to the plot to make the figure much more interesting.

2.2. Extracting point data:

Let's leverage xarray functionalities and extract timeseries of temperature data for Dallas, TX in February.

ds_dallas = ds.sel(lat=32.802, lon=-96.782, method='nearest').where(
    ds.time.dt.month==2, drop=True)
ds_dallas.__setitem__('temperature_2m',(ds_dallas.temperature_2m -
    273.15) * 1.8 + 32)

Now let's plot the daily values for different years (all in one plot stacked on top of each other):

for yr in range(2000,2023):
    da_sel = ds_dallas.temperature_2m.where(
        ds_dallas.time.dt.year==yr, drop=True)
    if yr == 2021:
        plt.plot(da_sel.time.dt.day, da_sel, color='b', linewidth=2)
    else:
        plt.plot(da_sel.time.dt.day, da_sel, color='g', alpha=.4)
plt.grid()
plt.xlim([1,29])
plt.xlabel('Day in February')
plt.xticks(ticks=range(1,30,2))
plt.ylabel('temperature (°F)')
plt.title(
  'Dallas temperature in February since 2000 (2021 shown in blue)')

Daily air temperature in Dallas, TX during Februaries of 2000-2023

The plot shows the daily temperature in Dallas, TX during February since the year 2000, and it highlights how severe the 2021 cold storm was. If you're interested to learn more about that event, you can take a look at this analysis I did at the time.

2.3. Saving the outputs:

Saving data with xarray is really convenient. you can save the data as netcdf or zarr (and even compress the data):

# Save as NetCDF:
ds_dallas.to_netcdf('t2m_dallas.nc')
# You can also simply save it as zarr (and use compression too):
# compressor = zarr.Blosc(cname="zstd", clevel=3, shuffle=2)
# ds_dallas.to_zarr("foo.zarr", 
#                   encoding={"foo":{"compressor":compressor}})

2.4. More advanced visualization (using Cartopy):

Let's look back at the first figure we generated (in section 2.1) and add more features to it and generate a high quality plot for it.

fig, axis = plt.subplots(1, 1,
    subplot_kw=dict(projection=cartopy.crs.Orthographic(-100, 34.5)),
    figsize=(5.8, 5))
plt.subplots_adjust(top=.94, bottom=0, left=.1, right=.98)
ds_jan.temperature_2m.plot(
    x='lon',
    y='lat',
    vmin=0,
    vmax=60,
    cmap='jet',
    ax=axis,
    transform=cartopy.crs.PlateCarree(),
    cbar_kwargs={
        "orientation": "horizontal",
        "shrink": 0.85, # length of colorbar
        "pad": 0.08, # the distance between colorbar and the plot
        "aspect": 30, # the width (thickness) of colorbar
        "label": "2m temperature (°F)"
    },
    robust=True)
axis.coastlines()
axis.add_feature(cfeature.STATES, zorder=1, linewidth=.5, edgecolor='k')
axis.add_feature(cfeature.OCEAN, facecolor='b', alpha=.2)
gl = axis.gridlines(crs=cartopy.crs.PlateCarree(),
                    draw_labels=True,
                    linewidth=.6,
                    color='gray',
                    alpha=0.5,
                    linestyle='-.')
gl.top_labels = False  # suppress top labels
gl.right_labels = False  # suppress right labels
axis.set_extent([-120.5, -72, 23.6, 53], crs=cartopy.crs.PlateCarree())
plt.title('January 2020 mean temperature')
plt.savefig('Fig_t2m_199001.png', format='png', dpi=200)

The above code generates the following figure:

Sample plot for January 2020 when using Cartopy and adding more features to the figure

Much better (in my opinion)! You can choose other geographic projections or add other features too (e.g. rivers, lakes, etc.).

4. Limitations:

GEE API requests are still limited and you can't request a lot of data at once (e.g. you can't ask for 50 years of daily data over the CONUS at 0.1 degree resolution).
When using GEE, you essentially depend on GEE for any data access and that may limit your control (but do we really have much control if we get the data directly from NOAA or ECMWF?).

??Note: there's no single solution for all purposes. XEE is a useful tool and it's an excuse for me to share this tutorial :)

5. Summary:

The combination of Google Earth Engine and xarray seems to be an amazing way of accessing geospatial data: really flexible and highly customizable, and definitely much faster and more convenient than a lot of other data portals out there for moderate to large size datasets. This is in particular true for many of the satellite datasets (e.g. NASA) that are available on many many tiles (for different regions and times). Nonetheless, I think improving large-scale ML workflows for geospatial problems is still needed and worth further exploration. Better data storage and access will make large-scale (e.g. continental or global scale) geospatial ML possible for not only the giant companies with vast resources, but also to many small startups. Startups like Earthmover , Fused , and Coiled have been working on data storage and processing to some extent, but there's still opportunities for deep tech startups in this area.

Roxana F.

Hydrologist at the Water Center @TecDeMonterrey | Professor in hydrology | GIS Specialist | Environmental consulting | Electronics

8 个月

Gracias!

Imran A Khan

Assistant Professor at Karachi University

9 个月

very nice

Nabilla Putri

Quality Control Department at PT. JGC Indonesia

12 个月

useful code for processing long spatiotemporal data. but, i'd like to ask if this xarray could help me to extract tabular data of chlorophyll value from Sentinel-2 Level 1C data? Since the results shows best interpretation in level 1 data, it is a must to done atmospheric correction and it was only use single image (no range of dates). So, is this xarray able to allow me to extract tabular data format? Ali Ahmadalipour

Alex Merose

Founding Member of the Technical Staff @ Open Athena

1 年

Thanks for writing a tutorial for Xee!

1 次回应

Sai Krishna Dammalapati

Civic Technology | Statistics | Data | Science

1 年

Thank you Ali Ahmadalipour for this tutorial. Could you please check this once? In "xr.open_dataset(engine='ee')" -- there is an error popping up that ee is an unrecognised engine. The xarray documentation also does not show ee as one of the available engines. https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html I've installed xee library as well Curious to know how to executed it.

查看更多评论

要查看或添加评论，请登录

Ali Ahmadalipour的更多文章

How to stand out in a crowded job market

2024年7月9日

How to stand out in a crowded job market

This article offers insights and multiple tips and suggestions for job hunting, which can be especially beneficial for…

8 条评论
A running list of 300+ companies hiring for geospatial & climate-tech roles

2022年8月16日

A running list of 300+ companies hiring for geospatial & climate-tech roles

Last Update: Sep 10, 2024 Introduction: For those of you who don't know me, I have been working as a data scientist and…

29 条评论
Python for Climate Data Analysis: A Tutorial + Code

2022年4月6日

Python for Climate Data Analysis: A Tutorial + Code

Link to tutorial on Google Colab: https://colab.research.

70 条评论
Forget about faculty positions and (how to) join the tech industry

2021年10月1日

Forget about faculty positions and (how to) join the tech industry

1. Intro: Career choice and job satisfaction are highly subjective matters which depend on one’s preferences…

28 条评论
How unusual was the 2021 Texas cold span?

2021年3月15日

How unusual was the 2021 Texas cold span?

1. Introduction It has been a few weeks since the widespread winter storm Uri severely impacted the US during Feb 13-17.

19 条评论
Deep learning (computer vision) & satellite imagery for estimating hurricane intensity

2021年1月24日

Deep learning (computer vision) & satellite imagery for estimating hurricane intensity

1. Problem statement Tropical cyclones (i.

19 条评论
AI & climate data for predicting fire frequency in California

2020年12月27日

AI & climate data for predicting fire frequency in California

Fires are costly natural hazards that cause billions of dollars damage and substantial morbidity/mortality each year in…

10 条评论
How big of a challenge is climate change and should we really care about it?

2020年12月5日

How big of a challenge is climate change and should we really care about it?

Climate change has been a topic of discussion in many media outlets and there have been numerous useful resources (as…

22 条评论
How to pursue a data science career in the geoscience field?

2020年11月23日

How to pursue a data science career in the geoscience field?

About myself: I have been studying and working in the field of Earth sciences (i.e.

25 条评论
Google's Resources for Data Science in Earth Sciences

2020年9月26日

Google's Resources for Data Science in Earth Sciences

Google has developed numerous tools, and most of us have been using these resources such as Google search engine…

9 条评论

See all articles

XEE: Geospatial analysis made easier - a tutorial & code

Ali Ahmadalipour

Research Scientist at Google X | Geospatial, AI, Climate Change, Sustainability

1. Problem statement:

2. Tutorial

2.1. A simple analysis and visualization

领英推荐

2.2. Extracting point data:

2.3. Saving the outputs:

2.4. More advanced visualization (using Cartopy):

4. Limitations:

5. Summary:

Ali Ahmadalipour的更多文章

社区洞察

其他会员也浏览了

Introducing kepler.gl 3.0

Milan's Data Science Insights #006

Milan's Data Science Insights #001

Newsletter - February 22, 2024

Neo4j Graph Tech Weekly (E:11)

Build A Simple Point-in-Polygon Web API Using FastAPI

30 Day Map Challenge

Real-Time Geospatial Intelligence: Leveraging Rust WebAssembly and Predictive AI for Browser-Based Spatial Queries

INT’s Next-Generation Platform IVAAP Redefines High-Performance 3D Visualization for the Energy Industry

Week of May 27th

1. Problem statement:

2. Tutorial

2.1. A simple analysis and visualization

领英推荐

2.2. Extracting point data:

2.3. Saving the outputs:

2.4. More advanced visualization (using Cartopy):

4. Limitations:

5. Summary:

Ali Ahmadalipour的更多文章

How to stand out in a crowded job market

A running list of 300+ companies hiring for geospatial & climate-tech roles

Python for Climate Data Analysis: A Tutorial + Code

Forget about faculty positions and (how to) join the tech industry

How unusual was the 2021 Texas cold span?

Deep learning (computer vision) & satellite imagery for estimating hurricane intensity

AI & climate data for predicting fire frequency in California

How big of a challenge is climate change and should we really care about it?

How to pursue a data science career in the geoscience field?

Google's Resources for Data Science in Earth Sciences

社区洞察

其他会员也浏览了

Introducing kepler.gl 3.0

Milan's Data Science Insights #006

Milan's Data Science Insights #001

Newsletter - February 22, 2024

Neo4j Graph Tech Weekly (E:11)

Build A Simple Point-in-Polygon Web API Using FastAPI

30 Day Map Challenge

Real-Time Geospatial Intelligence: Leveraging Rust WebAssembly and Predictive AI for Browser-Based Spatial Queries

INT’s Next-Generation Platform IVAAP Redefines High-Performance 3D Visualization for the Energy Industry

Week of May 27th