Time Series Analysis of Geospatial Data
Credit: iStock/Essentials collection

Time Series Analysis of Geospatial Data

From geospatial information to a pandas dataframe for time series analysis

Time series analysis of geospatial data allows us to analyze and understand how events and attributes of a place change over time. Its use cases are wide ranging, particularly in social, demographic, environmental and meteorology/climate studies. In environmental sciences, for example, time series analysis helps analyze how land cover/land use of an area changes over time and its underlying drivers. It is also useful in meteorological studies in understanding the spatial-temporal changes in weather patterns (I will shortly demonstrate one such case study using rainfall data). Social and economic sciences hugely benefit from such analysis in understanding dynamics of temporal and spatial phenomena such as demographic, economic and political patterns.

Case study: daily rainfall pattern in Hokkaido, Japan

Data source

For this case study I am using spatial distribution of rainfall in?Hokkaido prefecture, Japan?between the periods 01 January to 31 December of 2020 — accounting for 366 days of the year. I downloaded data from an open access spatial data platform?ClimateServe?— which is a product of a joint NASA/USAID partnership.

No alt text provided for this image
Snapshot of some of the raster files in local directory

Setup

First, I set up a folder where the raster dataset is stored so I can loop through them later on.


# specify folder path for raster dataset
tsFolderPath = './data/hokkaido/'        

Next, I’m importing a few libraries, most of which would be familiar to data scientists. To work with raster data I’m using the?rasterio?library.


# import libraries
import os
import rasterio 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd        

Visualize data

Let’s check out how the raster images look like in a plot. I’ll first load in a random image using?rasterio?and then plot it using?matplotlib?functionality.


# load in raster data
rf = rasterio.open('./data/hokkaido/20201101.tif')

fig, ax = plt.subplots(figsize=(15,5))

_ = ax.imshow(rf.read()[0], cmap = 'inferno')
fig.colorbar(_, ax=ax)
plt.axis('off')
plt.title('Daily rainfall Jan-Dec 2020, Hokkaido, Japan');        
No alt text provided for this image
Distribution of rainfall (in mm) in Hokkaido, Japan on 01 November, 2020 (source: author)

As you can see, this image is a combination of pixels, the value of each pixel represents rainfall for that particular location. Brighter pixels have high rainfall value. In the next section I am going to extract those values and transfer them into a?pandas?dataframe.

Extract data from raster files

Now into the key step — extracting pixel values for each of the 366 raster images. The process is simple: we will loop through each image, read pixel values and store them in a list.

We will separately keep track of dates in another list. Where are we getting the dates information? If you take a closer look at the file names, you’ll notice they are named after each respective day.


# create empty lists to store data
date = []
rainfall_mm = []

# loop through each raster
for file in os.listdir(tsFolderPath):
    
    # read the files
    rf = rasterio.open(tsFolderPath + file)
    
    # convert raster data to an array
    array = rf.read(1)
    
    # store data in the list
    date.append(file[:-4])
    rainfall_mm.append(array[array>=0].mean())        

Note that it did not take long to loop through 366 rasters because of low image resolution (i.e. large pixel size). However, it can be computationally intensive for high resolution datasets.

So we just created two lists, one stores the dates from file names and the other has rainfall data. Here are first five items of two lists:


print(date[:5])
print(rainfall_mm[:5])


>> ['20200904', '20200910', '20200723', '20200509', '20200521']
>> [4.4631577, 6.95278, 3.4205956, 1.7203209, 0.45923564]        

Next on to transferring the lists into a?pandas?dataframe. We will take an extra step from here to change the dataframe into a time series object.

Convert to a time series dataframe

Transferring lists to a dataframe format is an easy task in?pandas:


# convert lists to a dataframe
df = pd.DataFrame(zip(date, rainfall_mm), columns = ['date', 'rainfall_mm']) 
df.head()        
No alt text provided for this image
First few rows of dataframe generated from lists

We now have a?pandas?dataframe, but notice that ‘date’ column holds values in strings,?pandas?does not know yet that it represent dates. So we need to tweak it a little bit:


# Convert dataframe to datetime object
df['date'] = pd.to_datetime(df['date'])
df.head()        
No alt text provided for this image
Date column now transformed into a datetime object


df['date'].info()        
No alt text provided for this image
This confirms that the column is a datetime object

Now the dataframe is a datetime object.

It is also a good idea to set date column as the index. This facilitates slicing and filtering data by different dates and date range and makes plotting tasks easy. We will first sort the dates into the right order and then set the column as the index.


df = df.sort_values('date')
df.set_index('date', inplace=True)        

Okay, all processing done. You are now ready to use this time series data however you wish. I’ll just plot the data to see how it looks.



# plot
df.plot(figsize=(12,3), grid =True);        
No alt text provided for this image
TIme series plot of rainfall data in Hokkaido, Japan between January to December, 2020

Final word

Extracting interesting and actionable insights from geospatial time series data can be very powerful as it shows data both in spatial and temporal dimensions. However, for data scientists without training in geospatial information this can be a daunting task. In this article I demonstrated with a case study how this difficult task can be done easily with minimal efforts.

要查看或添加评论,请登录

Gokulakkannan AK的更多文章

社区洞察

其他会员也浏览了