Time Series Analysis of Geospatial Data
Gokulakkannan AK
Aspiring Data Analyst | Recent Graduate | Excel in Data Analytics | SQL | Python
From geospatial information to a pandas dataframe for time series analysis
Time series analysis of geospatial data allows us to analyze and understand how events and attributes of a place change over time. Its use cases are wide ranging, particularly in social, demographic, environmental and meteorology/climate studies. In environmental sciences, for example, time series analysis helps analyze how land cover/land use of an area changes over time and its underlying drivers. It is also useful in meteorological studies in understanding the spatial-temporal changes in weather patterns (I will shortly demonstrate one such case study using rainfall data). Social and economic sciences hugely benefit from such analysis in understanding dynamics of temporal and spatial phenomena such as demographic, economic and political patterns.
Case study: daily rainfall pattern in Hokkaido, Japan
Data source
For this case study I am using spatial distribution of rainfall in?Hokkaido prefecture, Japan?between the periods 01 January to 31 December of 2020 — accounting for 366 days of the year. I downloaded data from an open access spatial data platform?ClimateServe?— which is a product of a joint NASA/USAID partnership.
Setup
First, I set up a folder where the raster dataset is stored so I can loop through them later on.
# specify folder path for raster dataset
tsFolderPath = './data/hokkaido/'
Next, I’m importing a few libraries, most of which would be familiar to data scientists. To work with raster data I’m using the?rasterio?library.
# import libraries
import os
import rasterio
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Visualize data
Let’s check out how the raster images look like in a plot. I’ll first load in a random image using?rasterio?and then plot it using?matplotlib?functionality.
# load in raster data
rf = rasterio.open('./data/hokkaido/20201101.tif')
fig, ax = plt.subplots(figsize=(15,5))
_ = ax.imshow(rf.read()[0], cmap = 'inferno')
fig.colorbar(_, ax=ax)
plt.axis('off')
plt.title('Daily rainfall Jan-Dec 2020, Hokkaido, Japan');
As you can see, this image is a combination of pixels, the value of each pixel represents rainfall for that particular location. Brighter pixels have high rainfall value. In the next section I am going to extract those values and transfer them into a?pandas?dataframe.
Extract data from raster files
Now into the key step — extracting pixel values for each of the 366 raster images. The process is simple: we will loop through each image, read pixel values and store them in a list.
We will separately keep track of dates in another list. Where are we getting the dates information? If you take a closer look at the file names, you’ll notice they are named after each respective day.
# create empty lists to store data
date = []
rainfall_mm = []
# loop through each raster
for file in os.listdir(tsFolderPath):
# read the files
rf = rasterio.open(tsFolderPath + file)
# convert raster data to an array
array = rf.read(1)
# store data in the list
date.append(file[:-4])
rainfall_mm.append(array[array>=0].mean())
Note that it did not take long to loop through 366 rasters because of low image resolution (i.e. large pixel size). However, it can be computationally intensive for high resolution datasets.
领英推荐
So we just created two lists, one stores the dates from file names and the other has rainfall data. Here are first five items of two lists:
print(date[:5])
print(rainfall_mm[:5])
>> ['20200904', '20200910', '20200723', '20200509', '20200521']
>> [4.4631577, 6.95278, 3.4205956, 1.7203209, 0.45923564]
Next on to transferring the lists into a?pandas?dataframe. We will take an extra step from here to change the dataframe into a time series object.
Convert to a time series dataframe
Transferring lists to a dataframe format is an easy task in?pandas:
# convert lists to a dataframe
df = pd.DataFrame(zip(date, rainfall_mm), columns = ['date', 'rainfall_mm'])
df.head()
We now have a?pandas?dataframe, but notice that ‘date’ column holds values in strings,?pandas?does not know yet that it represent dates. So we need to tweak it a little bit:
# Convert dataframe to datetime object
df['date'] = pd.to_datetime(df['date'])
df.head()
df['date'].info()
Now the dataframe is a datetime object.
It is also a good idea to set date column as the index. This facilitates slicing and filtering data by different dates and date range and makes plotting tasks easy. We will first sort the dates into the right order and then set the column as the index.
df = df.sort_values('date')
df.set_index('date', inplace=True)
Okay, all processing done. You are now ready to use this time series data however you wish. I’ll just plot the data to see how it looks.
# plot
df.plot(figsize=(12,3), grid =True);
Final word
Extracting interesting and actionable insights from geospatial time series data can be very powerful as it shows data both in spatial and temporal dimensions. However, for data scientists without training in geospatial information this can be a daunting task. In this article I demonstrated with a case study how this difficult task can be done easily with minimal efforts.