Times Series Clustering with Dynamic Time Warping

Times Series Clustering with Dynamic Time Warping

by John Akwei, ECMp ERMp Data Scientist


Table of Contents

Section 1 — Problem Definition

Section 1.1 — Project Summary

Section 2 — Data Preparation

Section 2.1 — Working Directory and Required Libraries

Section 2.2 — Import Data

Section 3 — Exploratory Data Analysis

Section 3.1 — Plot of Time Series Data

Section 3.2 — Dynamic Time Warping (DTW)

Section 4 — Model Development

Section 4.1 — Creating Distance Measure Models

Section 4.2 — Time Series Distance Measure Visualizations

Section 5 — Evaluation

Section 1 — Problem Definition

Compare methods of time series clustering with Dynamic Time Warping (DTW), Euclidean distance, and a third measure. Included in the Data Science analysis: Required R Language Libraries, Data Importation, Exploratory Data Analysis, Distance Measure Model Development, Visualizations, and Model Evaluation with Cluster Validity Indices.


Section 1.1 — Project Summary

This project is implemented in R Markdown format, with the objective of:

1) Presenting a comparison of Dynamic Time Warping (DTW), Euclidean Distance, and a third measure.

2) Evaluating the time series distance measures via Cluster Validity Indices (CVIs).

3) Displaying the results of the time series distance measures via meaningful, target-oriented visualizations.

4) Descriptions of the Data Science techniques involved in creating the subsequent results.

Section 2 — Data Preparation

Section 2.1 — Working Directory and Required Libraries

The interactive programming of this Time Series Distance Measure Evaluation was accomplished with the RStudio Cloud Interactive Development Environment, (IDE). In addition to the basic capabilities of the R programming language, several R language packages of pre-programmed functions are used for the distance measure algorithms.


The “Set Working Directory” R language basic function is used to set the working directory to the directory with the source files. The function, “setwd()”, sets the filepath as the current working directory of the R environment. The permanence of the filepath varies with different operating systems, and the status of the R language Integrated Development Environment. The “Get Working Directory” function is used to verify that the working directory has been set to the right location.

Also, in this section the required R programming language packages are included in the package library. The function, “library()”, loads the R language packages into the session library of packages, in order to run the functions within the packages. The R packages included are packages for distance measure, and time series clustering.

# setwd("C:/Users/...")
# getwd()

library(tidyr)
library(dplyr)
library(knitr)
library(dtw)
library(BBmisc)
library(dtwclust)
library(TSdist)        

Section 2.2 — Import Dataset

The dataset imported for distance measure technique evaluation is a dataframe of 35040 time series’ of electricity load data for the one year period of 1/1/2014–1/1/2015. The data is separated by semicolons. After importation, the date field is re-formatted for the R language. Then, two of the time series’ are extracted for the distance measure evaluation. In order to perform distance measuring of the entire 35040 time series’ simultaneously, millions of years of processing time would be required.


data <- read.csv("Electrictiy_load.csv", sep = ";")
data$date <- as.Date(data$date, format="%d.%m.%Y")

customer_1 <- ts(data$Customer1, start=c(2014, 1, 1),
                 end=c(2015, 1, 1), frequency=213)
customer_2 <- ts(data$Customer2, start=c(2014, 1, 1),
                 end=c(2015, 1, 1), frequency=213)        

Section 3 — Exploratory Data Analysis

Exploratory Data Analysis is an approach for analyzing datasets to summarize their main characteristics, in order to decide on subsequent Time Series Clustering methods. The quality of the dataset should be examined to determine the usefulness of your available data. Regardless of sophistication, a time series distance measure algorithm is limited by the accuracy of the data. If the data you are working with is collected or labeled by humans, reviewing a subset of data will help with estimation of possible mistakes via human error.

The data should also be reviewed for possible omitted values. Usually, omitted values are replaceable with the median value of the entire dataset column. However, the more omitted values that are within the dataset, the more the results of the Time Series Clustering is expected to be inaccurate. The dataset chosen for a Time Series Clustering should be the right type of data for the insights that are needed. If your company is selling electronics in the US and is planning on expanding into Europe, you should try to gather data that can aid in Time Series Clustering of both markets.

Section 3.1 — Plot of Time Series Data

In Figure 1, the two time series’ extracted from the electricity load dataset are plotted within the same time frame, in order to visualize the data that is then processed with time series distance measuring, and time series clustering.

xrange <- range(data$date[1]:data$date[35040])
yrange <- range(c(data$Customer1,data$Customer2))

plot(xrange, yrange, xaxt = "n", type="n",
     xlab="time",ylab="value",
     main="Figure 1. Plot of Time Series Data")
axis(1, data$date, format(data$date, "%b %y"), cex.axis = .7)
lines(data$Customer1, col='blue', type='l')
lines(data$Customer2, col='magenta', type='l')        
No alt text provided for this image

Section 3.2 — Dynamic Time Warping (DTW)

In Figure 2, the dtw() function within the dtw package is used to calculate the distance between two vectors. Diagonal lines represent one-to-one matching. Vertical and horizontal lines represent many-to-one matching.

plot(dtw(customer_1, customer_2), xlab="customer_1", ylab="customer_2", main="Figure 2. DTW Matching")        
No alt text provided for this image

In Figure 3, a “threeway” derivation of Figure 2 displays the plots on x and y axes. The “keep=TRUE” parameter of the dtw() function is required.

plot(dtw(customer_1, customer_2, keep=TRUE),
     xlab="customer_1", ylab="customer_2", type="threeway",
     main="Figure 3. Threeway DTW Matching Plot")        
No alt text provided for this image

Figure 4 plots the DTW step patterns using type=“twoway”. The blue time series is customer_1, and the magenta time series is customer_2.

plot(dtw(customer_1,customer_2,keep=TRUE), type="twoway",
     col=c('blue', 'magenta'),
     main="Figure 4. Twoway DTW Matching Plot")        
No alt text provided for this image

Section 4 — Model Development

In Section 4, the Dynamic Time Warping, Euclidean Distance, and Global Alignment Kernel models for Time Series Clustering are developed for the electricity load data. The “dtwclust” package for time series clustering allows for specification of “DTW”, “Euclidean”, or “GAK” distance measuring. Thereby, allowing for evaluation of these measures in parallel. The ranges of the two time series’ are normalized for effective evaluation within similar number scales. The clustering of the 214 time series dates are formatted for six clusters.

Section 4.1 — Creating Distance Measure Models

customer_data <- data.frame(customer_1, customer_2)
customer.data.norm <- BBmisc::normalize(customer_data,
                                        method="standardize")

dtw_clust <- tsclust(customer.data.norm, type="partitional",
                     k=6L, distance="dtw", centroid="pam")
euclidean_clust <- tsclust(customer.data.norm, type="partitional",
                           k=6L, distance="Euclidean",
                           centroid="pam")
gak_clust <- tsclust(customer.data.norm, type="partitional",
                           k=6L, distance="gak",
                           centroid="pam")        

Section 4.2 — Time Series Distance Measure Visualizations

Figure 5 visualizes the series and centroid plot of six Dynamic Time Warping time series clusters. Figure 6 visualizes the series and centroid plot of six Euclidean Distance time series clusters. Figure 7 applies the same visualizations to the Global Alignment Kernel time series clusters. The dashed line represents the medoid time series. The electricity load data for the two customers are separated into six general trends that represent upward trends and downward trends of electricity usage.

Tables 1, 2 and 3 display the assignment of the six clusters for the 214 data sample dates in the electricity load data.

cat("Figure 5. DTW Time Series Clusters")

# Figure 5. DTW Time Series Clusters
plot(dtw_clust, type = "sc")        
No alt text provided for this image
cat("Figure 6. Euclidean Time Series Clusters")

# Figure 6. Euclidean Time Series Clusters
plot(euclidean_clust, type = "sc")        
No alt text provided for this image
cat("Figure 7. Global Alignment Kernel Time Series Clusters")

# Figure 7. Global Alignment Kernel Time Series Clusters
plot(gak_clust, type = "sc")        
No alt text provided for this image
kable(t(cbind(customer.data.norm[,0], cluster = dtw_clust@cluster)),
      caption = "Table 1. DTW Cluster Assignments of Time Series
      Dates")        

Table 1. DTW Cluster Assignments of Time Series Dates

cluster:

3365556366363333333555533333312221111111444111211122111122221144412222222222222221444422113666333365553333333366366655533363632222112111444111112222111222111114441122222222222211444412213333333365556663333333335555

kable(t(cbind(customer.data.norm[,0],
              cluster = euclidean_clust@cluster)),
      caption = "Table 2. Euclidean Cluster Assignments of Time
      Series Dates")        

Table 2. Euclidean Cluster Assignments of Time Series Dates

cluster:

2251114244242222222511522222263336666666111666366633666633336611163333333333333335111533662444422251112422222244244511122244426333663666111666663333333333666631116633333333333366111163362222222251114442222222225111

kable(t(cbind(customer.data.norm[,0], cluster = gak_clust@cluster)),
      caption = "Table 3. Global Alignment Kernel Cluster
      Assignments of Time Series Dates")        

Table 3. Global Alignment Kernel Cluster Assignments of Time Series Dates

cluster:

2213336266262222226133126222255555555555333555555544555544445533354444444444544445333145552666622213332222222266266133362266625444554555333555555544554444555543335544444444544455333354452222622213336662666222221333

In Figures 8–16, the six clusters are plotted with a combination of series and centroid, followed by a series plot showing the members of the first cluster, and a centroids plot showing the first cluster chosen as the medoid.

# Figure 8. DTW Series/Centroid Plot

plot(dtw_clust, type = "sc", clus = 1L)        
No alt text provided for this image
# Figure 9. DTW Series Plot

plot(dtw_clust, type = "series", clus = 1L)        
No alt text provided for this image
# Figure 10. DTW Centriods Plot

plot(dtw_clust, type = "centroids", clus = 1L)        
No alt text provided for this image
# Figure 11. Euclidean Series/Centroid Plot

plot(euclidean_clust, type = "sc", clus = 1L)        
No alt text provided for this image
# Figure 12. Euclidean Series Plot

plot(euclidean_clust, type = "series", clus = 1L)        
No alt text provided for this image
# Figure 13. Euclidean Centriods Plot

plot(euclidean_clust, type = "centroids", clus = 1L)        
No alt text provided for this image
# Figure 14. Global Alignment Kernel Series/Centroid Plot

plot(gak_clust, type = "sc", clus = 1L)        
No alt text provided for this image
# Figure 15. Global Alignment Kernel Series Plot

plot(gak_clust, type = "series", clus = 1L)        
No alt text provided for this image
# Figure 16. Global Alignment Kernel Centriods Plot

plot(gak_clust, type = "centroids", clus = 1L)        
No alt text provided for this image

Figures 17, 18 and 19 are hierarchical dendrograms of the time series dates within the ix clusters, for Dynamic Time Warping, Euclidean Distance, and Global Alignment Kernel Measures.

set.seed(123)
clust.hier.dtw <- tsclust(customer.data.norm, type = "h",
                          k = 6L, distance = "dtw")
clust.hier.euclidean <- tsclust(customer.data.norm, type = "h",
                                k = 6L, distance = "euclidean")
clust.hier.gak <- tsclust(customer.data.norm, type = "h",
                                k = 6L, distance = "gak")
plot(clust.hier.dtw, main="Figure 17. DTW Dendrogram")        
No alt text provided for this image
plot(clust.hier.euclidean, main="Figure 18. Euclidean Distance Dendrogram")        
No alt text provided for this image
plot(clust.hier.gak, main="Figure 19. Global Alignment Kernel Distance Dendrogram")        
No alt text provided for this image


Section 5 — Evaluation

For the final evaluation of Distance Measures for this project, the Cluster Validity Indices evaluation metric is chosen for evaluation of the accuracy of producing six clusters of time series data via Dynamic Time Warping, Euclidean Distance, and Global Alignment Kernel.

kable(cvi(dtw_clust), caption = "Table 4. DTW - Cluster Validity Indices")        

Table 4. DTW — Cluster Validity Indices

Sil 0.4161095

SF 0.0290512

CH 111.8179800

DB 1.0677429

DBstar 1.8132669

D 0.0128444

COP 0.1154342

kable(cvi(euclidean_clust), caption = "Table 5. Euclidean Distance - Cluster Validity Indices")        

Table 5. Euclidean Distance — Cluster Validity Indices

Sil 0.4290294

SF 0.1217342

CH 130.4489850

DB 0.8211406

DBstar 1.8619141

D 0.0186354

COP 0.1152264

kable(cvi(gak_clust), caption = "Table 6. Global Alignment Kernel Distance - Cluster Validity Indices")        

Table 6. Global Alignment Kernel Distance — Cluster Validity Indices

Sil 0.5571970

SF 0.6153544

CH 1100.6889099

DB 1.0270383

DBstar 10.0607217

D 0.0006024

COP 0.0433589

Bimo H. Tedjo

Kementerian ATR/BPN - National Cheng Kung University

1 年

HI..thank you for your sharing. I keep up your step, however, I have difficulties because my data was red as coercion, how to solve this problem?

回复

要查看或添加评论,请登录

John Akwei的更多文章

社区洞察

其他会员也浏览了