登录查看更多内容

Times Series Clustering with Dynamic Time Warping

John Akwei

Senior Data Scientist at ContextBase

发布日期: 2022年5月22日

+ 关注

by John Akwei, ECMp ERMp Data Scientist

Section 1 — Problem Definition

Section 1.1 — Project Summary

Section 2 — Data Preparation

Section 2.1 — Working Directory and Required Libraries

Section 2.2 — Import Data

Section 3 — Exploratory Data Analysis

Section 3.1 — Plot of Time Series Data

Section 3.2 — Dynamic Time Warping (DTW)

Section 4 — Model Development

Section 4.1 — Creating Distance Measure Models

Section 4.2 — Time Series Distance Measure Visualizations

Section 5 — Evaluation

Section 1 — Problem Definition

Compare methods of time series clustering with Dynamic Time Warping (DTW), Euclidean distance, and a third measure. Included in the Data Science analysis: Required R Language Libraries, Data Importation, Exploratory Data Analysis, Distance Measure Model Development, Visualizations, and Model Evaluation with Cluster Validity Indices.

Section 1.1 — Project Summary

This project is implemented in R Markdown format, with the objective of:

1) Presenting a comparison of Dynamic Time Warping (DTW), Euclidean Distance, and a third measure.

2) Evaluating the time series distance measures via Cluster Validity Indices (CVIs).

3) Displaying the results of the time series distance measures via meaningful, target-oriented visualizations.

4) Descriptions of the Data Science techniques involved in creating the subsequent results.

Section 2 — Data Preparation

Section 2.1 — Working Directory and Required Libraries

The interactive programming of this Time Series Distance Measure Evaluation was accomplished with the RStudio Cloud Interactive Development Environment, (IDE). In addition to the basic capabilities of the R programming language, several R language packages of pre-programmed functions are used for the distance measure algorithms.

The “Set Working Directory” R language basic function is used to set the working directory to the directory with the source files. The function, “setwd()”, sets the filepath as the current working directory of the R environment. The permanence of the filepath varies with different operating systems, and the status of the R language Integrated Development Environment. The “Get Working Directory” function is used to verify that the working directory has been set to the right location.

Also, in this section the required R programming language packages are included in the package library. The function, “library()”, loads the R language packages into the session library of packages, in order to run the functions within the packages. The R packages included are packages for distance measure, and time series clustering.

# setwd("C:/Users/...")
# getwd()

library(tidyr)
library(dplyr)
library(knitr)
library(dtw)
library(BBmisc)
library(dtwclust)
library(TSdist)

Section 2.2 — Import Dataset

The dataset imported for distance measure technique evaluation is a dataframe of 35040 time series’ of electricity load data for the one year period of 1/1/2014–1/1/2015. The data is separated by semicolons. After importation, the date field is re-formatted for the R language. Then, two of the time series’ are extracted for the distance measure evaluation. In order to perform distance measuring of the entire 35040 time series’ simultaneously, millions of years of processing time would be required.

data <- read.csv("Electrictiy_load.csv", sep = ";")
data$date <- as.Date(data$date, format="%d.%m.%Y")

customer_1 <- ts(data$Customer1, start=c(2014, 1, 1),
                 end=c(2015, 1, 1), frequency=213)
customer_2 <- ts(data$Customer2, start=c(2014, 1, 1),
                 end=c(2015, 1, 1), frequency=213)

Section 3 — Exploratory Data Analysis

Exploratory Data Analysis is an approach for analyzing datasets to summarize their main characteristics, in order to decide on subsequent Time Series Clustering methods. The quality of the dataset should be examined to determine the usefulness of your available data. Regardless of sophistication, a time series distance measure algorithm is limited by the accuracy of the data. If the data you are working with is collected or labeled by humans, reviewing a subset of data will help with estimation of possible mistakes via human error.

The data should also be reviewed for possible omitted values. Usually, omitted values are replaceable with the median value of the entire dataset column. However, the more omitted values that are within the dataset, the more the results of the Time Series Clustering is expected to be inaccurate. The dataset chosen for a Time Series Clustering should be the right type of data for the insights that are needed. If your company is selling electronics in the US and is planning on expanding into Europe, you should try to gather data that can aid in Time Series Clustering of both markets.

Section 3.1 — Plot of Time Series Data

In Figure 1, the two time series’ extracted from the electricity load dataset are plotted within the same time frame, in order to visualize the data that is then processed with time series distance measuring, and time series clustering.

xrange <- range(data$date[1]:data$date[35040])
yrange <- range(c(data$Customer1,data$Customer2))

plot(xrange, yrange, xaxt = "n", type="n",
     xlab="time",ylab="value",
     main="Figure 1. Plot of Time Series Data")
axis(1, data$date, format(data$date, "%b %y"), cex.axis = .7)
lines(data$Customer1, col='blue', type='l')
lines(data$Customer2, col='magenta', type='l')

Section 3.2 — Dynamic Time Warping (DTW)

In Figure 2, the dtw() function within the dtw package is used to calculate the distance between two vectors. Diagonal lines represent one-to-one matching. Vertical and horizontal lines represent many-to-one matching.

plot(dtw(customer_1, customer_2), xlab="customer_1", ylab="customer_2", main="Figure 2. DTW Matching")

In Figure 3, a “threeway” derivation of Figure 2 displays the plots on x and y axes. The “keep=TRUE” parameter of the dtw() function is required.

plot(dtw(customer_1, customer_2, keep=TRUE),
     xlab="customer_1", ylab="customer_2", type="threeway",
     main="Figure 3. Threeway DTW Matching Plot")

Figure 4 plots the DTW step patterns using type=“twoway”. The blue time series is customer_1, and the magenta time series is customer_2.

plot(dtw(customer_1,customer_2,keep=TRUE), type="twoway",
     col=c('blue', 'magenta'),
     main="Figure 4. Twoway DTW Matching Plot")

Section 4 — Model Development

In Section 4, the Dynamic Time Warping, Euclidean Distance, and Global Alignment Kernel models for Time Series Clustering are developed for the electricity load data. The “dtwclust” package for time series clustering allows for specification of “DTW”, “Euclidean”, or “GAK” distance measuring. Thereby, allowing for evaluation of these measures in parallel. The ranges of the two time series’ are normalized for effective evaluation within similar number scales. The clustering of the 214 time series dates are formatted for six clusters.

Section 4.1 — Creating Distance Measure Models

customer_data <- data.frame(customer_1, customer_2)
customer.data.norm <- BBmisc::normalize(customer_data,
                                        method="standardize")

dtw_clust <- tsclust(customer.data.norm, type="partitional",
                     k=6L, distance="dtw", centroid="pam")
euclidean_clust <- tsclust(customer.data.norm, type="partitional",
                           k=6L, distance="Euclidean",
                           centroid="pam")
gak_clust <- tsclust(customer.data.norm, type="partitional",
                           k=6L, distance="gak",
                           centroid="pam")

Section 4.2 — Time Series Distance Measure Visualizations

Figure 5 visualizes the series and centroid plot of six Dynamic Time Warping time series clusters. Figure 6 visualizes the series and centroid plot of six Euclidean Distance time series clusters. Figure 7 applies the same visualizations to the Global Alignment Kernel time series clusters. The dashed line represents the medoid time series. The electricity load data for the two customers are separated into six general trends that represent upward trends and downward trends of electricity usage.

Tables 1, 2 and 3 display the assignment of the six clusters for the 214 data sample dates in the electricity load data.

cat("Figure 5. DTW Time Series Clusters")

# Figure 5. DTW Time Series Clusters
plot(dtw_clust, type = "sc")

cat("Figure 6. Euclidean Time Series Clusters")

# Figure 6. Euclidean Time Series Clusters
plot(euclidean_clust, type = "sc")

cat("Figure 7. Global Alignment Kernel Time Series Clusters")

# Figure 7. Global Alignment Kernel Time Series Clusters
plot(gak_clust, type = "sc")

kable(t(cbind(customer.data.norm[,0], cluster = dtw_clust@cluster)),
      caption = "Table 1. DTW Cluster Assignments of Time Series
      Dates")

Table 1. DTW Cluster Assignments of Time Series Dates

领英推荐

Effortless Data Exploration with Pandas Profiling

360DigiTMG 1 年前

What Is Data Exploration? A Simple Guide On Types…

Ze Learning Labb 1 个月前

Unmasking Real-World Data Science: A Departure from…

Royal Cyber Asia 1 年前

cluster:

3365556366363333333555533333312221111111444111211122111122221144412222222222222221444422113666333365553333333366366655533363632222112111444111112222111222111114441122222222222211444412213333333365556663333333335555

kable(t(cbind(customer.data.norm[,0],
              cluster = euclidean_clust@cluster)),
      caption = "Table 2. Euclidean Cluster Assignments of Time
      Series Dates")

Table 2. Euclidean Cluster Assignments of Time Series Dates

cluster:

2251114244242222222511522222263336666666111666366633666633336611163333333333333335111533662444422251112422222244244511122244426333663666111666663333333333666631116633333333333366111163362222222251114442222222225111

kable(t(cbind(customer.data.norm[,0], cluster = gak_clust@cluster)),
      caption = "Table 3. Global Alignment Kernel Cluster
      Assignments of Time Series Dates")

Table 3. Global Alignment Kernel Cluster Assignments of Time Series Dates

cluster:

2213336266262222226133126222255555555555333555555544555544445533354444444444544445333145552666622213332222222266266133362266625444554555333555555544554444555543335544444444544455333354452222622213336662666222221333

In Figures 8–16, the six clusters are plotted with a combination of series and centroid, followed by a series plot showing the members of the first cluster, and a centroids plot showing the first cluster chosen as the medoid.

# Figure 8. DTW Series/Centroid Plot

plot(dtw_clust, type = "sc", clus = 1L)

# Figure 9. DTW Series Plot

plot(dtw_clust, type = "series", clus = 1L)

# Figure 10. DTW Centriods Plot

plot(dtw_clust, type = "centroids", clus = 1L)

# Figure 11. Euclidean Series/Centroid Plot

plot(euclidean_clust, type = "sc", clus = 1L)

# Figure 12. Euclidean Series Plot

plot(euclidean_clust, type = "series", clus = 1L)

# Figure 13. Euclidean Centriods Plot

plot(euclidean_clust, type = "centroids", clus = 1L)

# Figure 14. Global Alignment Kernel Series/Centroid Plot

plot(gak_clust, type = "sc", clus = 1L)

# Figure 15. Global Alignment Kernel Series Plot

plot(gak_clust, type = "series", clus = 1L)

# Figure 16. Global Alignment Kernel Centriods Plot

plot(gak_clust, type = "centroids", clus = 1L)

Figures 17, 18 and 19 are hierarchical dendrograms of the time series dates within the ix clusters, for Dynamic Time Warping, Euclidean Distance, and Global Alignment Kernel Measures.

set.seed(123)
clust.hier.dtw <- tsclust(customer.data.norm, type = "h",
                          k = 6L, distance = "dtw")
clust.hier.euclidean <- tsclust(customer.data.norm, type = "h",
                                k = 6L, distance = "euclidean")
clust.hier.gak <- tsclust(customer.data.norm, type = "h",
                                k = 6L, distance = "gak")
plot(clust.hier.dtw, main="Figure 17. DTW Dendrogram")

plot(clust.hier.euclidean, main="Figure 18. Euclidean Distance Dendrogram")

plot(clust.hier.gak, main="Figure 19. Global Alignment Kernel Distance Dendrogram")

Section 5 — Evaluation

For the final evaluation of Distance Measures for this project, the Cluster Validity Indices evaluation metric is chosen for evaluation of the accuracy of producing six clusters of time series data via Dynamic Time Warping, Euclidean Distance, and Global Alignment Kernel.

kable(cvi(dtw_clust), caption = "Table 4. DTW - Cluster Validity Indices")

Table 4. DTW — Cluster Validity Indices

Sil 0.4161095

SF 0.0290512

CH 111.8179800

DB 1.0677429

DBstar 1.8132669

D 0.0128444

COP 0.1154342

kable(cvi(euclidean_clust), caption = "Table 5. Euclidean Distance - Cluster Validity Indices")

Table 5. Euclidean Distance — Cluster Validity Indices

Sil 0.4290294

SF 0.1217342

CH 130.4489850

DB 0.8211406

DBstar 1.8619141

D 0.0186354

COP 0.1152264

kable(cvi(gak_clust), caption = "Table 6. Global Alignment Kernel Distance - Cluster Validity Indices")

Table 6. Global Alignment Kernel Distance — Cluster Validity Indices

Sil 0.5571970

SF 0.6153544

CH 1100.6889099

DB 1.0270383

DBstar 10.0607217

D 0.0006024

COP 0.0433589

Bimo H. Tedjo

Kementerian ATR/BPN - National Cheng Kung University

1 年

HI..thank you for your sharing. I keep up your step, however, I have difficulties because my data was red as coercion, how to solve this problem?

查看更多评论

要查看或添加评论，请登录

John Akwei的更多文章

The Higgs Boson: The Key to Unifying the Four Forces?

2023年7月27日

The Higgs Boson: The Key to Unifying the Four Forces?

by John Akwei, ECMp ERMp Data Scientist Founder of ContextBase, https://contextbase.github.
The Binary Planck Wave/Object/Gravity Theory of the Origin and Structure of the Universe

2023年7月24日

The Binary Planck Wave/Object/Gravity Theory of the Origin and Structure of the Universe

by John Akwei, ECMp ERMp Data Scientist Section 1 - Preamble The following proposes a theory of new physics related to…
Solving the Most Complex Problems in Business with ContextBase and LLM AI

2023年6月27日

Solving the Most Complex Problems in Business with ContextBase and LLM AI

ContextBase is an AI startup that has the potential to revolutionize the way businesses address their most complex…
Introducing ContextBase: The Future of AI for Businesses

2023年6月26日

Introducing ContextBase: The Future of AI for Businesses

ContextBase is a new startup that is developing proprietary large language model (LLM) artificial intelligence (AI)…
Repertory Grid Analysis

2022年5月20日

Repertory Grid Analysis

Repertory Grid Analysis of Innovation Management Methodologies by John Akwei, ECMp ERMp Data Scientist June 9, 2019…
Comprehensive Machine Learning Solution

2021年5月23日

Comprehensive Machine Learning Solution

All programming by John Akwei, ECMp ERMp Data Scientist May 18, 2021 Table of Contents Section 1 - Problem Definition…
ContextBase - Topic Modeling

2021年5月21日

ContextBase - Topic Modeling

ContextBase - https://contextbase.github.
ContextBase Cryptocurrency Markets Analysis

2021年5月20日

ContextBase Cryptocurrency Markets Analysis

All programming by John Akwei, ECMp ERMp Data Scientist ContextBase, https://contextbase.github.
The Future of Cryptocurrency

2018年1月4日

The Future of Cryptocurrency

A more energy efficient and secure form of Bitcoin will possibly emerge as a World Currency, like the Globo. An easily…
Augmenting R Programming/Data Science with Tableau

2017年2月9日

Augmenting R Programming/Data Science with Tableau

After years of R programming and Data Science experience, I decided to study Tableau. I was motivated by the prospect…

See all articles

Times Series Clustering with Dynamic Time Warping

John Akwei

Senior Data Scientist at ContextBase

by John Akwei, ECMp ERMp Data Scientist

Table of Contents

Section 1 — Problem Definition

Section 1.1 — Project Summary

Section 2 — Data Preparation

Section 2.1 — Working Directory and Required Libraries

Section 2.2 — Import Dataset

Section 3 — Exploratory Data Analysis

Section 3.1 — Plot of Time Series Data

Section 3.2 — Dynamic Time Warping (DTW)

Section 4 — Model Development

Section 4.1 — Creating Distance Measure Models

Section 4.2 — Time Series Distance Measure Visualizations

领英推荐

Section 5 — Evaluation

John Akwei的更多文章

社区洞察

其他会员也浏览了

Mastering Data Science [Concepts and Practices]

How do you Define “Data Science” and “Data Scientist" December 2024

Data Wrangling in the Digital Age: Your Essential Guide to Transforming Raw Data into Actionable Insights

Imperio - Data Science

R's significance in current and future data science

From Raw Data to Actionable Insights: The Role of Preprocessing and Cleaning

Data Science VS Data Analytics: What’s the Difference?

Solving Popular Data Science Challenge with ServiceNow Predictive Intelligence

Data Science for Business

Cut Your Data Cleaning Time in Half: 5 Proven Strategies to Get You Back to Analysis

by John Akwei, ECMp ERMp Data Scientist

Table of Contents

Section 1 — Problem Definition

Section 1.1 — Project Summary

Section 2 — Data Preparation

Section 2.1 — Working Directory and Required Libraries

Section 2.2 — Import Dataset

Section 3 — Exploratory Data Analysis

Section 3.1 — Plot of Time Series Data

Section 3.2 — Dynamic Time Warping (DTW)

Section 4 — Model Development

Section 4.1 — Creating Distance Measure Models

Section 4.2 — Time Series Distance Measure Visualizations

领英推荐

Section 5 — Evaluation

John Akwei的更多文章

The Higgs Boson: The Key to Unifying the Four Forces?

The Binary Planck Wave/Object/Gravity Theory of the Origin and Structure of the Universe

Solving the Most Complex Problems in Business with ContextBase and LLM AI

Introducing ContextBase: The Future of AI for Businesses

Repertory Grid Analysis

Comprehensive Machine Learning Solution

ContextBase - Topic Modeling

ContextBase Cryptocurrency Markets Analysis

The Future of Cryptocurrency

Augmenting R Programming/Data Science with Tableau

社区洞察

其他会员也浏览了

Mastering Data Science [Concepts and Practices]

How do you Define “Data Science” and “Data Scientist" December 2024

Data Wrangling in the Digital Age: Your Essential Guide to Transforming Raw Data into Actionable Insights

Imperio - Data Science

R's significance in current and future data science

From Raw Data to Actionable Insights: The Role of Preprocessing and Cleaning

Data Science VS Data Analytics: What’s the Difference?

Solving Popular Data Science Challenge with ServiceNow Predictive Intelligence

Data Science for Business

Cut Your Data Cleaning Time in Half: 5 Proven Strategies to Get You Back to Analysis