Symbolic time series representations

Stefan Güttel

Professor of Applied Mathematics & Royal Society Industry Fellow

发布日期: 2022年1月19日

Here is some new work with PhD student Xinye Chen on a fast method for the symbolic representation of time series: https://arxiv.org/abs/2201.05697

So, what are symbolic representations of time series data and what they are good for?

Let me try to give a brief informal introduction.

We are often faced with?numerical?realisations of time series, which are simply vectors of values corresponding to measurements over time. Say, the daily number of infected people in a population. But let's not overcomplicate things with "real-world data" and, for illustration, consider these four "toy" time series sampled at 8,000 data points:

Four simple time series for illustrative purposes

A common problem would be to forecast the future behaviour of these time series. Looks easy enough, right?

Well, actually, even for these simple examples, choosing the right parameters in a model can be challenging. The red curves are the predictions that an autoregression of order 3, AR(3), would give you:

And below are iterated one-step forecasts of AR(9). This model class mainly works well in the absence of noise and when the model order is chosen consistently with the frequency of oscillations (as in the top right plot):

Below are forecasts from a more complicated model, a long short-term memory (LSTM) network with two hidden layers, each with 50 neurons, and a lag parameter of 100:

This doesn't look good either, and it will be even more difficult to make these models work on real-world data, in particular, in the presence of noise and when the data exhibits oscillations on different time scales.

So why, as humans, would we consider the four "toy" time series from above as being easy to forecast? The answer is that all four of them can be described symbolically using only a small number of trends and features. For example, a human would describe a sine wave simply by its "up and down" trends, and perhaps its smoothness. In doing so, they effectively construct a low-dimensional symbolic representation that contains all the relevant features of the time series.

And this is what symbolic representations are all about: convert your numerical data into a sequence of letters from a small alphabet. Then use this representation for retrieval, forecasting, anomaly detection, or whatever else you want to do with your time series data.

The ABBA method is one example of a symbolic time series representation. Its name derives from "adaptive Brownian bridge-based aggregation" and it works as follows. First, approximate your numerical data by a polygonal chain controlled by an error tolerance. Here, we approximate the data using seven linear pieces:

Polygonal chain approximation of a time series

Each of the seven pieces of the polygonal chain can be represented by their length and increment in a 2d plot:

Now run k-means on these points to identify clusters of similar time series pieces. In this case, there are three clusters labelled a, b, and c.

k-means clustering of time series pieces

Finally, we replace each piece of the polygonal chain by the letter of the cluster it belongs to. Thereby we have converted our time series data into the symbolic form a-b-b-a-c-a-b.

Final symbolic representation of a time series

Going back to the four toy examples from the beginning, all of these have the same very simple representation a-b-a-b-a-b-...

Symbolic representation of four simple time series

It is easy to imagine how even a basic language model, like hidden Markov, could perfectly forecast this sequence.

Previous work with Steven Elsworth (https://arxiv.org/abs/2003.05672) has demonstrated how symbolic representations can reduce the training time and enhance the predictive performance of machine learning models like LSTM. You can think of that as converting a regression problem into a classification task.

The issue with ABBA has been its rather high computational complexity due to repeated runs of k-means. We did this just to figure out the parameter k. The work with Xinye addresses this problem by getting rid of k-means all together, achieving speedups of >30 without sacrificing accuracy. Finally, this makes ABBA applicable to really large temporal data, including sequences of colour values in 2d images.

If you want to learn about all the details, have a look at https://arxiv.org/abs/2201.05697 and the associated Python module (https://pypi.org/project/fABBA/).

Eamonn Keogh

Distinguished Professor at University of California, Riverside Co-Founder of FarmSense

3 年

Found some of this hard to understand. Are the "symbols" in fABBA single real numbers? If not, then the comparisons of Reconstruction Performance with SAX seem unfair. A single SAX symbol takes only ceil( log2(cardinality)). So, for example, if you allow eight SAX symbols, A,B, C,D,E,F,G,H, then each SAX symbol only requires, 3bits. A SAX word with 10 symbols and cardinality of eight takes only takes 30 bits. Surely a fair Reconstruction Performance comparison should be bit-for-bit, that is how compressions are normally compared. Perhaps you are doing that, I could not quite parse the text.

Joe McMahon

Applied Research Mathematician at U.S. Government

3 年

Those impressive results from the ABBA method scream "take a chance on me".

1 次回应

查看更多评论

要查看或添加评论，请登录

Stefan Güttel的更多文章

Image segmentation using CLASSIX

2023年12月12日

Image segmentation using CLASSIX

In this post I will demonstrate the use of CLASSIX's new MATLAB implementation which is available on MATLAB File…

4 条评论
Clustering 5.7 million RNA sequences of coronavirus

2023年11月28日

Clustering 5.7 million RNA sequences of coronavirus

In today's post we apply CLASSIX, a fast and explainable clustering method, to analyze a dataset accompanying a paper…
Fast & explainable clustering with CLASSIX: Quick start

2023年11月23日

Fast & explainable clustering with CLASSIX: Quick start

In today's post, I demonstrate the CLASSIX clustering method using its Python implementation. A discussion of the inner…
Introducing CLASSIX: fast and explainable data clustering

2023年11月20日

Introducing CLASSIX: fast and explainable data clustering

Data clustering is the task of assigning data points into groups (clusters) in a way that points within a cluster are…

2 条评论

Stefan Güttel的更多文章

Image segmentation using CLASSIX

Clustering 5.7 million RNA sequences of coronavirus

Fast & explainable clustering with CLASSIX: Quick start

Introducing CLASSIX: fast and explainable data clustering

社区洞察