Symbolic time series representations

Symbolic time series representations

Here is some new work with PhD student Xinye Chen on a fast method for the symbolic representation of time series: https://arxiv.org/abs/2201.05697

So, what are symbolic representations of time series data and what they are good for?

Let me try to give a brief informal introduction.

We are often faced with?numerical?realisations of time series, which are simply vectors of values corresponding to measurements over time. Say, the daily number of infected people in a population. But let's not overcomplicate things with "real-world data" and, for illustration, consider these four "toy" time series sampled at 8,000 data points:

Four simple time series for illustrative purposes

A common problem would be to forecast the future behaviour of these time series. Looks easy enough, right?

Well, actually, even for these simple examples, choosing the right parameters in a model can be challenging. The red curves are the predictions that an autoregression of order 3, AR(3), would give you:

Forecasts from an AR(3) model

And below are iterated one-step forecasts of AR(9). This model class mainly works well in the absence of noise and when the model order is chosen consistently with the frequency of oscillations (as in the top right plot):

Forecasts from an AR(9) model

Below are forecasts from a more complicated model, a long short-term memory (LSTM) network with two hidden layers, each with 50 neurons, and a lag parameter of 100:

Forecasts from an LSTM model

This doesn't look good either, and it will be even more difficult to make these models work on real-world data, in particular, in the presence of noise and when the data exhibits oscillations on different time scales.

So why, as humans, would we consider the four "toy" time series from above as being easy to forecast? The answer is that all four of them can be described symbolically using only a small number of trends and features. For example, a human would describe a sine wave simply by its "up and down" trends, and perhaps its smoothness. In doing so, they effectively construct a low-dimensional symbolic representation that contains all the relevant features of the time series.

And this is what symbolic representations are all about: convert your numerical data into a sequence of letters from a small alphabet. Then use this representation for retrieval, forecasting, anomaly detection, or whatever else you want to do with your time series data.

The ABBA method is one example of a symbolic time series representation. Its name derives from "adaptive Brownian bridge-based aggregation" and it works as follows. First, approximate your numerical data by a polygonal chain controlled by an error tolerance. Here, we approximate the data using seven linear pieces:

Polygonal chain approximation of a time series

Each of the seven pieces of the polygonal chain can be represented by their length and increment in a 2d plot:

2d plot of lengths and increments

Now run k-means on these points to identify clusters of similar time series pieces. In this case, there are three clusters labelled a, b, and c.

k-means clustering of time series pieces

Finally, we replace each piece of the polygonal chain by the letter of the cluster it belongs to. Thereby we have converted our time series data into the symbolic form a-b-b-a-c-a-b.

Final symbolic representation of a time series

Going back to the four toy examples from the beginning, all of these have the same very simple representation a-b-a-b-a-b-...

Symbolic representation of four simple time series

It is easy to imagine how even a basic language model, like hidden Markov, could perfectly forecast this sequence.

Previous work with Steven Elsworth (https://arxiv.org/abs/2003.05672) has demonstrated how symbolic representations can reduce the training time and enhance the predictive performance of machine learning models like LSTM. You can think of that as converting a regression problem into a classification task.

No alt text provided for this image

The issue with ABBA has been its rather high computational complexity due to repeated runs of k-means. We did this just to figure out the parameter k. The work with Xinye addresses this problem by getting rid of k-means all together, achieving speedups of >30 without sacrificing accuracy. Finally, this makes ABBA applicable to really large temporal data, including sequences of colour values in 2d images.

No alt text provided for this image

If you want to learn about all the details, have a look at https://arxiv.org/abs/2201.05697 and the associated Python module (https://pypi.org/project/fABBA/).

Eamonn Keogh

Distinguished Professor at University of California, Riverside Co-Founder of FarmSense

3 年

Found some of this hard to understand. Are the "symbols" in fABBA single real numbers? If not, then the comparisons of Reconstruction Performance with SAX seem unfair. A single SAX symbol takes only ceil( log2(cardinality)). So, for example, if you allow eight SAX symbols, A,B, C,D,E,F,G,H, then each SAX symbol only requires, 3bits. A SAX word with 10 symbols and cardinality of eight takes only takes 30 bits. Surely a fair Reconstruction Performance comparison should be bit-for-bit, that is how compressions are normally compared. Perhaps you are doing that, I could not quite parse the text.

回复
Joe McMahon

Applied Research Mathematician at U.S. Government

3 年

Those impressive results from the ABBA method scream "take a chance on me".

要查看或添加评论,请登录

Stefan Güttel的更多文章

社区洞察