LLMs and foundational models for time series forecasting: They are not (yet) as good as you may hope
created with BingAI

LLMs and foundational models for time series forecasting: They are not (yet) as good as you may hope

It seems somewhat clear to everybody that "foundational" time series models that are pretrained with many time series and/or work off the back of an LLM are likely the next big thing in time series forecasting.

Currently, as it tends to happen in Machine Learning (ML), everybody gets excited and tries to rush to papers and results to get all the credit. As has happened in the past, on this way corners are cut so that not everything is as great as these papers want to make us believe.

Misleading evaluations in current papers in top ML conferences

The problem

One paper to call out here is TimesNet [1]. This is a paper from ICLR2023 (a top ML conference), and their selling point is that they can do "long-term forecasting","short-term forecasting", imputation, classification, and anomaly detection, all with the same methodology and all with state-of-the-art (SOTA) performance. For short-term, they use the M4 dataset. They report (Table 3 on page 7) an OWA of their method of 0.851, which places them in first place among their comparison methods. Also, on their web page [7], they show a leaderboard for the SOTA of short term forecasting (which for them means performance on the M4) where TimesNet wins and some transformer methods are 2nd and 3rd. The problem I have with this is that whoever bothers to look at the results from the original M4 [2, Table 4], will see that the competition was actually won by the method ES-RNN with an OWA of 0.821. Second place was FFORMA with an OWA of 0.838, and the stated OWA of TimesNet would put it at 7th place, behind methods that used no deep learning at all and probably not even ML. This is still not a bad place (out of the 61 original participants), but certainly not "SOTA" that Machine Learners tend to be so obsessed about. Another interesting aspect is their treatment of NBEATS. This method got into the spotlight some time ago when it first came out precisely by claiming to achieve "SOTA" in the M4, with an OWA of 0.795 [3, Table1]. The TimesNet paper indeed does report results for NBEATS, but with an OWA of 0.855.

Others were happy to quickly jump on the bandwagon. TIME-LLM [5] (likely accepted at ICLR2024) "wins" on the M4 with an OWA of 0.859. If you ask how this can now suddenly be a winner: they report OWAs for TimesNet of 0.955 and for NBEATS of 0.896, which would place these methods with those "updated" results outside of the top 15 in the M4, and somewhere close to the Theta benchmark (NBEATS) or even considerably worse (TimesNet). One thing we seem to be learning from this is that ML methods are not so easy to use for non-experts (if already other top ML scientists cannot run them in a way that they are competitive, how will an average data science graduate in a company make them work better than simple benchmarks?). The list goes on with ModernTCN [6] "winning" with an OWA of 0.838, Card [16] "winning" with an OWA of 0.832, TimeMixer [17] "winning" with an OWA of 0.840, and GPT4TS [8] with an OWA of 0.861. To be fair, the GPT4TS paper does not claim any superiority on the M4 dataset, and seems to generally have a more measured evaluation. Also, while ModernTCN and Card are still worse than the original M4 winner and NBEATS, they do have good 2nd place results on the original M4, they are just not winning as their papers suggest. Finally, TIME-LLM adds a particular twist to it, with reporting sMAPE results on the quarterly M3 dataset (Table 13 in the Appendix of the current version of the paper from openreview.net, accepted at ICLR 2024) of 11.171 for TIME-LLM and 10.410 for TimesNet. The interesting thing here is that, when we look at the results from the original participants in the M3 20 years ago (see, e.g., Table 5 in [9]), we see that it was won by Theta with an sMAPE of 8.956, and NAIVE2 had an sMAPE of 9.951. As such, these results likely to be published in ICLR 2024 are worse than a NAIVE2 from 20 years ago.

The responses from the authors

When I asked the TimesNet authors for the discrepancies for NBEATS and the results of the original competitors, regarding NBEATS they stated: "As we stated in the paper, N-BEATS employs a special ensemble method, which incorporates the results predicted from different input series. Its final results are ensembled from 7 models. Thus, to ensure a fair comparison, we test all the models with one single input length." [4]. Regarding the original M4, they say "Note that different from competition, for research, a fair comparison is essential. Without fair comparison, we cannot obtain any scientific conclusions." [4] I'm not sure what Makridakis and the M4 team would respond to statements hinting towards that the M4 was not a fair comparison. In fact I cannot think of a fairer comparison than a competition, and Makridakis' whole motivation for the competitions I assume was to create a platform for an objective, academic, fair comparison. But anyway, this argument seems to have found widespread adoption in these papers, for example, the TimeMixer [17] paper says: "The original paper of N-BEATS (2019) adopts a special ensemble method to promote the performance. For fair comparisons, we remove the ensemble and only compare the pure forecasting models." In a private conversation with me, the authors of Time-LLM brought to their defence this same argument, along with arguments along the lines of only wanting to benchmark deep-learning methods.

A breaststroke competition

The problem with all of this is that if you are going to host a breaststroke competition, then this might be exciting for all sorts of reasons, but you should make it clear to anybody that this is a breaststroke competition and not freestyle, as in the freestyle competition you are obviously losing. You should make this particularly clear to all the practitioners that are now going to take your methods out into the wild, wasting lots of time, energy, and compute before realising that these methods are not yet what they promise. And, knowing these top-level conferences and how they operate, I think it is safe to assume that these papers would have had a way more difficult stance with the reviewers if these things would have been made more clear. Also, the rules that those papers apply are quite arbitrary. Why can't I use an ensemble? Does that mean you don't have to win against a random forest in your competition? Note that the winning M4 method was in fact a deep-learning method, so it should qualify at least in this sense for your breaststroke competition. However, it had some ensembling elements to it.

There may be an argument to be made that having to beat a competition winner is unfair, because by definition a competition winner is the best method on a given test set, selected out of many others. But we should also keep in mind that the original participants didn't have the luxury of knowing the test set, an advantage that all subsequent methods do have. Also, I'm fully aware that just pulling out numbers from papers and comparing them has its pitfalls, and I didn't do any calculations myself. Did they use the same definition of OWA? Did they maybe leave out some time series for whatever reason? Are we really comparing apples with apples here? None of the authors hinted to anything along these lines in their defence, so I assume that these results are just what they claim to be, namely OWA on the full M4, comparable to the original participants. Furthermore, it should be the first responsibility of the authors of these papers to show to me as a reader how these methods fare compared with the original submissions, to be able to make a direct assessment of how well they perform.

A systemic problem?

This whole situation certainly also points to the weaknesses of the review processes of top ML conferences, where reviewers are under time pressure and are not necessarily experts in the field they are reviewing in, and authors are under even more pressure, often being asked to prepare major revisions in few days, as opposed to weeks or months that you would have in a journal, always threatened with a reject that quickly will render the work irrelevant and non-publishable elsewhere.

While I think the current situation outlined above is somewhat embarrassing for the field of ML, and maybe in this instance for ICLR in particular, it is by no means a new phenomenon in this space. Together with colleagues I have pointed out earlier in [10] that the exchange rate forecasting dataset and associated forecasting problem used in many transformer papers is essentially a random number generator that makes anybody in Economics and Finance that I talk to about this laugh, and by winning on this dataset along all the other datasets where these papers "win", they essentially make their whole evaluation suspicious. The details can be looked up in our paper, and I want to say that the critique seems to have been heard by now and most papers do not use this dataset anymore, however, some still do, most notably ModernTCN [6] and UniTime [11].

Another aspect to mention here in this context is that I often enough am confronted, e.g. by reviewers, with statements like "XYZformer is the SOTA in forecasting, so why did you not compare with this method?" This statement makes hardly any sense to me, just because there are so many different sub-problems in forecasting. You can have intermittent series in a retail setting (with 65% zeros, or 95% zeros, ...), hierarchical time series, a single yearly budget series, a dataset of 1000 product time series, measured weekly over the last 2 years, etc. Is your new XYZformer really the solution for all of these problems? We have seen above that obviously for M3 and M4 it isn't. It is certainly a step in the right direction that these newer papers also benchmark on something like the M4, if they want to claim "SOTA in time series forecasting". Traditionally, most papers about new transformer architectures that have come out and claimed "SOTA" did this because they win on a set of 5-6 datasets, based on a dataset that Lai et al. put together for a paper in 2018 [19]. These series are quite particular in the sense that they are all very long, and most of them (a notable exception is the exchange rate dataset) have quite regular patterns, and not much trend, as far as I know. The task is then what in these papers is called "long-term forecasting", which to me seems a quite particular problem mostly selected for the pure sake of academic publishing with very little practical relevance. That is because I find it difficult to think of an application where I need a forecast on a daily time series for a random day, say, 465 days out. In all industry settings where such forecasts may be requested, it will usually be a better idea to try and convince the stakeholders that, e.g., a monthly forecast for these large horizons is more in line with the decision they try to make, and that them now summing up all my daily forecasts will give them a worse result than me just providing that monthly forecast for them.

Where to from here?

What does all this mean for the current state of forecasting models that are pretrained and/or use LLMs? Contenders I want to mention more positively are TimeGPT-1 [12], Lag-Llama [13], LLMTIME [18], and perhaps GPT4TS [8]. They all have credible results that are good but not "disrupting-the-field-of-forecasting" type of good (yet), and therewith somewhat consistent with the results from above when you take my discussions into account. As such, these models are clearly coming but not there yet. A 2nd place in the M4 is remarkable, even though none of the above papers sold it that way. We can still develop dedicated models that outperform those models purely based on accuracy. However, once these models really achieve the accuracies that we are already now promised they have, some interesting questions will arise, as follows.

Data leakage will be a major challenge going forward

Data leakage is always a problem in forecasting (we've also written about it in [10]), and already global models that train across time series face these problems more than local per-series univariate models. For example I noticed that models pretrained on the M4 seem to show really good performance on the M3. One has to wonder how much this has to do with the M4 dataset being put together roughly 20 years after the M3 dataset, with ample room to contain information about the future of the M3 series. So it is difficult to tell how much of this performance is relevant for real-world forecasting applications and how much is just due to some form of data leakage.

With foundational and LLM-based models, this will become a big challenge. Our evaluation methodologies currently all rely on experiments on publicly available benchmark datasets. LLM producers usually don't disclose the datasets they train on (and for good reason). So, while it is somewhat unlikely that they would train with time series data, the reality is that we don't know. Furthermore, a known weakness of LLMs is their bad performance at solving mathematical problems [14]. So it wouldn't be surprising if soon we see models that are pretrained on numerical data, including time series. When looking on pure time series models, Lag-Llama is clear about the data they train on, whereas TimeGPT is less so. We have a good idea of what they likely trained on, which is all publicly available forecasting datasets that they could find.

But if you now have a pretrained model that has been trained on all publicly available datasets, then how can you evaluate such models? In the future, we may be forced to use non-public benchmark datasets, and being careful to only benchmark models that we can run ourselves, as sending them to a web service like TimeGPT will effectively hand over this dataset to the model developers who will likely use the data to train the next iteration of their model (the TimeGPT T&Cs have a "Use of Content to Improve Services" clause, from which one can opt-out).

What about external covariates? And: How much can you learn from 14 data points?

Another problem that I'm sure will be solved at some point but that is not addressed yet as far as I can see is the context of a time series. Be it metadata or external covariates.

If you have one forecasting task that is wind power forecasting and you have wind and temperature as covariates, and then you have another forecasting task in retail where your covariates may be promotions, prices, and other factors, it is not clear how one model could handle such different covariates . Maybe this can be done with embeddings or certain forms of dimensionality reduction techniques (PCA in the simplest case). Also TabPFN [15] is an exciting idea that may show a way forward. Finally, I can also think of stacking solutions, where you take the output of the foundational model and feed it as an additional input into your specialised forecasting method that also gets fed all the covariates.

Finally, this could be a space where LLMs could actually shine. Let's assume you have a yearly time series, and you have a long history, which could be 14 years. That gives you 14 data points in your series. With the current implementations, you would just feed in these 14 data points with no additional information. But there is a limit of how much information you can extract from 14 data points, even for an LLM-enhanced-transformer-deep-learning-super-model. To be able to predict from such small amounts of data, a model needs to make strong assumptions and have strong regularization. As such, even the most complex model would eventually fall back to something more simple, and the big question is really if your assumptions about the data are correct. As an example, if I know the yearly series are sales of millions of chewing gums of a product that hasn't yet entered large markets such as China or India, it will be more reasonable to forecast the continuation of an exponential growth that we see in the data, than if we know that the series are sales of millions of iPhones. The big strength that LLMs promise here is that they could be able to take in all context in any form that you may have about your time series, to build in reasonable biases into the model that are not justified from the time series data alone, but from the context you have given and the world knowledge of the LLM what this context may mean for your time series.

What about data confidentiality and the constraints of your production environment?

Finally, other interesting questions arise just from a purely operational perspective of how these models will actually be used. If you need to send your time series to a web service that gives you back forecasts, you may simply not be able to do it because the data is confidential. Or you may have some real-time constraints, e.g., in wind power forecasting you may need to produce a 5-minute-out forecast. If it takes you more than 20 seconds to generate your forecast, you may already be worse than just a naive/persistence forecast produced 20 seconds later. There could be other operational constraints around reliable internet connections, etc. So an interesting question is if there will be open-source pre-trained models available that you can run on-site, in the way you want to run them.

Conclusions

Exciting developments are happening in the space of pretrained and/or LLM-based models for forecasting. It is not yet the time to throw over board everything we have learned about forecasting over the last 40 years or so. The future promises to be exciting.

References

[1] Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Temporal 2d-variation modeling for general time series analysis. ICLR, 2023.

[2] Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2020). The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting, 36(1), 54-74.

[3] Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2019). N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. ICLR, 2020.

[4] https://github.com/thuml/Time-Series-Library/issues/293 (accessed 5/12/2023)

[5] Jin, M., Wang, S., Ma, L., Chu, Z., Zhang, J. Y., Shi, X., ... & Wen, Q. (2023). Time-llm: Time series forecasting by reprogramming large language models. ICLR 2024 (accepted).

[6] Zhong, S., Song, S., Li, G., Zhuo, W., Liu, Y., & Chan, S. H. G. (2023). ModernTCN: A Modern Pure Convolution Structure for General Time Series Analysis. ICLR 2024 (accepted, review is double-blind, but this is my guess for the authors based on a similar arxiv paper).

[7] https://github.com/thuml/Time-Series-Library (accessed 5/12/2023)

[8] Zhou, T., Niu, P., Wang, X., Sun, L., & Jin, R. (2023). One Fits All: Power General Time Series Analysis by Pretrained LM. NeurIPS 2023.

[9] Bergmeir, C., Hyndman, R. J., & Benítez, J. M. (2016). Bagging exponential smoothing methods using STL decomposition and Box–Cox transformation. International journal of forecasting, 32(2), 303-312.

[10] Hewamalage, H., Ackermann, K., & Bergmeir, C. (2023). Forecast evaluation for data scientists: common pitfalls and best practices. Data Mining and Knowledge Discovery, 37(2), 788-832.

[11] Liu, X., Hu, J., Li, Y., Diao, S., Liang, Y., Hooi, B., & Zimmermann, R. (2023). UniTime: A Language-Empowered Unified Model for Cross-Domain Time Series Forecasting. arXiv preprint arXiv:2310.09751.

[12] Garza, A., & Mergenthaler-Canseco, M. (2023). TimeGPT-1. arXiv preprint arXiv:2310.03589.

[13] Rasul, K., Ashok, A., Williams, A. R., Khorasani, A., Adamopoulos, G., Bhagwatkar, R., ... & Rish, I. (2023). Lag-llama: Towards foundation models for time series forecasting. arXiv preprint arXiv:2310.08278.

[14] https://www.firstpost.com/tech/news-analysis/below-average-chatgpt-is-terrible-at-maths-and-getting-worse-openai-needs-to-ask-google-for-help-13054652.html (accessed 8/12/2023)

[15] Hollmann, N., Müller, S., Eggensperger, K., & Hutter, F. (2022, October). TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. In NeurIPS 2022 First Table Representation Workshop.

[16] Xue, W., Zhou, T., Wen, Q., Gao, J., Ding, B., & Jin, R. (2023). Make Transformer Great Again for Time Series Forecasting: Channel Aligned Robust Dual Transformer. arXiv preprint arXiv:2305.12095 (seems likely to be accepted at ICLR 2024, see https://openreview.net/forum?id=MJksrOhurE).

[17] TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting (2023) (seems likely to be accepted at ICLR 2024, see https://openreview.net/forum?id=7oLshfEIC2)

[18] Gruver, N., Finzi, M., Qiu, S., & Wilson, A. G. (2023). Large language models are zero-shot time series forecasters. arXiv preprint arXiv:2310.07820. NeurIPS (2023).

[19] Lai, G., Chang, W. C., Yang, Y., & Liu, H. (2018, June). Modeling long-and short-term temporal patterns with deep neural networks. In The 41st international ACM SIGIR conference on research & development in information retrieval (pp. 95-104).

Faruk Erdo?an Buldur

Co-Founder at Harmony AI

6 个月

Great points, Christoph Bergmeir. As a researcher in the field, I agree with almost all of your insights. I've also encountered numerous papers claiming SOTA performance, but when applying those methods to different datasets—especially those that involve subproblems like intermittency or trend—the results are often far from the promised performance. This gap between theoretical advancements and real-world applicability is something we definitely need to address more in the research community.

André Meyer-Baron

Dr. sc., Senior Researcher studying Developer Productivity and Well-Being

8 个月

Thanks for this great discussion of the potential and drawbacks of evaluations of the early LLM-based time series forecasting models. I wonder how, in your opinion, this changed in the 8 months since you've posted your thoughts?

回复
Wendy Zhang PhD

Data Scientist at OGIA

11 个月
回复

要查看或添加评论,请登录

Christoph Bergmeir的更多文章

  • Some of my recent time series research

    Some of my recent time series research

    I've just updated my web presence since some time so let me take this opportunity to point you to some of the more…

    4 条评论

社区洞察

其他会员也浏览了