Sebastian Raschka, PhD的动态

查看Sebastian Raschka, PhD的档案,图片

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison

Recently, we have seen a wave of LLMs for longer contexts: We had 1) the RMT paper on scaling Transformers to 1M tokens, 2) the convolutional Hyena LLM for 1M tokens, and 3) LongNet: Scaling Transformers to 1 Billion Tokens. While there are several use-cases for such long LLMs, for example, asking questions about particular long document inputs, the elephant in the room is: How well do LLMs use these longer contexts? New research shows that LLMs are good at retrieving information at the beginning of documents. They do less well in terms of retrieving information if its contained in the middle of a document. This is quite interesting ... 1) I would expect that the opposite is true for, e.g., RNN-based LLMs like RWKV (since it's processing information sequentially, it might rather forget early information) 2) To my knowledge, there is no specific inductive bias in transformer-based LLM architectures that explains why the retrieval performance should be worse for text in the middle of the document. I suspect it is all because of the training data and how humans write: the most important information is usually in the beginning or the end (think paper Abstracts and Conclusion sections), and it's then how LLMs parameterize the attention weights during training. #llm #ai #machinelearning

  • 该图片无替代文字
Sebastian Raschka, PhD

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison

1 年

Links to the papers: 1) Lost in the Middle: How Language Models Use Long Contexts: https://arxiv.org/abs//2307.03172 2) Scaling Transformer to 1M tokens and beyond with RMT: https://arxiv.org/abs/2304.11062 3) Hyena Hierarchy: Towards Larger Convolutional Language Models: https://arxiv.org/abs/2302.10866 4) LongNet: LongNet: Scaling Transformers to 1,000,000,000 Tokens: https://arxiv.org/abs/2307.02486

Sebastian Raschka, PhD

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison

1 年

Btw I also want to clarify that they didn't compare these recent models (Hyena, LongNet etc). The analysis is focused on ChatGPT (as it's shown in the figure) and Claude. Of course, it would be interesting to include the newer models in the future.

Pascal Biese

Daily AI highlights for 60k+ experts ???? AI/ML Engineer

1 年

Either that or machine attention isn't as different to human attention as we had thought. (It's probably the data though)

  • 该图片无替代文字
Haseeb R.

Machine Learning Engineer @ Lunit | Digging Autoregressive Transformers ??

1 年

I wonder if this research is utlizing attention-based RNNs? Otherwise it is strange to find results akin to Transformers. Even if they used attention-based RNNs I am not sure if RNNs should remember early information instead of latest context.

How do I train model with more than 4k context on A100 GPU. While I read these papers. I am missing the trick to train models with 4k+ tokens. Please help

Luckily we have augmented retrieval methods.l

回复
Ludwig Stumpp

Building the largest EU AI ecosystem Heilbronn with @appliedAI and @IPAI | AI Engineer | Creator of the mlstarterkit.com

1 年

One reason why LLMs are good at retrieving information at the beginning of documents might be that in an autoregressive language modeling task, tokens at the beginning of the document are seen more often during training, no? Therefore the model evaluated the relevance of tokens at the beginning in more gradually growing contexts than tokens, that were just at the end of an document, leading to a better performance for earlier positions in the attention heads. Just a guess though.

回复

I've seen earlier some research (unfortunately I can't find it back right now) about the bias introduced by positional encoding. From what I remember, simply offseting a sequence seen in the training by a few tokens (by adding a padding for example) can completely throw a transformer with classical attention and a sinuso?dal positional encoding scheme. Maybe some related effect is at play here, and in that case the problem may come from too much homogeneity in the size + in the structure of training documents (which would lead to important information being localized at similar absolute coordinates in the sequence). No matter what, this is indeed a very interesting and important problem to solve to get further in NLP-based interfaces !

Camaron Foster

Founder at FosterAI

1 年

This makes me think of GANs and the chain of reasoning simultaneously. I imagine GPT4, BARD, and I don't know... Mathematica (Wolfram), a code AI , and I imagine coding them to productively argue towards a shared generation. I imagine them converging on protocols to ensure blind spots in their reasoning are nspected, augmented and that debugging, and debias...ing protocols are well cultivated. Sort of like the agent constructions people are making. No?

回复
Afaque U.

MLE at Tiger Analytics | xTCS | GenAI | MLOps

1 年

A wild thought, with so much focus and research going around llms in context retrieval it's possible few years from now people will stop reading PDFs just like PDFs replaced books not entirely but upto some extent, similarly for aiding llms will writers also change the writing style? Like instead of writing in fashion how humans understand will there be more keywords in sync with embeddings to help llms?

回复
查看更多评论

要查看或添加评论,请登录