Sebastian Raschka, PhD的动态

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison

1 年

Recently, we have seen a wave of LLMs for longer contexts: We had 1) the RMT paper on scaling Transformers to 1M tokens, 2) the convolutional Hyena LLM for 1M tokens, and 3) LongNet: Scaling Transformers to 1 Billion Tokens. While there are several use-cases for such long LLMs, for example, asking questions about particular long document inputs, the elephant in the room is: How well do LLMs use these longer contexts? New research shows that LLMs are good at retrieving information at the beginning of documents. They do less well in terms of retrieving information if its contained in the middle of a document. This is quite interesting ... 1) I would expect that the opposite is true for, e.g., RNN-based LLMs like RWKV (since it's processing information sequentially, it might rather forget early information) 2) To my knowledge, there is no specific inductive bias in transformer-based LLM architectures that explains why the retrieval performance should be worse for text in the middle of the document. I suspect it is all because of the training data and how humans write: the most important information is usually in the beginning or the end (think paper Abstracts and Conclusion sections), and it's then how LLMs parameterize the attention weights during training. #llm #ai #machinelearning

37 条评论

Sebastian Raschka, PhD

1 年

Links to the papers: 1) Lost in the Middle: How Language Models Use Long Contexts: https://arxiv.org/abs//2307.03172 2) Scaling Transformer to 1M tokens and beyond with RMT: https://arxiv.org/abs/2304.11062 3) Hyena Hierarchy: Towards Larger Convolutional Language Models: https://arxiv.org/abs/2302.10866 4) LongNet: LongNet: Scaling Transformers to 1,000,000,000 Tokens: https://arxiv.org/abs/2307.02486

18 次回应

Sebastian Raschka, PhD

1 年

Btw I also want to clarify that they didn't compare these recent models (Hyena, LongNet etc). The analysis is focused on ChatGPT (as it's shown in the figure) and Claude. Of course, it would be interesting to include the newer models in the future.

8 次回应

Pascal Biese

Daily AI highlights for 60k+ experts ???? AI/ML Engineer

1 年

Either that or machine attention isn't as different to human attention as we had thought. (It's probably the data though)

19 次回应

Haseeb R.

Machine Learning Engineer @ Lunit | Digging Autoregressive Transformers ??

1 年

I wonder if this research is utlizing attention-based RNNs? Otherwise it is strange to find results akin to Transformers. Even if they used attention-based RNNs I am not sure if RNNs should remember early information instead of latest context.

1 次回应

Mrityunjoy Panday

1 年

How do I train model with more than 4k context on A100 GPU. While I read these papers. I am missing the trick to train models with 4k+ tokens. Please help

2 次回应

Rafa? Pastuszak

Data Mining

1 年

Luckily we have augmented retrieval methods.l

Ludwig Stumpp

Building the largest EU AI ecosystem Heilbronn with @appliedAI and @IPAI | AI Engineer | Creator of the mlstarterkit.com

1 年

One reason why LLMs are good at retrieving information at the beginning of documents might be that in an autoregressive language modeling task, tokens at the beginning of the document are seen more often during training, no? Therefore the model evaluated the relevance of tokens at the beginning in more gradually growing contexts than tokens, that were just at the end of an document, leading to a better performance for earlier positions in the attention heads. Just a guess though.

Nicolas Posocco

AI Practice Lead

1 年

I've seen earlier some research (unfortunately I can't find it back right now) about the bias introduced by positional encoding. From what I remember, simply offseting a sequence seen in the training by a few tokens (by adding a padding for example) can completely throw a transformer with classical attention and a sinuso?dal positional encoding scheme. Maybe some related effect is at play here, and in that case the problem may come from too much homogeneity in the size + in the structure of training documents (which would lead to important information being localized at similar absolute coordinates in the sequence). No matter what, this is indeed a very interesting and important problem to solve to get further in NLP-based interfaces !

1 次回应

Camaron Foster

Founder at FosterAI

1 年

This makes me think of GANs and the chain of reasoning simultaneously. I imagine GPT4, BARD, and I don't know... Mathematica (Wolfram), a code AI , and I imagine coding them to productively argue towards a shared generation. I imagine them converging on protocols to ensure blind spots in their reasoning are nspected, augmented and that debugging, and debias...ing protocols are well cultivated. Sort of like the agent constructions people are making. No?

Afaque U.

MLE at Tiger Analytics | xTCS | GenAI | MLOps

1 年

A wild thought, with so much focus and research going around llms in context retrieval it's possible few years from now people will stop reading PDFs just like PDFs replaced books not entirely but upto some extent, similarly for aiding llms will writers also change the writing style? Like instead of writing in fashion how humans understand will there be more keywords in sync with embeddings to help llms?

查看更多评论

要查看或添加评论，请登录

最相关的动态

Sebastian Raschka, PhD

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison
3 天前
举报此动态
The Llama 3.2 1B and 3B models are my favorite LLMs -- small but very capable.? If you want to understand how the architectures look like under the hood, I implemented them from scratch (one of the best ways to learn)!? First, a coding journey to convert GPT to Llama 3.2: https://lnkd.in/gzAEkzJ6 So, what's unique about Llama 3.2 considering the original GPT architecture? Here's a quick breakdown of the key developments from GPT-2 XL to the latest LLaMA 3.2 1B: https://lnkd.in/gzAEkzJ6 1) Vocabulary and Embedding Size Increases: - Transitioned from a vocabulary of 50,257 to 128,256 to reduce the token input (for a fixed text) by being able to represent more unique words (in more languages) - Shifted towards larger embedding dimensions (up to 2,048). 2) Improved Attention Mechanisms: - Introduced relative (versus absolute) position embeddings, improving the model's ability to handle varying input lengths. - Adopted grouped-query attention to improve compute- and parameter-efficiency in processing context. 3) Scaling and Optimization: - Redistributing the numbers of heads and transformer blocks, going from a narrow but deep to a wider but shorter architecture - Simplifying LayerNorm to RMSNorm to slightly reduce complexity. - Increased the context length support to 8,192 tokens to process longer inputs.
80 条评论
赞评论
要查看或添加评论，请登录
Sebastian Raschka, PhD

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison
1 周
举报此动态
I just added the Llama 3.2 1B and 3B models to LitGPT, the open-source LLM library I help develop (focused on efficiency and code readability). It allows you to fine-tune and use these models on the cloud or a laptop. So, if you are looking for something to play with this weekend: # 1) Finetune the model litgpt finetune_lora meta-llama/Llama-3.2-1B \ ?--data JSON \ ?--data.json_path my_custom_dataset.json \ ?--train.epochs 1 \ ?--out_dir out/llama-3.2-finetuned \ ?--precision bf16-true # 2) Chat with the model litgpt chat out/llama-3.2-finetuned/final # 3) Serve the model via an API endpoint litgpt serve out/llama-3.2-finetuned/final # 4) Bonus: Embed the model in a ChatGPT-like UI chainlit run app.py
30 条评论
赞评论
要查看或添加评论，请登录
Sebastian Raschka, PhD

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison
2 周已编辑
举报此动态
An often-asked question is how GPT compares to Llama. In my opinion, one of the best ways to understand the differences is to implement both models from scratch. I created a code notebook that converts the GPT model into Llama and loads the pretrained weights from Meta AI to learn about the key differences: https://lnkd.in/gvtcZE7V In addition to the hands-on code I linked above, below is a short text summary of the main differences: Vocabulary Size: GPT-2 XL supports a larger vocabulary with 50,257 tokens compared to Llama 2's 32,000 tokens (however, the recent Llama 3 model bumps the vocabulary to 128k). A larger vocabulary can potentially enhance the model's ability to handle diverse text inputs. Positional Embeddings: GPT-2 XL employs absolute positional embeddings capable of handling sequences. On the other hand, Llama uses Rotary Positional Embeddings (RoPE). Attention Heads: There are more attention heads in Llama compared to GPT-2 XL, which means improved attention distribution over the input data, potentially leading to better context understanding and generation. Activation Function: While GPT-2 XL uses the GELU activation function, Llama uses the SiLU (Swish) function, which is a bit simpler and can thus improve training efficiency. Normalization: The standard Layer Normalization in GPT-2 XL is replaced by RMS Layer Normalization in Llama, which may contribute to faster training and more stable convergence by normalizing layer inputs.
61 条评论
赞评论
要查看或添加评论，请登录
Sebastian Raschka, PhD

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison
2 周
举报此动态
In this article, I want to show you how to transform pretrained large language models (LLMs) into strong text classifiers. But why focus on classification? First, finetuning a pretrained model for classification offers a gentle yet effective introduction to model finetuning. Second, many real-world and business challenges revolve around text classification: spam detection, sentiment analysis, customer feedback categorization, topic labeling, and more. Additionally, I’ll share insights from some extra experiments to address common questions readers might have: 1) Do we need to train all layers? 2) Why finetuning the last token, not the first token? 3) How does BERT compare to GPT performance-wise? 4) Should we disable the causal mask? 5) What impact does increasing the model size have? 6) What improvements can we expect from LoRA? 7) Padding or no padding? Happy reading!

Building a GPT-Style LLM Classifier From Scratch

Sebastian Raschka, PhD，发布于领英

53 条评论
赞评论
要查看或添加评论，请登录
Sebastian Raschka, PhD

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison
3 周
举报此动态
I’m happy to confirm that the physical copies have started shipping. Happy weekend, everyone! P.S. I’ll be at the PyTorch Conference in SF next week and would love to meet, chat, and maybe sign a book copy! ?? (I’ll be joining my colleagues at the Lightning AI booth before and after my talk.)
130 条评论
赞评论
要查看或添加评论，请登录
Sebastian Raschka, PhD

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison
3 周
举报此动态
I'm excited to announce that after 1.5 years of hard work, the final version of "Build A Large Language Model From Scratch" has been published. The goal of this book is to help professionals and enthusiasts truly understand how LLMs work by building one from the ground up. Print and ebook copies are now available from Manning’s website at:?https://mng.bz/M96o. The book will also be available on Amazon in the coming weeks:?https://lnkd.in/g9BumP6W. Thank you to everyone who has supported this project along the way!
204 条评论
赞评论
要查看或添加评论，请登录
Sebastian Raschka, PhD

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison
3 周
举报此动态
Last week, there was a discussion about a new LLM that was said to be "the world's top open-source model." The name of the model was "Reflection Llama 3.1 70B". The model didn't turn out to be as good as advertised (there are currently many community discussions on whether this was an honest mistake or a scam) Either way, this got me curious to read up on the "Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning" paper (https://lnkd.in/gFE2fcyQ) -- not affiliated with the developers of the Reflection model -- which appears to be a legit way to improve LLMs through improving the dataset quality. So, what's Reflection-Tuning? In essence, it's a method that uses GPT-4 to improve the instructions and responses in a given instruction-finetuning dataset. The improved instruction-finetuning dataset can then be used to improve the LLM you are interested in finetuning (for example, a Llama 3.1 model). One of the fundamental concepts is the "garbage in / garbage out" principle in classic machine learning: the model can only be as good as your data. This is no different for LLMs: better datasets will produce better LLMs. I just implemented the dataset improvement methodology from the Reflection-Tuning paper in a standalone notebook here if you are curious about additional details and want to give it a try: https://lnkd.in/gesFpg8t PS: When I recently reviewed the recent Llama 3.1, Gemma 2, Phi-3, and Qwen 2 papers, improving (instead of just growing) the dataset was one of the biggest themes. For example, the Qwen 2 team used LLMs to generate instruction-response pairs specifically tailored for "high-quality literary data" to create high-quality Q&A pairs for training. Or, in Gemma 2, the instruction data involved using English-only prompt pairs, which were a mix of human-generated and synthetic-generated content. Specifically, and interestingly, the responses were primarily generated by teacher models, and knowledge distillation was also applied during the SFT phase.
33 条评论
赞评论
要查看或添加评论，请登录
Sebastian Raschka, PhD

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison
4 周
举报此动态
I’m happy to see that my 'Machine Learning Q and AI' book is featured in No Starch Press's latest Humble Bundle. This bundle offers a selection of 19 books on machine learning and AI at a fair price while contributing to charity!

No Starch Press

10,521 位关注者
4 周

The Humble Bundle you've been waiting for is here! Set your price and unlock up to 19 limited-time ebooks covering the essentials of Machine Learning and AI. Perfect for beginners and experts alike. Plus, your purchase supports the Electronic Frontier Foundation (EFF), the leading nonprofit defending digital privacy, free speech, and innovation. https://lnkd.in/gS_eq8RT

Machine Learning and AI | Humble Bundle

humblebundle.com

4 条评论
赞评论
要查看或添加评论，请登录
Sebastian Raschka, PhD

Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison
1 个月
举报此动态
There are many ways to implement the multi-head attention mechanism used in LLMs. I recently added an implementation based on?torch.einsum?for the Einstein summation enthusiasts to my collection:?https://lnkd.in/gcKY-yMd. With this addition, the notebook likely covers the full spectrum, from the simplest to the most exotic and efficient implementations. They are all mathematically equivalent and useable (there may be small different in the random initializations); I tested them all for training GPT-like models.
23 条评论
赞评论
要查看或添加评论，请登录

132,367 位关注者

查看档案关注

Sebastian Raschka, PhD的动态

更多文章

Building a GPT-Style LLM Classifier From Scratch

New LLM Pre-training and Post-training Paradigms

Instruction Pretraining LLMs