My Weekend Awareness of "Situational Awareness"
??? Every rainy weekend in NYC when I have no better thing to do, I dedicate time to reading papers or books?? that provoke deeper thoughts. My latest read yesterday was Leopold Aschenbrenner 's excellently inspiring "Situational Awareness" (https://situational-awareness.ai/wp-content/uploads/2024/06/situationalawareness.pdf ). I am recommending this book to my students and colleagues too.
While I highly echo many points, including the sarcastic observation that "academics are surprisingly irrelevant in the LLM revolution of the past few years" (with the same bitter feeling, I decided to take a break from academia and jump into industry research), I find myself giving more thoughts to some other points raised in the book.
?? Neural Scaling: is Sky the limit? ??
One strong technical argument that has captured my attention is the potential and limitations of neural scaling laws. I find myself less optimistic than the book suggests, despite acknowledging that "the trendlines have been astonishingly consistent, despite naysayers at every turn." Theoretical insights suggest that neural networks operate within variance-limited and resolution-limited regimes. Performance improves predictably as model size and data scale increase, a principle robustly validated across language, vision, audio, and even more complex signal types (guess what I am talking about? :-)
Let's for now ignore other barriers insightfully discussed in Leopold's book (such as GPU and data center power), and focus on the algorithm side: should there be a natural limit for scaling, even just in theory?
For Language Models specifically (or any model whose input space is discrete), with a fixed context length K and vocabulary size M, the number of possible sentences is finite (M^K). Grammar and logical rules further narrow this space markedly. Yes I know this is already an astronomical number, but it is a finite number ... plus before 2015, we also thought the total numbers of possible moves in a Go game was an astronomical number to search and evaluate with!
While current scaling laws have demonstrated remarkable robustness as we approach theoretical upper limits, a critical question emerges: what happens when we actually reach those limits? The logical next step would be to raise the upper limit itself, but this introduces a new layer of complexity. Can we scale these fundamental limits with the same predictability and efficiency, as we've scaled model parameters and datasets (when still under the limits)? This becomes the new triillion-dollar question in the field. The scalability of these upper limits may not follow the same smooth, power-law relationships we've observed in model scaling (with an implicitly fixed upper limit). Instead, we might encounter step changes or even diminishing returns as we attempt to expand these fundamental limitations; I will discuss a bit in the next section.
While synthetic data can augment training datasets, it doesn't change the inherent upper limit "span" imposed by those two fixed hyperparameters. This explains why synthetic data is most useful in improving LLM performance for data-scarce applications, such as IMO-level math problems, where real-world data is limited and cannot sufficiently span over the (even much narrowed) subspace.
Note that I am not even directly discussing model parameters - which will influence how easily you could learn a target function, but not the upper limit expressiveness of that target function itself. Besides, whether the vocabulary size is still a limit for continuous input spaces, such as images or generic time series, remains somehow mysterious to me.
?? Increasing the Upper Limit? ??
From the above discussion, two obvious approaches to increase the upper limit are immediately given out:
Context Length: While the now achievable context size of million tokens is truly exciting, I'm skeptical about whether we'll soon hit a ceiling with context length. Traditional linguistic theories, such as those involving syntactic trees and dependency grammars, highlight that syntactic dependencies typically span relatively short distances. While long-distance dependencies do exist, they are less common and more challenging to model. Zipf's law also tells us that, a small number of words are used very frequently, while the majority are used rarely. That implies in long contexts, the frequent repetition of common words can overshadow the correlation between less frequent words, leading to diminishing returns in capturing meaningful relationships over extended text.
Vocabulary Size: Expanding vocabulary indeed presents a promising opportunity. Recent research suggests that scaling vocabulary can significantly enhance model performance. A larger vocabulary allows the model to represent more nuanced concepts directly, rather than relying on combinations of simpler tokens. Meanwhile, many common phrases or concepts can be represented by single tokens instead of multiple ones, potentially allowing for more efficient use of context length and improved handling of long-range dependencies. Expanding vocabulary to include tokens from multiple languages, or including specialized vocabulary from various fields (e.g., scientific, technical, or professional jargon), can all offer further "blue ocean" of potential.
The scaling of context length (K) and vocabulary size (M) in language models has received relatively less attention in the literature compared to the scaling of training data volume (D) and model parameter count (N). However, as data and model sizes continue to increase by orders of magnitude, it becomes increasingly imperative - and very natural - to consider scaling K and M concurrently to prevent them from becoming potential bottlenecks in model performance - if they were not already. I am aware of recent great works such as https://arxiv.org/pdf/2309.16039 (context) and https://arxiv.org/html/2407.13623v1 (vocabulary).
Despite these advancements, mainstream scaling laws such as the Chinchilla law and its variants (https://en.wikipedia.org/wiki/Neural_scaling_law#Chinchilla_scaling_(Hoffmann,_et_al,_2022) ) typically treat both K and M as fixed parameters or choose them on an ad-hoc basis. I would be very surprised if the leading industry labs had not developed their new internal laws dependent on M or K, though.
?? Beyond Plain Text Sequences: A New Frontier? ??
Is there another approach to increase the upper limit M^K, assuming both context length and vocabulary size remain fixed? Here comes a thought-provoking question: What if we move beyond organizing tokens as simple sequences? While sequence learning has become the Bible of LLMs, in our college class of "data structures", arrays and linked lists represent only the most basic chapter!
I am VERY curious about moving beyond organizing language tokens as sequences. By structuring tokens in more sophisticated ways, like trees or graphs, we might capture complex relationships (such as reasoning paths) more effectively and compactly:
By moving away from purely sequential representations, we might be able to pack more information into the same context window, effectively increasing the functional context length without changing the literal token limit.
To see this more concretely, let me write down a not-so-rigorous example. Consider a traditional sequence-based model with context length K and vocabulary size M, and the upper limit of possible unique inputs is T = M^K.
领英推荐
Now, let's introduce a graph-based input structure G = (V, E):
In this graph structure:
The key difference is in the edges. Let's say each vertex can have up to d outgoing edges (d > 1). This allows for non-linear connections between tokens. The number of possible unique labeled graphs is now:
where M^K denotes number of ways to label K vertices; and (K^d)^K means that for each vertex, we can choose up to d out of K vertices to connect to, and we do this for all K vertices.
Let us compare the two limits with a numerical example. if K = 1024, M = 50,000, and d = 2:
This demonstrates how a graph-based input structure could theoretically increase the upper limit of unique inputs by many, many orders of magnitude!
Spoiler: graphs and time series are unfortunately the two MOST under-studied subjects in last few years' ML - much less than they deserve. One reason is perhaps they are not the best friends with Mr. Self-Attention.
??? "Rich Text" Annotation for LLMs? ???
Another interesting challenge is LLMs' inability to "organically" differentiate the importance of various parts of language inputs. Unlike humans, who can convey importance through vocal tone or rich text formatting, LLMs process all input as "plain text." This can lead to inefficiencies, especially when prompts include both critical instructions (such as "hard" rules it must follow, e.g. specific formats, mathematical constraints, or ethical guidelines) and less important background information. These components have unequal importance; yet, we currently rely on LLMs to infer this hierarchy from lengthy descriptions.
Research shows that longer prompts can degrade LLM performance by introducing irrelevant information, making it difficult for models to prioritize critical details. "Annotating" LLM inputs by importance could also be seen as endowing data with (hyper) structures: they can take many forms, such as labels, metadata, or tags, and they provide essential information that helps in categorizing, organizing, and interpreting data.
P.S. ChatGPT and Claude likely handle rich text formats in their input, through a combination of tokenization and special token handling. However, the exact implementation details aren't publicly known. My best guess would be using a tokenizer that recognizes special markup or formatting tokens when processing input. The design space here remains large.
??Maximizing LLM Utilization is Beyond Training ??
Okay great, but would our discussion so far benefit academic settings when the affordable context length is typically VERY small? Perhaps no... So, let's watch some other opportunities for GPU-poor academics.
While much of our discussion has centered on maximizing the performance of individual LLMs, I believe there's significant room for development in how we optimally utilize trained models. Leopold's work highlights several promising avenues that I find particularly intriguing:
Compared to the pre-training/scaling end, there's a whole buffet of opportunities at this testing/inference end, that academic researchers may be able to catch up on!
Last words: I must acknowledge the giant possibility that all my perspectives could be mistaken, since (1) I have never lived in San Francisco to "see the future first" (as humorously noted in the book's preface), and (2) for most of my career I am an academic that may just keep irrelevant in this ongoing revolution :).
Alas, “Mann Tracht, Un Gott Lacht” (Man Plans, and God Laughs)! Let's keep thinking, innovating, and occasionally laughing at ourselves. The next groundbreaking paper might come from a sleep-deprived grad student who accidentally trained their model on a dataset of cat memes. Who knows? Maybe one day we'll create an AI that gets the joke too!
CS @ Cornell | CS @ Columbia | ex-quant @AQR
3 个月leopold also discusses some of the key ideas in this interview: https://youtu.be/zdbVtZIn9IM?si=SYhsklKu_h1q--Dd