Top Deep Learning Papers of 2024 (so far)
Diego Bonilla Salvador
Deep Learning Researcher @ Mango || AI, Deep Learning and Computer Vision Lover ??
The year 2024 has been tremendous for AI. The on the rise trend doesn’t seem to end any soon. Image generation, LLMs, Computer Vision, … every model is getting bigger, more efficient and smarter at the same time. Long gone are the one-task models and the no-free-lunch theorem. The hype is now at multimodal fundational models that can even be executed on phones to perform some unnecesary task that no-one asked for. But, hey! It’s AI! Everybody loves AI, right? Do you want a dumb-smartphone ?? or my cool intelligent AI device ???
Either way, in this article, we’ll update ourselves on the most recent Deep Learning techniques and explore the State-of-the-Art. Hopefully, we can both learn something by exploring the Top Deep Learning papers of 2024 (so far).
WARNING ??
I tried to select papers that I believe hold significant value for the AI field and use techniques that are demonstrated to work effectively (I’m looking at you, Kolmogorov Networks). However, as Mark Zuckerberg said, “I was human,” and of course, there is a significant personal choice in these papers. Feel free to let me know in the comments if I’ve missed any pearls (or made any mistakes). Having said that, let’s get into it!
P.S: I’ve been trying not to overdo these kinds of articles. I only write them when I think they hold value for potential readers. I believe there are a lot of interesting papers to discuss here, so… take notes!
No “Zero-Shot” (https://arxiv.org/pdf/2404.04125)
Starting off strong with a paper about the power of different models in achieving Zero-Shot performance on various tasks, such as image recognition. Zero-Shot means that the model can recognize and understand new categories without having been explicitly trained on them. This is particularly useful in tasks like large-scale pretraining and computer vision because it allows the model to generalize from previously learned data to new, unseen scenarios. This reduces the need for extensive labeled datasets and enables more flexible and adaptive AI systems.
Models, like humans, need to learn a concept to understand it. Our first paper tries to answer the question, “How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets?” In simpler terms, how many “horses” does our model need to see to identify any kind of horse with high accuracy? Like me at school, if I study math for 2 hours, physics for 3 hours, and geography for 10 minutes, it would be obvious that my knowledge in geography would be far worse than the rest (even though I can name all 3 continents).
Fortunately, we can investigate exactly how much of a concept needs to be present in a dataset for a model to increase its accuracy on it. The results are quite discouraging… They found that the amount of a concept present in a dataset is a very good estimator of the accuracy of the trained model on that concept and even worse: “[…] model performance scales linearly as the concept frequency in pretraining data grows exponentially.” This means that to increase accuracy just a little, we need very large amounts of additional data, which can produce an asymptote on AGI.
These results challenge the current approach of using huge unsupervised datasets from all over the internet with long-tailed concept distributions. It seems like the no-free-lunch theorem is back.
In my personal opinion, which no one asked for, it doesn’t make much sense and is not very scalable to have a single module (called a Neural Network) that can memorize, understand, generalize, and retrieve. This requires the model to be over-parameterized and to scale in size every time a larger dataset is created. Having separate specialized modules for each task might be more interesting. In fact, I think I have some paper about that somewhere… Ah, here it is!
Working Memory Within the Transformer (https://arxiv.org/pdf/2404.09173)
Like pee is stored in the balls, long-memory is stored in the weights (end of quote). Fortunately, LLMs have enough weights to store all internet knowledge, but they lack what is called working memory. Transformers (the core of LLMs and Michael Bay’s bank account) work using context. Models like GPT-4 have a 128k token context, which is more letters than the trilogy of The Lord of the Rings. But at that length, we are at the quantum level of LLMs, at the limit, conventional rules don’t apply here.
First, the attention mechanism is not perfect. It can overlook and minimize the weight of important information present in the context. This error is amplified in long contexts and is very much present in >128k contexts. We can expect a significant degradation from the typical model’s performance. Second, attention has quadratic complexity. With a >128k token-sized context, the amounts of memory, time, and processing power necessary to compute that monstrosity grow exponentially with every token. Lastly, everything outside this context is forgotten. For the LLM, it’s like it never existed.
As we said before, memory is an important task learned by any model during training. In LLMs, the module responsible for this is the attention mechanism and is stored in its latent representations. So, what if we could transfer these latent representations at every token inference instead of the whole context?
Researchers at Google thought about this and created TransformerFAM. This is a Transformer model that performs a feedback loop to attend to its own latent representations (where memory is supposed to be). This enables a process similar to an RNN where a latent representation is passed through every inference step, allowing for infinite sequences to be generated with constant memory of O(1). TransformerFAM takes advantage of Block-Wise Attention and, thanks to this brain-inspired feedback mechanism, creates information compression and global contextual storage without adding any new weights to the normal Transformer model.
I know I promised a separate memory module outside the Transformer to mitigate the problems mentioned before, but I found the idea of TransformerFAM better developed, approachable, and realistic since it doesn’t really modify the Transformer architecture. However, for the curious minds, I’ll leave here another paper (https://arxiv.org/pdf/2403.11901 ) with a separated memory module called Laminar, capable of reading and updating this memory module using an encoder-decoder architecture and a memory matrix.
Enough of NLP!
Scalable Pre-training of Large Autoregressive Image Models (https://arxiv.org/pdf/2401.08541)
Computer Vision is once again copying from the NLP exam. Just as text is split into tokens, images can be tokenized into non-overlapping patches forming a chessboard pattern. Vision Transformers (ViTs) are essentially similar to text Transformers, differing only in input pre-processing and positional encoding. Given this similarity, techniques that work well for LLMs might also be effective for ViT-based models.
Like LLMs, Vision Models are pre-trained on vast amounts of data. Models like CLIP or DINO are shown everything there is to model the world of ours using billions of images from the internet. Once a model has this understanding, it can be used as a general model (like ChatGPT or CLIP Zero-Shot) or fine-tuned for specific tasks. The better the pre-training, the better the model will generalize and perform.
The vision counterparts of LLMs are called Large Autoregressive Image Models (AIM), introduced in this paper. This approach replaces the pre-training task with an autoregressive objective: given N image patches, predict the next N+1 image patch (in pixel space).
领英推荐
This simple approach benefits from the same advantages as LLMs! Performance did not saturate when scaling the model to 7 billion parameters and pre-training on 1.2 trillion image patches. Like LLMs such as T5, AIMs use fully bidirectional attention for downstream tasks like image classification to ensure the architecture is not merely causal.
Having strong image models is crucial for the advancement of multimodal models, enhanced intelligence, and a deeper understanding of the world. This progress enables better and more detailed image understanding in models like GPT-4, improved image generation, and higher-performing downstream computer vision models used for hundreds of tasks.
Imagine Flash (https://arxiv.org/pdf/2405.05224)
It’s 2024, and we can generate images, music, text, video, and more. Of course, I’m going to include a Generative AI (whatever that is) paper. Gotta go for those LinkedIn views. The paper I’ve selected highlights on significant faults in every Diffusion Model, like DALL·E or Stable Diffusion. Let’s dive in.
Simply explained, Diffusion Models work by gradually adding Gaussian noise to data until it becomes almost pure noise. At each step, a Neural Network learns to remove this noise, iteratively recovering the original data. This iterative process makes the generation robust and scalable. During inference, the model is fed pure noise and generates original data, which can be conditioned with text, for example.
Researchers at Meta saw three problems with this process. First, removing noise from a noisy image is not the same as creating an image from pure random noise. This creates a mismatch between training, where the original image is present to some degree, and inference, where there is no image, just noise. Second, Diffusion Models learn broad structures in early stages and fine details in later stages, which needs correction. Third, generating from pure noise is challenging, especially for the first step.
To address these issues that propagate errors during image generation, they introduced several methods. Backward Distillation involves a student network learning to diffuse an image in fewer steps, reconstructing any image from pure noise. This image is used as a starting point for training both student and teacher models. This simulates the actual conditions the student model will face during inference, ensuring it learns to generate high-quality images without relying on perfect initial conditions.
Shifted Reconstruction Loss (SRL) makes the model focus on both global and local details from the beginning of the diffusion process. By reversing the typical process, the training loss helps the student model capture both the big picture and fine details from the teacher model, enhancing image quality.
Finally, Noise Correction addresses the challenge of starting with pure noise. It creates a special case for diffusion at the first timestep, differentiating it from the rest to ease training.
Speed Round
Final section with papers that are very interesting but that can be understood with just a couple sentences since they are very intuitive (and I’m a good explainer):
FAREWELL.md
That is?all! Eleven loved papers with a highly recommended reading. A lot of papers about LLMs but they have much flaws and millions of dollars behind them (not every coin is going to go to OpenAI!). Less single-purpose models because corporations want you to pay for their APIs instead of a small company training their own… But that’s my 2024 paper experience on Deep Learning. Care to share yours? ??
Hope you learned something! If I missed anything let me know in the comments.
?? Did you like the story? ?? Let me know in the comments and give it a ??!! Share it with friends ??!! This things take a lot of time and effort to be done so the feedback is very appreciated! ??
?? Follow my Medium for more!
?? Follow my LinkedIn for more papers or interesting personal projects!
(or don’t I don’t care)
Very recommended indeed Diego Bonilla Salvador (unbiased) ??