Ahead of AI #4: A Big Year For AI
Sebastian Raschka, PhD
Machine learning and AI researcher ? author of the "Build a Large Language Model From Scratch" book (mng.bz/M96o) ? research engineer at Lightning AI ? ex-statistics professor at University of Wisconsin-Madison
Happy New Year! I'm thrilled to see that Ahead of AI has gained more than 20,000 subscribers after only 3 issues. This is a great motivator to keep writing, and I hope that everyone has a healthy and successful 2023 ahead!
It's been ten years since I first started exploring the field of machine learning, but this year has proven to be the most exciting and eventful one yet. Every day brings something new and exciting to the world of machine learning and AI, from the latest developments and breakthroughs in the field to emerging trends and challenges.
To mark the start of the new year, this month's issue will feature a review of the top ten papers I've read in 2022.
PS: Thank you to those who have reached out asking how they can support Ahead of AI. While this newsletter is free and unabbreviated, there is a?paid subscription option?on Substack for those who would like to support it.
Articles & Trends
In January 2022, diffusion models?caught my eye?for the first time, and I suspected something big was coming. However, I never expected what followed in just a matter of months: DALLE-2, Imagen, Stable Diffusion, and many others.
Similarly, large language models have had a big year, with the recent ChatGPT putting the cherry on top and stealing the show. What a year!
However, since we already discussed those diffusion models in?issue 1, various language models in?issue 2, and you probably can't hear "ChatGPT" anymore, let me keep this section brief. So instead, let's jump to the December highlights and a summary of a state of AI report and industry survey by McKinsey before going over 5 noteworthy papers published this year.
December Highlights
There has been a rapid release of the milestones mentioned above that will be hard to top. However, that doesn't mean December was dull. So, keeping it brief, here are two papers that caught my eye.
What do vision transformers (ViTs) learn?
A visual exploration?shows that ViTs learn inductive biases or features similar to those learned by convolutional networks (CNNs). For example, the early layers of ViTs capture edges and textures, while later layers learn more complex representations to capture broader concepts.
Regarding generative modeling, ViTs tend to generate higher-quality backgrounds than CNNs. This raises the question of how ViTs handle backgrounds and foregrounds in prediction tasks. It appears that ViTs are better at predicting the target class than CNNs when backgrounds are removed, and they also perform better when foregrounds are removed. This suggests that ViTs may be more selective in relying on certain features based on their presence or are simply more robust in general.
A diffusion model for generating proteins
Diffusion models have resulted in breakthrough peformance when it comes to image generation, how about generating protein structures? Researchers have developed a diffusion model, called?RoseTTAFold Diffusion (RFDiffusion)?for?de novo?protein synthesis -- proteins that are created from scratch, rather than being derived from preexisting proteins found in nature.
It is important to distinguish de novo proteins, which are synthesized in the laboratory using amino acid sequences that have no evolutionary history, from systems such as AlphaFold and AlphaFold2, which use existing amino acid sequence data to predict protein 3D structures. However, it is worth noting that AlphaFold2 was used to validate the results of the RDiffusion study.
Industry Trends
And as researchers, we usually (try to) work on and read about the state-of-the-art in AI. But what is actually used in industry today? According to?McKinsey's recent state of AI report, it's not large language models (transformers).*
(*It is important to consider that the findings in this report may not accurately reflect the experiences of all companies due to the limitations of the sample size and representativeness.)
Let me summarize some takeaways from the graphic above that I found interesting.
(*Text understanding may include summarization. Summarization is also "generative," so I assume it largely refers to classification-like tasks here. On the flip side, categories can overlap.)
It appears that many companies have not yet adopted BERT-like language model encoders for text understanding and classification (1). Instead, they may still be using bag-of-word-based classifiers or recurrent neural networks. Similarly, it seems that GPT-like model decoders are not yet being widely used for language generation, so text generation may still rely heavily on recurrent neural networks and other traditional methods.
Data set sizes, getting data to production, and model explainability
Additional insights I found interesting (based on the figure below):
Papers of the Year
The following are the top three papers that I read in 2022, along with a short discussion of each. Of course, there are many, many more exciting and potentially timeless and influential papers that were published this year.
Keeping it to "only" top-3 was particularly challenging this year, so there is also an extended list below featuring seven additional papers from my top-10 list.
1) ConvNeXt
The?A ConvNet for the 2020s?paper is a highlight for me because the authors were able to design a purely convolutional architecture that outperformed popular vision transformers such as?Swin Transformer?(and all convolutional neural networks that came before it, of course).https://arxiv.org/abs/2201.03545
This so-called ConvNeXt architecture may well be the new default when it comes to using convolutional neural networks not only for classification, but also object detection and instance segmentation -- it can be used as a backbone for?Mask R-CNN, for example.
As the authors stated in the paper, they were inspired by modern vision transformer training regimes as well as the fact that the Swin Transformer hybrid architecture showed that convolutional layers are still relevant. That's because pure vision transformer architectures lack useful inductive biases such as translation equivariance and parameter-sharing (i.e., the "sliding window" in convolutions).
To develop ConvNeXt, the authors started out with a?ResNet-50 base architecture and adopted architecture modifications and training regimes adopted from modern vision transformer training regimes. Note that these were not new, even in the context of convolutional neural networks. However, the novelty here is that the authors used, analyzed, and combined these techniques effectively.
Which techniques did they adopt? It's a long list, including?depthwise convolutions, inverted bottleneck layer designs,?AdamW,?LayerNorm, and many more. You can find a summary in the figure below. In addition, the authors also used modern data augmentation techniques such as?Mixup,?Cutmix, and others.
2) MaxViT
Despite convolutional neural networks making quite the comeback with ConvNeXt above, vision transformers are currently getting all the attention (no pun intended).
MaxViT: Multi-axis Vision Transformer?highlights how far vision transformers have come in recent years. While early vision transformers suffered from quadratic complexity, many tricks have been implemented to apply vision transformers to larger images with linear scaling complexity.
In MaxViT, this is achieved by decomposing an attention block into two parts with local-global interaction:
It's worth mentioning that MaxViT is a convolutional transformer hybrid featuring convolutional layers as well.
And it can be used for predictive modeling (incl. classification, object detection, and instance segmentation) as well as generative modeling.
As a side note,?a search for "vision transformer" on Google Scholar?yields over 5,000 results for 2022 alone. This high number of results, while potentially including false positives, demonstrates the widespread popularity and interest in vision transformers.
领英推荐
But no worries, vision transformers won't entirely replace our beloved convolutional neural networks. Instead, as MaxViT highlights, the current trend goes towards combining vision transformers and convolutional networks into hybrid architectures.
3) Stable Diffusion
Before ChatGPT became the state of the show, it was not too long ago since?Stable Diffusion?was all over the internet and social media. Stable Diffusion is based on the paper?High-Resolution Image Synthesis with Latent Diffusion Models, which was uploaded in December 2021. But since it was presented at?CVPR 2022?and got the spotlight with the Stable Diffusion in August 2022, I think it's fair to include it in this 2022 list.
Diffusion models (the topic of the first Ahead of AI issue), are a type of probabilistic model that are designed to learn the distribution of a dataset by gradually denoising a normally distributed variable. This process corresponds to learning the reverse process of a fixed Markov Chain over a length of T.
Unlike GANs, which are trained using a minimax game between a generator and a discriminator, diffusion models are likelihood-based models trained using maximum likelihood estimation (MLE). This can help to avoid mode collapse and other training instabilities.
Diffusion models have been around for some time (see?Deep Unsupervised Learning using Nonequilibrium Thermodynamics, 2015) but were notoriously expensive to sample from during training and inference. The authors of the 2022 paper above mentioned a runtime of 5 days to sample 50k images.
The?High-Resolution Image Synthesis with Latent Diffusion Models?paper's novelty is in applying diffusion in latent space using pretrained autoencoders instead of using the full-resolution raw pixel input space of the original images directly.
The training process can be described in two phases: First, pretrain an autoencoder to encode input images into a lower-dimensional latent space to reduce complexity. Second, train diffusion models on the latent representations of the pretrained autoencoder.
Operating in latent space reduces the computational costs and complexity of diffusion models for training and inference and can generate high-quality results.
Another contribution of this paper is the cross-attention mechanism for general conditioning. So, next to unconditional image generation, the proposed latent diffusion model can be capable of inpainting, class-conditional image synthesis, super-resolution, and text-to-image synthesis -- the latter is what made DALLE-2 and Stable Diffusion so famous.
Seven Other Paper Highlights of 2022
Here are short summaries of seven other papers that I had on my original top-10 list:
Open Source Highlights
Scikit-learn 1.2
The new version?of my favorite machine learning library, scikit-learn came out in December. My highlights are around the HistGradientBoostingClassifier. (The HistGradientBoostingClassifier is an implementation of the LightGBM.)
The HistGradientBoostingClassifier now supports
PaLM + RLHF - Pytorch (WIP)
PaLM + RLHF - Pytorch?is an implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture mentioned in the top-10 paper list above. It may be the first open-source equivalent of ChatGPT. The big caveat is that it doesn't come with pretrained weights (yet).
Echo
Remember OpenAI's?Whisper?model from?Ahead of AI Issue #1? Whisper (also included in the top-10 paper liset above) is a large language model for generating high-quality subtitles.
I've tried many subtitle generators over the years, and I am genuinely impressed by the quality of the subtitles it generates. In fact, I was so impressed that I used it to generate the subtitles for the?Deep Learning Fundamentals?course as well! (PS: it's also one of the few models that handle non-native accents well!)
Last month we released the?Echo app, which provides a user-friendly interface around Whisper that lets you drag & drop videos directly from your computer. Like?Muse, which runs Stable Diffusion, Echo is another example of putting a large model into production and scaling it to multiple users using the Lightning AI framework.
Of course, Echo is fully open-source. Check out the?Deploy OpenAI Whisper as a Cloud Product?tutorial to learn how this was built.
Notable Quote
If you devote yourself to?anything?diligently for ten years, that will make you an expert. (That is the time that it would take to earn two master's degrees and a doctorate.)
-- Elizabeth Gilbert
Announcement: Deep Learning Fundamentals, Unit 3
I have been heads-down working on?Deep Learning Fundamentals -- Learn Deep Learning With a Modern Open-Source Stack!
This free course teaches you deep learning from the ground up, from machine learning basics to training state-of-the-art deep neural networks on multiple GPUs in PyTorch.
I am happy to share that?Unit 3 is out?now. In Unit 2, we introduced PyTorch as a tensor/array library. In Unit 3, we take it a step further and talk about automatic differentiation!
I would love to hear your feedback!?Please reach out on social media?or the?forum. I'd love to hear what you think!
Study & Productivity Tips
A Yearly Review
One of the things to keep me sane is to do a weekly review (more on that in another newsletter). Since it's the end of the year, let me take this opportunity to write about my yearly review routine, which I usually do around New Year's Eve.
It's an exercise to reflect on your accomplishments (which hopefully serves as motivation) and what you want to accomplish next (to keep you focused).
Step 1: Looking Back
First, I collect stats, like reviewing my overall financials (expenses, income, savings), citations, website visits, etc., and compare them to last year's. I also look at the books I read and the courses I've taken.
Second, I am going through my calendar and project folders (a topic for another newsletter, but I also wrote about it?here) and writing down 2-4 highlights for each month.
Third, I go through all the pictures I took this year and pick 1-2 for each month to create a little diary/album I print out (highlighting key events, trips, etc.) to keep a nice and concise summary.
Fourth, I take some time to reflect and think about everything that went well and didn't go so well. And I write down a key lesson for the new year ahead.
Step 2: Looking Forward
After the review has commenced, I take some time to reflect on the main things I want to accomplish this year. This is the most challenging part of the yearly review, and I recommend taking time for that.
Naming 2-3 big things you want to accomplish this year is an excellent exercise for me to stay focused. Over the year, I accumulate a lot of small but interesting projects and responsibilities. This could include maintaining small hobby open-source projects. It's hard to let things go, but sometimes it's important to let things go. Letting things go is the only way to make room for something new.
In that process, I am also planning my curriculum for the year. This usually includes textbooks I want to read and courses I want to take. I am often way too optimistic in terms of what I can actually finish, but it's always good to have a plan.
Machine Learning Humor
"Move fast and break things" -- Software 2.0, 2004
"Scrape vast and generate fake things" -- Software 3.0, 2023
Senior Scientist | Global AI
1 年Shameful plug but... https://anima-ai.org/2022/12/31/top-10-things-in-2022/
Senior Geophysicist, Electromagnetic Team Leader, Senior Data Scientist and Project Manager at Eni SpA; Digital Musician and Multimedia Artist;
1 年Great summary-work Sebastian. Thank you!
??
Real-Time Video Analytics - Software Engineering and Solutions Architecting | Solution Architecting, AI-driven Systems
1 年Love this work Sebastian Raschka, PhD! Thank you for publishing.
Machine Learning Engineer
1 年Loved your course "STAT 453: Intro to DL and GMs", looking forward to something new in 2023!