Ahead of AI #4: A Big Year For AI

Ahead of AI #4: A Big Year For AI

Happy New Year! I'm thrilled to see that Ahead of AI has gained more than 20,000 subscribers after only 3 issues. This is a great motivator to keep writing, and I hope that everyone has a healthy and successful 2023 ahead!

It's been ten years since I first started exploring the field of machine learning, but this year has proven to be the most exciting and eventful one yet. Every day brings something new and exciting to the world of machine learning and AI, from the latest developments and breakthroughs in the field to emerging trends and challenges.

To mark the start of the new year, this month's issue will feature a review of the top ten papers I've read in 2022.

PS: Thank you to those who have reached out asking how they can support Ahead of AI. While this newsletter is free and unabbreviated, there is a?paid subscription option?on Substack for those who would like to support it.


Articles & Trends

In January 2022, diffusion models?caught my eye?for the first time, and I suspected something big was coming. However, I never expected what followed in just a matter of months: DALLE-2, Imagen, Stable Diffusion, and many others.

Similarly, large language models have had a big year, with the recent ChatGPT putting the cherry on top and stealing the show. What a year!

However, since we already discussed those diffusion models in?issue 1, various language models in?issue 2, and you probably can't hear "ChatGPT" anymore, let me keep this section brief. So instead, let's jump to the December highlights and a summary of a state of AI report and industry survey by McKinsey before going over 5 noteworthy papers published this year.


December Highlights

There has been a rapid release of the milestones mentioned above that will be hard to top. However, that doesn't mean December was dull. So, keeping it brief, here are two papers that caught my eye.

What do vision transformers (ViTs) learn?

A visual exploration?shows that ViTs learn inductive biases or features similar to those learned by convolutional networks (CNNs). For example, the early layers of ViTs capture edges and textures, while later layers learn more complex representations to capture broader concepts.

No alt text provided for this image
The progression for visualization features of vision transformers from early layers (left) to deeper layers (right). Source: https://arxiv.org/abs/2212.06727.


Regarding generative modeling, ViTs tend to generate higher-quality backgrounds than CNNs. This raises the question of how ViTs handle backgrounds and foregrounds in prediction tasks. It appears that ViTs are better at predicting the target class than CNNs when backgrounds are removed, and they also perform better when foregrounds are removed. This suggests that ViTs may be more selective in relying on certain features based on their presence or are simply more robust in general.

Paper:?What do Vision Transformers Learn? A Visual Exploration


A diffusion model for generating proteins

Diffusion models have resulted in breakthrough peformance when it comes to image generation, how about generating protein structures? Researchers have developed a diffusion model, called?RoseTTAFold Diffusion (RFDiffusion)?for?de novo?protein synthesis -- proteins that are created from scratch, rather than being derived from preexisting proteins found in nature.

No alt text provided for this image
Source: https://www.biorxiv.org/content/10.1101/2022.12.09.519842v1


It is important to distinguish de novo proteins, which are synthesized in the laboratory using amino acid sequences that have no evolutionary history, from systems such as AlphaFold and AlphaFold2, which use existing amino acid sequence data to predict protein 3D structures. However, it is worth noting that AlphaFold2 was used to validate the results of the RDiffusion study.

Paper:?Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models


Industry Trends

And as researchers, we usually (try to) work on and read about the state-of-the-art in AI. But what is actually used in industry today? According to?McKinsey's recent state of AI report, it's not large language models (transformers).*

(*It is important to consider that the findings in this report may not accurately reflect the experiences of all companies due to the limitations of the sample size and representativeness.)

No alt text provided for this image
Source: McKinsey State of AI Report 2022, https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2022-and-a-half-decade-in-review


Let me summarize some takeaways from the graphic above that I found interesting.

  1. Natural language processing has always been popular in industry, but it was usually substantially surpassed by computer vision applications. Now, for the first time, we see that computer vision and natural language processing are almost tied.
  2. Natural-language text understanding (which may refer to text classification*) is almost twice as popular as natural-language "generation". Note that natural-language generation typically dominates the news: GPT-3, Galactica, ChatGPT, and others.

(*Text understanding may include summarization. Summarization is also "generative," so I assume it largely refers to classification-like tasks here. On the flip side, categories can overlap.)

  1. Transformers rank at the bottom.

It appears that many companies have not yet adopted BERT-like language model encoders for text understanding and classification (1). Instead, they may still be using bag-of-word-based classifiers or recurrent neural networks. Similarly, it seems that GPT-like model decoders are not yet being widely used for language generation, so text generation may still rely heavily on recurrent neural networks and other traditional methods.


Data set sizes, getting data to production, and model explainability

Additional insights I found interesting (based on the figure below):

  • It's important to be able to leverage "small data" (keyword: data-centric AI). When data is not available, the ability to generate synthetic data is useful.
  • Having the ability to integrate data into the AI model as quickly as possible is what sets high-performers apart from the competition. A good software framework and infrastructure setup may be critical for this.
  • Most high-performing companies unfortunately don't care about model interpretibility (yet).

No alt text provided for this image
No alt text provided for this image
Source: McKinsey State of AI Report 2022, https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2022-and-a-half-decade-in-review


Papers of the Year

The following are the top three papers that I read in 2022, along with a short discussion of each. Of course, there are many, many more exciting and potentially timeless and influential papers that were published this year.

Keeping it to "only" top-3 was particularly challenging this year, so there is also an extended list below featuring seven additional papers from my top-10 list.


1) ConvNeXt

The?A ConvNet for the 2020s?paper is a highlight for me because the authors were able to design a purely convolutional architecture that outperformed popular vision transformers such as?Swin Transformer?(and all convolutional neural networks that came before it, of course).https://arxiv.org/abs/2201.03545

No alt text provided for this image
Source: https://arxiv.org/abs/2201.03545


This so-called ConvNeXt architecture may well be the new default when it comes to using convolutional neural networks not only for classification, but also object detection and instance segmentation -- it can be used as a backbone for?Mask R-CNN, for example.

As the authors stated in the paper, they were inspired by modern vision transformer training regimes as well as the fact that the Swin Transformer hybrid architecture showed that convolutional layers are still relevant. That's because pure vision transformer architectures lack useful inductive biases such as translation equivariance and parameter-sharing (i.e., the "sliding window" in convolutions).

To develop ConvNeXt, the authors started out with a?ResNet-50 base architecture and adopted architecture modifications and training regimes adopted from modern vision transformer training regimes. Note that these were not new, even in the context of convolutional neural networks. However, the novelty here is that the authors used, analyzed, and combined these techniques effectively.

Which techniques did they adopt? It's a long list, including?depthwise convolutions, inverted bottleneck layer designs,?AdamW,?LayerNorm, and many more. You can find a summary in the figure below. In addition, the authors also used modern data augmentation techniques such as?Mixup,?Cutmix, and others.

No alt text provided for this image
Annotated version of a figure from https://arxiv.org/abs/2201.03545


2) MaxViT

Despite convolutional neural networks making quite the comeback with ConvNeXt above, vision transformers are currently getting all the attention (no pun intended).

MaxViT: Multi-axis Vision Transformer?highlights how far vision transformers have come in recent years. While early vision transformers suffered from quadratic complexity, many tricks have been implemented to apply vision transformers to larger images with linear scaling complexity.

No alt text provided for this image
Source: https://arxiv.org/abs/2204.01697


In MaxViT, this is achieved by decomposing an attention block into two parts with local-global interaction:

  1. local attention ("block attention");
  2. global attention ("grid attention").

It's worth mentioning that MaxViT is a convolutional transformer hybrid featuring convolutional layers as well.

And it can be used for predictive modeling (incl. classification, object detection, and instance segmentation) as well as generative modeling.

No alt text provided for this image
Annotated version from https://arxiv.org/abs/2204.01697


As a side note,?a search for "vision transformer" on Google Scholar?yields over 5,000 results for 2022 alone. This high number of results, while potentially including false positives, demonstrates the widespread popularity and interest in vision transformers.

No alt text provided for this image


But no worries, vision transformers won't entirely replace our beloved convolutional neural networks. Instead, as MaxViT highlights, the current trend goes towards combining vision transformers and convolutional networks into hybrid architectures.


3) Stable Diffusion

Before ChatGPT became the state of the show, it was not too long ago since?Stable Diffusion?was all over the internet and social media. Stable Diffusion is based on the paper?High-Resolution Image Synthesis with Latent Diffusion Models, which was uploaded in December 2021. But since it was presented at?CVPR 2022?and got the spotlight with the Stable Diffusion in August 2022, I think it's fair to include it in this 2022 list.

Diffusion models (the topic of the first Ahead of AI issue), are a type of probabilistic model that are designed to learn the distribution of a dataset by gradually denoising a normally distributed variable. This process corresponds to learning the reverse process of a fixed Markov Chain over a length of T.

No alt text provided for this image
Illustration of a diffusion model


Unlike GANs, which are trained using a minimax game between a generator and a discriminator, diffusion models are likelihood-based models trained using maximum likelihood estimation (MLE). This can help to avoid mode collapse and other training instabilities.

Diffusion models have been around for some time (see?Deep Unsupervised Learning using Nonequilibrium Thermodynamics, 2015) but were notoriously expensive to sample from during training and inference. The authors of the 2022 paper above mentioned a runtime of 5 days to sample 50k images.

The?High-Resolution Image Synthesis with Latent Diffusion Models?paper's novelty is in applying diffusion in latent space using pretrained autoencoders instead of using the full-resolution raw pixel input space of the original images directly.

No alt text provided for this image
Source: https://arxiv.org/abs/2112.10752


The training process can be described in two phases: First, pretrain an autoencoder to encode input images into a lower-dimensional latent space to reduce complexity. Second, train diffusion models on the latent representations of the pretrained autoencoder.

Operating in latent space reduces the computational costs and complexity of diffusion models for training and inference and can generate high-quality results.

Another contribution of this paper is the cross-attention mechanism for general conditioning. So, next to unconditional image generation, the proposed latent diffusion model can be capable of inpainting, class-conditional image synthesis, super-resolution, and text-to-image synthesis -- the latter is what made DALLE-2 and Stable Diffusion so famous.


Seven Other Paper Highlights of 2022

Here are short summaries of seven other papers that I had on my original top-10 list:


Open Source Highlights

Scikit-learn 1.2

The new version?of my favorite machine learning library, scikit-learn came out in December. My highlights are around the HistGradientBoostingClassifier. (The HistGradientBoostingClassifier is an implementation of the LightGBM.)

The HistGradientBoostingClassifier now supports

  1. interaction constraints (in trees, features that appear along a particular path are considered as "interacting");
  2. class weights;
  3. feature names for categorical features.

PaLM + RLHF - Pytorch (WIP)

PaLM + RLHF - Pytorch?is an implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture mentioned in the top-10 paper list above. It may be the first open-source equivalent of ChatGPT. The big caveat is that it doesn't come with pretrained weights (yet).

Echo

Remember OpenAI's?Whisper?model from?Ahead of AI Issue #1? Whisper (also included in the top-10 paper liset above) is a large language model for generating high-quality subtitles.

I've tried many subtitle generators over the years, and I am genuinely impressed by the quality of the subtitles it generates. In fact, I was so impressed that I used it to generate the subtitles for the?Deep Learning Fundamentals?course as well! (PS: it's also one of the few models that handle non-native accents well!)

Last month we released the?Echo app, which provides a user-friendly interface around Whisper that lets you drag & drop videos directly from your computer. Like?Muse, which runs Stable Diffusion, Echo is another example of putting a large model into production and scaling it to multiple users using the Lightning AI framework.

No alt text provided for this image


Of course, Echo is fully open-source. Check out the?Deploy OpenAI Whisper as a Cloud Product?tutorial to learn how this was built.


Notable Quote

If you devote yourself to?anything?diligently for ten years, that will make you an expert. (That is the time that it would take to earn two master's degrees and a doctorate.)
-- Elizabeth Gilbert


Announcement: Deep Learning Fundamentals, Unit 3

I have been heads-down working on?Deep Learning Fundamentals -- Learn Deep Learning With a Modern Open-Source Stack!

This free course teaches you deep learning from the ground up, from machine learning basics to training state-of-the-art deep neural networks on multiple GPUs in PyTorch.

I am happy to share that?Unit 3 is out?now. In Unit 2, we introduced PyTorch as a tensor/array library. In Unit 3, we take it a step further and talk about automatic differentiation!

No alt text provided for this image


I would love to hear your feedback!?Please reach out on social media?or the?forum. I'd love to hear what you think!


Study & Productivity Tips


A Yearly Review

One of the things to keep me sane is to do a weekly review (more on that in another newsletter). Since it's the end of the year, let me take this opportunity to write about my yearly review routine, which I usually do around New Year's Eve.

It's an exercise to reflect on your accomplishments (which hopefully serves as motivation) and what you want to accomplish next (to keep you focused).


Step 1: Looking Back

First, I collect stats, like reviewing my overall financials (expenses, income, savings), citations, website visits, etc., and compare them to last year's. I also look at the books I read and the courses I've taken.

Second, I am going through my calendar and project folders (a topic for another newsletter, but I also wrote about it?here) and writing down 2-4 highlights for each month.

Third, I go through all the pictures I took this year and pick 1-2 for each month to create a little diary/album I print out (highlighting key events, trips, etc.) to keep a nice and concise summary.

Fourth, I take some time to reflect and think about everything that went well and didn't go so well. And I write down a key lesson for the new year ahead.


Step 2: Looking Forward

After the review has commenced, I take some time to reflect on the main things I want to accomplish this year. This is the most challenging part of the yearly review, and I recommend taking time for that.

Naming 2-3 big things you want to accomplish this year is an excellent exercise for me to stay focused. Over the year, I accumulate a lot of small but interesting projects and responsibilities. This could include maintaining small hobby open-source projects. It's hard to let things go, but sometimes it's important to let things go. Letting things go is the only way to make room for something new.

In that process, I am also planning my curriculum for the year. This usually includes textbooks I want to read and courses I want to take. I am often way too optimistic in terms of what I can actually finish, but it's always good to have a plan.


Machine Learning Humor

"Move fast and break things" -- Software 2.0, 2004

"Scrape vast and generate fake things" -- Software 3.0, 2023



Are you interested in more AI-related news, musings, and educational material but don't want to wait until the next newsletter issue? Follow me on?Twitter,?LinkedIn, or check out?my books.

Paolo Dell'Aversana

Senior Geophysicist, Electromagnetic Team Leader, Senior Data Scientist and Project Manager at Eni SpA; Digital Musician and Multimedia Artist;

1 年

Great summary-work Sebastian. Thank you!

Shang Ong

Real-Time Video Analytics - Software Engineering and Solutions Architecting | Solution Architecting, AI-driven Systems

1 年

Love this work Sebastian Raschka, PhD! Thank you for publishing.

Zain ul abideen

Machine Learning Engineer

1 年

Loved your course "STAT 453: Intro to DL and GMs", looking forward to something new in 2023!

要查看或添加评论,请登录

Sebastian Raschka, PhD的更多文章

社区洞察

其他会员也浏览了