登录查看更多内容

#37 What's that new Transformer for videos?

Riya Chhikara

Data Scientist at The Economist | Guest Teacher at LSE

发布日期: 2024年4月15日

Before we start, I want to share something hilarious. Remember the 2017 paper that we talked about on Attention in Transformers? In 2021, Google researchers published a paper, which directly attacked the 2017 findings. And they made sure the message was loud and clear, so they titled their paper "Attention is not all you need." Loud brag.

So, basically, the researchers conducted experiments analyzing the behaviour of the self-attention mechanism without any of the other components of the transformers. They found that this mechanism, by itself, is practically useless. The self-attention mechanism isn't very helpful on its own. When they tested it without other important parts, it didn't work well. But here's the interesting part: even though the self-attention mechanism isn't very useful by itself when you put it together with two other important parts—skip connections and MLP—it becomes incredibly powerful.

Transformers for videos: TimeSformers

The main paper which introduces transformers for videos:

We know that video is simply a set of frames.

The only addition to an image is we have to take in account time along with the images. So, we need some attention that takes into account the variation that occurs between consecutive frames.

We need new attention mechanisms that compute attention axially, scattered, or jointly between space and time. This means calculating attention along specific axes or directions within the video data. This could involve focusing attention on spatial dimensions (e.g., width and height of video frames) or temporal dimensions (e.g., frames over time). "Scattered attention" suggests spreading attention across various parts of the video rather than focusing on specific regions. This could be useful for capturing global context or detecting motion patterns. "Joint attention between space and time" refers to integrating information from both spatial and temporal dimensions simultaneously. This holistic approach allows the model to capture both static features (space) and dynamic changes (time) in the video.

Divided Space - Time Attention

It says that given a frame at instant t and one of its patches as a query, to compute the spatial attention over the whole frame and then the temporal attention in the same patch of the query but in the previous and next frames. It is able to learn more separate features than other approaches. So, it is better able to understand videos from different categories.

The researchers found that the higher the resolution the better the accuracy of the model, up to a point. As the number of frames increases, the accuracy also increases.

GitHub:

Link to the Full Code

You will find the dataset from David Coccomini's Repo on TimeSformers

Upload the data folder in your Google Drive: My Drive

2. Import libraries and initialize data

领英推荐

Marco Polo in the Electromagnetic Spectrum: The Game…

James Spriet 1 年前

Variography: Post II - The Nugget Effect

Celeste Wilson 2 个月前

How the FFT algorithm works | Part 6 - The Final Stage

Mark Newman 2 年前

3. Building TimeSformers

We create a class named PreNorm, which inherits from nn.Module (neural network module). The constructor method for the PreNorm class initializes the object with two parameters: dimensionality and function. Then, we define the forward pass of the neural network module which takes an input tensor x and optional additional arguments.

The GEGLU class implements a variant of the Gated Linear Unit (GLU) activation function. In this class, we define the method for the forward pass of the input tensor x. We split the input tensor x into two parts along the last dimension (dim=-1). The first part (x) will be used for transformation, and the second part (gates) will serve as the gating mechanism. Finally, the GELU activation function is applied to gates, and the element-wise multiplication is performed between the x and the result of the GELU activation.

The Feedforward class defines a feedforward neural network layer with a configurable architecture. We define the sequential network where the linear layer maps input dimensions to dim * mult * 2. This is followed by the GEGLU activation. Then, a Dropout layer with the specified dropout probability. The final linear layer maps the intermediate dimension back to the original dimension.

4. Divided Space-Time Attention

This is a utility function that calculates the attention mechanism given queries (q), keys (k), and values (v). Then, we compute the attention scores by taking the dot product between the query and key vectors for all pairs of tokens. We compute the attention weights by applying the softmax function along the last dimension of the similarity matrix sim.

You can check the complete code at the Repo.

5. Execution & Training

Extract frames from videos. -> Train model

Our model uses softmax activation to produce probabilities for each class.

6. Outputs

Sources:

100 Days of Computer Vision

837 位关注者

要查看或添加评论，请登录

Riya Chhikara的更多文章

#57 Vintage Watch Finder: AI in Luxury Watch Shopping

2024年10月21日

#57 Vintage Watch Finder: AI in Luxury Watch Shopping

Got a cool idea ! We have Google Lens where you can upload images to search for the items. I want to build a…
#56 Connecting the app to AWS S3 bucket

2024年9月22日

#56 Connecting the app to AWS S3 bucket

Now that QualScan works well, and we have integrated Postgres tables into the workflow, we have one more thing left to…
#55: How to build a solid backend for a scalable app?

2024年9月22日

#55: How to build a solid backend for a scalable app?

Now that we have a functional app with a decent interface, we can focus on the backend database storage. I used…
#54: How to integrate alert system into a machine vision app ?

2024年9月20日

#54: How to integrate alert system into a machine vision app ?

This will be a tutorial with code snippets. So, if you are building/ planning to build your app in Python, and want to…
# 53 The app now tracks defects in real-time

2024年9月19日

# 53 The app now tracks defects in real-time

What do real time quality dashboards 'really look' like? I found some results on Google which seemed pretty…
#52: Looks better than yesterday

2024年9月18日

#52: Looks better than yesterday

Today, I made some functional changes. Looks better, and fixed the slider issue.
#51: And the winner for the final model is VGG16

2024年9月17日

#51: And the winner for the final model is VGG16

Quick Recap: Yesterday we created an app that took product images as inputs and predicted the % of defects in it. The…

2 条评论
#50: Machine Vision for checking defects

2024年9月16日

#50: Machine Vision for checking defects

BACK AT IT ! Well, today I read about machine vision used in manufacturing setups. We know that humans can inspect only…
#49: Product Design for Smarter iPhone Search

2024年6月22日

#49: Product Design for Smarter iPhone Search

In the previous article, I mentioned 5 main improvements to be made in the iPhone photo Search. Today, I design…
#48 Tech Review on iPhone's Image Search

2024年6月22日

#48 Tech Review on iPhone's Image Search

As a phone user, I found a pain point in accessing photos from my gallery. Today, I study all the features that Apple…

See all articles

#37 What's that new Transformer for videos?

Riya Chhikara

Data Scientist at The Economist | Guest Teacher at LSE

Transformers for videos: TimeSformers

Divided Space - Time Attention

GitHub:

领英推荐

Sources:

100 Days of Computer Vision

837 位关注者

Riya Chhikara的更多文章

社区洞察

其他会员也浏览了

Attention Maps of Vision Transformers

Frequency Domain Transformation For Noise Filter Implementation

What do we know about the molecular docking procedure? (Part-7)

Moving-Horizon Estimation Approach for Nonlinear Systems with Measurement Contaminated by Outliers

Eureka! An Accidental Improvement

How to Implement x2-y2=1

Cronas as Generators of Variable Geometries: A Deep Dive into Temporal Structures

Want to Make the Best Cable Possible? Use a Modern Vision System

Different DLS-based systems can give us different size and PDI results, and all of them can be correct! How is that possible?

From “Both-And” to “All-And”

Transformers for videos: TimeSformers

Divided Space - Time Attention

GitHub:

领英推荐

Sources:

100 Days of Computer Vision

837 位关注者

Riya Chhikara的更多文章

#57 Vintage Watch Finder: AI in Luxury Watch Shopping

#56 Connecting the app to AWS S3 bucket

#55: How to build a solid backend for a scalable app?

#54: How to integrate alert system into a machine vision app ?

# 53 The app now tracks defects in real-time

#52: Looks better than yesterday

#51: And the winner for the final model is VGG16

#50: Machine Vision for checking defects

#49: Product Design for Smarter iPhone Search

#48 Tech Review on iPhone's Image Search

社区洞察

其他会员也浏览了

Attention Maps of Vision Transformers

Frequency Domain Transformation For Noise Filter Implementation

What do we know about the molecular docking procedure? (Part-7)

Moving-Horizon Estimation Approach for Nonlinear Systems with Measurement Contaminated by Outliers

Eureka! An Accidental Improvement

How to Implement x**2-y**2=1

Cronas as Generators of Variable Geometries: A Deep Dive into Temporal Structures

Want to Make the Best Cable Possible? Use a Modern Vision System

Different DLS-based systems can give us different size and PDI results, and all of them can be correct! How is that possible?

From “Both-And” to “All-And”

How to Implement x2-y2=1