#37 What's that new Transformer for videos?
Before we start, I want to share something hilarious. Remember the 2017 paper that we talked about on Attention in Transformers? In 2021, Google researchers published a paper, which directly attacked the 2017 findings. And they made sure the message was loud and clear, so they titled their paper "Attention is not all you need." Loud brag.
So, basically, the researchers conducted experiments analyzing the behaviour of the self-attention mechanism without any of the other components of the transformers. They found that this mechanism, by itself, is practically useless. The self-attention mechanism isn't very helpful on its own. When they tested it without other important parts, it didn't work well. But here's the interesting part: even though the self-attention mechanism isn't very useful by itself when you put it together with two other important parts—skip connections and MLP—it becomes incredibly powerful.
Transformers for videos: TimeSformers
The main paper which introduces transformers for videos:
We know that video is simply a set of frames.
The only addition to an image is we have to take in account time along with the images. So, we need some attention that takes into account the variation that occurs between consecutive frames.
We need new attention mechanisms that compute attention axially, scattered, or jointly between space and time. This means calculating attention along specific axes or directions within the video data. This could involve focusing attention on spatial dimensions (e.g., width and height of video frames) or temporal dimensions (e.g., frames over time). "Scattered attention" suggests spreading attention across various parts of the video rather than focusing on specific regions. This could be useful for capturing global context or detecting motion patterns. "Joint attention between space and time" refers to integrating information from both spatial and temporal dimensions simultaneously. This holistic approach allows the model to capture both static features (space) and dynamic changes (time) in the video.
Divided Space - Time Attention
It says that given a frame at instant t and one of its patches as a query, to compute the spatial attention over the whole frame and then the temporal attention in the same patch of the query but in the previous and next frames. It is able to learn more separate features than other approaches. So, it is better able to understand videos from different categories.
The researchers found that the higher the resolution the better the accuracy of the model, up to a point. As the number of frames increases, the accuracy also increases.
GitHub:
2. Import libraries and initialize data
领英推荐
3. Building TimeSformers
We create a class named PreNorm, which inherits from nn.Module (neural network module). The constructor method for the PreNorm class initializes the object with two parameters: dimensionality and function. Then, we define the forward pass of the neural network module which takes an input tensor x and optional additional arguments.
The GEGLU class implements a variant of the Gated Linear Unit (GLU) activation function. In this class, we define the method for the forward pass of the input tensor x. We split the input tensor x into two parts along the last dimension (dim=-1). The first part (x) will be used for transformation, and the second part (gates) will serve as the gating mechanism. Finally, the GELU activation function is applied to gates, and the element-wise multiplication is performed between the x and the result of the GELU activation.
The Feedforward class defines a feedforward neural network layer with a configurable architecture. We define the sequential network where the linear layer maps input dimensions to dim * mult * 2. This is followed by the GEGLU activation. Then, a Dropout layer with the specified dropout probability. The final linear layer maps the intermediate dimension back to the original dimension.
4. Divided Space-Time Attention
This is a utility function that calculates the attention mechanism given queries (q), keys (k), and values (v). Then, we compute the attention scores by taking the dot product between the query and key vectors for all pairs of tokens. We compute the attention weights by applying the softmax function along the last dimension of the similarity matrix sim.
You can check the complete code at the Repo.
5. Execution & Training
Extract frames from videos. -> Train model
Our model uses softmax activation to produce probabilities for each class.
6. Outputs
Sources: