登录查看更多内容

Self/cross hard/soft attention

Alfredo Canziani

Assistant Professor of Computer Science at NYU Courant Institute of Mathematical Sciences

发布日期: 2020年4月22日

?? NEW LECTURE ??

“Set to set” and “set to vector” mappings using self/cross hard/soft attention. We combined a (two) attention module(s) with a k=1 1D convolution to get a transformer encoder (decoder).

Slides: https://github.com/Atcold/pytorch-Deep-Learning/blob/master/slides/10%20-%20Attention%20%26%20transformer.pdf

Notebook: https://github.com/Atcold/pytorch-Deep-Learning/blob/master/15-transformer.ipynb

This week's slides were quite dense, but we've been building up momentum since the beginning of class, 3 months ago.

We recalled concepts from:

? Linear Algebra (Ax as lin. comb. of A's columns weighted by x's components, or scalar products or A's rows against x)

? Recurrent Nets (stacking x[t] with h[t–1] and concatenating W_x and W_h)

? Autoencoders (encoder-decoder architecture)

? k=1 1D convolutions (that does not assume correlation between neighbouring features and act as a dim. adapter)

and put in practice with PyTorch.

Notice how you can smoothly go from hard to soft attention by switching between argmax and softargmax (which most of you still call “softmax”). Hard attention implies a one-hot vector a, while soft attention gives your pseudo probabilities.

Once again, this architecture deals with *sets* of symbols!

There is no order. Therefore, computations can be massively parallelised (they are just a bunch of matrix products, afterall).

Just be aware of that t × t A matrix that could blow up, if t (your set length) is large.

Just a final recap, there is *one* and *only one* _query_ (I'd like to cook a lasagna) that I'm going to check against *all* _keys_ (recipes titles) in order to retrieve *one* (if hard) or *a mixed* (if soft) _value_ (recipe to prepare my dinner with).

Me, hungry, during class = decoder.

My granny, knowing all recipes names (keys) and preparations (values) = encoder.

Me, figuring out what I want = self-attention.

Me, asking granny = cross-attention.

Dinner = yay!

I'm done.

Next week: Graph Neural Nets (if it's taking me less than a week to learn about them).

https://twitter.com/alfcnz/status/1252802274080022528

要查看或添加评论，请登录

Alfredo Canziani的更多文章

Training for latent variable energy based models

2020年11月1日

Training for latent variable energy based models

This week we went through the second part of my lecture on latent variable ?? energy ?? based models. ?? We've warmed…
Inference for latent variable energy based models

2020年10月21日

Inference for latent variable energy based models

This week we've learnt how to perform inference with a latent variable ?? energy ?? based model. ?? These models are…
Graph Convolutional Networks (GCN)

2020年4月29日

Graph Convolutional Networks (GCN)

?? NEW LECTURE ?? Graph Convolutional Networks… from attention! In attention ?? is computed with a [soft]argmax over…

Alfredo Canziani的更多文章

Training for latent variable energy based models

Inference for latent variable energy based models

Graph Convolutional Networks (GCN)