Friday, March 29, 2024

Summary on Outlier's tutorial: Cross Attention

Self attention: we'd like to find a weighted sum of feature vectors for a series of input, whether it is a paragraph of texts or a list of image segments. Feature vector for each input vector is calculated by the Value matrix. The weight is calculated by the Query matrix (for the vector at focus) multiplying the Key matrix (for each vector in the input).

Cross attention: now consider having two inputs: one image and one text. The Query matrix applies to the image. And Key matrix and Value matrix applies to the text. Find the attention between each image input fragment to each word. Taking this value as weights, apply to the Value matrix to create the weighted sum. From this vector, then you can multiply another matrix to transform it back to the dimension of the input image to construct a (different) image.


 Reference: https://www.youtube.com/watch?v=aw3H-wPuRcw

No comments: