Yuhan's blog: September 2023

In Shushen Wang's video Few-shot learning (3/3), there introduced a simple fine-tuning idea.

1. Softmax classifier

Let x be the raw input, given your feature vector f(x), multiply it with a matrix W plus a bias b. For example, if your classifier generates 3 classes, then the matrix W and b should also have 3 rows. Then take a Softmax.

p = Softmax(W* f(x) + b)

To initialize matrix W, let each row of the matrix W be the average vector of 1 class to be predicted. Let b to be 0. W and b are trainable (fine-tuning).

For example, if there are 3 classes, and class 1 has 5 support vectors (from few shot examples), take an average of the 5 support vectors, call it w1, and let that be row 1 in W.

2. Regularization

Since this trick may lead to overfitting, a regularization term is introduced during loss minimization. Since p is a probability, Entropy regularization can be taken on p and seeking smaller entropy becomes part of the loss function.

For example, a prediction p is made for 3 classes, and p = [0.33, 0.33, 0.34]. This prediction may work, but it is pretty bad. So take an entropy at this number:

entropy = sum( p.map( x => - x * Math.log(x) ) )

That is - 0.33 * Math.log(0.33) + - 0.33 * Math.log(0.33) + - 0.34 * Math.log(0.34) = 1.09 in this example.

Include that as part of your loss function (multiply it with a weight) so that training will discourage this kind of the output.

3. Replace Matrix multiplication with Cosine similarity

Say W has 3 classes and thus 3 rows w1, w2, w3. In W* f(x), each row is multiplying f(x) with a dot product. For example w1 * f(x). Instead of a dot product, take a cosine similarity between w1 and f(x).

Basically a dot product first, and divide the determinant of w1 and f(x). (Basically making f(x) and w1 unit vectors before performing the dot product.)

StatQuest uploaded a good video at explaining how a Decorder-Only transformer works. Most of the content talked about how Key, Query, Value Matrices are calculated. It is quite complex. So here I am going to explain it in a more intuitive way (based on my own understanding).

A sentence is first parsed to tokens, and each token has an embedding and its position i in the sentence. The word embedding + the position encoding make a vector for the word at the position i. Nothing special so far.

Now at each token at position i, using this vector (call it word_vector_i), we'd like to encode another vector to represent the context in the sentence so far. This new vector at i should be based on the vector for this word at i and all the previous words from [0, i -1]. To combine these vectors, we are going to take a weighted sum. This is the overall idea.

vector_with_context (i) = w1 * value_vector_1 + w2 * value_vector_2 + ... + wi * value_vector_i

But wait, it is not nice to directly use the embedding + positional encoding (word_vector_i) as the value_vector_i. Instead, we will transform it with a matrix (Mv). Mv will be adjustable and learned. So,

value_vector_i = Mv * word_vector_i

Weight w1 is how similar the 1st word is related to the ith word. Weight w2 is how similar the 2nd word is related to the ith word, etc. To find out how similar the two words are, we are going to apply a dot product on the vector for two words.

But wait, it is not nice to directly use the embedding + positional encoding (word_vector_i), so we again are going to transform word_vector_i with a matrix... Actually, two matrices - one matrix (Mq) for transforming word_vector_i and one matrix (Mk) for transforming word_vector before ith position.

query_vector_i = Mq * word_vector_i

key_vector_1 = Mk * word_vector_1

key_vector_2 = Mk * word_vector_2

...

key_vector_(i-1) = Mk * word_vector_(i-1)

Mq and Mk will be adjustable and learned. The weight can be calculated

wj = query_vector_i * key_vector_j

But wait, these weights are not nice. So, we are going to take all the weights and run a softmax to get a better scaled weights (which sum to 1). Applying these weights and value_vector's, vector_with_context(i) is calculated. vector_with_context(i) is called Masked Self-Attention.

To predict the next word at i+1 position, just apply vector_with_context(i) to a fully connected layer to a result vector representing probability at each word in the dictionary.

But wait, using only masked self-attention (vector_with_context(i) ) isn't nice, we'd like to sum it with embedding and positional encoding (aka word_vector_i as described above). So the prediction of next word is really depending on 3 things. Since we are summing a later vector with an earlier vector, this becomes a residual link in the network.

(Note: since the residual link will sum the masked self-attention and the word embedding, that means their dimensions have to match. This also means Mk, Mq, Mv have to produce the same size. So the size of the Matrix is predetermined.)

Of course, the result vector will apply a softmax to scale probability better.

- What if it predicted the next word wrong, in my prompt?

Run your optimizer to train Mk, Mq, Mv and the fully connected layer to make it right.

- What if I want to generate a reply?

Repeat the process (without training) to run at every position in your prompt. At the end of the prompt, (at the end-of-sentence token), let the transformer predict the next word. Your transformer is now generating a reply! Keep output the next word and add to the end of the sentence until it outputs the end-of-sentence token.

Yuhan's blog

Thursday, September 14, 2023

Summary on Shusen Wang's video of: Fine-tuning with Softmax Classifier

1. Softmax classifier

2. Regularization

3. Replace Matrix multiplication with Cosine similarity

Friday, September 08, 2023

Key, Query, Value Matrices in Masked Self-Attention of Decoder-Only Transformers