Yuhan's blog: Summary on Shusen Wang's video of: Fine-tuning with Softmax Classifier

In Shushen Wang's video Few-shot learning (3/3), there introduced a simple fine-tuning idea.

1. Softmax classifier

Let x be the raw input, given your feature vector f(x), multiply it with a matrix W plus a bias b. For example, if your classifier generates 3 classes, then the matrix W and b should also have 3 rows. Then take a Softmax.

p = Softmax(W* f(x) + b)

To initialize matrix W, let each row of the matrix W be the average vector of 1 class to be predicted. Let b to be 0. W and b are trainable (fine-tuning).

For example, if there are 3 classes, and class 1 has 5 support vectors (from few shot examples), take an average of the 5 support vectors, call it w1, and let that be row 1 in W.

2. Regularization

Since this trick may lead to overfitting, a regularization term is introduced during loss minimization. Since p is a probability, Entropy regularization can be taken on p and seeking smaller entropy becomes part of the loss function.

For example, a prediction p is made for 3 classes, and p = [0.33, 0.33, 0.34]. This prediction may work, but it is pretty bad. So take an entropy at this number:

entropy = sum( p.map( x => - x * Math.log(x) ) )

That is - 0.33 * Math.log(0.33) + - 0.33 * Math.log(0.33) + - 0.34 * Math.log(0.34) = 1.09 in this example.

Include that as part of your loss function (multiply it with a weight) so that training will discourage this kind of the output.

3. Replace Matrix multiplication with Cosine similarity

Say W has 3 classes and thus 3 rows w1, w2, w3. In W* f(x), each row is multiplying f(x) with a dot product. For example w1 * f(x). Instead of a dot product, take a cosine similarity between w1 and f(x).

Basically a dot product first, and divide the determinant of w1 and f(x). (Basically making f(x) and w1 unit vectors before performing the dot product.)

Yuhan's blog

Thursday, September 14, 2023

Summary on Shusen Wang's video of: Fine-tuning with Softmax Classifier

1. Softmax classifier

2. Regularization

3. Replace Matrix multiplication with Cosine similarity

No comments: