Thursday, September 14, 2023

Summary on Shusen Wang's video of: Fine-tuning with Softmax Classifier

 In Shushen Wang's video Few-shot learning (3/3), there introduced a simple fine-tuning idea.

1. Softmax classifier

Let x be the raw input, given your feature vector f(x), multiply it with a matrix W plus a bias b. For example, if your classifier generates 3 classes, then the matrix W and b should also have 3 rows. Then take a Softmax.

    p = Softmax(W* f(x) + b)

To initialize matrix W, let each row of the matrix W be the average vector of 1 class to be predicted. Let b to be 0.  W and b are trainable (fine-tuning).

For example, if there are 3 classes, and class 1 has 5 support vectors (from few shot examples), take an average of the 5 support vectors, call it w1, and let that be row 1 in W.


2. Regularization

Since this trick may lead to overfitting, a regularization term is introduced during loss minimization. Since p is a probability, Entropy regularization can be taken on p and seeking smaller entropy becomes part of the loss function.

For example, a prediction p is made for 3 classes, and p = [0.33, 0.33, 0.34]. This prediction may work, but it is pretty bad. So take an entropy at this number:

     entropy = sum( p.map( x => - x * Math.log(x) ) )

     That is - 0.33 * Math.log(0.33) + - 0.33 * Math.log(0.33) +  - 0.34 * Math.log(0.34) = 1.09 in this example.

Include that as part of your loss function (multiply it with a weight) so that training will discourage this kind of the output.


3. Replace Matrix multiplication with Cosine similarity

Say W has 3 classes and thus 3 rows w1, w2, w3. In W* f(x), each row is multiplying f(x) with a dot product. For example w1 * f(x). Instead of a dot product, take a cosine similarity between w1 and f(x).

Basically a dot product first, and divide the determinant of w1 and f(x). (Basically making f(x) and w1 unit vectors before performing the dot product.)





No comments: