In Shushen Wang's video Few-shot learning (3/3), there introduced a simple fine-tuning idea.
1. Softmax classifier
Let x be the raw input, given your feature vector f(x), multiply it with a matrix W plus a bias b. For example, if your classifier generates 3 classes, then the matrix W and b should also have 3 rows. Then take a Softmax.
p = Softmax(W* f(x) + b)
To initialize matrix W, let each row of the matrix W be the average vector of 1 class to be predicted. Let b to be 0. W and b are trainable (fine-tuning).
For example, if there are 3 classes, and class 1 has 5 support vectors (from few shot examples), take an average of the 5 support vectors, call it w1, and let that be row 1 in W.
2. Regularization
Since this trick may lead to overfitting, a regularization term is introduced during loss minimization. Since p is a probability, Entropy regularization can be taken on p and seeking smaller entropy becomes part of the loss function.
For example, a prediction p is made for 3 classes, and p = [0.33, 0.33, 0.34]. This prediction may work, but it is pretty bad. So take an entropy at this number:
entropy = sum( p.map( x => - x * Math.log(x) ) )
That is - 0.33 * Math.log(0.33) + - 0.33 * Math.log(0.33) + - 0.34 * Math.log(0.34) = 1.09 in this example.
Include that as part of your loss function (multiply it with a weight) so that training will discourage this kind of the output.
3. Replace Matrix multiplication with Cosine similarity
Say W has 3 classes and thus 3 rows w1, w2, w3. In W* f(x), each row is multiplying f(x) with a dot product. For example w1 * f(x). Instead of a dot product, take a cosine similarity between w1 and f(x).
Basically a dot product first, and divide the determinant of w1 and f(x). (Basically making f(x) and w1 unit vectors before performing the dot product.)
No comments:
Post a Comment