Monday, February 19, 2024

Summary on Shusen Wang's video of: Self-supervised Learning

Often we have data that have features (keywords) but no classification information. For example, videos in a website can have keywords, but we haven't yet have user information to know if a certain feature can lead to a user to like the video. However, since there are many keywords on the same video, it is logical to believe that the feature vectors based on different keywords on the same video are close to each other - and more distant between the feature vectors based on different keywords on different videos. Separating the feature vectors based on this information is a self-supervised learning process.

Random mask: randomly task keywords at a certain attributes.

Dropout are techniques in collecting features during training from the same item. Dropout is to randomly take away a percentage of keywords (assuming there are many keywords). In this way, the feature on an item is more generalized.

Complementary features: split keywords on 1 item into two sets, and each set map into a feature vector (Thus the two are complementary features) by only seeing 50% of the keywords at a time. Since they are on the same item, the two vectors should large cosine similarity.

Mask related features: calculate the mutual information between two keywords on the items. (Calculated as the sum of all keywords (u,v) MI(U,V) = sum( p(u,v)  * log (p(u,v) / (p(u) * p(v) ), where p(u,v) are the probability that the two keywords on the same item. Then, for each keyword k, find half of other keywords that are related to this keyword k. Mask the half of the keywords that are closely related to this keyword k, and use the 50% of less related keywords.
(In practice, the method of masking related features is hard to calculate and hard to maintain.)

How to train: convert all records into vectors: by applying multiple mask techniques and let it multiply matrix to convert it into a vector. Pair vectors to calculate similarity. Take the similarity into a softmax layer. Let expected the cosine similarity of the vectors from the same object be 1 and let the vectors from different object be 0. Take the crossEntropyLoss between the calculated value and expected value.

It is also possible to combine the loss function of self-supervised learning with the loss function of the Contrast Learning, by taking a simple summation.

Reference: https://www.youtube.com/watch?v=Ra3MVhneR9E (voiced in Chinese)

No comments: