Yuhan's blog: February 2024

Thursday, February 29, 2024

Shusen Wang's video tutorial: LastN feature, DIN model, SIM model

LastN features came from the last N items that a user interacted with. The process is as simple as calculate embedding vector of the item. Having N vectors, take an average and this become a LastN feature of the user. (An alternative of average could be taking attention)

DIN model is another alternative of LastN feature: taking weighted average of the feature vectors of the LastN items. (This is the same as attention.) How to get weights? Use the most relevant 500 items to the user and take average to get a vector. Use the vector to calculate the similarity with each vector in LastN features. Apply the similarity score as weight and take weighted sum to get a new vector.

How to use DIN model: given an item recommended to the user, calculate its similarity to lastN items, and then take a weighted (by similarity) sum of the LastN vector. Apply this vector as a user feature.

DIN model need to calculate attentions N times. Thus N cannot be very large. It is limited to short-term interest of a user.

SIM model: To improve DIN model: quickly removing irrelevant items in LastN to avoid calculating all attentions. This allows N to be a few thousands. Here is how: for a group of candidate recommended item, find k similar items in N items. (call it TopK.) Then calculate attention.

Two steps of SIM model: 1. search 2. attention. Hardsearch: find by category and keyword. SoftSearch: Take item's embedding as vectors and search. Measure the timestamp of the item from now and convert that into a separate feature embedding. Concatenate it with the item feature. SIM model needs timestamp feature because it measures user's long-term behavior (while DIN model is the short-term behavior).

During the last step SIM will pick items from the list with the relevance as reward, minus the similarity with the already picked items.

MarginalRelevance (i) = a * reward (i) - (1-a) * max[ sim(i,j) for each j item that has been picked ]

When the picked set is big, max [ sim (i,j) ] is close to 1, causing the algorithm to fail. The solution is not to compare for similarity at all items in the picked set - but rather only to compare with the last few in a sliding window.

References:

https://www.youtube.com/watch?v=Stbc9goPKXQ

https://www.youtube.com/watch?v=0hPep80Oy6k

https://www.youtube.com/watch?v=_4J9aF5KR84

Monday, February 26, 2024

Summary on Shusen Wang's video of: SENet & Bilinear Cross

SENet was originally used in computer vision, but was later applied to recommendation system.

Consider discrete attributes on a product. Convert them into embedding to get features. For m features, each of which is expressed in a k-dimension vector. Take average pooling to make it into an m * 1 vector. (? why would that help?) .

From m * 1 vector, apply a fully connected layer and a ReLU layer. Output an (m / r) * 1 dimension vector. (r is a compression ratio).

From (m / r) * 1 dimension vector, apply fully connected and a sigmoid layer. Output an m * 1 dimension vector.

Now use this m * 1 vector from the last step as weight - apply this weights to the initial m*k matrix again with row-wise multiplication to get another m*k matrix.

(The intuitive meaning of the operations above is to take away irrelevant features from the matrix.)

The core of SENet is to apply weight at each field. Note that the fields can have features in different dimensions.

Cross two vectors between field i and field j. (All fields are in m-dimension.)

- by inner product: Xi.transpose * Xj = scaler

- Hadamard product of two vectors (element-wise multiplication) = vector.

Bilinear Cross:

- by inner product: Xi.transpose * Wij * Xj . (Inserted a matrix in between the inner product)

Note that there are too many pairs of fields. So we can only take a selective set of field pairs to perform Bilinear Cross.

- Hadamard product of two vectors: Hadamard( Xi , Wij * Xj )

Likewise, in practice, selective set of field pairs of used to perform Bilinear Cross.

FiBiNet:

- Combine SENet and Bilinear Cross.

- Take the embeddings for each field to 3 paths: bilinear network, SENet + bilinear network and the original embeddings concatenated. Concatenate all 3 vectors together and apply another network.

Reference: https://www.youtube.com/watch?v=nF37qtNvw1E

Saturday, February 24, 2024

Summary on Shusen Wang's video of: LHUC Network (Learning Hidden Unit Contributions)

LHUC is good for sorting at the last step of a recommendation system. The network was originally from research in speech recognition.

Consider the case of speech recognition, taking the voice signals as an input vector, and the feature of the speaker another input. Let the two inputs pass two separate networks (could be fully connected networks). And perform Hadamard product of the two outcomes. (Obviously the output dimensions need to be the same). Wire combined result into another fully connected layer.

Take the feature of the speaker as input, with another fully connected layer. Perform Hadamard product at the result. This process can repeat many times (by adding fully connected layer at the main path and Hadmard product the input).

Note that the network that transforms the feature of the speaker input can be a multiple fully connected layer with a sigmoid function and multiplies 2. The scale 2 allows features to be exaggerated to adapt to distinct feature of a speaker.

In case of apply LHUC network to a recommendation system, the voice signal is replaced with product features, while the speaker feature is replaced with the user features.

Reference: https://www.youtube.com/watch?v=TxIedW94hu0

Summary on Shusen Wang's video of: DCN (Deep & Cross Network)

DCN (Deep & Cross Network) is used to replace fully connected layer.

Cross Layer:

- Taking an intermediate output vector xi, and pass it through a fully connected layer to receive y of the same size as input x0. Take a Hadamard product of x0 and y, and call it z. Let Xi+1 = z + xi be the output.

Xi+1 = x0 * [W * xi + b] + xi

Cross Network:

- Apply cross layer multiple times with the same x0

Deep & Cross Network:

- Take all features as a vector to feed in a fully connected layer and a second path of Cross Network. Concatenate the two results and feed into another fully connected layer.

- widely used in industry for recommendation - to sort the items, or used in shared bottom.

Reference: https://www.youtube.com/watch?v=yNeRO5m63JQ

Friday, February 23, 2024

Summary on Shusen Wang's video of: Converting attributes to Features

Here are a few ways to deal with attribute to make them into features.

Discrete values: apply one-hot encoding. Then make embeddings from one-hot encoding input.
Continuous values: divide value ranges into ranges, and then apply one-hot encoding for each range.
Large continuous values: apply log(1+x), and then treat it as a regular Continuous value.

An alternative is to convert large values into ratios. For example, number of clicks into click rate.

Thumbnail image of a record can also be converted into vector to assist classification / prediction tasks.

User attributes and Item attributes are relatively static, has little changes. They are perfect caching candidates But user stats and items status may change on every click / user interaction. It is better to separate stats and the basic attributes.

Feature crosses are non-linear features derived from features with non-linear operations. A weighted sum of a vector (x1, x2, ... xd) would be a linear operation, while multiplication of features creates "crosses". For an example of second-order feature crossing, a weighted sum of xi * xj could introduce a non-linear term. To visualize how this works better - consider a vector with quality and price per unite as features. A multiplication of the two predicts the price effectively, while linear combination of the two features can not accurately describe the price.

n features creates n ** 2 of second-order feature crossing. This could become a lot of new dimensions. To reduce the dimensions, we can consider it as an n * n matrix, and then convert it into multiplication of a low-rank matrix and its transpose.

Call the low rank matrix V. Thus at the (i,j) position,

weight(i,j) * xi * xj = (V.transpose)i * Vj

Using matrix V makes the model become a Factorized Machine (FM), which is simply a linear model plus second-order feature crossing.

Note: FM has become less popular nowadays in recommendation system.

Collaborative Filtering features that are based on userIds and itemIds, can be more accurate than tag matching. However, for a new item that hasn't had any user interaction, it is impossible to estimate Collaborative Filtering features accurately. Thus during cold start of a new item, tag matching is used to take top K similar items to get their Collaborative Filtering features - use their average as the initial guessed value for this feature. Keep using the guessed value, until enough user has visited the new item.

Use K-mean cluster to divide the items into clusters - there could be thousands of clusters. Use the centroid of the cluster as the index for searching. When a new item is created, it will be matched to clusters. For example to match top 1000 clusters. Similarly when a user comes up, take her last N recent items, and match clusters. Once found clusters, pick items from them.

References:

https://www.youtube.com/watch?v=J7N4xjqg0rk

https://www.youtube.com/watch?v=exVPXVFPMDk

https://www.youtube.com/watch?v=lboewzsA8DU

Wednesday, February 21, 2024

Summary on Shusen Wang's video of: Ways to combine click rate, like rate, favorite rate, etc. into 1 score

In multi-target prediction, click probability, like rate, favorite rate, etc are generated together for an item recommended to a user. They need to be combined into 1 score to make ordering easy. Here are a few ways to combine the predictions.

Simple weighted sum:

click_rate + w1 * like_rate + w2 * favorite_rate + ...

Weighted sum with click_rate as precondition: (as like action has to happen after click happens)

click_rate * (1 + w1 * like_rate + w2 * favorite_rate + ... )

Weighted product instead of weighted sum:

[(1 + w1 * click_rate) ** a1] * [1 + w2 * like_rate)] ** a2 ] * ...

Weighted sum based on ranks from the list sorting by 1 dimension. For example, the

w1 / (click_rank ** a1 + b1) + w2 / (like_rank ** a2 + b2) + ...

For online purchase, it has to go through all steps for a transaction to happen. Thus the probability needs to be multiplied to make 1 score:

( click_rate ** a1 ) * ( cart_rate ** a2 ) * ( pay_rate ** a3 ) * ( price ** a4 )

Reference: https://www.youtube.com/watch?v=D2iqM2puJ2I

Summary on Shusen Wang's video of: Multi-gate Mixture-of-Experts (MMoE)

The idea is to train multiple expert neural networks and apply weighted sum of their results. The weights is decided by another neural network based on the same input but connected to a softmax layer to generate weights. The vector from the weighted sum is then feed into another neural network layer to predict a aggression task.

When multiple predictions (targets) are made, they shared the same expert neural networks, but each target has its own neural network for weight generation, and succeeding neural network to process the vector from weighted sum.

When 1 weight is close to 1 and other weights are are close to 0, only 1 expert neural network performs work, skipping all other expert networks. This polarization would be an unexpected setup that wastes resource. To avoid polarization in the weights, Dropout operation is performed on the softmax output during training to give 10% of the chance to drop the result from an expert network. As a result, the solution of putting all logic into 1 expert networks leads to poor performance - this in turn encourages multiple expert networks are trained at the same time.

Reference: https://www.youtube.com/watch?v=JIEwaPARjfk

Summary on Shusen Wang's video of: Reduce negative samples and Adjust prediction Rate

Often classifiers are trained on unequal amount of positive data and negative data. It is a common practice to reduce the negative data (by a rate a to the positive samples). The reducing the negative samples causes the predicted value larger than the actual occurrence. However, the value can be adjusted by:

Let

P_true = n_positive / (n_positive + n_negative)

P_pred = n_positive / (n_positive + n_negative * a)

And the adjusted expectation can be derived:

P_true = a * P_pred / ((1-P_pred) + a * P_pred )

So the predicted expectation can be adjusted to scale as if trained based on equal amount of samples from both sides.

Reference: https://youtu.be/kY4W46MQqsg?si=U33AhcwpvCvQ3G_2&t=526

Tuesday, February 20, 2024

Summary on Shusen Wang's video of: Deep Retrieval

Deep Retrieval considers features vector of an item as a path. It is an online method to find the matching user.

Make a lookup of a list of items by path and make a lookup of a list of paths by item. Have a prediction model to match a path to a user. During online recommendation, given a user, match a set of paths and follow the path to get to the recommended item.

How an item is expressed as a path: each path is a sequential traverse of an ordered layer. Each layer has k possible node for step, and let the number of layers to be 3 for an example. We call this depth = 3 and width = k. A path is described as 3 nodes.

Given a path [a,b,c], multiple items will be retrieved:

At layer 1, given a user feature vector x, the prediction model for node a is P1(a | x)
At layer 2, given user x and node a, prediction model for node b is P2(b | a ; x)
At layer 3, given user x and node a, b, prediction model for node c is P3(c | a,b ; x)
The the interest of user x to path [a,b,c] is:

[ P(a,b,c | x) = P1(a | x) * P2(b | a; x) * P3(c | a,b ; x)

The input of the prediction model is x, and the output of the prediction model is a k-dimension vector. Take a softmax on this value to direct path walking. The next step could be taking the largest value, or to perform beam search.

The next layer has a different prediction model taking dim(x) + k dimensions. (The input will be x concatenated with the output (k dimension) from the first layer). And likewise have a prediction model and apply softmax function.

Likewise for the succeeding layers.

So, each path consists a series of nodes for a user. P(a,b,c | x). For an item, multiply P with 1 for user x who clicked on the item and multiply 0 for user who didn't click. Take a summation of the P values for all users. This summation is the score for the given path to an item.

Item feature:

Take a summation of the item scores of a few paths, and the negation of this summation is the loss function for the item. We'd like to minimize this negation so that the model give large scores for the item.

Note that there could be a popular path that give large scores for many items. To avoid the popular path from happening, add penalty on paths with many items (How? By subtracting the penalty from the score function of item given a path).

Let reg (path) be the number of items represented by a path. Then to find a path that represents an item with the smallest loss:

path = argmin[path, ( loss(item) + a * reg (path) )]

During training, first update the model that predict a path given an user. Then update the item feature using the trained model.

Reference: https://www.youtube.com/watch?v=BYtzZ48hRFM (voiced in Chinese)

Monday, February 19, 2024

Summary on Shusen Wang's video of: Self-supervised Learning

Often we have data that have features (keywords) but no classification information. For example, videos in a website can have keywords, but we haven't yet have user information to know if a certain feature can lead to a user to like the video. However, since there are many keywords on the same video, it is logical to believe that the feature vectors based on different keywords on the same video are close to each other - and more distant between the feature vectors based on different keywords on different videos. Separating the feature vectors based on this information is a self-supervised learning process.

Random mask: randomly task keywords at a certain attributes.

Dropout are techniques in collecting features during training from the same item. Dropout is to randomly take away a percentage of keywords (assuming there are many keywords). In this way, the feature on an item is more generalized.

Complementary features: split keywords on 1 item into two sets, and each set map into a feature vector (Thus the two are complementary features) by only seeing 50% of the keywords at a time. Since they are on the same item, the two vectors should large cosine similarity.

Mask related features: calculate the mutual information between two keywords on the items. (Calculated as the sum of all keywords (u,v) MI(U,V) = sum( p(u,v) * log (p(u,v) / (p(u) * p(v) ), where p(u,v) are the probability that the two keywords on the same item. Then, for each keyword k, find half of other keywords that are related to this keyword k. Mask the half of the keywords that are closely related to this keyword k, and use the 50% of less related keywords.

(In practice, the method of masking related features is hard to calculate and hard to maintain.)

How to train: convert all records into vectors: by applying multiple mask techniques and let it multiply matrix to convert it into a vector. Pair vectors to calculate similarity. Take the similarity into a softmax layer. Let expected the cosine similarity of the vectors from the same object be 1 and let the vectors from different object be 0. Take the crossEntropyLoss between the calculated value and expected value.

It is also possible to combine the loss function of self-supervised learning with the loss function of the Contrast Learning, by taking a simple summation.

Reference: https://www.youtube.com/watch?v=Ra3MVhneR9E (voiced in Chinese)

Saturday, February 17, 2024

Quick Note: Contrast Learning in DSSM

Contrast learning is a way to setup the loss function. At point-wise loss function, negative samples and positive samples are considered separately. At Contrast Learning, the loss function (aka pair-wise loss) is constructed by taking 3 items: 2 positively related item and 1 unrelated item. The cosine similarity of the 2 positively related items should be large, and the similarity of the unrelated pair item should be small. The difference of the two similarity values is used as the loss function - the further apart of the two values the better. Since the optimization step is to reduce the loss, the training will lead it to fit both samples.

To make the positive sample and the negative sample separate better, usually a desired distance m is defined. If the difference is larger than m, it is considered as no loss. If the difference is less than m, loss will be the difference.

So the loss function is written as

Loss(a, b_positive, b_negative) = max{0, cos(a, b_negative) - cos(a, b_positive) + m}

It is also possible to write the function in logistic loss:

Loss(a, b_positive, b_negative) = log(1 + exp[ sigma * (cos(a, b_negative) - cos(a, b_positive)) ] )

List-wise loss is a similar idea of the pairwise loss function, but considers more samples in the same loss function. Each a training record consists 1 pair of positive samples (a, b) pairing with the input, and multiple negative samples (b_neg_1, b_neg_2, etc) , and take cosine similarity between the pairs: cos(a,b), cos(a, b_neg_1), cos(a, b_neg_2), etc. Put all results in a Softmax function. Let the expected result to be (1, 0, 0, 0, ... ). Train it with CrossEntropyLoss between the result and the expected result.

Note: When trying point-wise loss function (that is training negative and positive samples separately), the ratio of the amount of positive samples and negative samples should be from 1:2 to 1:3,

Reference:

https://www.youtube.com/watch?v=2Mc10LZ-DB0 (voiced in Chinese)

Friday, February 16, 2024

Quick note: Collaborative Filtering vs. Swing

You probably have heard of Collaborative Filtering, which is to find similarity of two items by counting the number of shared interested users. They are matched via a cosine similarity (or dot product in another word).

There comes a problem: two users are within the same interest group, the shared items may not be that similar.

Swing takes one more step: assign a weight for each pair of shared users - if the two users have shared a lot of items, let the weight of the pair of the user be smaller: 1 / (a + overlap(u1, u2) . (a is a positive term to avoid dividing by zero).

Collaborative Filtering may also happen in similarity of users. Recommendation is made from similar users' list of items. To avoid popular item from appearing in every recommendation, weight of item is reduced by the popularity with 1/ log(1+ num_users_like_the_item)

Reference:

https://www.youtube.com/watch?v=DUUMNTDuJ3Q

https://www.youtube.com/watch?v=7O9zFMNdrZ8

Tuesday, February 06, 2024

How to get Active Contour

A summary of the video tutorial Active Contour | Boundary Detection.

1. Draw a circle around the image (roughly around the object). This is the initial contour.

2. Calculate the gradient of the image.

3. Blur the gradient values in the images so that the value spread to other part of the image, creating a slope to guide a point to the position with large gradient value.

4. Apply greedy search from the contour to move control point to the points with large gradient values.

5. It is almost done, but the contour is not smooth. A modification to this solution could be to adjust the gradient value with smoothness and elasticity of the contour. The two values are defined as the derivative and second derivative on the contour.