Tuesday, February 20, 2024

Summary on Shusen Wang's video of: Deep Retrieval

 Deep Retrieval considers features vector of an item as a path. It is an online method to find the matching user.

Make a lookup of a list of items by path and make a lookup of a list of paths by item. Have a prediction model to match a path to a user. During online recommendation, given a user, match a set of paths and follow the path to get to the recommended item.

How an item is expressed as a path: each path is a sequential traverse of an ordered layer. Each layer has k possible node for step, and let the number of layers to be 3 for an example. We call this depth = 3 and width = k. A path is described as 3 nodes.

Given a path [a,b,c], multiple items will be retrieved:

  • At layer 1, given a user feature vector x, the prediction model for node a is P1(a | x)
  • At layer 2, given user x and node a, prediction model for node  b is P2(b | a ; x)
  • At layer 3, given user x and node a, b, prediction model for node  c is P3(c | a,b ; x)
  • The the interest of user x to path [a,b,c] is:
    •  [ P(a,b,c | x)P1(a | x) * P2(b | a; x) * P3(c | a,b ; x)
The input of the prediction model is x, and the output of the prediction model is a k-dimension vector. Take a softmax on this value to direct path walking. The next step could be taking the largest value, or to perform beam search.

The next layer has a different prediction model taking dim(x) + k dimensions. (The input will be x concatenated with the output (k dimension) from the first layer). And likewise have a prediction model and apply softmax function.

Likewise for the succeeding layers.

So, each path consists a series of nodes for a user. P(a,b,c | x). For an item, multiply P with 1 for user  x who clicked on the item and multiply 0 for user who didn't click. Take a summation of the P values for all users. This summation is the score for the given path to an item. 

Item feature:

Take a summation of the item scores of a few paths, and the negation of this summation is the loss function for the item. We'd like to minimize this negation so that the model give large scores for the item.

Note that there could be a popular path that give large scores for many items. To avoid the popular path from happening, add penalty on paths with many items (How? By subtracting the penalty from the score function of item given a path).

Let reg (path) be the number of items represented by a path. Then to find a path that represents an item with the smallest loss:

   path = argmin[path,  ( loss(item) + a * reg (path) )]


During training, first update the model that predict a path given an user. Then update the item feature using the trained model.


Reference: https://www.youtube.com/watch?v=BYtzZ48hRFM (voiced in Chinese)


No comments: