Tuesday, April 16, 2024

Quick Tutorial: Tiny Llama Finetuning

Tiny Llama is a small enough LLM model that could run from your laptop. It can generate somewhat good replies. People are fine-tuning it for their own purposes. I am going to show you how to play with it.

How to Play with a model on Huggingface ?

Go to the hugging face page, they already included the instructions on how to start the app:  https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

How to run it locally?

Download the files to your disk from the "Files" tag:

https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/tree/main

You will have to click on the download icon to download the files. Copying and pasting text leads to corrupted files. Put them all in one folder and run this python script:


Which one to use?

There are a few models. Use the chat model. It gives better result in implementing a question / answer chatbot. 

All models are good at continuing the text that you provide. The chat model allows separate question and answers well by separating user input and assistant output. Surround user input in <|user|> tag , ending it with </s>. Surround desired reply with <|assistant|>, also ending it with </s>.

You can also use it without tags, but expect it look like creative writing than a conversation.

In additional to "text-generation" Transformers pipeline has a "conversation"  mode, which allows the input the be a list of maps, each map is a message in the conversation. (So that it converts the list of maps into text with tags for you.)


How to fine tune with peft?

Use this script. It picks small percentage of a certain neural network components to fine tune.

https://gist.github.com/yuhanz/740ab396f237e8d2f41f07a10ccd8de9

You want to initialize TrainingArguments with push_to_hub=False, so that it doesn't require huggingface credentials. Replace the model name with relative path to folder containing model files in your disk.

Prepare the training data into conversations separated by tags. Use tokenizer to convert structured data into text with tags:

text = tokenizer.apply_chat_template(messages, tokenize=False)

Note that the configuration max_seq_length decides size of the input and the generated text. This may depend on the tokenizer for the model.

References:

 https://www.kaggle.com/code/tommyadams/fine-tuning-tinyllama

https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms

https://huggingface.co/docs/transformers/main/chat_templating


Thursday, April 04, 2024

Handling AWS API Gateway Error: Invalid HTTP endpoint encoding detected for URI

I encountered this error at work when deploying AWS API Gateway:

 Invalid HTTP endpoint encoding detected for URI (Service: ApiGateway, Status Code: 400

That was because the URI containing encoded %xx characters that AWS API Gateway doesn't like. (Even though they are valid and work well with curl.) Replacing those characters fixed the issue. Here are a list of not welcomed characters that I found out from my experiments: 

%21

%24

%2C

%3A

%0A

They are corresponding to characters  !$,: and new_line 

Friday, March 29, 2024

Summary on Outlier's tutorial: Cross Attention

Self attention: we'd like to find a weighted sum of feature vectors for a series of input, whether it is a paragraph of texts or a list of image segments. Feature vector for each input vector is calculated by the Value matrix. The weight is calculated by the Query matrix (for the vector at focus) multiplying the Key matrix (for each vector in the input).

Cross attention: now consider having two inputs: one image and one text. The Query matrix applies to the image. And Key matrix and Value matrix applies to the text. Find the attention between each image input fragment to each word. Taking this value as weights, apply to the Value matrix to create the weighted sum. From this vector, then you can multiply another matrix to transform it back to the dimension of the input image to construct a (different) image.


 Reference: https://www.youtube.com/watch?v=aw3H-wPuRcw

Thursday, March 28, 2024

Summary on Shusen Wang's tutorial: Item-to-Item (I2I) model

 The most common usage is User-Item-Item: user has a list of favorite items. Use the favorite items to find similar items (a.k.a Item-to-Item), and recommend these items to the user.

How to measure similarity: if user likes two items, and the two items have similarity. (Similar to Item Collaborative Filtering)

Another way to measure similarity: compare items' feature vectors.

User-User-Item: User 1 is similar to User 2. User 2 likes item A. So recommend item A to User 1.

User-Author-Item: User 1 likes Author 2, and Author 2 has item A. So, recommend item A to user 1.

User-Author-Athor-Item: User 1 likes Author2.Author 2 and Author 3 are similar. So recommend Author 3's items to User 1.

Using multiple models with separate shares in retrieving records, achieves better results than using the just one model to retrieve all items.


Summary on Shusen Wang's tutorial: Boost New Items

New Items cannot compete with older item. Thus it is necessary to give new items a boost in search when they are created to guaranteed some impressions.

It is difficult to assign the boost score, and it often leads to giving too many impressions to new items. Thus limiting impressions is necessary. 

One way is to vary boost scores according to the impressions that have been served. (Ex: Create a table and vary the boost score by reducing the boost value depending on how close to the deadline and how many impressions that have been served.)

Vary the number of impressions based on the predicted quality of the item - give good item more impressions.



Summary of Shusen Wang's tutorial: DPP to improve variety in Recommendation

 Determinantal Point Processes (DPP)

Let the vectors construct a volume. The larger the volume, the more variety. When the volume is low, then there are items similar to each other.

Given V to a d x k matrix, collecting k vectors of d dimensions. The volume is calculated as the determinant, which is V.transpose * V.

DPP math formula:

   argmax [ log det(V.transpose * V) ]

Hulu applied DPP in recommendation system:

   argmax [ a * total reward + (1- a) * log det(V.transpose * V) ]

  To solve this equation, Greedy search is applied to find 1 item at a time that has a good reward while not too close to the already picked items. Hulu provided a bit more efficient solution to find the similarity part, by using Cholesky Decomposition. 

Cholesky Decomposition is to make a matrix become a triangular matrix to multiply its own transpose. In Hulu's proposal, Cholesky Decomposition at each step is calculated based on Cholesky Decomposition of the previous step to avoid repeated calculation.


References:

https://www.youtube.com/watch?v=HjpJeUSekKs

https://www.youtube.com/watch?v=wi8xVHiZZr4

https://lsrs2017.files.wordpress.com/2017/08/lsrs_2017_lamingchen.pdf



Tuesday, March 12, 2024

How To Search AWS CloudWatch Log by Date?

 

Knowing a timestamp where an activity happens, go to your CloudWatch log and search by prefix using the date:

Locate by last event time that is larger than your timestamp. There will be many candidate logs.



Open each candidate log, search with the keyword "INIT_START" to make sure the log started before your timestamp. If it matches, then you can perform other search by keywords.


Repeat on all candidate log streams.

To locate a particular event in a log stream at a timestamp, search by keyword "END" and it will give you the start and the end of the request. Knowing the time range of the request, then you can perform further search.




Tuesday, March 05, 2024

Summary on Shusen Wang's tutorial: Look-alike Prediction Recommendation

 Starting with seed users, from whom we know the users' background: age group, education level, income, etc. Find users who didn't provide much information by similarity, including User Collaborative Filtering.

Focus on the users who interacted with a new item - treat them as seed users. Take an average on the seed user to get a vector. Use this vector to spread to look-alike users (by looking up in a vector db). Note that this feature vector needs to be updated when more users are interacting with this item.


Reference: https://www.youtube.com/watch?v=pjmRo8Uzzqg

Thursday, February 29, 2024

Shusen Wang's video tutorial: LastN feature, DIN model, SIM model

LastN features came from the last N items that a user interacted with. The process is as simple as calculate embedding vector of the item. Having N vectors, take an average and this become a LastN feature of the user. (An alternative of average could be taking attention)

DIN model is another alternative of LastN feature: taking weighted average of the feature vectors of the LastN items. (This is the same as attention.) How to get weights? Use the most relevant 500 items to the user and take average to get a vector. Use the vector to calculate the similarity with each vector in LastN features. Apply the similarity score as weight and take weighted sum to get a new vector.

How to use DIN model: given an item recommended to the user, calculate its similarity to lastN items, and then take a weighted (by similarity) sum of the LastN vector. Apply this vector as a user feature.

DIN model need to calculate attentions N times. Thus N cannot be very large. It is limited to short-term interest of a user.

SIM model: To improve DIN model: quickly removing irrelevant items in LastN to avoid calculating all attentions. This allows N to be a few thousands. Here is how: for a group of candidate recommended item, find k similar items in N items. (call it TopK.)  Then calculate attention.

   Two steps of SIM model: 1. search 2. attention. Hardsearch: find by category and keyword. SoftSearch: Take item's embedding as vectors and search. Measure the timestamp of the item from now and convert that into a separate feature embedding. Concatenate it with the item feature. SIM model needs timestamp feature because it measures user's long-term behavior (while DIN model is the short-term behavior).

During the last step SIM will pick items from the list with the relevance as reward, minus the similarity with the already picked items.

MarginalRelevance (i) = a * reward (i)  -  (1-a) * max[ sim(i,j) for each j item that has been picked ]

When the picked set is big, max [ sim (i,j) ] is close to 1, causing the algorithm to fail. The solution is not to compare for similarity at all items in the picked set - but rather only to compare with the last few in a sliding window.


References:

https://www.youtube.com/watch?v=Stbc9goPKXQ

https://www.youtube.com/watch?v=0hPep80Oy6k

https://www.youtube.com/watch?v=_4J9aF5KR84

Monday, February 26, 2024

Summary on Shusen Wang's video of: SENet & Bilinear Cross

SENet was originally used in computer vision, but was later applied to recommendation system.

Consider discrete attributes on a product. Convert them into embedding to get features. For m features, each of which is expressed in a k-dimension vector. Take average pooling to make it into an m * 1 vector. (? why would that help?) .

From m * 1 vector, apply a fully connected layer and a ReLU layer. Output an (m / r) * 1 dimension vector. (r is a compression ratio).

From (m / r) * 1 dimension vector, apply fully connected and a sigmoid layer. Output an m * 1 dimension vector. 

Now use this m * 1 vector from the last step as weight - apply this weights to the initial m*k matrix again with row-wise multiplication to get another m*k matrix.

(The intuitive meaning of the operations above is to take away irrelevant features from the matrix.)

The core of SENet is to apply weight at each field. Note that the fields can have features in different dimensions.


Cross two vectors between field i and field j. (All fields are in m-dimension.)

- by inner product:  Xi.transpose * Xj = scaler

- Hadamard product of two vectors (element-wise multiplication) = vector. 


Bilinear Cross:

- by inner product: Xi.transpose * Wij * Xj . (Inserted a matrix in between the inner product)

    Note that there are too many pairs of fields. So we can only take a selective set of field pairs to perform Bilinear Cross.

- Hadamard product of two vectors: Hadamard( Xi , Wij * Xj )

   Likewise, in practice, selective set of field pairs of used to perform Bilinear Cross.


FiBiNet:

- Combine SENet and Bilinear Cross.

- Take the embeddings for each field to 3 paths: bilinear network, SENet + bilinear network and the original embeddings concatenated. Concatenate all 3 vectors together and apply another network.



Reference: https://www.youtube.com/watch?v=nF37qtNvw1E


Saturday, February 24, 2024

Summary on Shusen Wang's video of: LHUC Network (Learning Hidden Unit Contributions)

LHUC is good for sorting at the last step of a recommendation system. The network was originally from research in speech recognition.

Consider the case of speech recognition, taking the voice signals as an input vector, and the feature of the speaker another input.  Let the two inputs pass two separate networks (could be fully connected networks). And perform Hadamard product of the two outcomes. (Obviously the output dimensions need to be the same). Wire combined result into another fully connected layer.

Take the feature of the speaker as input, with another fully connected layer. Perform Hadamard product at the result. This process can repeat many times (by adding fully connected layer at the main path and Hadmard product the input).

Note that the network that transforms the feature of the speaker input can be a multiple fully connected layer with a sigmoid function and multiplies 2. The scale 2 allows features to be exaggerated to adapt to distinct feature of a speaker.

In case of apply LHUC network to a recommendation system, the voice signal is replaced with product features, while the speaker feature is replaced with the user features.


Reference: https://www.youtube.com/watch?v=TxIedW94hu0


Summary on Shusen Wang's video of: DCN (Deep & Cross Network)

 DCN (Deep & Cross Network)  is used to replace fully connected layer.

Cross Layer:

- Taking an intermediate output vector xi, and pass it through a fully connected layer to receive y of the same size as input x0. Take a Hadamard product of x0 and y, and call it z. Let Xi+1 = z + xi be the output.

Xi+1 = x0  * [W * xi + b] + xi

Cross Network:

- Apply cross layer multiple times with the same x0

Deep & Cross Network:

- Take all features as a vector to feed in a fully connected layer and a second path of Cross Network. Concatenate the two results and feed into another fully connected layer.

- widely used in industry for recommendation - to sort the items, or used in shared bottom.

Reference: https://www.youtube.com/watch?v=yNeRO5m63JQ


Friday, February 23, 2024

Summary on Shusen Wang's video of: Converting attributes to Features

Here are a few ways to deal with attribute to make them into features.

  • Discrete values: apply one-hot encoding. Then make embeddings from one-hot encoding input.
  • Continuous values: divide value ranges into ranges, and then apply one-hot encoding for each range.
  • Large continuous values: apply log(1+x), and then treat it as a regular Continuous value.
    • An alternative is to convert large values into ratios. For example, number of clicks into click rate.
Thumbnail image of a record can also be converted into vector to assist classification / prediction tasks.

User attributes and Item attributes are relatively static, has little changes. They are perfect caching candidates But user stats and items status may change on every click / user interaction. It is better to separate stats and the basic attributes.

Feature crosses are non-linear features derived from features with non-linear operations. A weighted sum of a vector (x1, x2, ... xd) would be a linear operation, while multiplication of features creates "crosses".  For an example of second-order feature crossing, a weighted sum of xi * xj could introduce a non-linear term. To visualize how this works better - consider a vector with quality and price per unite as features. A multiplication of the two predicts the price effectively, while linear combination of the two features can not accurately describe the price.

n features creates n ** 2 of second-order feature crossing. This could become a lot of new dimensions. To reduce the dimensions, we can consider it as an n * n matrix, and then convert it into multiplication of a low-rank matrix and its transpose.

Call the low rank matrix V. Thus at the (i,j) position, 

weight(i,j) * xi * xj = (V.transpose)i * Vj

Using matrix V makes the model become a Factorized Machine (FM), which is simply a linear model plus second-order feature crossing.

Note: FM has become less popular nowadays in recommendation system.


Collaborative Filtering features that are based on userIds and itemIds, can be more accurate than tag matching. However, for a new item that hasn't had any user interaction, it is impossible to estimate Collaborative Filtering features accurately. Thus during cold start of a new item, tag matching is used to take top K similar items to get their Collaborative Filtering features - use their average as the initial guessed value for this feature. Keep using the guessed value, until enough user has visited the new item.

Use K-mean cluster to divide the items into clusters - there could be thousands of clusters. Use the centroid of the cluster as the index for searching. When a new item is created, it will be matched to clusters. For example to match top 1000 clusters. Similarly when a user comes up, take her last N recent items, and match clusters. Once found clusters, pick items from them.

References:


Wednesday, February 21, 2024

Summary on Shusen Wang's video of: Ways to combine click rate, like rate, favorite rate, etc. into 1 score

 In multi-target prediction, click probability, like rate, favorite rate, etc are generated together for an item recommended to a user. They need to be combined into 1 score to make ordering easy. Here are a few ways to combine the predictions.

Simple weighted sum:

   click_rate + w1 * like_rate + w2 * favorite_rate + ...

Weighted sum with click_rate as precondition: (as like action has to happen after click happens)

  click_rate * (1 + w1 * like_rate + w2 * favorite_rate + ... )

Weighted product instead of weighted sum:

   [(1 + w1 * click_rate) ** a1] * [1 + w2 * like_rate)] ** a2 ] * ...

Weighted sum based on ranks from the list sorting by 1 dimension. For example, the

   w1 / (click_rank ** a1 + b1) + w2 / (like_rank ** a2 + b2) + ...

For online purchase, it has to go through all steps for a transaction to happen. Thus the probability needs to be multiplied to make 1 score:

   ( click_rate ** a1 ) *    ( cart_rate ** a2 ) *  ( pay_rate ** a3 ) * ( price ** a4 )


Reference: https://www.youtube.com/watch?v=D2iqM2puJ2I

Summary on Shusen Wang's video of: Multi-gate Mixture-of-Experts (MMoE)

The idea is to train multiple expert neural networks and apply weighted sum of their results. The weights is decided by another neural network based on the same input but connected to a softmax layer to generate weights. The vector from the weighted sum is then feed into another neural network layer to predict  a aggression task.

When multiple predictions (targets) are made, they shared the same expert neural networks, but each target has its own neural network for weight generation, and succeeding neural network to process the  vector from weighted sum.

When 1 weight is close to 1 and other weights are are close to 0, only 1 expert neural network performs work, skipping all other expert networks. This polarization would be an unexpected setup that wastes resource. To avoid polarization in the weights, Dropout operation is performed on the softmax output during training to give 10% of the chance to drop the result from an expert network. As a result, the solution of putting all logic into 1 expert networks leads to poor performance - this in turn encourages multiple expert networks are trained at the same time.

Reference: https://www.youtube.com/watch?v=JIEwaPARjfk

Summary on Shusen Wang's video of: Reduce negative samples and Adjust prediction Rate

Often classifiers are trained on unequal amount of positive data and negative data. It is a common practice to reduce the negative data (by a rate a to the positive samples). The reducing the negative samples causes the predicted value larger than the actual occurrence. However, the value can be adjusted by:

Let

  P_true = n_positive / (n_positive + n_negative)

  P_pred = n_positive / (n_positive + n_negative * a)

And the adjusted expectation can be derived:

   P_true = a * P_pred / ((1-P_pred)  + a * P_pred )


So the predicted expectation can be adjusted to scale as if trained based on equal amount of samples from both sides.


Reference: https://youtu.be/kY4W46MQqsg?si=U33AhcwpvCvQ3G_2&t=526

Tuesday, February 20, 2024

Summary on Shusen Wang's video of: Deep Retrieval

 Deep Retrieval considers features vector of an item as a path. It is an online method to find the matching user.

Make a lookup of a list of items by path and make a lookup of a list of paths by item. Have a prediction model to match a path to a user. During online recommendation, given a user, match a set of paths and follow the path to get to the recommended item.

How an item is expressed as a path: each path is a sequential traverse of an ordered layer. Each layer has k possible node for step, and let the number of layers to be 3 for an example. We call this depth = 3 and width = k. A path is described as 3 nodes.

Given a path [a,b,c], multiple items will be retrieved:

  • At layer 1, given a user feature vector x, the prediction model for node a is P1(a | x)
  • At layer 2, given user x and node a, prediction model for node  b is P2(b | a ; x)
  • At layer 3, given user x and node a, b, prediction model for node  c is P3(c | a,b ; x)
  • The the interest of user x to path [a,b,c] is:
    •  [ P(a,b,c | x)P1(a | x) * P2(b | a; x) * P3(c | a,b ; x)
The input of the prediction model is x, and the output of the prediction model is a k-dimension vector. Take a softmax on this value to direct path walking. The next step could be taking the largest value, or to perform beam search.

The next layer has a different prediction model taking dim(x) + k dimensions. (The input will be x concatenated with the output (k dimension) from the first layer). And likewise have a prediction model and apply softmax function.

Likewise for the succeeding layers.

So, each path consists a series of nodes for a user. P(a,b,c | x). For an item, multiply P with 1 for user  x who clicked on the item and multiply 0 for user who didn't click. Take a summation of the P values for all users. This summation is the score for the given path to an item. 

Item feature:

Take a summation of the item scores of a few paths, and the negation of this summation is the loss function for the item. We'd like to minimize this negation so that the model give large scores for the item.

Note that there could be a popular path that give large scores for many items. To avoid the popular path from happening, add penalty on paths with many items (How? By subtracting the penalty from the score function of item given a path).

Let reg (path) be the number of items represented by a path. Then to find a path that represents an item with the smallest loss:

   path = argmin[path,  ( loss(item) + a * reg (path) )]


During training, first update the model that predict a path given an user. Then update the item feature using the trained model.


Reference: https://www.youtube.com/watch?v=BYtzZ48hRFM (voiced in Chinese)


Monday, February 19, 2024

Summary on Shusen Wang's video of: Self-supervised Learning

Often we have data that have features (keywords) but no classification information. For example, videos in a website can have keywords, but we haven't yet have user information to know if a certain feature can lead to a user to like the video. However, since there are many keywords on the same video, it is logical to believe that the feature vectors based on different keywords on the same video are close to each other - and more distant between the feature vectors based on different keywords on different videos. Separating the feature vectors based on this information is a self-supervised learning process.

Random mask: randomly task keywords at a certain attributes.

Dropout are techniques in collecting features during training from the same item. Dropout is to randomly take away a percentage of keywords (assuming there are many keywords). In this way, the feature on an item is more generalized.

Complementary features: split keywords on 1 item into two sets, and each set map into a feature vector (Thus the two are complementary features) by only seeing 50% of the keywords at a time. Since they are on the same item, the two vectors should large cosine similarity.

Mask related features: calculate the mutual information between two keywords on the items. (Calculated as the sum of all keywords (u,v) MI(U,V) = sum( p(u,v)  * log (p(u,v) / (p(u) * p(v) ), where p(u,v) are the probability that the two keywords on the same item. Then, for each keyword k, find half of other keywords that are related to this keyword k. Mask the half of the keywords that are closely related to this keyword k, and use the 50% of less related keywords.
(In practice, the method of masking related features is hard to calculate and hard to maintain.)

How to train: convert all records into vectors: by applying multiple mask techniques and let it multiply matrix to convert it into a vector. Pair vectors to calculate similarity. Take the similarity into a softmax layer. Let expected the cosine similarity of the vectors from the same object be 1 and let the vectors from different object be 0. Take the crossEntropyLoss between the calculated value and expected value.

It is also possible to combine the loss function of self-supervised learning with the loss function of the Contrast Learning, by taking a simple summation.

Reference: https://www.youtube.com/watch?v=Ra3MVhneR9E (voiced in Chinese)

Saturday, February 17, 2024

Quick Note: Contrast Learning in DSSM

Contrast learning is a way to setup the loss function. At point-wise loss function, negative samples and positive samples are considered separately. At Contrast Learning, the loss function (aka pair-wise loss) is constructed by taking 3 items: 2 positively related item and 1 unrelated item. The cosine similarity of the 2 positively related items should be large, and the similarity of the unrelated pair item should be small. The difference of the two similarity values is used as the loss function - the further apart of the two values the better. Since the optimization step is to reduce the loss, the training will lead it to fit both samples.

To make the positive sample and the negative sample separate better, usually a desired distance m is defined. If the difference is larger than m, it is considered as no loss. If the difference is less than m, loss will be the difference.

So the loss function is written as 

    Loss(a, b_positive, b_negative) = max{0, cos(a, b_negative) - cos(a, b_positive) + m}

It is also possible to write the function in logistic loss:

    Loss(a, b_positive, b_negative) = log(1 + exp[ sigma * (cos(a, b_negative) - cos(a, b_positive)) ] )


List-wise loss is a similar idea of the pairwise loss function, but considers more samples in the same loss function. Each a training record consists 1 pair of positive samples (a, b) pairing with the input, and multiple negative samples (b_neg_1, b_neg_2, etc) , and take cosine similarity between the pairs: cos(a,b), cos(a, b_neg_1), cos(a, b_neg_2), etc. Put all results in a Softmax function. Let the expected result to be (1, 0, 0, 0, ... ). Train it with CrossEntropyLoss between the result and the expected result.


Note: When trying point-wise loss function (that is training negative and positive samples separately), the ratio of the amount of positive samples and negative samples should be from 1:2 to 1:3,


Reference:

https://www.youtube.com/watch?v=2Mc10LZ-DB0 (voiced in Chinese)

Friday, February 16, 2024

Quick note: Collaborative Filtering vs. Swing

You probably have heard of Collaborative Filtering, which is to find similarity of two items by counting  the number of shared interested users. They are matched via a cosine similarity (or dot product in another word).

There comes a problem: two users are within the same interest group, the shared items may not be that similar.

Swing takes one more step: assign a weight for each pair of shared users - if the two users have shared a lot of items, let the weight of the pair of the user be smaller:  1 / (a + overlap(u1, u2) . (a is a positive term to avoid dividing by zero). 


Collaborative Filtering may also happen in similarity of users. Recommendation is made from similar users' list of items. To avoid popular item from appearing in every recommendation, weight of item is reduced by the popularity with 1/ log(1+ num_users_like_the_item)


Reference:

https://www.youtube.com/watch?v=DUUMNTDuJ3Q

https://www.youtube.com/watch?v=7O9zFMNdrZ8


Tuesday, February 06, 2024

How to get Active Contour

A summary of the video tutorial Active Contour | Boundary Detection.

1. Draw a circle around the image (roughly around the object). This is the initial contour.

2. Calculate the gradient of the image.

3. Blur the gradient values in the images so that the value spread to other part of the image, creating a slope to guide a point to the position with large gradient value.

4. Apply greedy search from the contour to move control point to the points with large gradient values.

5. It is almost done, but the contour is not smooth. A modification to this solution could be to adjust the gradient value with smoothness and elasticity of the contour. The two values are defined as the derivative and second derivative on the contour.


Saturday, January 20, 2024

Gaussian Mixure Model in Object Tracking

Summary on this Object Tracking lesson: https://www.youtube.com/watch?v=0nz8JMyFF14

The overall idea is to find large supporting evidence / small standard deviation: if this value is large, it is background. Else, foreground.

The evidence is probably the value for the point in the Gaussian model. Or it can be simply the distance from the mean of the Gaussian model. The lesson didn't mention clearly.

For each pixel there will be a Gaussian Mixture Model. Compute a pixel color histogram for the first N frames of a video. Normalize the histogram, and model it as a mixture of 3 to 5 Gaussians. 

During detection, for a value X,  | X - gaussian_center | < 2.5 standard_deviation, it is considered part of the Gaussian distribution.

To make the model adapt to new data, update Histogram H using the new pixel intensity. If new histogram differs from the old histogram a lot, refit the GMM.

GMM can be calculated using Expectation Maximization - Start by randomly picking means and standard deviations, cluster data points using these Gaussian distributions; then refine means and standard deviations. Repeat this process until the Gaussian distributions don't change any more (converged). Or, just use Python sklearn.mixture GaussianMixture.

Tuesday, January 16, 2024

Reinforcement Learning: Actor Critic

Why Reinforcement Learning?

Reinforcement learning is typically used to generate a sequence of actions to achieve a goal. Note that it is different from the classification problem, which answers a question like whether a given feature vector should have a yes/no (or categorical) answer. However, each step of the sequence of actions may be decided similar to a classification problems: given a feature vector for the current environment state, what is the next action to generate in the sequence to achieve a goal state. Main differences:

  • a sequence of actions (trajectory) vs. a single decision
  • environment may be a black box.

The goal state is assigned with a reward to guide the action. There are 2 approaches to solve it: find a policy function that given a state S, returns the probability to pick an action a. Or, find a value function that predict the reward in this state - use this value function to greedily pick an next action a that gives the most award (the observed reward at the next state S + the guessed reward moving from S).

Note that with a policy function, the actions are picked randomly by their probability. The policy function could be used to randomly sample a few trajectories to test out the rewards, and take an average to represent how good a state is. In reality there could be many trajectories. We just need a few sample to have a Monte Carlo estimation.

Describe RL in a more human-understandable fashion

It is a maze game, you are asked to find a path from point A to point B.

But you don't know how the maze looks like. However, putting a BBQ chicken at point B, you can follow the smell from point A to reach point B. The BBQ chicken here is the reward, and the smell is the expected future reward (call it return).

But the smell isn't there, either. To implement the smell, we need to assign every step toward the food a greater imagined smell value. This is the discounted return.

But we don't know where the food is, either. So we take random actions until reaching the BBQ, and then assign our steps with imagined smell values.

If we step on poop during the random exploration, assign our steps with negative smell values, so that they are visited less often. And try walking again following the imagined smell.

How to assign an imagined smell value to a step? Use a neural network, or any model.

Applications:

  • Chatbot - outputs a sequence of words, to provide answer to a question
  • AlphaGo - chess, use a sequence of moves to value best state
  • Control - plan a trajectory to control robot / game to reach a certain state.

Q Function

Q function is the quality of an action in an state S. It is usually guessed from a observed trajectory - by the sum of the observed rewards from an intermediate step i to the end state. This sum also have a named called discounted return. (It is called discounted because the sum is a weighted sum - a reward at a further step has less weight in the sum)

Actor Critic

In Actor Critic, two networks are trained. Critic network gives a score of how well an action is under an environment state. (So it takes 2 variables: action and state) Critic network is trained based on reward from the system. Actor network chooses a sequence of actions under each environment state so that the average score from the Critic network is higher. (So it takes 1 variable: state) It is trained based on the scores from the Critic network.

Proximal Policy Optimization

PPO is an improvement based on Actor Critic, where the learning on actor is capped to avoid taking too large of a step. It can be done by 2 ways: limit the loss - it is too large, clip to the max allowed value; Or, use a ratio between the probability of an action in the new policy and the old policy.

Trust Region Policy Optimization

In Trust Region policy optimization, use a function L to approximate a target function which we want to find parameters for. L and the target function are only similar within a small range. In this range, find the parameters for the max value of L, assuming it has the same parameter of the target function. Then start over from finding the the approximate function L again. The process repeats between Approximation and Maximization.

The Monte Carlo method can be used to make the approximation L, by taking sample trajectories.

To make sure the new parameters after maximization are within a small region to the old parameters, KL divergence is used to compare the probability distribution of the policy function. Or it is also possible to measure the distance of the parameters directly.