Tuesday, April 16, 2024

Quick Tutorial: Tiny Llama Finetuning

Tiny Llama is a small enough LLM model that could run from your laptop. It can generate somewhat good replies. People are fine-tuning it for their own purposes. I am going to show you how to play with it.

How to Play with a model on Huggingface ?

Go to the hugging face page, they already included the instructions on how to start the app:  https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

How to run it locally?

Download the files to your disk from the "Files" tag:

https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/tree/main

You will have to click on the download icon to download the files. Copying and pasting text leads to corrupted files. Put them all in one folder and run this python script:


Which one to use?

There are a few models. Use the chat model. It gives better result in implementing a question / answer chatbot. 

All models are good at continuing the text that you provide. The chat model allows separate question and answers well by separating user input and assistant output. Surround user input in <|user|> tag , ending it with </s>. Surround desired reply with <|assistant|>, also ending it with </s>.

You can also use it without tags, but expect it look like creative writing than a conversation.

In additional to "text-generation" Transformers pipeline has a "conversation"  mode, which allows the input the be a list of maps, each map is a message in the conversation. (So that it converts the list of maps into text with tags for you.)


How to fine tune with peft?

Use this script. It picks small percentage of a certain neural network components to fine tune.

https://gist.github.com/yuhanz/740ab396f237e8d2f41f07a10ccd8de9

You want to initialize TrainingArguments with push_to_hub=False, so that it doesn't require huggingface credentials. Replace the model name with relative path to folder containing model files in your disk.

Prepare the training data into conversations separated by tags. Use tokenizer to convert structured data into text with tags:

text = tokenizer.apply_chat_template(messages, tokenize=False)

Note that the configuration max_seq_length decides size of the input and the generated text. This may depend on the tokenizer for the model.

References:

 https://www.kaggle.com/code/tommyadams/fine-tuning-tinyllama

https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms

https://huggingface.co/docs/transformers/main/chat_templating


Thursday, April 04, 2024

Handling AWS API Gateway Error: Invalid HTTP endpoint encoding detected for URI

I encountered this error at work when deploying AWS API Gateway:

 Invalid HTTP endpoint encoding detected for URI (Service: ApiGateway, Status Code: 400

That was because the URI containing encoded %xx characters that AWS API Gateway doesn't like. (Even though they are valid and work well with curl.) Replacing those characters fixed the issue. Here are a list of not welcomed characters that I found out from my experiments: 

%21

%24

%2C

%3A

%0A

They are corresponding to characters  !$,: and new_line 

Friday, March 29, 2024

Summary on Outlier's tutorial: Cross Attention

Self attention: we'd like to find a weighted sum of feature vectors for a series of input, whether it is a paragraph of texts or a list of image segments. Feature vector for each input vector is calculated by the Value matrix. The weight is calculated by the Query matrix (for the vector at focus) multiplying the Key matrix (for each vector in the input).

Cross attention: now consider having two inputs: one image and one text. The Query matrix applies to the image. And Key matrix and Value matrix applies to the text. Find the attention between each image input fragment to each word. Taking this value as weights, apply to the Value matrix to create the weighted sum. From this vector, then you can multiply another matrix to transform it back to the dimension of the input image to construct a (different) image.


 Reference: https://www.youtube.com/watch?v=aw3H-wPuRcw

Thursday, March 28, 2024

Summary on Shusen Wang's tutorial: Item-to-Item (I2I) model

 The most common usage is User-Item-Item: user has a list of favorite items. Use the favorite items to find similar items (a.k.a Item-to-Item), and recommend these items to the user.

How to measure similarity: if user likes two items, and the two items have similarity. (Similar to Item Collaborative Filtering)

Another way to measure similarity: compare items' feature vectors.

User-User-Item: User 1 is similar to User 2. User 2 likes item A. So recommend item A to User 1.

User-Author-Item: User 1 likes Author 2, and Author 2 has item A. So, recommend item A to user 1.

User-Author-Athor-Item: User 1 likes Author2.Author 2 and Author 3 are similar. So recommend Author 3's items to User 1.

Using multiple models with separate shares in retrieving records, achieves better results than using the just one model to retrieve all items.


Summary on Shusen Wang's tutorial: Boost New Items

New Items cannot compete with older item. Thus it is necessary to give new items a boost in search when they are created to guaranteed some impressions.

It is difficult to assign the boost score, and it often leads to giving too many impressions to new items. Thus limiting impressions is necessary. 

One way is to vary boost scores according to the impressions that have been served. (Ex: Create a table and vary the boost score by reducing the boost value depending on how close to the deadline and how many impressions that have been served.)

Vary the number of impressions based on the predicted quality of the item - give good item more impressions.



Summary of Shusen Wang's tutorial: DPP to improve variety in Recommendation

 Determinantal Point Processes (DPP)

Let the vectors construct a volume. The larger the volume, the more variety. When the volume is low, then there are items similar to each other.

Given V to a d x k matrix, collecting k vectors of d dimensions. The volume is calculated as the determinant, which is V.transpose * V.

DPP math formula:

   argmax [ log det(V.transpose * V) ]

Hulu applied DPP in recommendation system:

   argmax [ a * total reward + (1- a) * log det(V.transpose * V) ]

  To solve this equation, Greedy search is applied to find 1 item at a time that has a good reward while not too close to the already picked items. Hulu provided a bit more efficient solution to find the similarity part, by using Cholesky Decomposition. 

Cholesky Decomposition is to make a matrix become a triangular matrix to multiply its own transpose. In Hulu's proposal, Cholesky Decomposition at each step is calculated based on Cholesky Decomposition of the previous step to avoid repeated calculation.


References:

https://www.youtube.com/watch?v=HjpJeUSekKs

https://www.youtube.com/watch?v=wi8xVHiZZr4

https://lsrs2017.files.wordpress.com/2017/08/lsrs_2017_lamingchen.pdf



Tuesday, March 12, 2024

How To Search AWS CloudWatch Log by Date?

 

Knowing a timestamp where an activity happens, go to your CloudWatch log and search by prefix using the date:

Locate by last event time that is larger than your timestamp. There will be many candidate logs.



Open each candidate log, search with the keyword "INIT_START" to make sure the log started before your timestamp. If it matches, then you can perform other search by keywords.


Repeat on all candidate log streams.

To locate a particular event in a log stream at a timestamp, search by keyword "END" and it will give you the start and the end of the request. Knowing the time range of the request, then you can perform further search.




Tuesday, March 05, 2024

Summary on Shusen Wang's tutorial: Look-alike Prediction Recommendation

 Starting with seed users, from whom we know the users' background: age group, education level, income, etc. Find users who didn't provide much information by similarity, including User Collaborative Filtering.

Focus on the users who interacted with a new item - treat them as seed users. Take an average on the seed user to get a vector. Use this vector to spread to look-alike users (by looking up in a vector db). Note that this feature vector needs to be updated when more users are interacting with this item.


Reference: https://www.youtube.com/watch?v=pjmRo8Uzzqg

Thursday, February 29, 2024

Shusen Wang's video tutorial: LastN feature, DIN model, SIM model

LastN features came from the last N items that a user interacted with. The process is as simple as calculate embedding vector of the item. Having N vectors, take an average and this become a LastN feature of the user. (An alternative of average could be taking attention)

DIN model is another alternative of LastN feature: taking weighted average of the feature vectors of the LastN items. (This is the same as attention.) How to get weights? Use the most relevant 500 items to the user and take average to get a vector. Use the vector to calculate the similarity with each vector in LastN features. Apply the similarity score as weight and take weighted sum to get a new vector.

How to use DIN model: given an item recommended to the user, calculate its similarity to lastN items, and then take a weighted (by similarity) sum of the LastN vector. Apply this vector as a user feature.

DIN model need to calculate attentions N times. Thus N cannot be very large. It is limited to short-term interest of a user.

SIM model: To improve DIN model: quickly removing irrelevant items in LastN to avoid calculating all attentions. This allows N to be a few thousands. Here is how: for a group of candidate recommended item, find k similar items in N items. (call it TopK.)  Then calculate attention.

   Two steps of SIM model: 1. search 2. attention. Hardsearch: find by category and keyword. SoftSearch: Take item's embedding as vectors and search. Measure the timestamp of the item from now and convert that into a separate feature embedding. Concatenate it with the item feature. SIM model needs timestamp feature because it measures user's long-term behavior (while DIN model is the short-term behavior).

During the last step SIM will pick items from the list with the relevance as reward, minus the similarity with the already picked items.

MarginalRelevance (i) = a * reward (i)  -  (1-a) * max[ sim(i,j) for each j item that has been picked ]

When the picked set is big, max [ sim (i,j) ] is close to 1, causing the algorithm to fail. The solution is not to compare for similarity at all items in the picked set - but rather only to compare with the last few in a sliding window.


References:

https://www.youtube.com/watch?v=Stbc9goPKXQ

https://www.youtube.com/watch?v=0hPep80Oy6k

https://www.youtube.com/watch?v=_4J9aF5KR84