Yuhan's blog: August 2023

Thursday, August 17, 2023

Short explanation on PEFT: Parameter Efficient Fine Tuning

Many pretrained large language models are out there for us to use. However, they may not be accurate for our purpose. Thus, the model needs fine tuning.

Since the model is large, the idea is to: make a copy of the existing model, and select a small percentage of trainable features to retrain. With the new copy of the model, train the new copy with your data.

Note that the library does not work with any random model that you created, as the parameter in LoraConfig task_type=TaskType.SEQ_2_SEQ_LM sets an expectation of the model.

LoRa applies the summation with the existing matrices with Low-Rank Matrices to adjust the weights, which is a trick to create a large matrix by adding small amount of parameters

(I explained it earlier in this post. ) Since only a small percentage of the features are trainable, the training is relatively fast.

This video explains how the LoRA training works internally:

https://www.coursera.org/learn/generative-ai-with-llms/lecture/NZOVw/peft-techniques-1-lora

Thursday, August 10, 2023

Pytorch: How to clear GPU memory

import gc

# del optimizer
# del model
gc.collect()
torch.cuda.empty_cache()

Quick Note: Training with Low-rank Matrices

When training a large matrix M with size WxH parameters is expensive, instead take the matrix into the multiplication of 2 smaller matrices. For example: matrix A is in size of (Wx3) and matrix B is in size of (3xH). And let A * B = M to give back a matrix of WxH dimensions. Since W * 3 + 3 * H < W *H, less amount of parameters are required.

This technique is mentioned in both of the following videos:

https://www.coursera.org/learn/generative-ai-with-llms/lecture/NZOVw/peft-techniques-1-lora

https://youtu.be/exVPXVFPMDk?t=205

Wednesday, August 02, 2023

Details in Positional Encoding for Transformer

The Attention is all you need paper mentioned positional encoding without lacking some details. I am going to write my understanding at those details

The formula is the following:

PE(pos,2i) =sin(pos/100002i/dmodel)

PE(pos,2i+1) =cos(pos/100002i/dmodel)

The paper mentioned that the i is the dimension index of and d_model is dimension of the embedding. If so, given the last i = d_model-1, 2i will be out of the bound. So, that is not the correct explanation.

2i and 2i+ 1 here suggest even and odd dimension indices. At the even dimension indices, apply sine function; at the odd dimension indices, apply cosine function. So i is ranged from [0, to d_model/2) and for each i, it generates 2 dimensions.

Once having the PE (Positional Encoding) value for a position, by the diagram in page 3, it is added to the embedding of the input.

new_embedding[pos, 2i] = embedding[pos, 2i] + PE(pos, 2i) new_embedding[pos, 2i+1] = embedding[pos, 2i+1] + PE(pos, 2i+1)

The embedding variable here is the embedding for each word in a sentence, and pos is the position of the sentence. (It is a sentence - not the whole dictionary.)

This part of the StatQuest video clearly explained how embedding is calculated.