Thursday, December 28, 2023

Study Note: Stochastic Gradient Descent, Dueling network, Neural Architecture Search

Mini-batch Stochastic Gradient Descent:

  • Stochastic Gradient Descent with pytorch:
    • keep the gradient in general (with  parameter momentum > 0)
    • work with large data by choosing a mini-batch - using DataLoader to randomly choose datapoints to update.

Dueling network:

  • Write the Q* function to be the sum of two neural networks:
    • Q(s, a) = V(s) + A(s,a) - max A(s,a)
    • Note that since Q depends on the sum of two neural networks,  under the same Q value, the two Neural networks could be unstable by shifting the center from network output 1 and adding the shifted portion to the network output 2.  Fortunately, the max A(s,a) term prevents this from happening, as changing shifting value would the  max A(s,a) term to change, resulted as varying Q. The max A(s,a) term in practice can be replaced with mean A(s,a) for better performance.

  

Neural Architecture Search

  • You have laid out a set of neural network layers for the same purpose, or maybe the esame neural network layer but with different choices of number of parameter - should I use 20 neurons or 30 neurons? or should ResNet works better one part of the layer than just FullyConnected? etc. The goal of the neural architecture search is to find out a better configuration.
    • Naive approach: give every configuration a try. If there are 3 candidates neural network layer, train on every one configuration. This becomes out of hand quickly when there are layers connected in series - the numbers of choices are multiplied.
    • Differentiable Neural Architecture Search:
      • Construct a super net by connecting all candidate layers that you want to try in 1 single network in parallel, and sum the results from each layer by a weighted sum using a Softmax layer. The weights are learnable parameters. Then train the super net. All layers will be trained, and their importance will be listed in the learned weight. The one with the largest weight is the winner. Keep that layer and throw away those layers with smaller weights.
      • It was explained in this video (Chinese). It is also possible to add running time to the  weight so that the choice from this process considers both accuracy and performance.


 

Study note: Bipartite Graph

14-1Bipartite Graphs 

  • Bipartite graph is one with two sets of nodes. Edges can exist between the sets, but not among a set.
  • Algorithm to identify if a Graph is Bipartite: given 2 colors, traverse nodes color them color them and color the neighbor nodes with different colors. If a node has a conflicted color with earlier color decision, the graph is not a Bipartite graph.
  • Define Matching: pairing nodes from two sets in a Bipartite graph, without using repeating elements. (two nodes cannot share the same partner node)
  • Max-cardinality Bipartite Matching - include as many pairs as possible.
    • Greedy algorithm: iterate through nodes in one side, and find one partner node. (Skip the node if no partner can be found)
    • Convert Matching problem into Network Flow
      • By connecting a source node to 1 set and connecting a sink to the other set. Then find network that could flow through from source to sink
        • assign each edge with 1 weight.
      • Solve the network flow problem by Ford-Fulkerson algorithm, and the result paths gave the optimized matching.

  • Let edges between 2 sets of nodes to be a matrix. Each edge has a weight. The goal is to find the matching with the max total weights.
    • Hungarian Algorithm can solve this by finding the minimum weight.
      • prerequisite: the cardinality of  of the two sets must be the same. That means the number of nodes must be the same.
      • Time complexity: O(n**3)
      • Negating the weights converts the algorithm from finding the minimum total weight to maximum total weight.

Tuesday, November 14, 2023

Write to DeltaTable using Python WITHOUT pandas (using pyarrow)

 I am trying to write data in DeltaTable format from an AWS Lambda Function, but AWS Lambda Function limits to 250MB. The deltalake library takes 247MB, which exceeds the limit along with pandas. Since the deltalake library included pyarrow, I need to find a way to write a data frame without including pandas.

Here is how:

Thursday, October 05, 2023

Quick Tutorial: Kalman Filter

I assume you already heard of Kalman Filter. In general, you are controlling a moving vehicle and would like to know where it is at the next moment.

You made your guess of the vehicle's next moment based on your current speed, and at the next moment you also measured the vehicle's next location. Neither of them were accurate. Thus, both of them were in Normal distributions, with some room for errors / noise. The best estimation is to take a product of the two Normal distributions.



Here is how to take the product of two Normal distributions, based on this video (voiced in Chinese):

 Taking a product of two Gaussian Distribution X ~ ( µ1δ1) , Y, ~ ( µ2δ2), the result Gaussian Distribution is:

   k = δ/ (δ1 + δ2)

    µ = µ1 + k * ( µ- µ)

   δ ** 2 =  δ1** 2 + k * δ2 ** 2


How do I guess my next location? Use middle school math: knowing the current location p and speed v, we could guess the location of the vehicle in the next moment to be: 

pk = pk-1 + vk-1 * t

vk = vk-1

(Since there is no external force, the next speed u is constant.)

Let state X = (p, v) be the current position and speed of the vehicle, rewrite the two formulas into a matrix form. That is:

pk = 1* pk-1 + t * vk-1

vk = 0* pk-1  + 1 * vk-1  

Then write (pk, vk) as Xk and ((pk-1, vk-1) as Xk-1 and rewrite the expressions above as matrix multiplication

Xk =  [1 , t ]  * Xk-1

          [0, 1]

Call the matrix as F in front of Xk-1 . The formula becomes Xk =  F * Xk-1 . We will use the matrix F later.

If there is an external acceleration a, then next p needs to add a * t**2 / 2. and next u needs to add a * t, middle school math again. See this part of the video . The external acceleration is your control. It is how much force to apply to make the vehicle faster.

Xk =  [1 , t ]  * Xk-1  +  [  t**2 / 2   ] * a

          [0, 1]               +  [ t               ]

    (Note: some articles use a different variable name for acceleration a as u. )

Let P be the covariance matrix of X along p and u axis. P represents the noise / errors in the prediction. It is a 2x2 matrix. This covariance matrix defines the oval shape and rotation of the Gaussian distribution in the (p, v) space.

Given the covariance matrix P at the k-1 moment, we'd like to calculate the covariance matrix at the next moment k. So given the covariance cov(Xk-1) = Pk-1 , we'd like to find out cov(Xk), which is

    cov(Xk) =  cov(F * Xk-1

Co-variance function has the following property: cov(A*X ) = A * cov(X) * AT . Use this property, thus:

    cov(Xk) =  cov(F * Xk-1)  = F * cov(Xk-1) * FT 

So, given the covariance matrix at the moment k-1, we can calculate the covariance matrix at moment k. (So, the Gaussian distribution of the next moment can be calculated based on Gaussian distribution of the current moment.)


How do I know my measurement?

In your measurement device, create a few samples and compare them with known distances. Calculate the standard deviation in the data collected from the measurement device. 


Take the product

Now we have two Normal distributions of the next position: one by the linear equation, and another by measurement device. Use the equation µ = µ1 + k * ( µ- µ) to take a product of two Normal distributions, as shown in the first section. Since X is in (p, u) space. So the equation should be in vector form, instead of in scalar form. The value of k and mean value are calculated the same, exception k is now calculated based on covariance. Covariances are combined in this way:

K' = ∑1 / ( ∑1  + ∑)

∑'  = ∑1 + K' * ∑1

The result looks crazy after plugging in the guessed values based on matrix F, but really they are just the result of the product of two Gaussian distributions.

After Xk is calculated, it will be used as the position to predict in the next round (at k+1 moment). And the calculated ∑' will be used as the covariance matrix for the next round.

Reference:

- https://youtu.be/2-lu3GNbXM8?si=4Xbeiq-LBgTMxbGh&t=750 (voiced in Chinese)

https://www.youtube.com/watch?v=KD0cH4fTFFU (voiced in Chinese)




Thursday, September 14, 2023

Summary on Shusen Wang's video of: Fine-tuning with Softmax Classifier

 In Shushen Wang's video Few-shot learning (3/3), there introduced a simple fine-tuning idea.

1. Softmax classifier

Let x be the raw input, given your feature vector f(x), multiply it with a matrix W plus a bias b. For example, if your classifier generates 3 classes, then the matrix W and b should also have 3 rows. Then take a Softmax.

    p = Softmax(W* f(x) + b)

To initialize matrix W, let each row of the matrix W be the average vector of 1 class to be predicted. Let b to be 0.  W and b are trainable (fine-tuning).

For example, if there are 3 classes, and class 1 has 5 support vectors (from few shot examples), take an average of the 5 support vectors, call it w1, and let that be row 1 in W.


2. Regularization

Since this trick may lead to overfitting, a regularization term is introduced during loss minimization. Since p is a probability, Entropy regularization can be taken on p and seeking smaller entropy becomes part of the loss function.

For example, a prediction p is made for 3 classes, and p = [0.33, 0.33, 0.34]. This prediction may work, but it is pretty bad. So take an entropy at this number:

     entropy = sum( p.map( x => - x * Math.log(x) ) )

     That is - 0.33 * Math.log(0.33) + - 0.33 * Math.log(0.33) +  - 0.34 * Math.log(0.34) = 1.09 in this example.

Include that as part of your loss function (multiply it with a weight) so that training will discourage this kind of the output.


3. Replace Matrix multiplication with Cosine similarity

Say W has 3 classes and thus 3 rows w1, w2, w3. In W* f(x), each row is multiplying f(x) with a dot product. For example w1 * f(x). Instead of a dot product, take a cosine similarity between w1 and f(x).

Basically a dot product first, and divide the determinant of w1 and f(x). (Basically making f(x) and w1 unit vectors before performing the dot product.)





Friday, September 08, 2023

Key, Query, Value Matrices in Masked Self-Attention of Decoder-Only Transformers

   StatQuest uploaded a good video at explaining how a Decorder-Only transformer works. Most of the content talked about how Key, Query, Value Matrices are calculated. It is quite complex. So here I am going to explain it in a more intuitive way (based on my own understanding).

A sentence is first parsed to tokens, and each token has an embedding and its position i in the sentence. The word embedding + the position encoding make a vector for the word at the position i. Nothing special so far.

Now at each token at position i, using this vector (call it word_vector_i), we'd like to encode another vector to represent the context in the sentence so far. This new vector at i should be based on the vector for this word at i and all the previous words from [0, i -1]. To combine these vectors, we are going to take a weighted sum. This is the overall idea.

    vector_with_context (i) =  w1 * value_vector_1 + w2 * value_vector_2 + ... + wi * value_vector_i

But wait, it is not nice to directly use the embedding + positional encoding (word_vector_i) as the value_vector_i. Instead, we will transform it with a matrix (Mv). Mv will be adjustable and learned. So,

    value_vector_i = Mv * word_vector_i

Weight w1 is how similar the 1st word is related to the ith word. Weight w2 is how similar the 2nd word is related to the ith word, etc. To find out how similar the two words are, we are going to apply a dot product on the vector for two words.

But wait, it is not nice to directly use the embedding + positional encoding (word_vector_i), so we again are going to transform word_vector_i with a matrix... Actually, two matrices - one matrix (Mq) for transforming word_vector_i and one matrix (Mk) for transforming word_vector before ith position. 

    query_vector_i = Mq * word_vector_i

    key_vector_1 = Mk * word_vector_1 

    key_vector_2 = Mk * word_vector_2

    ...

    key_vector_(i-1) = Mk * word_vector_(i-1)

Mq and Mwill be adjustable and learned. The weight can be calculated 

    wj = query_vector_i * key_vector_j   

But wait, these weights are not nice. So, we are going to take all the weights and run a softmax to get a better scaled weights (which sum to 1). Applying these weights and value_vector's, vector_with_context(i) is calculated. vector_with_context(i) is called Masked Self-Attention.

To predict the next word at i+1 position, just apply vector_with_context(i) to a fully connected layer to a  result vector representing probability at each word in the dictionary.

But wait, using only masked self-attention (vector_with_context(i) ) isn't nice, we'd like to sum it with embedding and positional encoding (aka word_vector_i as described above). So the prediction of next word is really depending on 3 things. Since we are summing a later vector with an earlier vector, this becomes a residual link in the network.

(Note: since the residual link will sum the masked self-attention and the word embedding, that means their dimensions have to match. This also means Mk, MqMhave to produce the same size. So the size of the Matrix is predetermined.)

Of course, the result vector will apply a softmax to scale probability better.

  - What if it predicted the next word wrong, in my prompt?

      Run your optimizer to train Mk, MqMv and the fully connected layer to make it right.

 - What if I want to generate a reply?

      Repeat the process (without training) to run at every position in your prompt. At the end of the prompt, (at the end-of-sentence token), let the transformer predict the next word. Your transformer is now generating a reply! Keep output the next word and add to the end of the sentence until it outputs the end-of-sentence token.



Thursday, August 17, 2023

Short explanation on PEFT: Parameter Efficient Fine Tuning

Many pretrained large language models are out there for us to use. However, they may not be accurate for our purpose. Thus, the model needs fine tuning. 

Since the model is large, the idea is to: make a copy of the existing model, and select a small percentage of trainable features to retrain. With the new copy of the model, train the new copy with your data.




Note that the library does not work with any random model that you created, as the parameter in LoraConfig task_type=TaskType.SEQ_2_SEQ_LM sets an expectation of the model.

LoRa applies the summation with the existing matrices with Low-Rank Matrices to adjust the weights, which is a trick to create a large matrix by adding small amount of parameters
(I explained it earlier in this post.  ) Since only a small percentage of the features are trainable, the training is relatively fast.


This video explains how the LoRA training works internally: 
https://www.coursera.org/learn/generative-ai-with-llms/lecture/NZOVw/peft-techniques-1-lora




Thursday, August 10, 2023

Pytorch: How to clear GPU memory

import gc

# del optimizer
# del model
gc.collect()
torch.cuda.empty_cache()

Quick Note: Training with Low-rank Matrices

When training a large matrix M with size WxH parameters is expensive, instead take the matrix into the multiplication of 2 smaller matrices. For example: matrix A is in size of (Wx3) and matrix B is in size of (3xH). And let A * B = M to give back a matrix of WxH dimensions. Since W * 3 + 3 * H < W *H, less amount of parameters are required.

This technique is mentioned in both of the following videos:

 https://www.coursera.org/learn/generative-ai-with-llms/lecture/NZOVw/peft-techniques-1-lora

https://youtu.be/exVPXVFPMDk?t=205

Wednesday, August 02, 2023

Details in Positional Encoding for Transformer

The Attention is all you need paper mentioned positional encoding without lacking some details. I am going to write my understanding at those details

The formula is the following:

PE(pos,2i=sin(pos/100002i/dmodel)

PE(pos,2i+1) =cos(pos/100002i/dmodel

The paper mentioned that the i is the dimension index of and dmodel is dimension of the embedding. If so, given the last i = dmodel -1,  2i will be out of the bound. So, that is not the correct explanation.

2i and 2i+ 1 here suggest even and odd dimension indices. At the even dimension indices, apply sine function; at the odd dimension indices, apply cosine function. So i is ranged from [0, to dmodel/2) and for each i, it generates 2 dimensions.

Once having the PE (Positional Encoding) value for a position, by the diagram in page 3, it is added to the embedding of the input.

new_embedding[pos, 2i] = embedding[pos, 2i] + PE(pos, 2i) new_embedding[pos, 2i+1] = embedding[pos, 2i+1] + PE(pos, 2i+1)

The embedding variable here is the embedding for each word in a sentence, and pos is the position of the  sentence. (It is a sentence - not the whole dictionary.)

This part of the StatQuest video clearly explained how embedding is calculated.


Sunday, July 23, 2023

Key, Query, Value Matrices in Self Attention

The Attention is all you need paper mentioned about an attention function with construction of three matrices Q, K, V without much explanation. Fortunately, this Youtube tutorial on attention explained well (voiced in Chinese). Here is a note that I took from the video.

In self attention, there is only one input, as a list of tokens, each of which is a word expressed as a vector embedding. Call this input X of m elements. Each Xi is the embedding of the ith word. The task to guess the ith word to output, by looking at all words and the i-th word in the input.

Wk, Wq, Wv are the parameter matrices to be learned. Each of them multiplies X to get Q, K, V.

1. It needs to look at all words, which is the Wk matrix multiply X. This matrix is called key matrix as it looks at all keys (words). K = Wk * X

2. It needs to look at the ith word Xi, which is transformed by Q matrix, aka query matrix. qi = Wq * Xi

3. Take the result of K from step 1 and multiply the qi in step 2 and take a softmax. Call this result Ai = Softmax(K.transpose * qi)

4. The context vector at ith location Ai and multiply it with V. Call this result Ci = V * Ai. Since Ai came from step 3 with a softmax , Ci is essentially a weighted sum of V, based on the weight Ai.

5. Take Ci into a Softmax Classifier to get an output word.


Also, Self Attention is a special case of Attention. For self attention, qi is calculated by the ith word in the input Xi. For attention, qi is calculated by looking at the previous output of ith word ( For the very first position, <start> token is considered as the previous output.)