Tuesday, September 10, 2024

Key ideas in the book of: "Reinforcement Learning - An introduction" of Sutton & Barto (Part 3)

Planning

In the previous chapter of the book, Monte Carlo and Temporal Difference were introduced, where real experience is used to learn a prediction model for return. As the training of a value function / policy function is directly based on state transition and the reward in the real world, it is called Direct Reinforcement Learning, or Modeless Learning). An alternative is to add an environment model to emulate the environment for state transition and reward. The emulated transition can also be used by training the prediction model for return. This is called Planning or Model-based Learning.


Train on Randomly Sampled past observations

Since we have a Model for state transition, does that mean all the prediction model for return is trained solely using emulated state transition? The answer is no. The training usually happens with a state transition in the real world, followed by training using n trials of emulated state transitions. The n trials were randomly picked from the previously observed states.

Prioritized Sweeping

Note that most of the observed state transitions will have 0 reward and very small amount of change in return . It is inefficient to train at these states. As a result, randomly picking the state transition for training is not plausible in a large state space. An improvement is by using a priority queue to order the real experience (state, action) pairs: by putting the (state, action) pair with largest errors in the front.



Sunday, August 18, 2024

Key ideas in the book of: "Reinforcement Learning - An introduction" of Sutton & Barto (Part 2)

Temporal Difference vs. Monte Carlo method

In Monte Carlo method, a trajectory (episode) has to end with a terminal state. The final return was calculated aggregating all rewards along the path. Then from the final step backward, assign each previous step with the return discounted by a rate.

Problems with Monte Carlo method:

  • It has to reach a terminal state. That means the sampling path is longer.
  • The rewards in the middle is not considered individually. That means the estimation in the middle is not accurate.
  • Consider having negative reward and when the final return becomes 0: a trajectory is not going to be useful for learning.

In Temporal Difference, instead of requiring a complete episode, a sublist can be used. At each step i, the estimated return at step i is the estimated return of the next step (i+1) plus the actual reward at step i. The future return on is then discounted.

The main difference from Monte Carlo method:

  • The sampling path does not have to be at a terminal state.
  • The rewards in the middle are considered at their exact steps. They are excluded individually from the future return.
  • When there's only 1 reward, the two methods are the same.

The sum of the  immediate reward at (at state s) and discounted future return from state s is called Bellman equation.

TD converges faster and reaches better performance at the same amount of training, comparing to the Monte Carlo method.

Batch Update

Instead of updating the neural network at every step. Play several episodes and the update the neural network (optimizer.step() in PyTorch).

In Monte Carlo method, we collect all state-to-return pairs, and take an average of the return values grouping by the states. And train the neural network.

This is a problem for a state that rarely occurs. Consider the this example: state A causes state B but failed to get a reward in 1 episode. While state B on average gives two reward at other episodes. The aggregation on A will falsely result in 0 return. (This is untrue because A should can only cause B, which has return of 2.)

A better way often adopted by TD is to consider state transitions between adjacent states. In the above example, State A is related to the return of the Future states, where B's return contribute to State A.

https://www.youtube.com/watch?v=OOlQj8UEJnI

https://www.youtube.com/watch?v=L64E_NTZJ_0


n-step Bootstrapping

In Temporal Difference, we are looking at the reward at the step i and the future return of the next state (i+1), while in Monte Carlo method we are looking at the rewards of all steps. The middle ground is to look at the rewards at multiple steps and the future return of a further state. Call this n-step boostrapping.

Monday, August 12, 2024

Important Sampling: intuitive explanation

We'd like to use Monte Carlo method to estimate the integral by randomly sampling from a distribution and take their average. But for large dataset with many dimensions, that would require taking too many samples to achieve a good standard deviation. Importance sampling is a way to reduce the sampling requirement and achieve a good standard deviation.

This overall ideas is to invent a new normal distribution q(x) around position X on the axis, where the product of probability p(x) and value f(x) is large. Then you can randomly sample from the new distribution, and let the value function be f(x) * p(x) / q(x) and then take an average.

Example:

  We have weather data of 10 years with temperature by days and its rainfall amount. The goal is to estimate the overall rainfall in a year. Let x be the temperature. p(x) is the probability of temperature given a day. f(x) is the rainfall amount given a temperature. So overall rainfall is ∫ p(x) * f(x) dx. We can iterate through all temperatures to take an average, or by Monte Carlo method - randomly sampling some temperatures (thus by distribution of p(x)) and take an average. Or by Monte Carlo method important sampling to sample from a new distribution q(x) to center at:

- where p(x) is large (where a temperature is common)

- where f(x) is large (where a rain fall is large)

- and hopefully we pick where both p(x) * f(x) are large, aka the importance values.

Take a Gaussian distribution q(x) around that region of x temperature values to centering the important values. Sample from q(x) distribution and take an average of f(x) * p(x) / q(x).  (You can take a smaller sample. That's the whole point.)

Intuitively, at the temperature x where q(x) is small (under sampled), 1/q(x) applies a higher weight. At the places where q(x) is large (over sampled), 1/q(x) applies a lower weight.

Questions from the example:

- What happens to negative temperature? Answer: we are getting p(x) for probability of a temperature - not the temperature value itself. So the value will be positive.

- How to pick the distribution q(x)? Answer: it is arbitrary and a manual step. 


Reference:

https://www.youtube.com/watch?v=C3p2wI4RAi8

https://www.youtube.com/watch?v=V8kruQjqpuw

https://www.youtube.com/watch?v=3Mw6ivkDVZc


Saturday, July 06, 2024

Key ideas in the book of: "Reinforcement Learning - An introduction" of Sutton & Barto (Part 1)

This is a summary of the key ideas from Sutton and Barto's book on Reinforcement Learning. (Here is the book http://www.incompleteideas.net/book/the-book-2nd.html)

Why do I need Reinforcement Learning when I already have the complete game tree for Tic-tac-toe?

That's because you want to also estimate the other player, not just the game itself. The goal is to find a hack to exploit a bug in the adversary.

If the opponent is smart, adapt to a smart way to not lose the game. If the opponent is dumb, find a dumb way to beat him - you are not looking for the optimal solution.


How to do Reinforcement Learning?

At each step, you have a list of actions that can be made. Each action has a score. When playing seriously, you take the action with the highest score. (You cannot adjust action values in this mode.) When you are not serious and are exploring a new way, take the action by probability by using the weighted value from the action scores.

Not every step has a result (reward). So you are collecting steps as a path.

Once a step reaches a result (reward), recall the steps in the whole path and adjust the action value at each step to be a little higher (and thus lowering the chance of other actions).
  • Imagine walking in a maze. Once you randomly walk from point A to point B, each action in the step of the path is valuable. You want to adjust the action score at each step in an increasing magnitude when it gets closer to the goal, so that during path following, you can look for the action with higher score.

K-armed Bandit problem

Given a slot machine with k arms (buttons), the jackpot (reward) is normally distributed around a value between 1 to k, and this location may vary over 24 hours (but it was certain for the given hour), RL can learn pick an arm at different hours to maximize the accumulative reward.

Note that in this problem every action (choice of arm) at an hour has no effect to other actions in a different hour. So there's no path involved. There is no parameter other than the timestamp.

If there is parameter to indicate the state of the slot machine (for example, color of the screen vary depending on the center of reward distribution),  it is called "Contextual Bandit problem", for having the extra parameter to assist guessing. (Note that there is still no effect between actions)

Tricks to improve RL: 
  • Assign a greater than 0 initial value helps RL to explore at the early stage, as starting from 0 often favors the first path found in the solution space too much. This is called Optimistic Initial value.
  • Upper Confidence bound problem: (I skipped)
  • Gradient bandit problem: (I skipped)

Markov Decision Process


Unlike K-armed Bandit problem, when an action change the state, affecting future actions, the problem is now a Markov process.

Return:

  Given a timestamp t, Return is the sum of all future reward in the path after time t.

Discounted Return:

  Similar to Return, but since we don't know if the reward will happen in the future, the weight of a reward in the future step is reduced by a fraction per step. (so the future steps were power of the fraction)

This is to make the near reward more attractive.


Policy Function:

It is a function taking in current state S and possible action a, return a probability for the action to happen.

Note that this could also be a value and apply softmax to choose an action by weight.


Value Function:

A value function only takes in the current state S, to evaluate the future return / discounted return under a policy.

Note that a value function has to be under a policy since the future return is based on the path that the policy goes through. It is possible to have both value function and policy function in a system. For example: actor-critic algorithm.


Optimal Value Function:

At each state, iterate all possible policies, and take the one policy that gives max Return comparing to other policies. Such a value function is called an optimal value function.

The value from the Optimal Value Function can be used to guide an optimal policy, which can choosing the action with the largest value from the optimal value function.


Episode:

Since not all tasks have an end state, when dealing with a task that goes on indefinitely, the process is usually divided into episodes. Each episode defines its own goal.


Monte Carlo Method

When the value function cannot be modeled, we focus on finding a policy function - to choose an action based on current state. It is usually good (optimal) to find out by random sampling and take an average on the Returns at a given (state, action) pair. (This process is so-called Monte Carlo Method) Once trained, we can use this mapping to choose the action that gives the largest Return at a given state.

The random sampling needs to cover all actions. To help this, we will need to let it try all initial states and all initial actions.

Each training takes in 1 episode. Each episode is a list of (state, action) pairs. Iterate the (state, action) pairs backwards from the goal state to the initial state: the return at each state is the reward at the current state + the discounted return in the future states. Note that at the goal state the discounted return is 0. (So, iterate backward and multiply the accumulated return with discount rate r

During this "training", just average the return at the (state, action) pair. (That means your sample episodes likely need to cover the station action more than once). Avoid "training" the same (state, action) pair twice in the same episode, by skipping to train only at the first occurrence toward the start of the episode.

Note the "training" here does nothing more than taking an average.


Avoid random initialization

Sometime, it might not be practical to construct all random actions. Instead, we let the stepping to be stochastic - at a small probability to randomly pick an action to explore, while at most of time, it follows a greedy algorithm to pick the action with the largest return.


On-Policy Training vs. Off-Policy Training

So far what is described here is has only 1 policy function, which is known as on-policy training. It is usually faster to train. But it is only a special case of off-policy training. In off-policy training,  two policy functions are involved: one policy for running and one policy for exploration. On-policy training is just a method there the target policy is the same as the behavior policy for exploration. By separating the two policies, we are not letting the same policy to be greedy and to be willing to try random actions at the same time.

To find such a policy function, it says the target policy must have a non-zero probability at the actions as the target policy to perform the same action as on the behavior policy.

There is a problem with Monte Carlo sampling at the tail of a distribution, where it usually requires a lot of samples. Importance sampling is a method to apply a second distribution at the end of the tail of the first distribution so that estimation requires less samples.

Off-policy method usually utilize importance sampling to find the behavior policy, where it collects less samples to improve the target policy.

On-policy method could also use importance sampling to learn from the previous experience (from a different policy, as the policy is kept updated), so that past experience can also be replayed in on-policy training. This was demonstrated in the PPO algorithm: https://www.youtube.com/watch?v=IScp-mZ7iS0




 

Tuesday, April 16, 2024

Quick Tutorial: Tiny Llama Finetuning

Tiny Llama is a small enough LLM model that could run from your laptop. It can generate somewhat good replies. People are fine-tuning it for their own purposes. I am going to show you how to play with it.

How to Play with a model on Huggingface ?

Go to the hugging face page, they already included the instructions on how to start the app:  https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

How to run it locally?

Download the files to your disk from the "Files" tag:

https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/tree/main

You will have to click on the download icon to download the files. Copying and pasting text leads to corrupted files. Put them all in one folder and run this python script:


Which one to use?

There are a few models. Use the chat model. It gives better result in implementing a question / answer chatbot. 

All models are good at continuing the text that you provide. The chat model allows separate question and answers well by separating user input and assistant output. Surround user input in <|user|> tag , ending it with </s>. Surround desired reply with <|assistant|>, also ending it with </s>.

You can also use it without tags, but expect it look like creative writing than a conversation.

In additional to "text-generation" Transformers pipeline has a "conversation"  mode, which allows the input the be a list of maps, each map is a message in the conversation. (So that it converts the list of maps into text with tags for you.)


How to fine tune with peft?

Use this script. It picks small percentage of a certain neural network components to fine tune.

https://gist.github.com/yuhanz/740ab396f237e8d2f41f07a10ccd8de9

You want to initialize TrainingArguments with push_to_hub=False, so that it doesn't require huggingface credentials. Replace the model name with relative path to folder containing model files in your disk.

Prepare the training data into conversations separated by tags. Use tokenizer to convert structured data into text with tags:

text = tokenizer.apply_chat_template(messages, tokenize=False)

Note that the configuration max_seq_length decides size of the input and the generated text. This may depend on the tokenizer for the model.

References:

 https://www.kaggle.com/code/tommyadams/fine-tuning-tinyllama

https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms

https://huggingface.co/docs/transformers/main/chat_templating


Thursday, April 04, 2024

Handling AWS API Gateway Error: Invalid HTTP endpoint encoding detected for URI

I encountered this error at work when deploying AWS API Gateway:

 Invalid HTTP endpoint encoding detected for URI (Service: ApiGateway, Status Code: 400

That was because the URI containing encoded %xx characters that AWS API Gateway doesn't like. (Even though they are valid and work well with curl.) Replacing those characters fixed the issue. Here are a list of not welcomed characters that I found out from my experiments: 

%21

%24

%2C

%3A

%0A

They are corresponding to characters  !$,: and new_line 

Friday, March 29, 2024

Summary on Outlier's tutorial: Cross Attention

Self attention: we'd like to find a weighted sum of feature vectors for a series of input, whether it is a paragraph of texts or a list of image segments. Feature vector for each input vector is calculated by the Value matrix. The weight is calculated by the Query matrix (for the vector at focus) multiplying the Key matrix (for each vector in the input).

Cross attention: now consider having two inputs: one image and one text. The Query matrix applies to the image. And Key matrix and Value matrix applies to the text. Find the attention between each image input fragment to each word. Taking this value as weights, apply to the Value matrix to create the weighted sum. From this vector, then you can multiply another matrix to transform it back to the dimension of the input image to construct a (different) image.


 Reference: https://www.youtube.com/watch?v=aw3H-wPuRcw