Sunday, August 18, 2024

Key ideas in the book of: "Reinforcement Learning - An introduction" of Sutton & Barto (Part 2)

Temporal Difference vs. Monte Carlo method

In Monte Carlo method, a trajectory (episode) has to end with a terminal state. The final return was calculated aggregating all rewards along the path. Then from the final step backward, assign each previous step with the return discounted by a rate.

Problems with Monte Carlo method:

  • It has to reach a terminal state. That means the sampling path is longer.
  • The rewards in the middle is not considered individually. That means the estimation in the middle is not accurate.
  • Consider having negative reward and when the final return becomes 0: a trajectory is not going to be useful for learning.

In Temporal Difference, instead of requiring a complete episode, a sublist can be used. At each step i, the estimated return at step i is the estimated return of the next step (i+1) plus the actual reward at step i. The future return on is then discounted.

The main difference from Monte Carlo method:

  • The sampling path does not have to be at a terminal state.
  • The rewards in the middle are considered at their exact steps. They are excluded individually from the future return.
  • When there's only 1 reward, the two methods are the same.

The sum of the  immediate reward at (at state s) and discounted future return from state s is called Bellman equation.

TD converges faster and reaches better performance at the same amount of training, comparing to the Monte Carlo method.

Batch Update

Instead of updating the neural network at every step. Play several episodes and the update the neural network (optimizer.step() in PyTorch).

In Monte Carlo method, we collect all state-to-return pairs, and take an average of the return values grouping by the states. And train the neural network.

This is a problem for a state that rarely occurs. Consider the this example: state A causes state B but failed to get a reward in 1 episode. While state B on average gives two reward at other episodes. The aggregation on A will falsely result in 0 return. (This is untrue because A should can only cause B, which has return of 2.)

A better way often adopted by TD is to consider state transitions between adjacent states. In the above example, State A is related to the return of the Future states, where B's return contribute to State A.

https://www.youtube.com/watch?v=OOlQj8UEJnI

https://www.youtube.com/watch?v=L64E_NTZJ_0


n-step Bootstrapping

In Temporal Difference, we are looking at the reward at the step i and the future return of the next state (i+1), while in Monte Carlo method we are looking at the rewards of all steps. The middle ground is to look at the rewards at multiple steps and the future return of a further state. Call this n-step boostrapping.

Monday, August 12, 2024

Important Sampling: intuitive explanation

We'd like to use Monte Carlo method to estimate the integral by randomly sampling from a distribution and take their average. But for large dataset with many dimensions, that would require taking too many samples to achieve a good standard deviation. Importance sampling is a way to reduce the sampling requirement and achieve a good standard deviation.

This overall ideas is to invent a new normal distribution q(x) around position X on the axis, where the product of probability p(x) and value f(x) is large. Then you can randomly sample from the new distribution, and let the value function be f(x) * p(x) / q(x) and then take an average.

Example:

  We have weather data of 10 years with temperature by days and its rainfall amount. The goal is to estimate the overall rainfall in a year. Let x be the temperature. p(x) is the probability of temperature given a day. f(x) is the rainfall amount given a temperature. So overall rainfall is ∫ p(x) * f(x) dx. We can iterate through all temperatures to take an average, or by Monte Carlo method - randomly sampling some temperatures (thus by distribution of p(x)) and take an average. Or by Monte Carlo method important sampling to sample from a new distribution q(x) to center at:

- where p(x) is large (where a temperature is common)

- where f(x) is large (where a rain fall is large)

- and hopefully we pick where both p(x) * f(x) are large, aka the importance values.

Take a Gaussian distribution q(x) around that region of x temperature values to centering the important values. Sample from q(x) distribution and take an average of f(x) * p(x) / q(x).  (You can take a smaller sample. That's the whole point.)

Intuitively, at the temperature x where q(x) is small (under sampled), 1/q(x) applies a higher weight. At the places where q(x) is large (over sampled), 1/q(x) applies a lower weight.

Questions from the example:

- What happens to negative temperature? Answer: we are getting p(x) for probability of a temperature - not the temperature value itself. So the value will be positive.

- How to pick the distribution q(x)? Answer: it is arbitrary and a manual step. 


Reference:

https://www.youtube.com/watch?v=C3p2wI4RAi8

https://www.youtube.com/watch?v=V8kruQjqpuw

https://www.youtube.com/watch?v=3Mw6ivkDVZc