Sunday, August 18, 2024

Key ideas in the book of: "Reinforcement Learning - An introduction" of Sutton & Barto (Part 2)

Temporal Difference vs. Monte Carlo method

In Monte Carlo method, a trajectory (episode) has to end with a terminal state. The final return was calculated aggregating all rewards along the path. Then from the final step backward, assign each previous step with the return discounted by a rate.

Problems with Monte Carlo method:

  • It has to reach a terminal state. That means the sampling path is longer.
  • The rewards in the middle is not considered individually. That means the estimation in the middle is not accurate.
  • Consider having negative reward and when the final return becomes 0: a trajectory is not going to be useful for learning.

In Temporal Difference, instead of requiring a complete episode, a sublist can be used. At each step i, the estimated return at step i is the estimated return of the next step (i+1) plus the actual reward at step i. The future return on is then discounted.

The main difference from Monte Carlo method:

  • The sampling path does not have to be at a terminal state.
  • The rewards in the middle are considered at their exact steps. They are excluded individually from the future return.
  • When there's only 1 reward, the two methods are the same.

The sum of the  immediate reward at (at state s) and discounted future return from state s is called Bellman equation.

TD converges faster and reaches better performance at the same amount of training, comparing to the Monte Carlo method.

Batch Update

Instead of updating the neural network at every step. Play several episodes and the update the neural network (optimizer.step() in PyTorch).

In Monte Carlo method, we collect all state-to-return pairs, and take an average of the return values grouping by the states. And train the neural network.

This is a problem for a state that rarely occurs. Consider the this example: state A causes state B but failed to get a reward in 1 episode. While state B on average gives two reward at other episodes. The aggregation on A will falsely result in 0 return. (This is untrue because A should can only cause B, which has return of 2.)

A better way often adopted by TD is to consider state transitions between adjacent states. In the above example, State A is related to the return of the Future states, where B's return contribute to State A.

https://www.youtube.com/watch?v=OOlQj8UEJnI

https://www.youtube.com/watch?v=L64E_NTZJ_0


n-step Bootstrapping

In Temporal Difference, we are looking at the reward at the step i and the future return of the next state (i+1), while in Monte Carlo method we are looking at the rewards of all steps. The middle ground is to look at the rewards at multiple steps and the future return of a further state. Call this n-step boostrapping.

No comments: