Temporal Difference vs. Monte Carlo method
In Monte Carlo method, a trajectory (episode) has to end with a terminal state. The final return was calculated aggregating all rewards along the path. Then from the final step backward, assign each previous step with the return discounted by a rate.
Problems with Monte Carlo method:
- It has to reach a terminal state. That means the sampling path is longer.
- The rewards in the middle is not considered individually. That means the estimation in the middle is not accurate.
- Consider having negative reward and when the final return becomes 0: a trajectory is not going to be useful for learning.
In Temporal Difference, instead of requiring a complete episode, a sublist can be used. At each step i, the estimated return at step i is the estimated return of the next step (i+1) plus the actual reward at step i. The future return on is then discounted.
The main difference from Monte Carlo method:
- The sampling path does not have to be at a terminal state.
- The rewards in the middle are considered at their exact steps. They are excluded individually from the future return.
- When there's only 1 reward, the two methods are the same.
The sum of the immediate reward at (at state s) and discounted future return from state s is called Bellman equation.
TD converges faster and reaches better performance at the same amount of training, comparing to the Monte Carlo method.
Batch Update
Instead of updating the neural network at every step. Play several episodes and the update the neural network (optimizer.step() in PyTorch).
In Monte Carlo method, we collect all state-to-return pairs, and take an average of the return values grouping by the states. And train the neural network.
This is a problem for a state that rarely occurs. Consider the this example: state A causes state B but failed to get a reward in 1 episode. While state B on average gives two reward at other episodes. The aggregation on A will falsely result in 0 return. (This is untrue because A should can only cause B, which has return of 2.)
A better way often adopted by TD is to consider state transitions between adjacent states. In the above example, State A is related to the return of the Future states, where B's return contribute to State A.
https://www.youtube.com/watch?v=OOlQj8UEJnI
https://www.youtube.com/watch?v=L64E_NTZJ_0
No comments:
Post a Comment