This is a summary of the key ideas from Sutton and Barto's book on Reinforcement Learning. (Here is the book http://www.incompleteideas.net/book/the-book-2nd.html)
Why do I need Reinforcement Learning when I already have the complete game tree for Tic-tac-toe?
- Imagine walking in a maze. Once you randomly walk from point A to point B, each action in the step of the path is valuable. You want to adjust the action score at each step in an increasing magnitude when it gets closer to the goal, so that during path following, you can look for the action with higher score.
K-armed Bandit problem
- Assign a greater than 0 initial value helps RL to explore at the early stage, as starting from 0 often favors the first path found in the solution space too much. This is called Optimistic Initial value.
- Upper Confidence bound problem: (I skipped)
- Gradient bandit problem: (I skipped)
Markov Decision Process
Return:
Given a timestamp t, Return is the sum of all future reward in the path after time t.
Discounted Return:
Similar to Return, but since we don't know if the reward will happen in the future, the weight of a reward in the future step is reduced by a fraction per step. (so the future steps were power of the fraction)
This is to make the near reward more attractive.
Policy Function:
It is a function taking in current state S and possible action a, return a probability for the action to happen.
Note that this could also be a value and apply softmax to choose an action by weight.
Value Function:
A value function only takes in the current state S, to evaluate the future return / discounted return under a policy.
Note that a value function has to be under a policy since the future return is based on the path that the policy goes through. It is possible to have both value function and policy function in a system. For example: actor-critic algorithm.
Optimal Value Function:
At each state, iterate all possible policies, and take the one policy that gives max Return comparing to other policies. Such a value function is called an optimal value function.
The value from the Optimal Value Function can be used to guide an optimal policy, which can choosing the action with the largest value from the optimal value function.
Episode:
Since not all tasks have an end state, when dealing with a task that goes on indefinitely, the process is usually divided into episodes. Each episode defines its own goal.
Monte Carlo Method
When the value function cannot be modeled, we focus on finding a policy function - to choose an action based on current state. It is usually good (optimal) to find out by random sampling and take an average on the Returns at a given (state, action) pair. (This process is so-called Monte Carlo Method) Once trained, we can use this mapping to choose the action that gives the largest Return at a given state.
The random sampling needs to cover all actions. To help this, we will need to let it try all initial states and all initial actions.
Each training takes in 1 episode. Each episode is a list of (state, action) pairs. Iterate the (state, action) pairs backwards from the goal state to the initial state: the return at each state is the reward at the current state + the discounted return in the future states. Note that at the goal state the discounted return is 0. (So, iterate backward and multiply the accumulated return with discount rate r)
During this "training", just average the return at the (state, action) pair. (That means your sample episodes likely need to cover the station action more than once). Avoid "training" the same (state, action) pair twice in the same episode, by skipping to train only at the first occurrence toward the start of the episode.
Note the "training" here does nothing more than taking an average.
Avoid random initialization
Sometime, it might not be practical to construct all random actions. Instead, we let the stepping to be stochastic - at a small probability to randomly pick an action to explore, while at most of time, it follows a greedy algorithm to pick the action with the largest return.
On-Policy Training vs. Off-Policy Training
So far what is described here is has only 1 policy function, which is known as on-policy training. It is usually faster to train. But it is only a special case of off-policy training. In off-policy training, two policy functions are involved: one policy for running and one policy for exploration. On-policy training is just a method there the target policy is the same as the behavior policy for exploration. By separating the two policies, we are not letting the same policy to be greedy and to be willing to try random actions at the same time.
To find such a policy function, it says the target policy must have a non-zero probability at the actions as the target policy to perform the same action as on the behavior policy.
There is a problem with Monte Carlo sampling at the tail of a distribution, where it usually requires a lot of samples. Importance sampling is a method to apply a second distribution at the end of the tail of the first distribution so that estimation requires less samples.
Off-policy method usually utilize importance sampling to find the behavior policy, where it collects less samples to improve the target policy.
On-policy method could also use importance sampling to learn from the previous experience (from a different policy, as the policy is kept updated), so that past experience can also be replayed in on-policy training. This was demonstrated in the PPO algorithm: https://www.youtube.com/watch?v=IScp-mZ7iS0