This is a summary of the key ideas from Sutton and Barto's book on Reinforcement Learning. (Here is the book http://www.incompleteideas.net/book/the-book-2nd.html)
Why do I need Reinforcement Learning when I already have the complete game tree for Tic-tac-toe?
That's because you want to also estimate the other player, not just the game itself. The goal is to find a hack to exploit a bug in the adversary.
If the opponent is smart, adapt to a smart way to not lose the game. If the opponent is dumb, find a dumb way to beat him - you are not looking for the optimal solution.
How to do Reinforcement Learning?
At each step, you have a list of actions that can be made. Each action has a score. When playing seriously, you take the action with the highest score. (You cannot adjust action values in this mode.) When you are not serious and are exploring a new way, take the action by probability by using the weighted value from the action scores.
Not every step has a result (reward). So you are collecting steps as a path.
Once a step reaches a result (reward), recall the steps in the whole path and adjust the action value at each step to be a little higher (and thus lowering the chance of other actions).
- Imagine walking in a maze. Once you randomly walk from point A to point B, each action in the step of the path is valuable. You want to adjust the action score at each step in an increasing magnitude when it gets closer to the goal, so that during path following, you can look for the action with higher score.
K-armed Bandit problem
Given a slot machine with k arms (buttons), the jackpot (reward) is normally distributed around a value between 1 to k, and this location may vary over 24 hours (but it was certain for the given hour), RL can learn pick an arm at different hours to maximize the accumulative reward.
Note that in this problem every action (choice of arm) at an hour has no effect to other actions in a different hour. So there's no path involved. There is no parameter other than the timestamp.
If there is parameter to indicate the state of the slot machine (for example, color of the screen vary depending on the center of reward distribution), it is called "Contextual Bandit problem", for having the extra parameter to assist guessing. (Note that there is still no effect between actions)
Tricks to improve RL:
- Assign a greater than 0 initial value helps RL to explore at the early stage, as starting from 0 often favors the first path found in the solution space too much. This is called Optimistic Initial value.
- Upper Confidence bound problem: (I skipped)
- Gradient bandit problem: (I skipped)
Markov Decision Process
Unlike K-armed Bandit problem, when an action change the state, affecting future actions, the problem is now a Markov process.
Return:
Given a timestamp t, Return is the sum of all future reward in the path after time t.
Discounted Return:
Similar to Return, but since we don't know if the reward will happen in the future, the weight of a reward in the future step is reduced by a fraction per step. (so the future steps were power of the fraction)
This is to make the near reward more attractive.
Policy Function:
It is a function taking in current state S and possible action a, return a probability for the action to happen.
Note that this could also be a value and apply softmax to choose an action by weight.
Value Function:
A value function only takes in the current state S, to evaluate the future return / discounted return under a policy.
Note that a value function has to be under a policy since the future return is based on the path that the policy goes through. It is possible to have both value function and policy function in a system. For example: actor-critic algorithm.
Optimal Value Function:
At each state, iterate all possible policies, and take the one policy that gives max Return comparing to other policies. Such a value function is called an optimal value function.
The value from the Optimal Value Function can be used to guide an optimal policy, which can choosing the action with the largest value from the optimal value function.
Episode:
Since not all tasks have an end state, when dealing with a task that goes on indefinitely, the process is usually divided into episodes. Each episode defines its own goal.
Monte Carlo Method
When the value function cannot be modeled, we focus on finding a policy function - to choose an action based on current state. It is usually good (optimal) to find out by random sampling and take an average on the Returns at a given (state, action) pair. (This process is so-called Monte Carlo Method) Once trained, we can use this mapping to choose the action that gives the largest Return at a given state.
The random sampling needs to cover all actions. To help this, we will need to let it try all initial states and all initial actions.
Each training takes in 1 episode. Each episode is a list of (state, action) pairs. Iterate the (state, action) pairs backwards from the goal state to the initial state: the return at each state is the reward at the current state + the discounted return in the future states. Note that at the goal state the discounted return is 0. (So, iterate backward and multiply the accumulated return with discount rate r)
During this "training", just average the return at the (state, action) pair. (That means your sample episodes likely need to cover the station action more than once). Avoid "training" the same (state, action) pair twice in the same episode, by skipping to train only at the first occurrence toward the start of the episode.
Note the "training" here does nothing more than taking an average.
Avoid random initialization
Sometime, it might not be practical to construct all random actions. Instead, we let the stepping to be stochastic - at a small probability to randomly pick an action to explore, while at most of time, it follows a greedy algorithm to pick the action with the largest return.
On-Policy Training vs. Off-Policy Training
So far what is described here is has only 1 policy function, which is known as on-policy training. It is usually faster to train. But it is only a special case of off-policy training. In off-policy training, two policy functions are involved: one policy for running and one policy for exploration. On-policy training is just a method there the target policy is the same as the behavior policy for exploration. By separating the two policies, we are not letting the same policy to be greedy and to be willing to try random actions at the same time.
To find such a policy function, it says the target policy must have a non-zero probability at the actions as the target policy to perform the same action as on the behavior policy.
There is a problem with Monte Carlo sampling at the tail of a distribution, where it usually requires a lot of samples. Importance sampling is a method to apply a second distribution at the end of the tail of the first distribution so that estimation requires less samples.
Off-policy method usually utilize importance sampling to find the behavior policy, where it collects less samples to improve the target policy.
On-policy method could also use importance sampling to learn from the previous experience (from a different policy, as the policy is kept updated), so that past experience can also be replayed in on-policy training. This was demonstrated in the PPO algorithm: https://www.youtube.com/watch?v=IScp-mZ7iS0