Why Reinforcement Learning?
Reinforcement learning is typically used to generate a sequence of actions to achieve a goal. Note that it is different from the classification problem, which answers a question like whether a given feature vector should have a yes/no (or categorical) answer. However, each step of the sequence of actions may be decided similar to a classification problems: given a feature vector for the current environment state, what is the next action to generate in the sequence to achieve a goal state. Main differences:
- a sequence of actions (trajectory) vs. a single decision
- environment may be a black box.
The goal state is assigned with a reward to guide the action. There are 2 approaches to solve it: find a policy function that given a state S, returns the probability to pick an action a. Or, find a value function that predict the reward in this state - use this value function to greedily pick an next action a that gives the most award (the observed reward at the next state S + the guessed reward moving from S).
Note that with a policy function, the actions are picked randomly by their probability. The policy function could be used to randomly sample a few trajectories to test out the rewards, and take an average to represent how good a state is. In reality there could be many trajectories. We just need a few sample to have a Monte Carlo estimation.
Describe RL in a more human-understandable fashion
Applications:
- Chatbot - outputs a sequence of words, to provide answer to a question
- AlphaGo - chess, use a sequence of moves to value best state
- Control - plan a trajectory to control robot / game to reach a certain state.
Q Function
Actor Critic
In Actor Critic, two networks are trained. Critic network gives a score of how well an action is under an environment state. (So it takes 2 variables: action and state) Critic network is trained based on reward from the system. Actor network chooses a sequence of actions under each environment state so that the average score from the Critic network is higher. (So it takes 1 variable: state) It is trained based on the scores from the Critic network.
Proximal Policy Optimization
PPO is an improvement based on Actor Critic, where the learning on actor is capped to avoid taking too large of a step. It can be done by 2 ways: limit the loss - it is too large, clip to the max allowed value; Or, use a ratio between the probability of an action in the new policy and the old policy.
Trust Region Policy Optimization
In Trust Region policy optimization, use a function L to approximate a target function which we want to find parameters for. L and the target function are only similar within a small range. In this range, find the parameters for the max value of L, assuming it has the same parameter of the target function. Then start over from finding the the approximate function L again. The process repeats between Approximation and Maximization.
The Monte Carlo method can be used to make the approximation L, by taking sample trajectories.
To make sure the new parameters after maximization are within a small region to the old parameters, KL divergence is used to compare the probability distribution of the policy function. Or it is also possible to measure the distance of the parameters directly.
No comments:
Post a Comment