Yuhan's blog: Reinforcement Learning: Actor Critic

Why Reinforcement Learning?

Reinforcement learning is typically used to generate a sequence of actions to achieve a goal. Note that it is different from the classification problem, which answers a question like whether a given feature vector should have a yes/no (or categorical) answer. However, each step of the sequence of actions may be decided similar to a classification problems: given a feature vector for the current environment state, what is the next action to generate in the sequence to achieve a goal state. Main differences:

a sequence of actions (trajectory) vs. a single decision
environment may be a black box.

The goal state is assigned with a reward to guide the action. There are 2 approaches to solve it: find a policy function that given a state S, returns the probability to pick an action a. Or, find a value function that predict the reward in this state - use this value function to greedily pick an next action a that gives the most award (the observed reward at the next state S + the guessed reward moving from S).

Note that with a policy function, the actions are picked randomly by their probability. The policy function could be used to randomly sample a few trajectories to test out the rewards, and take an average to represent how good a state is. In reality there could be many trajectories. We just need a few sample to have a Monte Carlo estimation.

Describe RL in a more human-understandable fashion

It is a maze game, you are asked to find a path from point A to point B.

But you don't know how the maze looks like. However, putting a BBQ chicken at point B, you can follow the smell from point A to reach point B. The BBQ chicken here is the reward, and the smell is the expected future reward (call it return).

But the smell isn't there, either. To implement the smell, we need to assign every step toward the food a greater imagined smell value. This is the discounted return.

But we don't know where the food is, either. So we take random actions until reaching the BBQ, and then assign our steps with imagined smell values.

If we step on poop during the random exploration, assign our steps with negative smell values, so that they are visited less often. And try walking again following the imagined smell.

How to assign an imagined smell value to a step? Use a neural network, or any model.

Applications:

Chatbot - outputs a sequence of words, to provide answer to a question
AlphaGo - chess, use a sequence of moves to value best state
Control - plan a trajectory to control robot / game to reach a certain state.

Q Function

Q function is the quality of an action in an state S. It is usually guessed from a observed trajectory - by the sum of the observed rewards from an intermediate step i to the end state. This sum also have a named called discounted return. (It is called discounted because the sum is a weighted sum - a reward at a further step has less weight in the sum)

Actor Critic

In Actor Critic, two networks are trained. Critic network gives a score of how well an action is under an environment state. (So it takes 2 variables: action and state) Critic network is trained based on reward from the system. Actor network chooses a sequence of actions under each environment state so that the average score from the Critic network is higher. (So it takes 1 variable: state) It is trained based on the scores from the Critic network.

Proximal Policy Optimization

PPO is an improvement based on Actor Critic, where the learning on actor is capped to avoid taking too large of a step. It can be done by 2 ways: limit the loss - it is too large, clip to the max allowed value; Or, use a ratio between the probability of an action in the new policy and the old policy.

Trust Region Policy Optimization

In Trust Region policy optimization, use a function L to approximate a target function which we want to find parameters for. L and the target function are only similar within a small range. In this range, find the parameters for the max value of L, assuming it has the same parameter of the target function. Then start over from finding the the approximate function L again. The process repeats between Approximation and Maximization.

The Monte Carlo method can be used to make the approximation L, by taking sample trajectories.

To make sure the new parameters after maximization are within a small region to the old parameters, KL divergence is used to compare the probability distribution of the policy function. Or it is also possible to measure the distance of the parameters directly.

Yuhan's blog

Tuesday, January 16, 2024

Reinforcement Learning: Actor Critic

Why Reinforcement Learning?

Describe RL in a more human-understandable fashion

Q Function

Actor Critic

Trust Region Policy Optimization

No comments: