Reinforcement Learning algorithms summarized
Everything you wanted to learn about RL, but were too afraid to ask!
These are notes that I made for myself. Cross-posting here so that others could benefit too!
The basic task of RL
The basic task of reinforcement learning is this: given a state we’re in and probabilities of different actions we can take, how do we increase or decrease those probabilities so that we increase average rewards in future. The fundamental tradeoff in this is bias, variance (as detailed below).
The intuitive thing to do is to increase probability of actions that are “better” (in the sense of helping a higher rewarding trajectory in future)
Notice that no matter which algorithm we end up using, we need to know how “rewarding” an action is. This reward can be calculated in two ways:
Montecarlo approach: Wait for full roll-outs and sum actual rewards we get in the trajectory
This is high variance, low bias because you’re working with actual rewards from environment (not estimates) but it has high variance because rewards for different trajectories varies by a lot
Bootstrap an estimator of rewards: learn an estimator of rewards (say via a neural network) and calculate how rewarding an action is by calculating reward + value of next state
This is low variance, high bias because you need to estimate it from a function (that when randomly initialised comes with a bias) but it has low variance because you’re expecting similar values from the estimator
Monte-carlo approach is not applicable in many scenarios
Episodes may never “end” or be very long.
In this case, we won’t wait until infinity. At certain stage, we need to calculate expected rewards until infinity
Learning in monte-carlo requires episode to end
If episode is too long, you may want to start learning immediately
Generalized advantage estimator gives a tradeoff between these two, where setting lambda can help either choose monte-carlo or estimator or a mix of both
REINFORCE algorithm
The REINFORCE algorithm is surprisingly simple
Initialize a neural network that takes a state and outputs probability of different actions
You can use softmax at the end to convert to probabilities
Loop through environment, and at each step:
Stochastically sample actions for the given state from the environment
Finish the trajectory
Loop through each state/action pair,
Calculate sum of all rewards from current state until the end
and accumulate loss values for each action, state pair
-log(probability(action|state))*sum_of_rewards
Do gradient descent wrt to the accumulated loss value
Understanding loss function, what it does
Since it is negative, decreasing -log(x) means increasing log(x)
Increasing log(probability(action|state))*sum_of_rewards means (over many loops/in the limit):
If a trajectory led to higher than average rewards, the probability of actions taken in those trajectories increase (relative to other actions)
And if a trajectory led to lower than average rewards, the probability of actions taken in those trajectories decrease (relative to other actions)
Notice that the algorithm only impacts the action probability that was chosen (the other available actions are untouched)
This is why exploration becomes very important. If we deterministically sample just one action, its probability will keep on rising (if we don’t have negative rewards, because we always multiply by a positive constant of accumulated rewards).
This is why stochastic sampling becomes important
When to use this algorithm and when not to:
The algorithm suffers from high variance since expected reward from a given (action, state) pair depends on future rewards, that are themselves stochastically picked
Since we’re nudging probabilities of all actions in a trajectory in direct proportion to accumulated reward, a bad action that led to eventual good outcome will get reinforced, while a good action that led to bad outcomes will also get reinforced
So once an algorithm increases probability of a bad action, future trajectories are expected to be worse
This high variance causes the algorithm to be very data hungry (hence often converges slowly)
It works well when you are in an infinite data regime (i.e. you can collect lots of data)
So the key problem is high variance of expected future reward. By mistake if we sample a high rewarding trajectory, since the value is high, we will end up reinforcing even the bad actions in the policy.
Actor-Critic
How do we counter this?
This is where advantage (in actor-critic framework comes in)
Key intuition: instead of REINFORCING an action in proportion to its expected future rewards, REINFORCE it in proportion to how much better the action is v/s average action (as measured by difference in expected future rewards from the state)
How do we know expected future reward for any action given a state?
We configure another network: the value network that outputs expected future value of a state: V(s)
V(s) = expected Q value over all actions
This value action of state V(s) is also called as critic, while policy is called actor
Together they constitute actor-critic algorithm
Actor-critic algorithm is exactly like REINFORCE algorithm, with one difference: instead of multiplying by sum_of_rewards, we multiply by advantage
Advantage_t = sum_of_rewards_t - discount_factor * value(state)
Loss -> -log(probability(action|state))*advantage_t
Another way to calculate advantage
Advantage = (reward + discount_factor * value(next_state) - value(current_state))
This has lower variance than the other one
How is value network trained
Loss is mean squared loss against expected future rewards from the provided state
Main training loop
Initialize policy network and value network
Sample episodes from policy
For each pair in trajectory
Calculate advantage
REINFORCE actions in proportion to their advantage
Train value network given a state
Note that both actor and critic networks can share parameters in the same neural network
They can be different heads of the same network
Actor-critic works well in practice, but the advantage can still make a large change in the network.
PPO - Proximal Policy Optimization
PPO helps prevent such a large change by doing lots of small changes instead.
Why large changes are detrimental
Large changes in policy would be okay if we had a perfect advantage value, but we always have an estimate of it (via a value network)
This means if we end up miscalculating advantage of a network, we can end up REINFORCING a bad action
And because future actions are sampled via updated policy, it could take a long time to undo this change
How PPO does lots of small changes v/s actor-critic
For each episode, run multiple epochs of updating policy (v/s often 1 update for actor-critic)
Use a smaller learning rate v/s actor critic
Limit large policy updates in one go
PPO requires keeping two networks: old policy and new one
It does so by calculating probability ratio = probability of action for new policy / probability of action for old policy
It clips probability ratio in a range 1-e, 1+e
Advantage is scaled by this probability ratio
What happens in PPO loop
Actions that initially that have high advantage get reinforced quickly and increase in probability wrt old policy
But clipping helps increase of this probability too much v/s old policy (since one batch of episodes only contains finite amount of information)
This applies to both positive and negative advantages - actions with negative advantage also can't be suppressed too aggressively
Imagine the counter-factual without clipping
If we do multiple epochs of updating policy with the same larger learning rate, we could end up reinforcing actions that got positive advantage purely by chance
PPO prevents that by ensuring new policy (as measured by probability of action given state) doesn’t drift too far from old policy