Reinforcement Learning algorithms summarized

Everything you wanted to learn about RL, but were too afraid to ask!

Jun 18, 2025

These are notes that I made for myself. Cross-posting here so that others could benefit too!

The basic task of RL

The basic task of reinforcement learning is this: given a state we’re in and probabilities of different actions we can take, how do we increase or decrease those probabilities so that we increase average rewards in future. The fundamental tradeoff in this is bias, variance (as detailed below).

The intuitive thing to do is to increase probability of actions that are “better” (in the sense of helping a higher rewarding trajectory in future)
Notice that no matter which algorithm we end up using, we need to know how “rewarding” an action is. This reward can be calculated in two ways:
- Montecarlo approach: Wait for full roll-outs and sum actual rewards we get in the trajectory
  - This is high variance, low bias because you’re working with actual rewards from environment (not estimates) but it has high variance because rewards for different trajectories varies by a lot
- Bootstrap an estimator of rewards: learn an estimator of rewards (say via a neural network) and calculate how rewarding an action is by calculating reward + value of next state
  - This is low variance, high bias because you need to estimate it from a function (that when randomly initialised comes with a bias) but it has low variance because you’re expecting similar values from the estimator
Monte-carlo approach is not applicable in many scenarios
- Episodes may never “end” or be very long.
  - In this case, we won’t wait until infinity. At certain stage, we need to calculate expected rewards until infinity
- Learning in monte-carlo requires episode to end
  - If episode is too long, you may want to start learning immediately
Generalized advantage estimator gives a tradeoff between these two, where setting lambda can help either choose monte-carlo or estimator or a mix of both

REINFORCE algorithm

The REINFORCE algorithm is surprisingly simple

Initialize a neural network that takes a state and outputs probability of different actions
- You can use softmax at the end to convert to probabilities
Loop through environment, and at each step:
- Stochastically sample actions for the given state from the environment
- Finish the trajectory
- Loop through each state/action pair,
  - Calculate sum of all rewards from current state until the end
  - and accumulate loss values for each action, state pair
    - -log(probability(action|state))*sum_of_rewards
- Do gradient descent wrt to the accumulated loss value
Understanding loss function, what it does
- Since it is negative, decreasing -log(x) means increasing log(x)
- Increasing log(probability(action|state))*sum_of_rewards means (over many loops/in the limit):
  - If a trajectory led to higher than average rewards, the probability of actions taken in those trajectories increase (relative to other actions)
  - And if a trajectory led to lower than average rewards, the probability of actions taken in those trajectories decrease (relative to other actions)
- Notice that the algorithm only impacts the action probability that was chosen (the other available actions are untouched)
  - This is why exploration becomes very important. If we deterministically sample just one action, its probability will keep on rising (if we don’t have negative rewards, because we always multiply by a positive constant of accumulated rewards).
    - This is why stochastic sampling becomes important
When to use this algorithm and when not to:
- The algorithm suffers from high variance since expected reward from a given (action, state) pair depends on future rewards, that are themselves stochastically picked
  - Since we’re nudging probabilities of all actions in a trajectory in direct proportion to accumulated reward, a bad action that led to eventual good outcome will get reinforced, while a good action that led to bad outcomes will also get reinforced
  - So once an algorithm increases probability of a bad action, future trajectories are expected to be worse
- This high variance causes the algorithm to be very data hungry (hence often converges slowly)
  - It works well when you are in an infinite data regime (i.e. you can collect lots of data)

So the key problem is high variance of expected future reward. By mistake if we sample a high rewarding trajectory, since the value is high, we will end up reinforcing even the bad actions in the policy.

Actor-Critic

How do we counter this?

This is where advantage (in actor-critic framework comes in)

Key intuition: instead of REINFORCING an action in proportion to its expected future rewards, REINFORCE it in proportion to how much better the action is v/s average action (as measured by difference in expected future rewards from the state)
How do we know expected future reward for any action given a state?
- We configure another network: the value network that outputs expected future value of a state: V(s)
  - V(s) = expected Q value over all actions
This value action of state V(s) is also called as critic, while policy is called actor
- Together they constitute actor-critic algorithm
Actor-critic algorithm is exactly like REINFORCE algorithm, with one difference: instead of multiplying by sum_of_rewards, we multiply by advantage
- Advantage_t = sum_of_rewards_t - discount_factor * value(state)
- Loss -> -log(probability(action|state))*advantage_t
Another way to calculate advantage
- Advantage = (reward + discount_factor * value(next_state) - value(current_state))
  - This has lower variance than the other one
How is value network trained
- Loss is mean squared loss against expected future rewards from the provided state
Main training loop
- Initialize policy network and value network
- Sample episodes from policy
  - For each pair in trajectory
    - Calculate advantage
    - REINFORCE actions in proportion to their advantage
    - Train value network given a state
Note that both actor and critic networks can share parameters in the same neural network
- They can be different heads of the same network

Actor-critic works well in practice, but the advantage can still make a large change in the network.

PPO - Proximal Policy Optimization

PPO helps prevent such a large change by doing lots of small changes instead.

Why large changes are detrimental
- Large changes in policy would be okay if we had a perfect advantage value, but we always have an estimate of it (via a value network)
- This means if we end up miscalculating advantage of a network, we can end up REINFORCING a bad action
  - And because future actions are sampled via updated policy, it could take a long time to undo this change
How PPO does lots of small changes v/s actor-critic
- For each episode, run multiple epochs of updating policy (v/s often 1 update for actor-critic)
- Use a smaller learning rate v/s actor critic
- Limit large policy updates in one go
  - PPO requires keeping two networks: old policy and new one
  - It does so by calculating probability ratio = probability of action for new policy / probability of action for old policy
    - It clips probability ratio in a range 1-e, 1+e
  - Advantage is scaled by this probability ratio
What happens in PPO loop
- Actions that initially that have high advantage get reinforced quickly and increase in probability wrt old policy
- But clipping helps increase of this probability too much v/s old policy (since one batch of episodes only contains finite amount of information)
- This applies to both positive and negative advantages - actions with negative advantage also can't be suppressed too aggressively
Imagine the counter-factual without clipping
- If we do multiple epochs of updating policy with the same larger learning rate, we could end up reinforcing actions that got positive advantage purely by chance
- PPO prevents that by ensuring new policy (as measured by probability of action given state) doesn’t drift too far from old policy

Lossfunk

Discussion about this post