Policy Gradient Methods

In reinforcement learning, most algorithms begin by learning a value function that estimates the long-term reward from each state or state-action pair. Policy gradient methods take a radically different, more direct approach: they bypass the value function and instead use gradient ascent to directly optimize a parameterized policy—the function that maps states to actions. This approach excels in high-dimensional or continuous action spaces, such as robotic control and complex video games, where enumerating all possible actions for a value function is impossible. By tweaking the policy parameters to increase the probability of high-reward trajectories, these methods provide a powerful and flexible framework for solving sequential decision-making problems.

The Policy Gradient Theorem and REINFORCE

The core of policy gradient methods is the Policy Gradient Theorem, which provides an elegant, computable form for the gradient of the expected return $J (θ)$ with respect to the policy parameters $θ$ . The objective is to maximize $J (θ) = E_{τ \sim π_{θ}} [R (τ)]$ , where $τ$ is a trajectory (a sequence of states and actions) and $R (τ)$ is its total reward.

Directly differentiating this expectation is challenging because the reward depends on the distribution of trajectories, which itself changes with $θ$ . The theorem uses the likelihood ratio trick (or REINFORCE trick) to move the derivative inside the expectation. The key result is:

$\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) R (τ)]$

This states that the gradient of the performance is the expected value of the gradient of the log-probability of the taken actions, weighted by the total reward of the entire trajectory. To increase $J (θ)$ , we adjust parameters to increase the log-probability of actions that led to high total reward.

The REINFORCE algorithm, also known as the Monte Carlo policy gradient, is the simplest instantiation of this theorem. It works by:

Using the current policy $π_{θ}$ to sample a complete trajectory $τ$ .
Computing the total reward $R (τ)$ .
Estimating the gradient with a single sample: $\nabla_{θ} J (θ) \approx \sum_{t} \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) R (τ)$ .
Updating the parameters via gradient ascent: $θ \leftarrow θ + α \nabla_{θ} J (θ)$ .

While foundational, REINFORCE suffers from high variance because a single trajectory's total reward $R (τ)$ is a noisy, high-variance signal for credit assignment. This leads to slow and unstable learning.

Actor-Critic Methods and Advantage Estimation

Actor-critic methods address the high-variance problem of REINFORCE by incorporating a learned value function, called the critic. The critic reduces variance by providing a more informed baseline for evaluating actions. In this architecture, the actor (the policy $π_{θ}$ ) decides which actions to take, and the critic (a value function $V_{ϕ} (s)$ or $Q_{ϕ} (s, a)$ ) evaluates how good those actions are.

The core improvement is to replace the total trajectory reward $R (τ)$ in the policy gradient with a per-timestep advantage function $A (s_{t}, a_{t})$ . The advantage measures how much better a specific action is compared to the average action in that state, formally $A (s_{t}, a_{t}) = Q (s_{t}, a_{t}) - V (s_{t})$ . The policy gradient then becomes:

$\nabla_{θ} J (θ) = E_{τ \sim π_{θ}} [t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) A (s_{t}, a_{t})]$

Since the advantage has lower variance than the raw return, this leads to much more stable updates. A common and effective way to estimate the advantage is Generalized Advantage Estimation (GAE), which creates a bias-variance trade-off by using a weighted combination of k-step advantage estimators. GAE provides a smooth, tunable estimate that balances between the high-bias, low-variance estimate from the critic and the high-variance, low-bias estimate from Monte Carlo returns.

Proximal Policy Optimization and Trust Region Methods

A major challenge in policy optimization is ensuring that updates are stable; a large, bad policy update can collapse performance, forcing you to restart training from earlier checkpoints. Trust region methods, like Trust Region Policy Optimization (TRPO), tackle this by constraining how much the new policy can diverge from the old policy during an update. TRPO uses a complex second-order optimization method to maximize performance subject to a constraint on the Kullback-Leibler (KL) divergence between the old and new policies.

Proximal Policy Optimization (PPO) achieves the same stability benefits as TRPO but with a much simpler, first-order implementation that is easier to apply. Instead of a hard constraint, PPO uses a clipped objective function that penalizes changes which move the new policy too far from the old one. The core PPO objective is:

$L^{C L I P} (θ) = E_{t} [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})]$

Here, $r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{o l d}} ( a _{t} ∣ s _{t} )}$ is the probability ratio between the new and old policies. The "clip" term modifies the objective to ignore updates that would make $r_{t} (θ)$ fall outside the interval $[1 - ϵ, 1 + ϵ]$ , thereby limiting the policy change. This elegant mechanism prevents destructively large policy updates and has made PPO one of the most popular and robust policy gradient algorithms in practice.

Application to Continuous Action Spaces

Policy gradient methods are uniquely suited for problems with continuous action spaces, where actions are real-valued vectors (e.g., torque applied to a robot joint). Unlike value-based methods, which require finding the maximum over a Q-function for continuous actions—a complex optimization problem at every step—policy gradients naturally output parameters of a probability distribution over the continuous action space.

For example, the policy network's output for a robotic arm could be the mean $μ$ and standard deviation $σ$ of a Gaussian distribution. The agent then samples actions $a \sim N (μ (s), σ (s))$ . During training, the log-probability $lo g π_{θ} (a ∣ s)$ is simply the log of the Gaussian probability density, which is easily differentiable. This allows for smooth, gradient-based optimization of complex behaviors like locomotion or dexterous manipulation. In game playing, such as training agents for physics-based sports simulations, this enables learning nuanced, analog control strategies that are beyond the scope of discrete-action methods.

Common Pitfalls

Ignoring Variance in Basic REINFORCE: Implementing a vanilla REINFORCE algorithm without a baseline or advantage function will almost certainly fail on non-trivial problems due to impractically high variance. The solution is to always use at least a simple state-dependent baseline (like a learned value function) to center the rewards.

Poor Advantage Estimation: Using a high-variance, Monte-Carlo return to estimate the advantage (e.g., $A_{t} = R_{t} - V (s_{t})$ where $R_{t}$ is the full return) can still lead to unstable learning. Employing GAE or carefully tuned n-step returns is crucial for stabilizing actor-critic training.

Incorrect Probability Ratio Handling (PPO): A subtle bug occurs when you don't properly stop the gradient through the probability ratio $r_{t} (θ)$ for the old policy parameters $θ_{o l d}$ . The old probabilities must be treated as fixed constants, recorded from the policy before the update. Backpropagating through them breaks the theoretical assumptions of the algorithm.

Forgetting Exploration in Deterministic Policies: When using a policy that outputs a deterministic action (common in continuous control), the agent can fail to explore. The standard remedy is to add noise to the actions during training (e.g., via the policy's output distribution or external noise like in the Deep Deterministic Policy Gradient algorithm).

Summary

Policy gradient methods directly optimize a parameterized policy using gradient ascent on the expected return, making them ideal for continuous and high-dimensional action spaces in domains like robotics and complex game AI.
The Policy Gradient Theorem provides the foundational gradient estimate, which the REINFORCE algorithm implements as a high-variance Monte Carlo method.
Actor-critic architectures dramatically improve stability by using a critic (value function) to estimate a low-variance advantage function, which tells the actor how much better an action was than expected.
Modern algorithms like Proximal Policy Optimization (PPO) ensure training stability by limiting how much the policy can change per update, either via a clipped objective or a trust region constraint, preventing catastrophic performance collapses.
Success requires careful attention to variance reduction (using baselines/advantages) and update stability (using clipping or trust regions), as well as correct handling of probability distributions for continuous actions.

Policy Gradient Methods

Policy Gradient Methods

The Policy Gradient Theorem and REINFORCE

Actor-Critic Methods and Advantage Estimation

Proximal Policy Optimization and Trust Region Methods

Application to Continuous Action Spaces

Common Pitfalls

Summary

Write better notes with AI