Policy Gradient and PPO Algorithms

Training an agent to perform complex tasks, like controlling a robot arm or playing a game, often requires decisions that fall on a continuous spectrum—how much force to apply, what angle to turn. Value-based methods, which learn the expected reward of actions, struggle in these continuous action spaces because they cannot feasibly evaluate every possible action. This is where policy gradient methods shine: they optimize the decision-making policy directly, enabling elegant solutions to continuous control problems. By learning a parameterized probability distribution over actions, these methods can sample finely-tuned actions like applying 2.7 Newtons of force. From the foundational REINFORCE algorithm to the state-of-the-art Proximal Policy Optimization (PPO), this evolution represents the quest for stable, sample-efficient, and robust learning, making PPO a default choice for modern reinforcement learning applications from robotics to language model alignment.

The Policy Gradient Foundation: REINFORCE

At its core, a policy gradient method aims to maximize the expected cumulative reward by directly adjusting the parameters $θ$ of a policy $π_{θ} (a ∣ s)$ , which is a probability distribution over actions given a state. Instead of estimating the value of actions, it ascends the gradient of performance. The REINFORCE algorithm, also known as the Monte Carlo policy gradient, provides the simplest form of this idea.

The key is the policy gradient theorem, which provides an unbiased estimate of the gradient. For a single trajectory (a sequence of states, actions, and rewards), the gradient of the expected return $J (θ)$ is estimated as: $\nabla_{θ} J (θ) \approx t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot G_{t}$ Here, $G_{t}$ is the return from time step $t$ —the sum of all future rewards from that point onward. The term $\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$ is the score function, which points the update in the direction that makes a taken action more probable. Multiplying by $G_{t}$ means actions that led to high total reward are reinforced more strongly.

However, REINFORCE suffers from high variance. Because it uses the full Monte Carlo return $G_{t}$ , which can vary wildly between trajectories, the gradient updates are noisy. This leads to unstable training, slow convergence, and the need for many samples. Furthermore, it is an on-policy algorithm, meaning it can only learn from experience collected using the current policy, making it sample inefficient. Despite its simplicity and clear theoretical foundation, these practical limitations necessitated major advances.

The Actor-Critic Framework: Reducing Variance

To combat the high variance of REINFORCE, the actor-critic architecture introduces a second neural network, the critic. This hybrid approach combines the strengths of both policy-based and value-based methods. The actor is the policy $π_{θ} (a ∣ s)$ , responsible for selecting actions. The critic is a value function $V_{ϕ} (s)$ , which estimates the expected return from a given state.

Instead of weighting policy updates by the noisy full return $G_{t}$ , the actor-critic uses the advantage function $A (s_{t}, a_{t})$ . The advantage measures how much better a specific action is compared to the average action in that state: $A (s_{t}, a_{t}) = Q (s_{t}, a_{t}) - V (s_{t})$ . A positive advantage means the action was better than average. The policy gradient then becomes: $\nabla_{θ} J (θ) \approx t = 0 \sum T \nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) \cdot A (s_{t}, a_{t})$ By using the advantage, updates are centered (reducing variance), and the agent learns to increase the probability of only those actions that were genuinely better than the policy's baseline performance. The critic learns to provide better baseline estimates, and the actor learns to take better actions, creating a stable feedback loop for simultaneous value and policy learning.

Advantage Estimation with Generalized Advantage Estimation (GAE)

A critical question remains: how do we accurately estimate the advantage $A (s_{t}, a_{t})$ ? The simplest method is the temporal-difference (TD) error: $δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})$ . This one-step TD error is an unbiased but noisy estimate of the advantage.

Generalized Advantage Estimation (GAE) provides an elegant solution that creates a balance between bias and variance by combining multiple steps of TD errors. It introduces a parameter $λ$ (between 0 and 1). The GAE estimate is a discounted sum of TD errors: $\hat{A}_{t}^{G A E (γ, λ)} = l = 0 \sum \infty (γλ)^{l} δ_{t + l}$ When $λ = 0$ , this reduces to the simple one-step TD error (high bias, low variance). When $λ = 1$ , it becomes a Monte Carlo estimate using the full return (low bias, high variance). Tuning $λ$ allows practitioners to find a sweet spot for their specific problem. GAE has become the standard technique for advantage estimation in policy gradient algorithms because it delivers relatively low-variance, reasonably biased advantage estimates that lead to stable and efficient learning.

Proximal Policy Optimization (PPO): Enabling Stable Training

Even with actor-critic methods and GAE, policy gradient updates can be destructive. A single, overly large update can collapse the policy's performance, requiring many more samples to recover—a phenomenon known as "falling off the cliff." Proximal Policy Optimization (PPO) directly addresses this instability with a simple yet highly effective mechanism.

PPO's core innovation is its clipped surrogate objective. Instead of directly maximizing the likelihood of advantageous actions, it constrains how much the new policy can deviate from the old one. The objective function compares the probability ratio of actions under the new and old policies: $r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{o l d}} ( a _{t} ∣ s _{t} )}$ . The naive objective would be $L^{CP I} = r_{t} (θ) \hat{A}_{t}$ (where CPI stands for Conservative Policy Iteration).

PPO modifies this by clipping the probability ratio: $L^{C L I P} (θ) = E_{t} [min (r_{t} (θ) \hat{A}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A}_{t})]$ Here, $ϵ$ is a small hyperparameter (e.g., 0.1 or 0.2). The $clip$ function prevents $r_{t} (θ)$ from moving outside the interval $[1 - ϵ, 1 + ϵ]$ . The $min$ operator ensures the update is based on the worse (more conservative) estimate between the clipped and unclipped objective. This clipping mechanism acts like a trust region, preventing destructively large policy updates while still allowing for vigorous learning. PPO is typically implemented with multiple epochs of minibatch updates on a fixed set of collected trajectories, making excellent use of sample data and further contributing to its status as the default algorithm for modern RL applications.

Common Pitfalls

Ignoring Advantage Normalization: Failing to normalize advantage estimates across a batch is a common oversight. Since advantages can have varying scales, large absolute values can lead to unstable, explosive gradient updates. Always standardize the advantages (subtract the mean, divide by the standard deviation) for each minibatch to ensure stable, well-conditioned optimization.
Misconfiguring the Clipping Range ( $ϵ$ ): The clipping parameter $ϵ$ is not set-and-forget. A value too large (e.g., 0.5) defeats the purpose of clipping, allowing destructive updates. A value too small (e.g., 0.05) can strangle learning progress, causing the policy to update at a glacial pace. Start with common defaults (0.1-0.2) but be prepared to tune it based on the observed volatility of your training rewards.
Poor Hyperparameter Tuning for GAE ( $λ$ and $γ$ ): Treating $λ$ and the discount factor $γ$ as mere afterthoughts can cripple performance. $γ$ controls the horizon of value estimation—too high can introduce noise from the far future, too low makes the agent myopic. $λ$ controls the bias-variance trade-off in advantage estimation. These require systematic tuning or informed setting based on the problem's time horizon and noise characteristics.
Inadequate Exploration: While PPO is robust, it can still get stuck in local optima if the policy's initial distribution or exploration mechanism is poor. This is especially critical in continuous spaces. Ensure your policy network outputs parameters for a distribution (like a Gaussian) with an initial standard deviation that encourages exploration, and consider if the problem requires additional exploration techniques like entropy bonuses.

Summary

Policy gradient methods, like REINFORCE, optimize an agent's policy directly by ascending the gradient of expected reward, making them uniquely suited for problems with continuous action spaces.
The actor-critic architecture stabilizes learning by using a value network (the critic) to estimate the advantage function, which reduces the variance of policy updates by measuring how much better an action is than the average.
Generalized Advantage Estimation (GAE) provides a robust method for calculating advantages, offering a tunable compromise between the high-bias of one-step TD estimates and the high-variance of Monte Carlo returns.
Proximal Policy Optimization (PPO) ensures stable training by employing a clipped surrogate objective that prevents the new policy from deviating too drastically from the old policy, mitigating the risk of catastrophic performance collapses and enabling efficient sample reuse.
Together, these components form a powerful and practical toolkit, establishing PPO as a versatile and reliable default algorithm for a wide range of modern reinforcement learning challenges.

Policy Gradient and PPO Algorithms

Policy Gradient and PPO Algorithms

The Policy Gradient Foundation: REINFORCE

The Actor-Critic Framework: Reducing Variance

Advantage Estimation with Generalized Advantage Estimation (GAE)

Proximal Policy Optimization (PPO): Enabling Stable Training

Common Pitfalls

Summary

Write better notes with AI