Machine Learning: Reinforcement Learning

Reinforcement learning (RL) is a branch of machine learning focused on learning through interaction. Instead of learning from labeled examples, an RL agent learns by taking actions in an environment, observing the outcomes, and receiving feedback in the form of rewards. Over time, it improves its behavior to maximize long-term reward. This framing maps naturally to sequential decision-making problems such as robotics, game playing, recommendation strategies, inventory control, and resource allocation.

What makes reinforcement learning distinct is that actions affect not only immediate reward but also the future situations the agent will face. The agent must balance short-term gains with long-term consequences, often under uncertainty and with incomplete information.

The Reinforcement Learning Setup

At the core of reinforcement learning is a loop:

The agent observes the current state of the environment.
The agent chooses an action.
The environment transitions to a new state and emits a reward.
The agent updates its strategy based on the experience.

The objective is usually expressed as maximizing expected cumulative reward over time. A common formalization uses a discounted return:

$G_{t} = k = 0 \sum \infty γ^{k} R_{t + k + 1}$

where $R_{t + k + 1}$ is the reward received at time $t + k + 1$ and $γ \in [0, 1)$ is a discount factor that trades off immediate versus future rewards. When $γ$ is close to 1, the agent values long-term outcomes more heavily.

Markov Decision Processes (MDPs)

Most foundational RL methods are built on the Markov Decision Process (MDP) framework. An MDP models a controlled stochastic process with the following components:

States (__MATH_INLINE_7__): representations of the current situation.
Actions (__MATH_INLINE_8__): choices available to the agent.
Transition dynamics (__MATH_INLINE_9__): probabilities of moving to state $s^{'}$ after taking action $a$ in state $s$ .
Reward function (__MATH_INLINE_13__ or __MATH_INLINE_14__): immediate feedback after a transition.
Discount factor (__MATH_INLINE_15__): preference for near-term versus long-term reward.

The “Markov” property means the next state depends only on the current state and action, not the full past history. In practice, many real environments are only approximately Markov. RL still works in many such settings, but state representation becomes critical. For example, a robot may need a short history of sensor readings to infer velocity, or a trading system may rely on recent price windows to summarize market conditions.

Value Functions and Policies

A policy $π (a ∣ s)$ defines how the agent acts, either deterministically (choose a single action) or stochastically (choose actions with probabilities). RL commonly revolves around two types of value functions:

State-value function $V^{π} (s)$ : expected return starting from state $s$ and following policy $π$ .
Action-value function $Q^{π} (s, a)$ : expected return starting from state $s$ , taking action $a$ , then following $π$ .

The relationship between these functions and the MDP’s structure is captured by Bellman equations, which express a value in terms of immediate reward plus the value of successor states. Many algorithms can be understood as ways to estimate or optimize these value functions efficiently.

Q-Learning: Learning Action Values from Experience

Q-learning is a classic and widely taught RL algorithm. It learns an estimate of the optimal action-value function $Q^{*} (s, a)$ without requiring knowledge of the transition dynamics. The agent uses experience tuples $(s, a, r, s^{'})$ and updates its estimate using a temporal-difference rule:

$Q (s, a) \leftarrow Q (s, a) + α [r + γ a^{'} max Q (s^{'}, a^{'}) - Q (s, a)]$

where $α$ is the learning rate. The term in brackets is the TD error, measuring how surprising the observed transition is compared to the current estimate.

Two practical insights define when Q-learning works well:

Discrete, manageable state-action spaces: A tabular $Q$ table is feasible when the number of states and actions is small.
Sufficient exploration: The agent must try different actions to learn accurate values.

In real applications with high-dimensional inputs (images, continuous sensor streams), tabular Q-learning becomes impractical. This is where function approximation, including deep learning, enters.

Policy Gradient Methods: Optimizing the Policy Directly

While Q-learning learns values and derives a policy from them, policy gradient methods directly optimize policy parameters. If a policy is parameterized by $θ$ (for example, the weights of a neural network), the goal is to maximize expected return:

$J (θ) = E [G_{t}]$

Policy gradient algorithms estimate $\nabla_{θ} J (θ)$ from sampled trajectories and adjust $θ$ via gradient ascent. A key advantage is that policy gradients naturally handle continuous action spaces, such as controlling steering angles or motor torques, where $ar g max_{a} Q (s, a)$ is not straightforward.

In practice, naive policy gradient methods can suffer from high variance in gradient estimates. Common improvements include:

Baselines: subtracting a value estimate to reduce variance without changing the expected gradient.
Actor-critic methods: combining a policy (actor) with a value function estimator (critic). The critic provides feedback that stabilizes learning and improves sample efficiency.

Policy optimization is central in many modern RL systems because it scales well and aligns directly with the control objective.

Deep Reinforcement Learning: Function Approximation at Scale

Deep reinforcement learning (deep RL) combines RL objectives with deep neural networks as function approximators. Instead of a Q-table or linear features, a network maps states (and sometimes actions) to values or action probabilities.

Deep RL made a major practical difference because it can learn directly from raw, high-dimensional observations. For example:

From pixels to actions in simulated game environments
From complex sensor arrays to control decisions in robotics
From large state representations to allocation choices in operations settings

Deep RL also introduces stability challenges. When you combine bootstrapped targets (as in TD learning) with non-linear function approximation and correlated data, training can become unstable. Many successful systems address this with techniques such as experience replay (to decorrelate samples) and target networks (to stabilize bootstrapped targets). The general lesson is that deep RL is powerful, but it is not “plug and play”; practical performance depends heavily on training design, reward formulation, and evaluation discipline.

Exploration vs. Exploitation

One of RL’s defining difficulties is the exploration-exploitation tradeoff:

Exploitation: choose actions believed to yield high reward based on current knowledge.
Exploration: try actions that may be worse in the short term but could reveal better strategies.

Without exploration, an agent can get stuck in suboptimal behavior. With too much exploration, it wastes time and fails to converge on a strong policy.

Common exploration strategies include:

__MATH_INLINE_32__-greedy: with probability $ϵ$ choose a random action; otherwise choose the best known action.
Stochastic policies: policies that inherently sample actions, useful in policy gradient methods.
Optimism and uncertainty-aware methods: encourage actions with high uncertainty because they might be better than currently estimated.

Exploration is not a purely technical detail. In applied RL, it is tied to safety, cost, and ethics. For instance, exploring random actions may be unacceptable for a physical robot or a healthcare decision system. Practical deployments often constrain exploration, use simulators, or rely on careful offline evaluation before any real-world interaction.

Practical Considerations and Common Pitfalls

Reinforcement learning succeeds when the problem is framed correctly and the training setup reflects reality. Several issues regularly determine outcomes:

Reward design: Poorly specified rewards can lead to unintended behavior. If the reward captures a proxy rather than the true goal, the agent may optimize the proxy in surprising ways.
Credit assignment: In long-horizon tasks, rewards may arrive far from the actions that caused them. Discounting, value estimation, and well-shaped intermediate rewards can help, but they must be used carefully.
Sample efficiency: Many RL methods require substantial interaction data. Simulators, parallel training, and efficient algorithms can reduce cost.
Generalization and robustness: Policies can overfit to a training environment. Testing across varied conditions matters, especially for real-world control.

Where Reinforcement Learning Fits

Reinforcement learning is most appropriate when decisions are sequential, outcomes depend on earlier choices, and explicit supervision is unavailable or too expensive. MDPs provide the conceptual backbone, Q-learning offers an accessible entry point for value-based learning, policy gradient methods open the door to continuous control, and deep RL scales these ideas to complex inputs. Across all of these, the exploration-exploitation dilemma remains central, shaping both algorithm design and practical deployment.

Used thoughtfully, reinforcement learning offers a disciplined way to build systems that improve through experience, turning interaction into a learning signal and long-term consequences into a first-class objective.

Machine Learning: Reinforcement Learning

Machine Learning: Reinforcement Learning

The Reinforcement Learning Setup

Markov Decision Processes (MDPs)

Value Functions and Policies

Q-Learning: Learning Action Values from Experience

Policy Gradient Methods: Optimizing the Policy Directly

Deep Reinforcement Learning: Function Approximation at Scale

Exploration vs. Exploitation

Practical Considerations and Common Pitfalls

Where Reinforcement Learning Fits

Write better notes with AI