Reinforcement Learning Fundamentals

Reinforcement Learning (RL) is the branch of machine learning concerned with how agents learn to make optimal decisions by interacting with an environment. Unlike supervised learning, which learns from a static dataset, RL agents learn from trial and error, receiving feedback in the form of rewards. This framework powers advancements from game-playing AI to robotic control and recommendation systems, making it essential for creating adaptive, intelligent systems.

The Markov Decision Process Framework

Every RL problem is formally framed as a Markov Decision Process (MDP), which provides the mathematical foundation for modeling sequential decision-making. An MDP is defined by five key components: a set of states $S$ , a set of actions $A$ , a transition function $P (s^{'} ∣ s, a)$ , a reward function $R (s, a, s^{'})$ , and a discount factor $γ$ (where $0 \leq γ \leq 1$ ).

The "Markov" property is crucial: it states that the future state and reward depend only on the current state and action, not on the entire history. In other words, the present fully encapsulates all relevant information for the future. The discount factor $γ$ determines the present value of future rewards; a $γ$ close to 1 makes the agent far-sighted, while a $γ$ close to 0 makes it short-sighted, prioritizing immediate gain.

Policies, Value Functions, and the Bellman Equations

The agent's behavior is defined by its policy, denoted $π (a ∣ s)$ . A policy is a mapping from states to probabilities of selecting each possible action. The goal is not just to take good actions, but to find an optimal policy that maximizes cumulative reward over time.

To evaluate the quality of a policy, we use value functions. The state-value function $V^{π} (s)$ estimates the expected cumulative reward starting from state $s$ and following policy $π$ thereafter. The action-value function (or Q-function) $Q^{π} (s, a)$ estimates the expected cumulative reward starting from state $s$ , taking action $a$ , and then following $π$ .

These value functions are not arbitrary; they must satisfy self-consistency conditions known as the Bellman equations. For a given policy $π$ , the Bellman expectation equation for the state-value function is: $V^{π} (s) = a \sum π (a ∣ s) s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V^{π} (s^{'})]$ This equation expresses a recursive relationship: the value of a state is the expected immediate reward plus the discounted value of the next state, averaged over all actions weighted by the policy. The corresponding Bellman equation for the action-value function is: $Q^{π} (s, a) = s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ a^{'} \sum π (a^{'} ∣ s^{'}) Q^{π} (s^{'}, a^{'})]$

The ultimate objective is to find the optimal value functions $V^{*} (s)$ and $Q^{*} (s, a)$ , which correspond to the optimal policy $π^{*}$ . These satisfy the Bellman optimality equations: $V^{*} (s) = a max s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V^{*} (s^{'})]$ $Q^{*} (s, a) = s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ a^{'} max Q^{*} (s^{'}, a^{'})]$ These are non-linear equations because of the $max$ operator, but their solution defines the best possible performance in the MDP.

Dynamic Programming: Policy and Value Iteration

Dynamic Programming (DP) refers to a collection of algorithms that can compute optimal policies given a perfect model of the environment (i.e., known $P$ and $R$ functions). They operate by turning the Bellman equations into iterative update rules.

Policy Iteration is a two-step process that alternates until convergence:

Policy Evaluation: Given a policy $π$ , iteratively apply the Bellman expectation equation to compute its accurate value function $V^{π}$ .
Policy Improvement: Using the computed value function, create a new, greedier policy $π^{'}$ that selects the action maximizing the expected reward-plus-next-value in each state: $π^{'} (s) = ar g max_{a} \sum_{s^{'}} P (s^{'} ∣ s, a) [R + γ V^{π} (s^{'})]$ .

This cycle of evaluation and improvement is guaranteed to converge to the optimal policy $π^{*}$ .

Value Iteration combines these steps more efficiently. It directly iterates the Bellman optimality equation as an update rule: $V_{k + 1} (s) \leftarrow a max s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V_{k} (s^{'})]$ It does not require a full policy evaluation between improvements. You simply repeatedly update the value function towards the optimal one, and the optimal policy can be extracted once the values have converged. Both methods are fundamental planning algorithms, though they require a known model and become computationally expensive for very large state spaces.

The Exploration vs. Exploitation Tradeoff

A defining challenge in RL, especially when the model is unknown, is the exploration versus exploitation tradeoff. The agent must exploit its current knowledge to choose actions that yield high reward, but it must also explore unfamiliar actions to discover potentially better long-term strategies. A purely exploitative agent may get stuck in a suboptimal policy, while a purely exploratory one will never capitalize on what it has learned.

Simple strategies include $ϵ$ -greedy, where the agent selects the greedy (best-known) action with probability $1 - ϵ$ and a random action with probability $ϵ$ . More sophisticated methods, like Upper Confidence Bound (UCB) or Thompson sampling, quantify the uncertainty of value estimates to guide exploration more intelligently. Balancing this tradeoff is critical for efficient learning.

This challenge is directly linked to the reward hypothesis, which states that all goals and purposes of an agent can be thought of as the maximization of the expected cumulative reward. The reward signal is the sole feedback for the agent, making its design paramount. A poorly specified reward (e.g., with unintended loopholes or misaligned incentives) will lead the agent to learn optimal but undesirable behaviors, highlighting that the true objective is defined solely by the reward function you provide.

Common Pitfalls

Ignoring the Discount Factor's Role: Treating $γ$ as merely a technicality is a mistake. Its value fundamentally shapes the agent's optimal behavior. A medical treatment planning agent with $γ = 0.9$ is long-term patient-focused, while one with $γ = 0.1$ might prioritize immediate symptom relief, potentially ignoring harmful side effects later. Always choose $γ$ deliberately based on the problem horizon.

Confusing Model-Based and Model-Free Learning: DP algorithms like policy and value iteration are model-based; they require perfect knowledge of the transition and reward dynamics. A common error is trying to apply them directly to environments where this model is unknown. In such cases, you must use model-free methods like Q-learning or policy gradients, which learn from experience without a model.

Misunderstanding the Policy in Value Iteration: During value iteration, the value function $V_{k}$ is not the value function of any particular policy until convergence. Therefore, you cannot reliably extract a stable policy from intermediate $V_{k}$ . The policy is only guaranteed to be optimal after the value function has (approximately) converged to $V^{*}$ .

Poor Handling of Exploration: Setting a fixed, high exploration rate (e.g., $ϵ = 0.5$ ) throughout training prevents the agent from ever converging to a stable, exploitative policy. Effective strategies typically decay exploration over time (e.g., annealing $ϵ$ from 1.0 to 0.01), allowing for initial discovery followed by refined exploitation of the best-found strategy.

Summary

Reinforcement Learning formalizes learning from interaction through the framework of Markov Decision Processes (MDPs), defined by states, actions, transitions, rewards, and a discount factor.
An agent's strategy is its policy, evaluated by state-value and action-value functions. These functions must satisfy recursive Bellman equations, which provide the foundation for all RL algorithms.
Dynamic Programming methods like policy iteration and value iteration can compute optimal policies when a perfect model of the environment is available, by iteratively applying Bellman equations.
The core challenge in learning is the exploration versus exploitation tradeoff, where an agent must balance gathering new information and using known information to maximize cumulative reward, as directed by the reward hypothesis.

Reinforcement Learning Fundamentals

Reinforcement Learning Fundamentals

The Markov Decision Process Framework

Policies, Value Functions, and the Bellman Equations

Dynamic Programming: Policy and Value Iteration

The Exploration vs. Exploitation Tradeoff

Common Pitfalls

Summary

Write better notes with AI