Deep Reinforcement Learning

Reinforcement learning (RL) is the area of machine learning concerned with how agents take sequences of actions in an environment to maximize cumulative reward. Unlike supervised learning, the agent is not told which actions to take; instead it discovers which actions yield the most reward by exploring the environment.

The Reinforcement Learning Paradigm

The three main components are:

  • Agent — the learner and decision-maker
  • Environment — what the agent interacts with
  • Reward signal — the goal of the agent

At each time step t, the agent:

  1. Receives observation s_t from the environment
  2. Selects action a_t
  3. Transitions to new state s_{t+1}
  4. Receives scalar reward r_{t+1}

The agent’s goal is to learn a policy π(s) — a mapping from states to actions — that maximizes total cumulative reward.

Markov Decision Processes (MDP)

Most RL problems are formulated as MDPs, defined by the tuple (S, A, P, R, γ):

  • S — finite set of states
  • A — finite set of actions
  • P(s'|s, a) — state transition probability
  • R(s, a) — expected immediate reward
  • γ ∈ [0, 1] — discount factor

The Markov property states that the future depends only on the current state, not on the history:

P[s_{t+1} | s_t] = P[s_{t+1} | s_1, ..., s_t]

Value Functions

The state-value function V^π(s) is the expected return when starting in state s and following policy π:

V^π(s) = E_π[G_t | s_t = s]
        = E_π[Σ γ^k R_{t+k+1} | s_t = s]

The action-value function Q^π(s, a) (Q-function) gives the expected return after taking action a in state s:

Q^π(s, a) = E_π[G_t | s_t = s, a_t = a]

The Bellman Equation

The Bellman equation decomposes the value function into two parts: immediate reward plus the discounted future value:

V^π(s) = Σ_a π(a|s) Σ_{s'} P(s'|s, a) [R(s, a, s') + γ V^π(s')]

For the optimal value function V*:

V*(s) = max_a Σ_{s'} P(s'|s, a) [R(s, a, s') + γ V*(s')]

The optimal policy is:

π*(s) = argmax_a Σ_{s'} P(s'|s, a) [R(s, a, s') + γ V*(s')]

Model-Based vs Model-Free RL

Model-Based methods learn a model of the environment (transition dynamics and reward function) and use it for planning:

  • Dynamic Programming (DP)
  • Dyna-Q

Model-Free methods learn directly from interaction without building an explicit model:

  • Monte Carlo (MC)
  • Temporal Difference (TD)

Temporal Difference Methods

TD(0) updates estimates based on a single step of experience:

V(s_t) ← V(s_t) + α [r_{t+1} + γ V(s_{t+1}) - V(s_t)]

The TD error δ_t = r_{t+1} + γ V(s_{t+1}) - V(s_t) measures how far the current estimate is from the target.

SARSA (On-Policy TD Control)

SARSA updates Q-values using the action actually taken by the current policy:

Q(s_t, a_t) ← Q(s_t, a_t) + α [r_{t+1} + γ Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]

Name comes from the tuple (S, A, R, S', A') used in each update.

Q-Learning (Off-Policy TD Control)

Q-learning updates toward the maximum Q-value regardless of the action taken:

Q(s_t, a_t) ← Q(s_t, a_t) + α [r_{t+1} + γ max_a Q(s_{t+1}, a) - Q(s_t, a_t)]

Q-learning directly learns the optimal Q-function, independent of the policy being followed.

Exploration vs Exploitation

A fundamental challenge in RL is balancing:

  • Exploration — trying new actions to discover better rewards
  • Exploitation — using known information to maximize reward

Common strategies:

  • ε-greedy: with probability ε choose a random action, otherwise act greedily
  • Softmax / Boltzmann: sample actions from a probability distribution proportional to Q-values
  • UCB (Upper Confidence Bound): prefer actions with high uncertainty

Deep Q-Networks (DQN)

When the state space is large or continuous, we approximate Q-functions with neural networks. Key innovations in DQN (Mnih et al., 2015):

  1. Experience Replay — store past transitions (s, a, r, s') in a replay buffer and sample mini-batches to break correlations
  2. Target Network — use a separate, periodically updated network for computing TD targets to stabilize training
# Simple Q-Network update pseudocode
target = r + gamma * max(target_network.predict(s_next))
loss = MSE(q_network.predict(s)[a], target)
q_network.optimize(loss)

Monte Carlo vs Temporal Difference

  Monte Carlo Temporal Difference
Updates End of episode Every time step
Variance High Low
Bias Zero Some
Convergence Slower Faster
Online learning No Yes

TD methods are generally preferred for online, continuing tasks while MC can be practical when episodes are short and well-defined.