A hallmark of intelligence is the ability to adapt behaviour to reflect external
feedback. In reinforcement learning, this feedback is provided as a real-valued
quantity called the reward. Stubbing one’s toe on the dining table or forgetting
soup on the stove are situations associated with negative reward, while (for
some of us) the first cup of coffee of the day is associated with positive reward.
We are interested in agents that seek to maximise their cumulative reward or
return obtained from interactions with an environment. An agent maximises
its return by making decisions that either have immediate positive consequences,
or steer it into a desirable state. A particular assignment of reward to states and
decisions determines the agent’s objective. For example, in the game of Go
the objective is represented by a positive reward for winning. Meanwhile, the
objective of keeping a helicopter in flight is represented by a per-step negative
reward (typically expressed as a cost) proportional to how much the aircraft
deviates from a desired flight path. In this case, the agent’s return is the total
cost accrued over the duration of the flight.
Often, a decision will have uncertain consequences. Travellers know that it is
almost impossible to guarantee that a trip will go as planned, even though a
three-hour layover is usually more than enough to catch a connecting flight.
Nor are all decisions equal: transiting through Chicago O’Hare may be a riskier
choice than transiting through Toronto Pearson. To model this uncertainty,
reinforcement learning introduces an element of chance to the rewards and to
the effects of the agent’s decisions on its environment. Because the return is the
sum of rewards received along the way, it too is random.
Historically, most of the field’s efforts have gone towards modelling the mean of
the random return. Doing so is useful, as it allows us to make the right decisions:
when we talk of “maximising the cumulative reward”, we typically mean
“maximising the expected return”. The idea has deep roots in probability, the
2 Chapter 1
law of large numbers, and subjective utility theory. In fact, most reinforcement
learning textbooks axiomatise the maximisation of expectation. Quoting Richard
Bellman, for example:
The general idea, and this is fairly unanimously accepted, is to use some average of
the possible outcomes as a measure of the value of a policy.
This book takes the perspective that modelling the expected return alone fails to
account for many complex, interesting phenomena that arise from interactions
with one’s environment. This is evident in many of the decisions that we make:
the first rule of investment states that expected profits should be weighed
against volatility. Similarly, lottery tickets offer negative expected returns but
attract buyers with the promise of a high payoff. During a snowstorm, relying
on the average frequency at which buses arrive at a stop is likely to lead to
disappointment. More generally, hazards big and small result in a wide range
of possible returns, each with its own probability of occurrence. These returns
and their probabilities can be collectively described by a return distribution, our
main object of study.
1.1 Why Distributional Reinforcement Learning?
Just as a colour photograph conveys more information about a scene than a
black and white photograph, the return distribution contains more information
about the consequences of the agent’s decisions than the expected return. The
expected return is a scalar, while the return distribution is infinite-dimensional;
it is possible to compute the expectation of the return from its distribution, but
not the other way around (to continue the analogy, one cannot recover hue from
By considering the return distribution, rather than just the expected return,
we gain a fresh perspective on the fundamental problems of reinforcement
learning. This includes understanding of how optimal decisions should be
made, methods for creating effective representations of an agent’s state, and
the consequences of interacting with other learning agents. In fact, many of the
tools we develop here are useful beyond reinforcement learning and decision
making. We call the process of computing return distributions distributional
dynamic programming. Incremental algorithms for learning return distributions,
such as quantile temporal-difference learning (Chapter 6) are likely to find uses
wherever probabilities need to be estimated.
Throughout this book, we will encounter many examples in which we use the
tools of distributional reinforcement learning to characterise the consequences
of the agent’s choices. This is a modelling decision, rather than a reflection of
Introduction 3
check raise
fold call
-2 -2 -2
-1 -1 -1-1 -1 -1
win (+2)
loss (-2)
(a) (b)
Figure 1.1
(a) In Kuhn poker, each player is dealt one card and then bets on whether they hold the
highest card. The diagram depicts one particular play through; the house’s card (bottom)
is hidden until betting is over. (b) A game tree with all possible states shaded according
to their frequency of occurrence in our example. Leaf nodes depict immediate gains and
losses, which we equate with different values of the received reward.
some underlying truth about these examples. From a theoretical perspective,
justifying our use of the distributional model requires us to make a number of
probabilistic assumptions. These include the notion that the random nature of
the interactions is intrinsically irreducible (what is sometimes called aleatoric
uncertainty) and unchanging. As we encounter these examples, the reader is
invited to reflect on these assumptions, and their effect on the learning process.
Furthermore, there are many situations in which such assumptions do not
completely hold but where distributional reinforcement learning still provides
a rich picture of how the environment operates. For example, an environment
may appear random because some parts of it are not described to the agent (it
is said to be partially observable), or because the environment changes over
the course of the agent’s interactions with it (multi-agent learning, the topic
of Section 11.1, is an example of this). Changes in the agent itself, such as a
change in behaviour, also introduce non-stationarity in the observed data. In
practice, we have found that the return distributions are a valuable reflection
of the underlying phenomena, even when there is no aleatoric uncertainty at
play. Put another way, the usefulness of distributional reinforcement learning
methods does not end with the theorems that characterise them.
1.2 An Example: Kuhn Poker
Kuhn poker is a simplified variant of the well-known card game. It is played
with three cards of a single suit (jack, queen, and king) and over a single round
of betting, as depicted in Figure 1.1a. Each player is dealt one card and must
first bet a fixed ante, which for the purpose of this example we will take to be
4 Chapter 1
10 5 0 5 10
10 5 0 5 10
10 5 0 5 10
10 5 0 5 10
Figure 1.2
Distribution over winnings for the player after playing T rounds. For T = 1, this corre-
sponds to the distribution of immediate gains and losses. For T = 5, we see a single mode
appear roughly centred on the expected winnings. For larger T, two additional modes
appear, one in which the agent goes bankrupt and one where the player has successfully
doubled their stake. As T !1, only these two outcomes have non-zero probability.
$1. After looking at their card, the first player decides whether to raise, which
doubles their bet, or check. In response to a raise, the second player can call
and match the new bet or fold and lose their $1 ante. If the first player chooses
to check instead (keep the bet as-is), the option to raise is given to the second
player, symmetrically. If neither player folded, the player with the higher card
wins the pot ($1 or $2, depending on whether the ante was raised). Figure 1.1b
visualises a single play of the game as a 55-state game tree.
Consider a player who begins with $10 and plays a total of up to T hands of
Kuhn poker, stopping early if they go bankrupt or double their initial stake.
To keep things simple, we assume that this player always goes first and that
their opponent, the house, makes decisions uniformly at random. The player’s
strategy depends on the card they are dealt, and also incorporates an element of
randomness. There are two situations in which a choice must be made: whether
to raise or check at first, and whether to call or fold when the other player raises.
The following table of probabilities describes a concrete strategy as a function
of the player’s dealt card:
Holding a ... Jack Queen King
Probability of raising
/3 01
Probability of calling 0
/3 1
If we associate a positive and negative reward with each round’s gains or losses,
then the agent’s random return corresponds to their total winnings at the end of
the T rounds, and ranges from -10 to 10.
How likely is the player to go bankrupt? How long does it take before the player
is more likely to be ahead than not? What is the mean and variance of the
player’s winnings after T = 15 hands have been played? These three questions
Introduction 5
(and more) can be answered by using distributional dynamic programming to
determine the distribution of returns obtained after T rounds. Figure 1.2 shows
the distribution of winnings (change in money held by the player) as a function
of T. After the first round (T = 1) the most likely outcome is to have lost $1, but
the expected reward is positive. Consequently, over time the player is likely to
be able to achieve their objective. By the fifteenth round the player is much more
likely to have doubled their money than to have gone broke, with a bell-shaped
distribution of values in-between. If the game is allowed to continue until the
end, the player has either gone bankrupt or doubled their stake. In our example,
the probability that the player comes out a winner is approximately 85%.
1.3 How Is Distributional Reinforcement Learning Different?
In reinforcement learning, the value function describes the expected return
that one would counterfactually obtain from beginning in any given state. It is
reasonable to say that its fundamental object of interest the expected return
is a scalar, and that algorithms that operate on value functions operate on
collections of scalars (one per state). On the other hand, the fundamental object
of distributional reinforcement learning is a probability distribution over returns:
the return distribution. The return distribution characterises the probability of
different returns that can be obtained as an agent interacts with its environment
from a given state. Distributional reinforcement learning algorithms operate on
collections of probability distributions that we call return-distribution functions
(or simply return functions).
More than a simple type substitution, going from scalars to probability dis-
tributions results in changes across the spectrum of reinforcement learning
In distributional reinforcement learning, equations relating scalars become
equations relating random variables. For example, the Bellman equation states
that the expected return at a state x , denoted equals the expectation of the
immediate reward R, plus the discounted expected return at the next state X
R + V
)|X = x],
is the agent’s policy a description of how it chooses actions in different
states. By contrast, the distributional Bellman equation states that the random
return at a state x, denoted G
(x), is itself related to the random immediate
6 Chapter 1
reward and the random next-state return according to a distributional equation:
= R + G
), X = x .
In this case, G
(x), R, X
, and G
), are random variables, and the superscript
indicates equality between their distributions. Correctly interpreting the distri-
butional Bellman equation requires identifying the dependency between random
variables, in particular between R and X
. It also requires understanding how
discounting affects the probability distribution of G
(x), and how to manipulate
the collection of random variables G
implied by the definition.
Another change concerns how we quantify the behaviour of learning algorithms,
and how we measure the quality of an agent’s predictions. Because value func-
tions are real-valued vectors, the distance between a value function estimate and
the desired expected return is measured as the absolute difference between those
two quantities. On the other hand, when analysing a distributional reinforcement
learning algorithm, we must instead measure the distance between probability
distributions using a probability metric. As we will see, some probability met-
rics are better suited to distributional reinforcement learning than others, but
no single metric can be identified as the “natural” metric for comparing return
Implementing distributional reinforcement learning algorithms also poses some
concrete computational challenges. In general, the return distribution is sup-
ported on a range of possible returns and its shape can be quite complex. To
approximate this distribution with a finite number of parameters, some approx-
imation is necessary; the practitioner is faced with a variety of choices and
trade-offs. One approach is to discretise the support of the distribution uniformly
and assign a variable probability to each interval, what we call the categorical
representation. Another is to represent the distribution using a finite number
of uniformly-weighted particles whose locations are parameterised, called the
quantile representation. In practice and in theory, we find that the choice of dis-
tribution representation impacts the quality of the return function approximation
and also the ease with which it can be computed.
Learning return distributions from sampled experience is also more challenging
than learning to predict expected returns. The issue is particularly acute when
learning proceeds by bootstrapping, i.e. when the return function estimate at
one state is learned on the basis of the estimate at successor states. When the
return function estimates are defined by a deep neural network, as is common in
1. Later we will consider a form that equates probability distributions directly.
Introduction 7
practice, one must also take care in choosing a loss function that is compatible
with a stochastic gradient descent scheme.
For an agent that only knows about expected returns, it is natural (almost
necessary) to define optimal behaviour in terms of maximising this quantity.
The Q-learning algorithm, which performs credit assignment by maximising
over state-action values, learns a policy with exactly this objective in mind.
Knowledge of the return function, however, allows us to define behaviours
that depend on the full distributions of returns what is called risk-sensitive
reinforcement learning. For example, it may be desirable to act so as to avoid
states that carry a high probability of failure, or penalise decisions that have
high variance. In many circumstances, distributional reinforcement learning
enables behaviour that is more robust to variations and, perhaps, better suited to
real-world applications.
1.4 Intended Audience and Organisation
This book is intended for advanced undergraduates, graduate students, and
researchers who have some exposure to reinforcement learning and are inter-
ested in understanding its distributional counterpart. We present core ideas from
classical reinforcement learning as they are needed to contextualise distribu-
tional topics, but often omit longer discussions and a presentation of specialised
methods in order to keep the exposition concise. The reader wishing a more
in-depth review of classical reinforcement learning is invited to consult one
of the literature’s many excellent books on the topic, including Bertsekas and
Tsitsiklis [1996], Szepesvári [2010], Bertsekas [2012], Puterman [2014], Sutton
and Barto [2018], Meyn [2022].
Already, an exhaustive treatment of distributional reinforcement learning would
require a substantially larger book. Instead, here we emphasise key concepts and
challenges of working with return distributions, in a mathematical language that
aims to be both technically correct but also easily applied. Our choice of topics
is driven by practical considerations (such as scalability in terms of available
computational resources), a topic’s relative maturity, and our own domains of
expertise. In particular, this book contains only one chapter about what is com-
monly called the control problem, and focuses on dynamic programming and
temporal-difference algorithms over Monte Carlo methods. Where appropriate,
in the bibliographical remarks we provide references on these omitted topics.
In general, we chose to include proofs when they pertain to major results in the
chapter, or are instructive in their own right. We defer the proof of a number of
smaller results to exercises.
8 Chapter 1
Each chapter of this book is structured like a hiking trail.
The first sections
(the “foothills”) introduce a concept from classical reinforcement learning and
extend it to the distributional setting. Here, a knowledge of undergraduate-level
probability theory and computer science usually suffice. Later sections (the
“incline”) dive into more technical points, for example a proof of convergence
or more complex algorithms. These may be skipped without affecting the
reader’s understanding of the fundamentals of distributional reinforcement
learning. Finally, most chapters end on a few additional results or remarks that
are interesting yet easily omitted (the “side trail”). These are indicated by an
asterisk (
). For the latter part of the chapter’s journey, the reader may wish to
come equipped with tools from advanced probability theory; our own references
are Billingsley [2012] and Williams [1991].
The book is divided into three parts. The first part introduces the building
blocks of distributional reinforcement learning. We begin by introducing our
fundamental objects of study, the return distribution and the distributional
Bellman equation (Chapter 2). Chapter 3 then introduces categorical temporal-
difference learning, a simple algorithm for learning return distributions. By
the end of Chapter 3, the reader should understand the basic principles of
distributional reinforcement learning, and should be able to use it in simple
practical settings.
The second part develops the theory of distributional reinforcement learning.
Chapter 4 introduces a language for measuring distances between return dis-
tributions, and operators for transforming with these distributions. Chapter 5
introduces the notion of a probability representation, needed to implement
distributional reinforcement learning; it subsequently considers the problem of
computing and approximating return distributions using such representations,
introducing the framework of distributional dynamic programming. Chapter 6
studies how return distributions can be learned from samples and in a incre-
mental fashion, giving a formal construction of categorical temporal-difference
learning as well as other algorithms such as quantile temporal-difference learn-
ing. Chapter 7 extends these ideas to the setting of optimal decision making
(also called the control setting). Finally, Chapter 8 introduces a different perspec-
tive on distributional reinforcement learning based on the notion of statistical
functionals. By the end of the second part, the reader should understand the
challenges that arise when designing distributional reinforcement learning
algorithms, and the available tools to address these challenges.
2. Based on one of the authors’ experience hiking around Banff, Canada.
Introduction 9
The third and final part develops distributional reinforcement learning for
practical scenarios. Chapter 9 reviews the principles of linear value function
approximation and extends these ideas to the distributional setting. Chapter 10
discusses how to combine distributional methods with deep neural networks
to obtain algorithms for deep reinforcement learning. Chapter 11 discusses the
emerging use of distributional reinforcement learning in two further domains of
research (multi-agent learning and neuroscience) and concludes.
Code for examples and exercises, as well as standard implementations of some
of the algorithms presented here can be found at http://distributional-rl.org.
1.5 Bibliographical Remarks
1.0. The quote is due to Bellman [1957a].
Kuhn poker is due to Kuhn [1950], who gave an exhaustive characterisation
of the game’s Nash equilibria. The player’s strategy used in the main text forms
part of such a Nash equilibrium.