3
Learning the Return Distribution
Reinforcement learning provides a computational model of what and how an
animal, or more generally an agent, learns to predict on the basis of received
experience. Often, the prediction of interest is about the expected return under
the initial state distribution or, counterfactually, from a query state x
2X
. In the
latter case, making good predictions is equivalent to having a good estimate
of the value function V
. From these predictions, reinforcement learning also
seeks to explain how the agent might best control its environment, for example
to maximise its expected return.
By learning from experience, we generally mean from data rather than from the
Markov decision process description of the environment (its transition kernel,
reward distribution, and so on). In many settings, this data is taken from sample
interactions with the environment. In their simplest forms, these interactions are
realisations of the random trajectory (X
t
, A
t
, R
t
)
t0
. The record of a particular
game of chess and the fluctuations of an investment account throughout the
month are two examples of sample trajectories. Learning from experience is
a powerful paradigm, because it frees us from needing a complete description
of the environment, an often impractical if not infeasible requirement. It also
enables incremental algorithms whose run time does not depend on the size of
the environment and that can easily be implemented and parallelised.
The aim of this chapter is to introduce a concrete algorithm for learning return
distributions from experience, called categorical temporal-difference learning.
In doing so, we will provide an overview of the design choices that must be made
when creating distributional reinforcement learning algorithms. While there are
well-established, and in some cases definitive methods for learning to predict
the expected return in classical reinforcement learning, in the distributional
setting choosing the right algorithm requires balancing multiple, sometimes
conflicting considerations. In great part, this is because of the unique and
53
54 Chapter 3
exciting challenges that arise when one wishes to estimate sometimes intricate
probability distributions, rather than scalar expectations.
3.1 The Monte Carlo Method
Birds such as the pileated woodpecker follow a feeding routine that regularly
takes them back to the same foraging grounds. The success of this routine can be
measured in terms of the total amount of food obtained during a fixed period of
time, say a single day. As part of a field study, it may be desirable to predict the
success of a particular bird’s routine on the basis of a limited set of observations;
for example, to assess its survival chances at the beginning of winter based
on feeding observations from the summer months. In reinforcement learning
terms, we view this as the problem of learning to predict the expected return
(total food per day) of a given policy
(the feeding routine). Here, variations in
weather, human activity, and other foraging animals are but a few of the factors
that affect the amount of food obtained on any particular day.
In our example, the problem of learning to predict is abstractly a problem of
statistical estimation. To this end, let us model the woodpecker’s feeding routine
as a Markov decision process.
17
We associate each day with a sample trajectory
or episode, corresponding to measurements made at regular intervals about the
bird’s location x, behaviour a, and per-period food intake r. Suppose that we
have observed a set of K sample trajectories,
(x
k,t
, a
k,t
, r
k,t
)
T
k
–1
t=0
K
k=1
, (3.1)
where we use k to index the trajectory and t to index time, and where T
k
denotes
the number of measurements taken each day. In this example, it is most sensible
to assume a fixed number of measurements T
k
= T, but in the general setting T
k
may be random and possibly dependent on the trajectory, often corresponding to
the time when a terminal state is first reached. For now, let us also assume that
there is a unique starting state x
0
, such that x
k,0
= x
0
for all k. We are interested
in the problem of estimating the expected return
E
T–1
t=0
t
R
t
= V
(x
0
),
corresponding to the expected per-day food intake of our bird.
18
17.
Although this may be a reasonable approximation of reality, it is useful to remember that
concepts such as the Markov property are in this case modelling assumptions, rather than actual
facts.
18.
In this particular example, a discount factor of
= 1 is reasonable given that T is fixed and we
should have no particular preference for mornings over evenings.
Learning the Return Distribution 55
Monte Carlo methods estimate the expected return by averaging the outcomes of
observed trajectories. Let us denote by g
k
the sample return for the k
th
trajectory:
g
k
=
T–1
t=0
t
r
k,t
. (3.2)
The sample-mean Monte Carlo estimate is the average of these K sample
returns:
ˆ
V
(x
0
)=
1
K
K
k=1
g
k
. (3.3)
This a sensible procedure that also has the benefit of being simple to implement
and broadly applicable. In our example above, the Monte Carlo method corre-
sponds to estimating the expected per-day food intake by simply averaging the
total intake measured over different days.
Example 3.1.
Monte Carlo tree search is a family of approaches that have
proven effective for planning in stochastic environments and have been used
to design state-of-the-art computer programs that play the game of Go. Monte
Carlo tree search combines a search procedure with so-called Monte Carlo
rollouts (simulations). In a typical implementation, a computer Go algorithm
will maintain a partial search tree whose nodes are Go positions, rooted in the
current board configuration. Nodes at the fringe of this tree are evaluated by
performing a large number of rollouts of a fixed rollout policy (often uniformly
random) from the nodes’ position to the game’s end.
By defining the return of one rollout to be 1 if the game is won and 0 if the game
is lost, the expected undiscounted return from a leaf node corresponds to the
probability of winning the game (under the rollout policy) from that position.
By estimating the expected return from complete rollouts, the Monte Carlo
method avoids the challenges associated with devising a heuristic to determine
the value of a given position. 4
We can use the Monte Carlo method to estimate the value function V
rather
than only the expected return from the initial state. Suppose now that the sample
trajectories (Equation 3.1) have different starting states (that is, x
k,0
varies across
trajectories). The Monte Carlo method constructs the value function estimate
ˆ
V
(x)=
1
N(x)
K
k=1
{x
k,0
=x}
g
k
,
56 Chapter 3
where
N(x)=
K
k=1
{x
k,0
= x}
is the number of trajectories whose starting state is x. For simplicity of
exposition, we assume here that N(x) > 0 for all x 2X.
Learning the value function is useful because it provides finer-grained detail
about the environment and chosen policy compared to only learning about the
expected return from the initial state. Often, estimating the full value function
can be done at little additional cost, because sample trajectories contain infor-
mation about the expected return from multiple states. For example, given a
collection of K sample trajectories drawn independently from the generative
equations, the first-visit Monte Carlo estimate is
ˆ
V
FV
(x)=
1
N
FV
(x)
K
k=1
T
k
–1
t=0
{x
k,t
= x, x
k,0
, ..., x
k,t–1
6= x}

c
k,t
T
k
–1
s=t
st
r
k,s
N
FV
(x)=
K
k=1
T
k
–1
t=0
c
k,t
.
The first-visit Monte Carlo estimate treats each time step as the beginning of a
new trajectory; this is justified by the Markov property and time-homogeneity
of the random trajectory ( X
t
, A
t
, R
t
)
t0
(Section 2.6). The restriction to the first
occurrence of x in a particular trajectory guarantees that
ˆ
V
FV
behaves like the
sample-mean estimate.
3.2 Incremental Learning
Both in practice and in theory, it is useful to consider a learning model under
which sample trajectories are processed sequentially, rather than all at once.
Algorithms that operate in this fashion are called incremental algorithms, as
they maintain a running value function estimate V
2R
X
which they improve
with each sample.
19
Under this model, we now consider an infinite sequence of
sample trajectories
(x
k,t
, a
k,t
, r
k,t
)
T
k
–1
t=0
k0
, (3.4)
presented one at a time to the learning algorithm. In addition, we consider the
more general setting in which the initial states (x
k,0
)
k0
may be different; we
19.
Incremental methods are also sometimes called stochastic or sample-based. We prefer the term
“incremental” as these methods can be applied even in the absence of randomness, and because
other estimation methods including the sample-mean method of the preceding section are also
based on samples.
Learning the Return Distribution 57
call these states the source states, as with the sample transition model (Section
2.6). As in the previous section, a minimum requirement for learning V
is that
every state x 2X should be the source state of some trajectories.
Algorithm 3.1: Online incremental first-visit Monte Carlo
Algorithm parameters: step size 2(0, 1],
policy : X!P(A),
initial estimate V
0
2R
X
V(x) V
0
(x) for all x 2X
Loop for each episode:
Observe initial state x
0
T 0
Loop for t = 0, 1,
Draw a
t
from ( · | x
t
)
Take action a
t
, observe r
t
, x
t+1
T T +1
until x
t+1
is terminal
g 0
for t = T 1, ,0do
g r
t
+ g
if x
t
is not in {x
0
, , x
t–1
} then
V(x
t
) (1 )V(x
t
)+g
end for
end
The incremental Monte Carlo algorithm begins by initialising its value function
estimate:
V(x) V
0
(x), for all x 2X.
Given the k
th
trajectory, the algorithm computes the sample return g
k
(Equation
3.2), called the Monte Carlo target in this context. It then adjusts its value
function estimate for the initial state x
k,0
towards this target, according to the
update rule
V(x
k,0
) (1
k
)V(x
k,0
)+
k
g
k
.
The parameter
k
2
[0, 1) is a time-varying step size that controls the impact
of a single sample trajectory on the value estimate V. Because the incremental
Monte Carlo update rule only depends on the starting state and the sample
return, we can more generally consider learning V
on the basis of the sequence
58 Chapter 3
of state-return pairs
x
k
, g
k
)
k0
. (3.5)
Under this simplified model, the sample return g
k
is assumed drawn from the
return distribution
(x
k
) (rather than constructed from the sample trajectory; of
course, these two descriptions are equivalent). The initial state x
k,0
is substituted
for x
k
to yield:
V(x
k
) (1
k
)V(x
k
)+
k
g
k
. (3.6)
By choosing a step size that is inversely proportional to the number of times
N
k
(x
k
) that the value of state x
k
has been previously encountered, we recover
the sample-mean estimate of the previous section (see Exercise 3.1). In practice,
it is also common to use a constant step size
to avoid the need to track N
k
.We
will study the effect of step sizes in greater detail in Chapter 6.
As given, the Monte Carlo update rule is agnostic to the mechanism by which
the sample trajectories are generated, or when learning occurs in relation to the
agent observing these trajectories. A frequent, and important setting in reinforce-
ment learning is when each trajectory is consumed by the learning algorithm
immediately after being experienced, which we call the online setting.
20
Com-
plementing the abstract presentation given in this section, Algorithm 3.1 gives
pseudo-code for an incremental implementation of the first-visit Monte Carlo
algorithm in the online, episodic
21
setting.
In Equation 3.6, the notation A
B indicates that the value B (which may be a
scalar, a vector, or a distribution) should be stored in the variable A. In the case
of Equation 3.6, V is a collection of scalars one per state and the update
rule describes the process of modifying the value of one of these variables “in
place”. This provides a succinct way of highlighting the incremental nature of
the process. On the other hand, it is often useful to consider the value of the
variable after a given number of iterations. For k > 0, we express this with the
notation
V
k+1
(x
k
) = (1
k
)V
k
(x
k
)+
k
g
k
V
k+1
(x)=V
k
(x) for x 6= x
k
, (3.7)
where the second line reflects the fact that only the variable associated to state
x
k
is modified at time k.
20.
The online setting is sometimes defined in terms of individual transitions, which are to be
consumed immediately after being experienced. Our use of the term is broader and related to the
streaming setting considered in other fields of research [see e.g. Cormode and Muthukrishnan,
2005].
21. A environment is said to be episodic when all trajectories eventually reach a terminal state.
Learning the Return Distribution 59
3.3 Temporal-Difference Learning
Algorithm 3.2: Online temporal-difference learning
Algorithm parameters: step size 2(0, 1],
policy : X!P(A)
initial estimate V
0
2R
X
V(x) V
0
(x) for all x 2X
Loop for each episode:
Observe initial state x
0
Loop for t = 0, 1,
Draw a
t
from ( · | x
t
)
Take action a
t
, observe r
t
, x
t+1
if x
t+1
is terminal then
V(x
t
) (1 )V(x
t
)+r
t
else
V(x
t
) (1 )V(x
t
)+
r
t
+ V(x
t+1
)
end if
until x
t+1
is terminal
end
Incremental algorithms are useful because they do not need to maintain the
entire set of sample trajectories in memory. In addition, they are often simpler
to implement and enable distributed and approximate computation. Temporal-
difference learning (TD learning, or simply TD) is particularly well-suited to
the incremental setting, because it can learn from sample transitions, rather than
entire trajectories. It does so by leveraging the relationship between the value
function at successive states effectively, it takes advantage of the Bellman
equation.
To begin, let us abstract away the sequential nature of the trajectory and consider
the sample transition model (X, A, R, X
0
) defined by Equation 2.12. Here, the
distribution
of the source state X may correspond, for example, to the relative
frequency at which states are visited over the random trajectory. We consider a
sequence of sample transitions drawn independently according to this model,
denoted
(x
k
, a
k
, r
k
, x
0
k
)
k0
. (3.8)
60 Chapter 3
As with the incremental Monte Carlo algorithm, the update rule of temporal-
difference learning is parametrised by a time-varying step size
k
. For the k
th
sample transition, this update rule is given by
V(x
k
) (1
k
)V(x
k
)+
k
r
k
+ V(x
0
k
)
. (3.9)
Algorithm 3.2 instantiates this update rule in the online, episodic setting. We
call the term r
k
+
V(x
0
k
) in Equation 3.9 the temporal-difference target. If we
again write the Bellman equation
V
(x)=E
R + V
(X
0
)|X = x
,
we see that the temporal-difference target can be thought of as a realisation
of the random variable whose expectation forms the right-hand side of the
Bellman equation, with the exception that the TD target uses the estimated
value function V in place of the true value function V
. This highlights the fact
that temporal-difference learning adjusts its value function estimate to be more
like the right-hand side of the Bellman equation; we will study this relationship
more formally in Chapter 6.
By rearranging terms, we can also express the temporal-difference update rule
in terms of the temporal-difference error r
k
+ V(x
0
k
)–V(x
k
):
V(x
k
) V(x
k
)+
k
r
k
+ V(x
0
k
)–V(x
k
)
;
this form highlights the direction of change of the value function estimate
(positive if the target is greater than our estimate, negative if it is not) and is
needed to express certain reinforcement learning algorithms, as we will see in
Chapter 9.
The incremental Monte Carlo algorithm updates its value function estimate
towards a fixed (but noisy) target. By contrast, Equation 3.9 describes a recur-
sive process, without such a fixed target. Temporal-difference learning instead
depends on the value function at the next state V(x
0
k
) being approximately
correct. As such, it is said to bootstrap from its own value function estimate.
Because of this recursive dependency, the dynamics of temporal-difference
learning are usually different from those of Monte Carlo methods, and are more
challenging to analyse (Chapter 6).
On the other hand, temporal-difference learning offers some important advan-
tages over Monte Carlo methods. One is the way in which value estimates
are naturally shared between states, so that once a value has been estimated
accurately at one state, this can often be used to improve the value estimates at
other states. In many situations, the estimates produced by temporal-difference
learning are consequently more accurate than their Monte Carlo counterparts.
Learning the Return Distribution 61
A full treatment of the statistical properties of temporal-difference learning is
beyond the scope of this book, but we provide references on the topic in the
bibliographical remarks.
3.4 From Values To Probabilities
In distributional reinforcement learning, we are interested in understanding
the random return as it arises from interactions with the environment. In the
context of this chapter, we are specifically interested in how we can learn the
return-distribution function
.
As a light introduction, consider the scenario in which rewards are binary
(R
t
2
{0, 1}) and where we are interested in learning the distribution of the
undiscounted finite-horizon return function
G
(x)=
H–1
t=0
R
t
, X
0
= x. (3.10)
Here, H
2N
+
denotes the horizon of the learner how far it predicts into
the future. By enumeration, we see that G
(x) takes on one of H + 1 possible
values, integers ranging from 0 to H. These form the support of the probability
distribution
(x). To learn
, we will construct an incremental algorithm that
maintains a return-distribution function estimate
, the distributional analogue
of V from the previous sections. This estimate assigns a probability p
i
(x) to
each possible return i 2{0, ..., H}:
(x)=
H
i=0
p
i
(x)
i
, (3.11)
where p
i
(x)
0 and
H
i=0
p
i
(x) = 1. We call Equation 3.11 a categorical rep-
resentation, since each possible return can now be thought of as one of H +1
categories. Under this perspective, we can think of the problem of learning the
return distribution for a given state x as a classification problem assigning
probabilities to each of the possible categories. We may then view the problem
of learning the return function
as a collection of classification problems (one
per state).
As before, let us consider a sequence of state-return pairs (x
k
, g
k
)
k0
, where
each g
k
is drawn from the distribution
(x
k
). As in Section 3.2, the sample
return g
k
provides a target for an update rule except that now we want to adjust
the probability of observing g
k
rather than an estimate of the expected return.
For a step size
k
2(0, 1], the categorical update rule is
p
g
k
(x
k
) (1
k
)p
g
k
(x
k
)+
k
62 Chapter 3
p
i
(x
k
) (1
k
)p
i
(x
k
), i 6= g
k
. (3.12)
The adjustment of the probabilities for returns other than g
k
ensures that
the return distribution estimate at x
k
continues to sum to 1 after the update.
Expressed as an operation over probability distributions, this update rule corre-
sponds to changing
(x
k
) to be a mixture between itself and a distribution that
puts all of its mass on g
k
:
(x
k
) (1
k
)(x
k
)+
k
g
k
. (3.13)
We call this the undiscounted finite-horizon categorical Monte Carlo algorithm.
It is instantiated in the online, episodic setting in Algorithm 3.3. Similar to the
other incremental algorithms presented thus far, it is possible to demonstrate
that under the right conditions, this algorithm learns a good approximation to
the distribution of the binary-reward, undiscounted, finite-horizon return. In the
next section we will see how this idea can be carried over to the more general
setting.
3.5 The Projection Step
The finite, undiscounted algorithm of the previous section is a sensible approach
when the random return takes only a few distinct values. In the undiscounted
setting, we already saw that the number of possible returns is N
G
= H + 1 when
there are two possible rewards. However, this small number is the exception,
rather than the rule. If there are N
R
possible rewards then N
G
can be as large as
N
R
+H–1
H
, already a potentially quite large number for N
R
> 2 (see Exercise 3.5).
Worse, when a discount factor
is introduced, the number of possible returns
depends exponentially on H. To see why, recall the single-state, single-action
Markov decision process of Example 2.10 with reward distribution
P
(R
t
= 0) = P
(R
t
= 1) =
1
/2 .
With a discount factor
=
1
/
2
, we argued that the set of possible returns for this
MDP corresponds to the binary expansion of all numbers in the [0, 2] interval,
from which we concluded that the random return is uniformly distributed on
that interval. By the same argument, we can show that the H-horizon return
G =
H–1
t=0
t
R
t
has support on all numbers in the [0, 2] interval that are described by H binary
digits; there are 2
H
such numbers. Of course, if we take H to be infinite there
may be uncountably many possible returns.
Learning the Return Distribution 63
Algorithm 3.3: Undiscounted finite-horizon categorical Monte
Carlo
Algorithm parameters: step size 2(0, 1],
horizon H,
policy : X!P(A),
initial probabilities
(p
0
i
(x))
H
i=0
: x 2X
p
i
(x) p
0
i
(x) for all 0 i H, x 2X
Loop for each episode:
Observe initial state x
0
T 0
Loop for t = 0, 1, ...
Draw a
t
from ( · | x
t
)
Take action a
t
, observe r
t
, x
t+1
T T +1
until x
t+1
is terminal
g 0
for t = T 1, ,0do
g g + r
t
if t < T H then
g g r
t+H
if x
t
is not in {x
0
, , x
t–1
} then
p
g
(x
t
) (1 )p
g
(x
t
)+
p
i
(x
t
) (1 )p
i
(x
t
), for i 6= g
end for
end
To deal with the issue of a large (or even infinite) set of possible returns, we
insert a projection step prior to the mixture update in Equation 3.13.
22
The
purpose of the projection step is to transform an arbitrary target return g into
a modified target taking one of m values, for m reasonably small. From a
classification perspective, we can think of the projection step as assigning a
label (from a small set of categories) to each possible return.
22.
When there are relatively few sample trajectories, a sensible alternative is to construct a
non-parametric estimate of the return distribution that simply puts equal probability mass on all
observed returns. The return distributions described in Section 1.2 and Example 2.9, in particular,
were estimated in this manner. See Remark 3.1 for further details.
64 Chapter 3
We will consider return distributions that assign probability mass to m
2
evenly-spaced values or locations. In increasing order, we denote these locations
by
1
2
···
m
. We write
&
m
for the gap between consecutive locations,
so that for i = 1, ..., m 1, we have
&
m
=
i+1
i
.
A common design takes
1
= V
MIN
and
m
= V
MAX
, in which case
&
m
=
V
MAX
V
MIN
m –1
.
However, other choices are possible and sometimes desirable. Note that the
undiscounted algorithm of the previous section corresponds to m = H + 1 and
i
= i 1.
We express the corresponding return distribution as a weighted sum of Dirac
deltas at these locations:
(x)=
m
i=1
p
i
(x)
i
;
For mathematical convenience, let us write
0
=–
1
and
m+1
=
1
. Consider
a sample return g, and denote by i
(g) the index of the largest element of the
support (extended with 1) that is no greater than g:
i
= arg max
i2{0,...,m}
{
i
:
i
g}.
For this sample return, we write
(g)=
i
+
(g)=
i
+1
to denote the corresponding element of the support and its immediate successor;
when
1
,
...
,
m
are consecutive integers and g
2
[
1
,
m
] these are the floor and
ceiling of g, respectively.
The projection step begins by computing
(g)=
g
(g)
+
(g)–
(g)
,
with the convention that (g) = 1 if
(g)=–1 and (g) = 0 if
+
(g)=1. The
(g) term corresponds to the distance of g to the two closest elements of the
support, scaled to lie in the interval [0, 1] effectively, the fraction of the
distance between
(g) and
+
(g) at which g lies. The stochastic projection of
g is
±
(g)=
(g) with probability 1 (g)
+
(g) with probability (g).
(3.14)
Learning the Return Distribution 65
We use this projection to construct the update rule
(x) (1 )(x)+↵
±
g
,
where as before x is a source state and g a sample return.
23
Similar to Equation
3.12, this update rule is implemented by adjusting the probabilities according to
p
i
±
(x) (1 )p
i
±
(x)+
p
i
(x) (1 )p
i
(x), i 6= i
±
.
where i
±
is the index of location
±
g.
We can improve on the stochastic projection by constructing a modified target
that contains information about both
(g) and
+
(g). In classification terms,
this corresponds to using soft labels meaning that the target is a probability
distribution over labels, rather than a single label. This deterministic projection
of g results in the update rule
(x) (1 )(x)+
(1 (g))
(g)
+ (g)
+
(g)
. (3.15)
We denote the deterministic projection by
C
. Statistically speaking, the deter-
ministic projection produces return distribution estimates that are on average
the same as those produced by the stochastic projection, but are comparatively
more concentrated around their mean. Going forward, we will see that it is
conceptually simpler to apply this projection to probability distributions, rather
than to sample returns. Rather than
C
(g), we therefore write
C
g
= (1 (g))
(g)
+ (g)
+
(g)
.
We call this method the categorical Monte Carlo algorithm. This algorithm can
be used to learn to predict infinite-horizon, discounted returns, and is applicable
even when there are a large number of possible rewards.
Example 3.2.
Montezuma’s Revenge is a 1984 platform game designed by
then-16-year-old Robert Jaeger. As part of the Arcade Learning Environment
[Bellemare et al., 2013a], the Atari 2600 version of the game poses a challenging
reinforcement learning problem due to the rare occurrence of positive rewards.
The very first task in Montezuma’s Revenge consists in collecting a key, an act
which rewards the agent with 100 points. Let us consider the integer support
i
= i 1, for i = 1, 2,
...
, 101. For a discount factor
= 0.99, the discounted
23.
From here onwards we omit the iteration index k from the notation as it is not needed in the
definition of the update rule. The algorithm proper should still be understood as applying this update
rule to a sequence of state-return pairs (x
k
, g
k
)
k0
, possibly with a time-varying step size
k
.
66 Chapter 3
Figure 3.1
Illustration of the projection for the 5-location grid used in Example 3.3. The middle
panel depicts the proportion of probability mass assigned to the location
3
= 1 in terms
of the sample return g. The left and right panels illustrate this probability assignment at
the boundary locations
1
= 0 and
m
= 2.
H-horizon return from the initial game state is
G =
H–1
t=0
t
R
t
=
{ < H}
0.99
100,
where
denotes the time at which the key is obtained. For a fixed
2R
, write
g = 0.99
100 = g bgc.
For
< H, the deterministic projection of the return 0.99
puts probability mass
1–
on
b
g
c
and
on
d
g
e
. For example, if
= 60, then 100
0.99
54.72 and
the deterministic projection is
C
g
= 0.28
54
+ 0.72
55
. 4
By introducing a projection step we typically lose some information about
the target return. This is by necessity: we are asking the learning algorithm to
approximate complex distributions using a small, finite set of possible returns.
Under the right conditions, Equation 3.15 gives rise to a convergent algorithm.
The point of convergence of this algorithm is the return function
ˆ
(x) for
which, for all x 2X, we have
ˆ
(x)=E[
C
G
(x)
]=E

1–(G
(x))
(G
(x))
+ (G
(x))
+
(G
(x))
.
(3.16)
In fact, we may write
E[
C
G
(x)
]=
C
(x),
where
C
(x) denotes distribution supported on {
1
,
,
m
} produced by
projecting each possible outcome under the distribution
(x); we will discuss
this definition in further detail in Chapter 5.
Learning the Return Distribution 67
Example 3.3.
Recall the single-state Markov decision process of Example 2.10,
whose return is uniformly distributed on the interval [0, 2]. Suppose that we take
our support to be the uniform grid {0, 0.5, 1, 1.5, 2}. Let us write
ˆ
p
0
,
ˆ
p
0.5
,
...
,
ˆ
p
2
for the probabilities assigned to these locations by the projected distribution
ˆ
(x)=
C
(x), where x is the unique state of the MDP. The probability density
of the return on the interval [0, 2] is 0.5. We thus have
ˆ
p
0
(a)
= 0.5
g2[0,2]
{
(g) = 0}
(1 (g)) +
{
+
(g) = 0}
(g)
dg
= 0.5
g2[0,0.5]
(1 (g))dg
= 0.5
g2[0,0.5]
(1 2g)dg
= 0.125.
In (a), we re-expressed Equation 3.16 in terms of the probability assigned to
ˆ
p
0
. A similar computation shows that
ˆ
p
0.5
=
ˆ
p
1
=
ˆ
p
1.5
= 0.25, while
ˆ
p
2
= 0.125.
Figure 3.1 illustrates the process of assigning probability mass to different
locations. Intuitively, the solution makes sense: there is less probability mass
near the boundaries of the interval [0, 2]. 4
3.6 Categorical Temporal-Difference Learning
With the use of a projection step, the categorical Monte Carlo method allows us
to approximate the return-distribution function of a given policy from sample
trajectories and using a fixed amount of memory. Like the Monte Carlo method
for value functions, however, it ignores the relationship between successive
states in the trajectory. To leverage this relationship, we turn to categorical
temporal-difference learning (CTD).categorical temporal-difference learning
(CTD)
Consider now a sample transition (x, a, r, x
0
). Like the categorical Monte Carlo
algorithm, CTD maintains a return function estimate
(x) supported on the
evenly-spaced locations {
1
,
...
,
m
}. Like temporal-difference learning, it
learns by bootstrapping from its current return function estimate. In this
case, however, the update target is a probability distribution supported on
{
1
,
...
,
m
}. The algorithm first constructs an intermediate target by scaling the
return distribution
(x
0
) at the next state by the discount factor
, then shifting
it by the sample reward r. That is, if we write
(x
0
)=
m
i=1
p
i
(x
0
)
i
,
68 Chapter 3
then the intermediate target is
˜(x)=
m
i=1
p
i
(x
0
)
r+✓
i
,
which can also be expressed in terms of a pushforward distribution:
˜(x) = (b
r,
)
#
(x
0
) . (3.17)
Observe that the shifting and scaling operations are applied to each particle
in isolation. After shifting and scaling, however, these particles in general no
longer lie in the support of the original distribution. This motivates the use of the
projection step described in the previous section. Let us denote the intermediate
particles by
˜
i
=b
r,
(
i
)=r + ✓
i
.
The CTD target is formed by individually projecting each of these particles
back onto the support, and combining their probabilities. That is,
C
(b
r,
)
#
(x
0
)=
m
j=1
p
j
(x
0
)
C
r+✓
j
.
More explicitly, this is
C
(b
r,
)
#
(x
0
)=
m
j=1
p
j
(x
0
)
(1 (
˜
j
))
(
˜
j
)
+ (
˜
j
)
+
(
˜
j
)
=
m
i=1
i
m
j=1
p
j
(x
0
)
i,j
(r)
, (3.18)
where
i,j
(r)=
(1 (
˜
j
))
{
(
˜
j
)=
i
}
+ (
˜
j
)
{
+
(
˜
j
)=
i
}
.
In Equation 3.18, the second line highlights that the CTD target is a probability
distribution supported on {
1
,
...
,
m
}. The probabilities assigned to specific
locations are given by the inner sum; as shown here, this assignment is obtained
by weighting the next-state probabilities p
j
(x
0
) by the coefficients
i,j
(r).
The CTD update adjusts the return function estimate (x) towards this target:
(x) (1 )(x)+
C
(b
r,
)
#
(x
0
)
. (3.19)
Expressed in terms of the probabilities of the distributions
(x) and
(x
0
), this is
(for i = 1, ..., m)
p
i
(x) (1 )p
i
(x)+
m
j=1
i,j
(r)p
j
(x
0
) . (3.20)
Learning the Return Distribution 69
With this form, we see that the CTD update rule adjusts each probability p
i
(x)
of the return distribution at state x towards a mixture of the probabilities of the
return distribution at the next state (see Algorithm 3.4).
Algorithm 3.4: Online categorical temporal-difference learning
Algorithm parameters: step size 2(0, 1],
policy : X!P(A),
evenly-spaced locations
1
, ,
m
2R,
initial probabilities
(p
0
i
(x))
H
i=0
: x 2X
(p
i
(x))
H
i=0
(p
0
i
(x))
H
i=0
for all x 2X
Loop for each episode:
Observe initial state x
0
Loop for t = 0, 1, ...
Draw a
t
from ( · | x
t
)
Take action a
t
, observe r
t
, x
t+1
ˆ
p
i
0 for i = 1, , m
for j = 1, , m do
if x
t+1
is terminal then
g r
t
else
g r
t
+ ✓
j
if g
1
then
ˆ
p
1
ˆ
p
1
+ p
j
(x
t+1
)
else if g
m
then
ˆ
p
m
ˆ
p
m
+ p
j
(x
t+1
)
else
i
largest i s.t.
i
g
g
i
i
+1
i
ˆ
p
i
ˆ
p
i
+ (1 )p
j
(x
t+1
)
ˆ
p
i
+1
ˆ
p
i
+1
+ p
j
(x
t+1
)
end for
for i = 1, , m do
p
i
(x
t
) (1 )p
i
(x
t
)+
ˆ
p
i
end for
until x
t+1
is terminal
end
70 Chapter 3
1.0 0.5 0.0 0.5 1.0
0.00
0.25
0.50
0.75
1.00
Probability
Return
Probability
Return
Probability
Return
Probability
Return
1.0 0.5 0.0 0.5 1.0
0.00
0.05
0.10
0.15
Probability
Return
1.0 0.5 0.0 0.5 1.0
0.00
0.05
0.10
0.15
Probability
Return
1.0 0.5 0.0 0.5 1.0
0.00
0.05
0.10
0.15
Probability
Return
1.0 0.5 0.0 0.5 1.0
0.00
0.05
0.10
0.15
Probability
Return
1.0 0.5 0.0 0.5 1.0
0.0
0.1
0.2
0.3
0.4
0.5
1.0 0.5 0.0 0.5 1.0
0.00
0.05
0.10
0.15
Probability
Return
1.0 0.5 0.0 0.5 1.0
0.00
0.05
0.10
0.15
Probability
Return
Figure 3.2
The return distribution at the initial state in the Cliffs domain, as predicted by categorical
temporal-difference learning over the course of learning.
Top panels.
The predictions
after e
2
{0, 1000, 2000, 10000} episodes (
= 0.05; see Algorithm 3.4). Here, the return
distributions are initialised by assigning equal probability to all locations.
Bottom panels.
Corresponding predictions when the return distributions initially put all probability mass
on the zero return.
(a) (b) (c)
1.0 0.5 0.0 0.5 1.0
Return
0
2
4
6
8
10
Density
1.0 0.5 0.0 0.5 1.0
Return
0.0
0.1
0.2
0.3
0.4
0.5
Probability
1.0 0.5 0.0 0.5 1.0
Return
0.00
0.05
0.10
0.15
0.20
0.25
Probability
Figure 3.3
(a)
The return distribution at the initial state in the Cliffs domain, visualised using
kernel density estimation (see Figure 2.2).
(b)
The return distribution estimate learned
by the categorical Monte Carlo algorithm with m = 31.
(c)
The estimate learned by the
categorical temporal-difference learning algorithm.
The definition of the intermediate target in pushforward terms (Equation
3.17) illustrates that categorical temporal-difference learning relates to the
distributional Bellman equation
(x)=E
[(b
R,
)
#
(X
0
)|X = x],
analogous to the relationship between TD learning and the classical Bellman
equation. We will continue to explore this relationship in later chapters. How-
ever, the two algorithms usually learn different solutions, due to the introduction
of approximation error from the bootstrapping process.
Learning the Return Distribution 71
Example 3.4.
We can study the behaviour of the categorical temporal-
difference learning algorithm by visualising how its predictions vary over the
course of learning. We apply CTD to approximate the return function of the safe
policy in the Cliffs domain (Example 2.9). Learning takes place online (as per
Algorithm 3.4), using a constant step size of
= 0.05, and return distributions
are approximated with m = 31 locations evenly spaced between –1 and 1.
Figure 3.2 illustrates that the initial return function plays a substantial role in
the speed at which CTD learns a good approximation. Informally, this occurs
because the uniform distribution is closer to the final approximation than the
distribution that puts all of its probability mass on zero (what “close” means in
this context will be the topic of Chapter 4). In addition, we see that categorical
temporal-difference learning learns a different approximation to the true return
function, compared to the categorical Monte Carlo algorithm (Figure 3.3). This
is due to a phenomenon we call diffusion which arises from the combination of
the bootstrapping step and projection; we will study diffusion in Chapter 5.
4
3.7 Learning To Control
A large part of this book considers the problem of learning to predict the
distribution of an agent’s returns. In Chapter 7, we will discuss how one might
instead learn to maximise or control these returns, and the role that distributional
reinforcement learning plays in this endeavour. By learning to control, we
classically mean obtaining (from experience) a policy
that maximises the
expected return:
E
1
t=0
t
R
t
E
1
t=0
t
R
t
, for all .
Such a policy is called an optimal policy. From Section 2.5, recall that the
state-action value function Q
is given by
Q
(x, a)=E
1
t=0
t
R
t
X
0
= x, A
0
= a
.
Any optimal policy
has the property that its state-action value function also
satisfies the Bellman optimality equation:
Q
(x, a)=E
R + max
a
0
2A
Q
(X
0
, a
0
)|X = x, A = a
.
Similar in spirit to temporal-difference learning, Q-learning is an incremental
algorithm that finds an optimal policy. Q-learning maintains a state-action value
function estimate, Q, which it updates according to
Q(x, a) (1 )Q(x, a)+
r + max
a
0
2A
Q(x
0
, a
0
)
. (3.21)
72 Chapter 3
The use of the maximum in the Q-learning update rule results in different
behaviour than TD learning, as the selected action depends on the current
value estimate. We can think of this difference as constructing one target for
each action a
0
, and updating the value estimate towards the largest of such
targets. It can be shown that under the right conditions, Q-learning converges
to the optimal state-action value function Q
, corresponding to the expected
return under any optimal policy. We extract an optimal policy from Q
by
acting greedily with respect to Q
, that is choosing a policy
that selects
maximally-valued actions according to Q
.
The simplest way to extend Q-learning to the distributional setting is to express
the maximal action in Equation 3.21 as a greedy policy. Denote by
a return
function estimate over state-action pairs, such that
(x, a) is the return distri-
bution associated with the state-action pair (x, a)
2X A
. Define the greedy
action
a
(x) = arg max
a2A
E
Z(x,a)
[Z],
breaking ties arbitrarily. The categorical Q-learning update rule is
(x, a) (1 )(x, a)+
C
(b
r,
)
#
x
0
,a
(x
0
)
.
It can be shown that, under the same conditions as Q-learning, the mean of the
return function estimates also converges to Q
. The behaviour of the distribu-
tions themselves, however, may be surprisingly complicated. We can also put
the learned distributions to good use and make decisions on the basis of their
full characterisation, rather than from their mean alone. This forms the topic
of risk-sensitive reinforcement learning. We return to both of these points in
Chapter 7.
3.8 Further Considerations
Categorical temporal-difference learning learns to predict return distributions
from sample experience. As we will see in subsequent chapters, the choices that
we made in designing CTD are not unique, and the algorithm is best thought of
as a jumping-off point into a broad space of methods. For example, an important
question in distributional reinforcement learning asks how we should represent
probability distributions, given a finite memory budget. One issue with the
categorical representation is that it relies on a fixed grid of locations to cover
the range [
1
,
m
], which lacks flexibility and is in many situations inefficient.
We will take a closer look at this issue in Chapter 5. In many practical situations
we also need to deal with a few additional considerations, including the use of
function approximation to deal with very large state spaces (Chapters 9 and 10).
Learning the Return Distribution 73
3.9 Technical Remarks
Remark 3.1
(Non-parametric distributional Monte Carlo algorithm)
.
In
Section 3.4, we saw that the (unprojected) finite-horizon categorical Monte
Carlo algorithm can in theory learn finite-horizon return-distribution functions
when there are only a small number of possible returns. It is possible to extend
these ideas to obtain a straightforward, general-purpose algorithm that can be
sometimes be used to learn an accurate approximation to the return distribution.
Like the sample-mean Monte Carlo method, the non-parametric distributional
Monte Carlo algorithm takes as input K finite-length trajectories with a com-
mon source state x
0
. After computing the sample returns (g
k
)
K
k=1
from these
trajectories, it constructs the estimate
ˆ
(x
0
)=
1
K
K
k=1
g
k
(3.22)
of the return distribution
(x
0
). Here, non-parametric refers to the fact that
the approximating distribution in Equation 3.22 is not described by a finite
collection of parameters; in fact, the memory required to represent this object
may grow linearly with K. Although this is not an issue when K is relatively
small, this can be undesirable when working with large amounts of data, and
moreover precludes the use of function approximation (see Chapters 9 and 10).
However, unlike the categorical Monte Carlo and temporal-difference learning
algorithms presented in this chapter, the accuracy of this estimate is only limited
by the number of trajectories K; we describe various ways to quantify this
accuracy in Remark 4.3. As such, it provides a useful baseline for measuring the
quality of other distributional algorithms. In particular, we used this algorithm
to generate the ground truth return distribution estimates in Example 2.9 and in
Figure 3.3. 4
3.10 Bibliographical Remarks
The development of a distributional algorithm in this chapter follows our own
development of the distributional perspective, beginning with our work on using
compression algorithms in reinforcement learning [Veness et al., 2015].
3.1.
The first-visit Monte Carlo estimate is studied by Singh and Sutton [1996],
where it is used to characterise the properties of replacing eligibility traces [see
also Sutton and Barto, 2018]. Statistical properties of model-based estimates
(which solve for the Markov decision process’s parameters as an intermediate
step) are analysed by Mannor et al. [2007]. Grünewälder and Obermayer [2011]
74 Chapter 3
argue that model-based methods must incur statistical bias, an argument that
also extends to temporal-difference algorithms. Their work also introduces
a refined sample-mean Monte Carlo method that yields a minimum-variance
unbiased estimator (MVUE) of the value function. See Browne et al. [2012]
for a survey of Monte Carlo tree search methods, and Liu [2001], Robert and
Casella [2004], Owen [2013] for further background on Monte Carlo methods
more generally.
3.2.
Incremental algorithms are a staple of reinforcement learning and have
roots in stochastic approximation [Robbins and Monro, 1951, Widrow and Hoff,
1960, Kushner and Yin, 2003] and psychology [Rescorla and Wagner, 1972].
In the control setting, these are also called optimistic policy iteration methods,
and exhibit fairly complex behaviour [Sutton, 1999, Tsitsiklis, 2002].
3.3.
Temporal-difference learning was introduced by Sutton [1984, 1988]. The
sample transition model presented here differs from the standard algorithmic
presentation, but allows us to separate concerns of behaviour (data collection)
from learning. A similar model is used in Bertsekas and Tsitsiklis [1996] to
prove convergence of a broad class of TD methods, and by Azar et al. [2012] to
provide sample efficiency bounds for model-based control.
3.4.
What is effectively the undiscounted finite-horizon categorical Monte
Carlo algorithm was proposed by Veness et al. [2015]. There, the authors the
demonstrate that by means of Bayes’ rule, one can learn the return distribution
by first learning the joint distribution over returns and states (in our notation,
P
(X, G)) by means of a compression algorithm, and subsequently using Bayes
rule to extract P
(G | X). The method proved surprisingly effective at learning
to play Atari 2600 games. Toussaint and Storkey [2006] consider the problem
of control as probabilistic inference, where the reward and trajectory length are
viewed as random variables to be optimised over, again obviating the need to
deal with a potentially large support for the return.
3.5–3.6.
Categorical temporal-difference learning for both prediction and
control was introduced by Bellemare et al. [2017a], in part to address the
shortcomings of the undiscounted algorithm. Its original form contains both the
projection step and the categorical representation as given here. The mixture
update that we study in this chapter is due to Rowland et al. [2018].
3.7.
The Q-learning algorithm is due to Watkins [1989]; see also Watkins and
Dayan [1992]. The explicit construction of a greedy policy is commonly found
in more complex reinforcement learning algorithms, including modified policy
iteration [Puterman and Shin, 1978],
-policy iteration [Bertsekas and Ioffe,
Learning the Return Distribution 75
1996], and non-stationary policy iteration [Scherrer and Lesner, 2012, Scherrer,
2014].
3.9.
The algorithm introduced in Remark 3.1 is essentially an application of the
standard Monte Carlo method to the return distribution, and is a special case of
the framework set out by Chandak et al. [2021], who also analyse the statistical
properties of the approach.
3.11 Exercises
Exercise 3.1.
Suppose that we begin with the initial value function estimate
V(x) = 0 for all x 2X.
(i)
Consider first the setting in which we are given sample returns for a single
state x . Show that in this case, the incremental Monte Carlo algorithm
(Equation 3.6), instantiated with
k
=
1
k+1
, is equivalent to computing the
sample-mean Monte Carlo estimate for x. That is, after processing the
sample returns g
1
, ..., g
K
, we have
V
K
(x)=
1
K
K
i=1
g
i
.
(ii)
Now consider the case where source states are drawn from a distribution
, with
(x) > 0 for all x, and let N
k
(x
k
) be the number of times x
k
has been
updated up to but excluding time k. Show that the appropriate step size to
match the sample-mean Monte Carlo estimate is
k
=
1
N
k
(x
k
)+1
. 4
Exercise 3.2. Recall from Exercise 2.8 the n-step Bellman equation:
V
(x)=E
n–1
t=0
t
R
t
+
n
V
(X
n
)|X
0
= x
.
Explain what a sensible n-step temporal-difference learning update rule might
look like. 4
Exercise 3.3.
The Cart-Pole domain is a small, two-dimensional reinforcement
learning problem [Barto et al., 1983]. In this problem, the learning agent must
balance a swinging pole that is a attached to a moving cart. Using an open-
source implementation of Cart-Pole, implement the undiscounted finite-horizon
categorical Monte Carlo algorithm and evaluate its behaviour. Construct a finite
state space by discretising the four-dimensional state space using a uniform grid
of size 10
10
10
10. Plot the learned return distribution for a fixed initial
state and other states of your choice when
76 Chapter 3
(i) The policy chooses actions uniformly at random;
(ii)
The policy moves the cart in the direction that the pole is leaning towards.
4
You may want to pick the horizon H to be the maximum length of an episode.
Exercise 3.4.
Implement the categorical Monte Carlo (CMC), non-parametric
categorical Monte Carlo (Remark 3.1), and categorical temporal-difference
learning (CTD) algorithms. For a discount factor
= 0.99, compare the return
distributions learned by these three algorithms on the Cart-Pole domain of the
previous exercise. For CMC and CTD, vary the number of particles m and the
range of the support [
1
,
m
]. How do the approximations vary as a function of
these parameters? 4
Exercise 3.5.
Suppose that the rewards R
t
take on one of N
R
values. Consider
the undiscounted finite-horizon return
G =
H–1
t=0
R
t
.
Denote by N
G
the number of possible realisations of G.
(i) Show that N
G
can be as large as
N
R
+H–1
H
.
(ii) Derive a better bound when R
t
2{0, 1, ..., N
R
1}.
(iii)
Explain, in words, why the bound is better in this case. Are there other sets
of rewards for which N
G
is smaller than the worst-case from (i)? 4
Exercise 3.6.
Recall from Section 3.5 that the categorical Monte Carlo
algorithm aims to find the approximation
ˆ
m
(x)=
C
(x), x 2X ,
where we use the subscript m to more explicitly indicate that the quality of this
approximation depends on the number of locations.
Consider again the problem setting of Example 3.3. For m
2, suppose that we
take
1
= 0 and
m
= 2, such that
i
=2
i –1
m –1
i = 1, ...m .
Show that in this case
ˆ
m
(x) converges to the uniform distribution on the interval
[0, 2], in the sense that
lim
m!1
ˆ
m
(x)
[a, b]
!
b a
2
Learning the Return Distribution 77
for all a < b; recall that
([a, b]) denotes the probability assigned to the interval
[a, b] by a probability distribution . 4
Exercise 3.7.
The purpose of this exercise is to demonstrate that the categorical
Monte Carlo algorithm is what we call mean-preserving. Consider a sequence
of state-return pairs (x
k
, g
k
)
K
k=1
, and an evenly-spaced grid {
1
,
...
,
m
}, m
2.
Suppose that rewards are bounded in [R
MIN
, R
MAX
].
(i)
Suppose that V
MIN
1
and V
MAX
m
. For a given g
2
[V
MIN
, V
MAX
], show
that the distribution =
C
g
satisfies
E
Z
[Z]=g .
(ii)
Based on this, argue that if
(x) is a distribution with mean V, then after
applying the update
(x) (1 )(x)+
C
g
the mean of (x) is (1 )V + g.
(iii)
By comparing with the incremental Monte Carlo update rule, explain why
categorical Monte Carlo can be said to be mean-preserving.
(iv)
Now suppose that [V
MIN
, V
MAX
]
6✓
[
1
,
m
]. How are the preceding results
affected?
4
Exercise 3.8.
Following the notation of Section 3.5, consider the nearest
neighbour projection method
NN
g =
(g) if (g) 0.5
+
(g) otherwise.
Show that this projection is not mean-preserving in the sense of the preced-
ing exercise. Implement and evaluate on the Cart-Pole domain. What do you
observe? 4