5
Distributional Dynamic Programming
Markov decision processes model the dynamics of an agent exerting control
over its environment. Once the agent’s policy is selected, a Markov decision
process gives rise to a sequential system whose behaviour we would like to char-
acterise. In particular, policy evaluation describes the process of determining
the returns obtained from following a policy
. Algorithmically, this translates
into the problem of computing the value or return-distribution function given
the parameters of the Markov decision process and the agent’s policy.
Computing the return-distribution function requires being able to describe the
output of the algorithm in terms of atomic objects (depending on the program-
ming language, these may be bits, floating point numbers, vectors, or even
functions). This is challenging because in general, return distributions take on a
continuum of values, i.e. they are infinite-dimensional objects. By contrast, the
expected return from a state x is described by a single real number. Defining
an algorithm that computes return-distribution functions first requires us to
decide how we represent probability distributions in memory, knowing that
some approximation error must be incurred if we want to keep things finite.
This chapter takes a look at different representations of probability distributions
as they relate to the problem of computing return-distribution functions. We will
see that, unlike the relatively straightforward problem of computing value func-
tions, there is no obviously-best representation for return-distribution functions,
and that different finite-memory representations offer different advantages. We
will also see that making effective use of different representations requires
different algorithms.
5.1 Computational Model
As before, we assume that the environment is described as a finite-state, finite-
action Markov decision process. We write N
X
and N
A
for the size of the state
119
120 Chapter 5
and action spaces
X
and
A
. When describing algorithms in this chapter, we will
further assume that the reward distributions P
R
(
·
| x, a) are supported on a finite
set
R
of size N
R
; we discuss a way of lifting this assumption in Remark 5.1.
Of note, having finitely many rewards guarantees the existence of an interval
[V
MIN
, V
MAX
] within which the returns lie.
35
We measure the complexity of a
particular algorithm in terms of the number of atomic instructions or memory
words it requires, assuming that these can reasonably be implemented in a
physical computer, as described by the random-access machine (RAM) model
of computation [Cormen et al., 2001].
In classical reinforcement learning, linear algebra provides a simple algorithm
for computing the value function of a policy
. In vector notation, the Bellman
equation is
V
= r
+ P
V
, (5.1)
where the transition function P
is represented as an N
X
-dimensional square
stochastic matrix, and r
is an N
X
-dimensional vector. With some matrix
algebra we deduce that
V
= r
+ P
V
() (I P
)V
= r
() V
=(I P
)
–1
r
. (5.2)
The computational cost of determining V
is dominated by the matrix inversion,
requiring O(N
3
X
) operations. The result is exact. The matrix P
and the vector
r
are constructed entry-wise by writing expectations as sums:
P
(x
0
| x)=
a2A
(a | x)P
X
(x
0
| x, a)
r
(x)=
a2A
r2R
(a | x)P
R
(r | x, a) r .
When the matrix inversion is undesirable, the value function can instead be
found by dynamic programming. Dynamic programming describes a wide vari-
ety of computational methods that find the solution to a given problem by
caching intermediate results. In reinforcement learning, the dynamic program-
ming approach for finding the value function V
, also called iterative policy
evaluation, begins with an initial estimate V
0
2R
X
and successively computes
V
k+1
= T
V
k
35.
We can always take R
MIN
and R
MAX
to be the smallest and largest possible rewards, respectively.
Distributional Dynamic Programming 121
for k = 1, 2,
...
, until some desired number of iterations K have been performed
or some stopping criterion is reached. This is possible because, when
X
,
A
, and
R are all finite, the Bellman operator can be written in terms of sums:
(T
V
k
)(x)=
a2A
r2R
x
0
2X
P
(A = a, R = r, X
0
= x
0
| X = x)

(a | x)P
X
(x
0
| x,a)P
R
(r | x,a)
r + V
k
(x
0
)
. (5.3)
A naive implementation expresses these sums as nested FOR loops. Since the
new value function must be computed at all states, this naive implementa-
tion requires on the order of N = N
2
X
N
A
N
R
operations. We can do better by
implementing it in terms of vector operations:
V
k+1
= r
+ P
V
k
, (5.4)
where V
k
is stored in memory as an N
X
-dimensional vector. A single appli-
cation of the Bellman operator with vectors and matrices requires O(N
2
X
N
A
+
N
X
N
A
N
R
) operations for computing r
and P
, and O(N
2
X
) operations for
the matrix-vector multiplication. As r
and P
do not need to be recomputed
between iterations, the dominant cost of this process comes from the successive
matrix multiplications, requiring O(KN
2
X
) operations.
36
In general, the iterates (V
k
)
k0
will not reach V
after any finite number of
iterations. However, the contractive nature of the Bellman operator allows us to
bound the distance from any iterate to the fixed point V
.
Proposition 5.1.
Let V
0
2R
X
be an initial value function and consider
the iterates
V
k+1
= T
V
k
. (5.5)
For any " > 0, if we take
K
"
log
1
"
+ log(kV
0
V
k
1
)
log
1
.
then for all k K
"
we have that
kV
k
V
k
1
" .
For V
0
= 0, the dependency on V
can be simplified by noting that
log(kV
0
V
k
1
) = log(kV
k
1
) log
max(|V
MIN
|, |V
MAX
|)
. 4
36.
Assuming that the number of states N
X
is large compared to the number of actions N
A
and
rewards N
R
. For transition functions with special structure (sparsity, low rank, etc.) one can hope
to do even better.
122 Chapter 5
Proof.
Since T
is a contraction mapping with respect to the L
1
metric
with contraction modulus
(Proposition 4.4), and V
is its fixed point
(Proposition 2.12), we have for any k 1
kV
k
V
k
1
= kT
V
k–1
T
V
k
1
kV
k–1
V
k
1
,
and so by induction we have
kV
k
V
k
1
k
kV
0
V
k
1
.
Setting the right-hand side to be less than or equal to
"
and rearranging gives
the required inequality for K
"
.
From Proposition 5.1, we conclude that we can obtain an
"
-approximation
to V
in O(K
"
N
2
X
) operations, by applying the Bellman operator K
"
times to
an initial value function V
0
= 0. Since the iterate V
k
can be represented as an
N
X
-dimensional vector and is the only object that the algorithm needs to store in
memory (other than the description of the MDP itself), this shows that iterative
policy evaluation can approximate V
efficiently.
5.2 Representing Return-Distribution Functions
Now, let us consider what happens in distributional reinforcement learning.
As with any computational problem, we first must decide on a data structure
that our algorithms operate on. The heart of our data structure is a scheme for
representing return-distribution functions in memory. We call such a scheme a
probability distribution representation.
Definition 5.2.
A probability distribution representation
F
, or simply repre-
sentation, is a collection of probability distributions indexed by a parameter
from some set of allowed parameters :
F =
P
2P(R): 2
. 4
Example 5.3.
The Bernoulli representation is the set of all Bernoulli distribu-
tions:
F
B
=
(1 p)
0
+ p
1
: p 2[0, 1]
. 4
Example 5.4.
The uniform representation is the set of all uniform distributions
on finite-length intervals:
F
U
=
U([a, b]) : a, b 2R, a < b
. 4
We represent return functions using a table of probability distributions, each
associated with a given state and described in our chosen representation. For
Distributional Dynamic Programming 123
example, a uniform return function is described in memory by a table of 2N
X
numbers, corresponding to the upper and lower ends of the distribution at each
state. By extension, we call such a table a representation of return-distribution
functions. Formally, for a representation
F
, the space of representable return
functions is F
X
.
With this data structure in mind, let us consider the procedure (introduced by
Equation 4.10) that approximates the return function
by repeatedly applying
the distributional Bellman operator:
k+1
= T
k
. (5.6)
Because an operator is an abstract object, Equation 5.6 describes a mathematical
procedure, rather than a computer program. To obtain the latter, we begin by
expressing the distributional Bellman operator as a sum, analogous to Equation
5.3. Recall that the distributional operator is defined by an expectation over the
random variables R and X
0
:
(T
)(x)=E
[(b
R,
)
#
(X
0
)|X = x] . (5.7)
Here, the expectation describes a mixture of pushforward distributions. By
writing this expectation in full, we find that this mixture is given by
T
(x)=
a2A
r2R
x
0
2X
P
A = a, R = r, X
0
= x
0
| X = x
(b
r,
)
#
(x
0
) . (5.8)
The pushforward operation scales and then shifts (by
and r, respectively)
the support of the probability distribution
(x
0
), as depicted in Figure 2.5.
Implementing the distributional Bellman operator therefore requires being able
to efficiently perform the shift-and-scale operation and compute mixtures of
probability distributions; we caught a glimpse of what that might entail when
we derived categorical temporal-difference learning in Chapter 3.
Cumulative distribution functions allow us to rewrite Equations 5.7–5.8 in terms
of vector-like objects, providing a nice parallel with the usual vector notation
for the expected-value setting. Let us write F
(x, z)=F
(x)
(z) to denote the
cumulative distribution function of
(x). We can equally express Equation 5.7
as
37
(T
F
)(x, z)=E
F
X
0
,
zR
| X = x
. (5.9)
37. With Chapter 4 in mind, note that we are overloading operator notation when we apply T
to
collections of cumulative distribution functions, rather than return-distribution functions.
124 Chapter 5
As a weighted sum of cumulative distribution functions, Equation 5.9 is
(T
F
)(x, z)=
a2A
r2R
x
0
2X
P
A = a, R = r, X
0
= x
0
| X = x
F
x
0
,
zr
.
(5.10)
Similar to Equation 5.1, we can set F
= F
on both sides of Equation 5.10 to
obtain the linear system:
F
(x, z)=
a2A
r2R
x
0
2X
P
A = a, R = r, X
0
= x
0
| X = x
F
x
0
,
zr
.
However, this particular set of equations is infinite-dimensional. This is because
cumulative distribution functions are themselves infinite-dimensional objects,
more specifically elements of the space of monotonically increasing functions
mapping
R
to [0, 1]. Because of this we cannot describe them in physical
memory, at least not on a modern-day computer. This gives a concrete argument
as to why we cannot simply “store” a probability distribution, but must instead
use a probability distribution representation as our data structure. For the same
reason, a direct algebraic solution to Equation 5.10 is not possible, in contrast to
the expected-value setting (Equation 5.2). This justifies the need for a dynamic
programming method to approximate
.
Creating an algorithm for computing the return-distribution function requires
us to implement the distributional Bellman operator in terms of our chosen
representation. Conversely, we should choose a representation that supports an
efficient implementation. Unlike the value function setting, however, there is
no single best representation making this choice requires balancing available
memory, accuracy, and the downstream uses of the return-distribution function.
The rest of this chapter is dedicated to studying these trade-offs and developing
a theory of what makes for a good representation. Like Goldilocks faced with
her choices, we first consider the situation where memory and computation are
plentiful, then the use of normal distributions to construct a minimally-viable
return-distribution function, before finally introducing fixed-size empirical
representations as a sensible and practical middle ground.
5.3 The Empirical Representation
Simple representations like the Bernoulli representation are ill-suited to describe,
say, the different outcomes in blackjack (Example 2.7) or the variations in
the return distributions from different policies (Example 2.9). Although there
are scenarios in which a representation with few parameters gives a reason-
able approximation, the most general-purpose algorithms for computing return
distributions should be based on representations that are sufficiently expressive.
Distributional Dynamic Programming 125
To understand what “sufficiently expressive” might mean, let us consider what
it means to implement the iterative procedure
k+1
= T
k
. (5.11)
The most direct implementation is a FOR loop over the iteration number k =
0, 1, ..., interleaving
(a) Determining (T
k
)(x) for each x 2X, and
(b) Expressing the outcome as a return-distribution function
k+1
2F .
The first step of this procedure is an algorithm that emulates the operator
T
,
which we may call the operator-algorithm. When used as part of the
for
loop,
the output of this operator-algorithm at iteration k becomes the input at iteration
k + 1. As such, it is desirable for the inputs and outputs of the operator-algorithm
to have the same type: given as input a return function represented by
F
, the
operator-algorithm should produce a new return function that is also represented
by
F
. A prerequisite is that the representation
F
be closed under the operator
T
, in the sense that
2F
X
=)T
2F
X
. (5.12)
The empirical representation satisfies this desideratum.
Definition 5.5.
The empirical representation is the set
F
E
of empirical
distributions
F
E
=
m
i=1
p
i
i
: m 2N
+
,
i
2R, p
i
0,
m
i=1
p
i
=1
. 4
An empirical distribution
2F
E
can be stored in memory as a finite list of
pairs
i
, p
i
m
i=1
. We call individual elements of such a distribution particles,
each consisting of a probability and a location. Notationally, we extend the
empirical representation to return distributions by writing
(x)=
m(x)
i=1
p
i
(x)
i
(x)
(5.13)
for the return distribution corresponding to state x.
The application of the distributional Bellman operator to empirical probabil-
ity distributions has a particular convenient form, that we formalise with the
following lemma and proposition.
126 Chapter 5
Lemma 5.6.
Let
2F
E
be an empirical distribution with parameters m, and
(
i
, p
i
)
m
i=1
. For r 2R and 2R, we have
(b
r,
)
#
=
m
i=1
p
i
r+✓
i
. 4
In words, the application of the bootstrap function to an empirical distribution
shifts and scales the locations of that distribution (see Exercise 5.7). This
property was implicit in our description of the pushforward operation in Chapter
2, and we made use of it (also implicitly) to derive the categorical temporal-
difference learning algorithm in Chapter 3.
Proposition 5.7.
Provided that the set of possible rewards
R
is finite, the
empirical representation
F
E
is closed under
T
. In particular, if
2F
X
E
is a return-distribution with parameters
p
i
(x),
i
(x)
m(x)
i=1
: x 2X
then
T
(x)=
a2A
r2R
x
0
2X
P
(A = a, R = r, X
0
= x
0
| X = x)
m(x
0
)
i=1
p
i
(x
0
)
r+✓
i
(x
0
)
.
(5.14)
4
Proof.
Pick a state x
2X
. For a triple (a, r, x
0
)
2ARX
write P
a,r,x
0
=
P
A = a, R = r, X
0
= x
0
| X = x
. Then
(T
)(x)
(a)
=
a2A
r2R
x
0
2X
P
a,r,x
0
(b
r,
)
#
(x
0
)
=
a2A
r2R
x
0
2X
P
a,r,x
0
(b
r,
)
#
m(x
0
)
i=1
p
i
(x
0
)
i
(x
0
)
(b)
=
a2A
r2R
x
0
2X
P
a,r,x
0
m(x
0
)
i=1
p
i
(x
0
)
r+✓
i
(x
0
)
m
0
j=1
p
0
j
0
j
for some collection (
0
j
, p
0
j
)
m
0
j=1
. Line (a) is Equation 5.8 and (b) follows from
Lemma 5.6. We conclude that (
T
)(x)
2F
E
, and hence
F
E
is closed under
T
.
Distributional Dynamic Programming 127
Algorithm 5.1 uses Proposition 5.7 to compute the application of the distribu-
tional Bellman operator to any
represented by
F
E
. It implements Equation
5.14 almost verbatim, with two simplifications. First, it uses the fact that the
particle locations for a distribution (
T
)(x) only depend on r and x
0
, but not
on a. This allows us to produce a single particle for each reward-next-state
pair. Second, it also encodes the fact that the return is 0 from the terminal state,
making dynamic programming more effective from this state (Exercise 5.3
asks you to justify this claim). Since the output of Algorithm 5.1 is also a
return function from
F
E
, we can use it to produce the iterates
1
,
2
,
...
from
an initial return function
0
2F
X
E
.
Algorithm 5.1: Empirical representation distributional Bellman
operator
Algorithm parameters: , expressed as =

i
(x), p
i
(x)
m(x)
i=1
: x 2X
foreach x 2Xdo
0
(x) EMPTY LIST
foreach x
0
2X do
foreach r 2Rdo
r,x
0
a2A
(a | x)P
R
(r | x, a)P
X
(x
0
| x, a)
if x
0
is terminal then
APPEND
r,
r,x
0
to
0
(x)
else
for i = 1, , m(x
0
) do
APPEND
r + ✓
i
(x
0
),
r,x
0
p
i
(x
0
)
to
0
(x)
end for
end foreach
end foreach
end foreach
return
0
Because
F
E
is closed under
T
, we can analyse the behaviour of this procedure
using the theory of contraction mappings (Chapter 4). Here we bound the num-
ber of iterations needed to obtain an
"
-approximation to the return-distribution
function
, as measured by a supremum p-Wasserstein metric.
128 Chapter 5
Proposition 5.8.
Consider an initial return function
0
(x)=
0
for all x
2X
and the dynamic programming approach that iteratively computes
k+1
= T
k
by means of Algorithm 5.1. Let " > 0 and let
K
"
log
1
"
+ log
max(|V
MIN
|, |V
MAX
|)
log
1
.
Then for all k K
"
, we have that
w
p
(
k
,
) " 8p 2[1, 1]
where w
p
is the supremum p-Wasserstein distance. 4
Proof.
Similar to the proof of Proposition 5.1, since
T
is a contraction map-
ping with respect to the
w
p
metric with contraction modulus
(Proposition 4.15),
and
is its fixed point (Propositions 2.17 and 4.9), we can deduce
w
p
(
k
,
)
k
w
p
(
0
,
).
Since
0
=
0
and
is supported on [V
MIN
, V
MAX
], we can upper-bound
w
p
(
0
,
) by
max
(|V
MIN
|, |V
MAX
|). Setting the right-hand side to be less than or
equal to " and rearranging gives the required inequality for K
"
.
Although we state Proposition 5.8 in terms of p-Wasserstein distance for con-
creteness, we can also obtain similar results more generally for probability
metrics under which the distributional Bellman operator is contractive.
The analysis of this section shows that the empirical representation is suffi-
ciently expressive to support an iterative procedure for approximating the return
function
to an arbitrary precision, due to being closed under the distributional
Bellman operator. This result is perhaps somewhat surprising: even if
62F
E
,
we are able to obtain an arbitrarily accurate approximation within F
E
.
Example 5.9.
Consider the single-state Markov decision process of Example
2.10, with Bernoulli reward distribution
U
({0, 1}) and discount factor
=
1
/
2
.
Beginning with
0
(x), the return distributions of the iterates
k+1
= T
k
Distributional Dynamic Programming 129
are a collection of uniformly-weighted, evenly-spaced Dirac deltas:
k
(x)=
1
2
k
2
k
–1
i=0
i
,
i
=
i
2
k–1
. (5.15)
As suggested by Figure 2.3, the sequence of distributions
k
(x)
k0
converges
to the uniform distribution
U
([0, 2]) in the p-Wasserstein distances, for all
p 2[1, 1]. However, this limit is not itself an empirical distribution. 4
The downside is that the algorithm is typically intractable for anything but a
small number of iterations K. This is because the lists that describe
k
may grow
exponentially in length with K, as shown in the example above. Even when
0
is initialised to be
0
at all states (as in Proposition 5.8), representing the k
th
iterate requires O
N
k
X
N
k
R
particles per state, corresponding to all achievable
discounted returns of length k.
38
This is somehow unavoidable: in a certain
sense, the problem of computing return functions is NP-hard (see Remark 5.2).
This motivates the need for a more complex procedure that forgoes closedness
in favour of tractability.
5.4 The Normal Representation
To avoid the computational costs associated with unbounded memory usage, we
may restrict ourselves to probability distributions described by a fixed number
of parameters. A simple choice is to model the return with a normal distribution,
which requires only two parameters per state: a mean µ and variance
2
.
Definition 5.10. The normal representation is the set of normal distributions
F
N
=
N(µ,
2
):µ 2R,
2
0
.
4
With this representation, a return function is described by a total of 2N
X
parameters.
More often than not, however, the random returns are not normally distributed.
This may be because the rewards themselves are not normally distributed, or
because the transition kernel is stochastic. Figure 5.1 illustrates the effect of
the distributional Bellman operator on a normal return-distribution function
:
mixing the return distributions at successor states results in a mixture of normal
distributions, which is not normally distributed except in trivial situations. In
38. A smarter implementation only requires O
N
k
R
particles per state. See Exercise 5.8.
130 Chapter 5
Return
Return
Return
Figure 5.1
Applying the distributional Bellman oper-
ator to a return function
described by the
normal representation generally produces
return distributions that are not normally
distributed. Here, a uniform mixture of two
normal distributions (shown as probability
densities) results in a bimodal distribution
(in light grey). The best normal approxi-
mation
ˆ
to that distribution is depicted by
the solid curve.
other words, the normal representation
F
N
is generally not closed under the
distributional Bellman operator.
Rather than represent the return distribution with high accuracy, as with the
empirical distribution, let us consider the more modest goal of determining the
best normal approximation to the return-distribution function
. We define
“best normal approximation” as
ˆ
(x)=N
V
(x), Var
G
(x)
. (5.16)
Given that a normal distribution is parametrised by its mean and variance, this
is an obviously sensible choice. In some cases, this choice can also be justified
by arguing that
ˆ
(x) is the normal distribution closest to
(x) in terms of what
is called the Kullback-Leibler divergence.
39
As we now show, this choice also
leads to a particularly efficient algorithm for computing ˆ
.
We will construct an iterative procedure that operates on return functions from
a normal representation and converges to
ˆ
. We start with the random variable
operator
T
G
(x)
D
= R + G(X
0
), X = x , (5.17)
and take expectations on both sides of the equation.
40
E
(T
G)(x)
= E
R + E[G(X
0
)|X
0
]|X = x
= E
[R | X = x]+ E
[E[G(X
0
)|X
0
]|X = x]
39.
Technically, this is only true when the return distribution
(x) has a probability density
function. When
(x) does not have a density, a similar argument can be made in terms with the
cross-entropy loss; see Exercises 5.5 and 5.6.
40.
In Equation 5.18, some of the expectations are taken solely with respect to the sample transition
model, which we denote by
E
as usual. The expectation with respect to the random variable G, on
the other hand, is not part of the model; we use the unsubscripted E to emphasise this point.
Distributional Dynamic Programming 131
= E
[R | X = x]+
x
0
2X
P
(X
0
= x
0
| X = x) E[G(x
0
)] , (5.18)
where the last step follows from the assumption that the random variables G are
independent of the next-state X
0
. When applied to the return-variable function
G
, Equation 5.18 is none other than the classical Bellman equation.
The same technique allows us to relate the variance of (
T
G)(x) to the next-state
variances. For a random variable Z, recall that
Var(Z)=E
(Z E[Z])
2
.
The random reward R and next-state X
0
are by definition conditionally indepen-
dent given X and A. However, to simplify the exposition, we assume that they
are also conditionally independent given only X (in Chapter 8 we will see how
this assumption can be avoided).
Let us use the notation
Var
to make explicit the dependency of certain random
variables on the sample transition model, analogous to our use of
E
. Denote
the value function corresponding to G by V(x)=E[G(x)]. We have
Var
(T
G)(x)
= Var
R + G(X
0
)|X = x
= Var
R | X = x
+ Var
G(X
0
)|X = x
= Var
R | X = x
+
2
Var
G(X
0
)|X = x
= Var
R | X = x
+
2
Var
V(X
0
)|X = x
+
2
x
0
2X
P
(X
0
= x
0
| X = x)Var
G(x
0
)
, (5.19)
where the last line follows by the law of total variance:
Var(A) = Var
E[A | B]
+ E
Var[A | B]
.
Equation 5.19 shows how to compute the variances of the return function
T
G.
Specifically, these variances depend on the reward variance, the variance in
next-state values, and the expected next-state variances. When G = G
, this is
the Bellman equation for the variance of the random return. Writing
2
(x)=
Var
G
(x)
, we obtain
2
(x) = Var
(R | X = x)+
2
Var
V
(X
0
)|X = x
+
2
x
0
2X
P
(X
0
= x
0
| X = x)
2
(x) . (5.20)
We now combine the results above to obtain a dynamic programming procedure
for finding
ˆ
. For each state x
2X
, let us denote by
µ
0
(x) and
2
0
(x) the
parameters of a return-distribution distribution
0
with
0
(x)=
N
µ
0
(x),
2
0
(x)
.
132 Chapter 5
For all x, we simultaneously compute
µ
k+1
(x)=E
[R + µ
k
(X
0
)|X = x] (5.21)
2
k+1
(x) = Var
R | X = x
+
2
Var
µ
k
(X
0
)|X = x
+
2
E
[
2
k
(X
0
)|X = x].
(5.22)
We can view these two updates as the implementation of a bona fide operator
over the space of normal return-distribution functions. Indeed, we can associate
to each iteration the return function
k
(x)=N
µ
k
(x),
2
k
(x)
2F
N
.
Analysing the behaviour of the sequence (
k
)
k0
require some care, but we
shall see in Chapter 8 that the iterates converge to the best approximation
ˆ
(Equation 5.16), in the sense that for all x 2X,
µ
k
(x) !V
(x)
2
k
(x) !Var
G
(x)
.
This derivation shows that the normal representation can be used to create a
tractable algorithm for approximating the return distribution. However, in our
work we have found that the normal representation rarely gives a satisfying
depiction of the agent’s interactions with its environment; it is not sufficiently
expressive. In many problems, outcomes are discrete in nature: success or
failure, food or hunger, forward motion or fall. This arises in video games in
which the game ends once the player’s last life is spent. Timing is also critical:
to catch the diamond, key, or mushroom, the agent must press “jump” at just the
right moment, again leading to discrete outcomes. Even in relatively continuous
systems, such as the application of reinforcement learning to stratospheric
balloon flight, the return distributions tend to be skewed or multimodal. In short,
normal distributions are a poor fit for the wide gamut of problems found in
reinforcement learning.
5.5 Fixed-Size Empirical Representations
The empirical representation is expressive because it can use more particles to
describe more complex probability distributions. This “blank cheque” approach
to memory and computation, however, results in an intractable algorithm. On
the other hand, the simple normal distribution is rarely sufficient to give a
good approximation of the return distribution. A good middle ground is to
preserve the form of the empirical representation while imposing a limit on its
expressivity. Our approach is to fix the number and type of particles used to
represent probability distributions.
Distributional Dynamic Programming 133
Definition 5.11.
The m-quantile representation parametrises the location of m
equally-weighted particles. That is,
F
Q,m
=
1
m
m
i=1
i
:
i
2R
. 4
Definition 5.12.
Given a collection of m evenly-spaced locations
1
<
···
<
m
,
the m-categorical representation parametrises the probability of m particles at
these fixed locations:
F
C,m
=
m
i=1
p
i
i
: p
i
0,
m
i=1
p
i
=1
.
We denote the stride between successive particles by &
m
=
m
1
m–1
. 4
This definition corresponds to the categorical representation used in Chapter 3.
Note that because of the constraint that probabilities should sum to one, a m-
categorical distribution is described by m 1 parameters. In addition, although
the representation depends on the choice of locations
1
,
,
m
, we omit this
dependence in the notation F
C,m
to keep things concise.
In our definition of the m-categorical representation, we assume that the loca-
tions (
i
)
m
i=1
are given a priori and not part of the description of a particular
probability distribution. This is sensible when we consider that algorithms such
as categorical temporal-difference learning use the same set of locations to
describe distributions at different states, and keep these locations fixed across
the learning process. For example, a common choice is
1
= V
MIN
and
m
= V
MAX
.
When it is desirable to adjust both the locations and probabilities of different
particles, we instead make use of the m-particle representation.
Definition 5.13.
The m-particle representation parametrises both the probabil-
ity and location of m particles:
F
E,m
=
m
i=1
p
i
i
:
i
2R, p
i
0,
m
i=1
p
i
=1
.
The m-particle representation contains both the m-quantile representation and
the m-categorical representation; a distribution from
F
E,m
is defined by 2m –1
parameters. 4
Mathematically, all three representations described above are subsets of the
empirical representation
F
E
; accordingly, we call them fixed-size empiri-
cal representations. Fixed-size empirical representations are flexible and can
134 Chapter 5
Figure 5.2
A distribution
(in light grey), as approximated with a m-categorical, m-quantile, or
m-particle representation, for m = 5.
approximate both continuous and discrete outcome distributions (Figure 5.2).
The categorical representation is so called because it models the probability
of a set of fixed outcomes. This is somewhat of a misnomer: the “categories”
are not arbitrary, but instead correspond to specific real values. The quantile
representation is named for its relationship to the quantiles of the return distri-
bution. Although it might seem like the m-particle representation is a strictly
superior choice, we will see that committing to either fixed locations or fixed
probabilities simplifies algorithmic design. For an equal number of parameters,
it is also not clear whether one should prefer m fully-parametrised particles or
say, 2m uniformly-weighted particles.
Like the normal representation, fixed-size empirical representations are not
closed under the distributional Bellman operator
T
. As discussed in Section
5.3, the consequence is that we cannot implement the iterative procedure
k+1
= T
k
with such a representation. To get around this issue, let us now introduce
the notion of a projection operator: a mapping from the space of probability
distributions (or a subset thereof) to a desired representation
F
.
41
We denote
such an operator by
F
: P(R) !F .
Definitionally, we require that projection operators satisfy, for any 2F ,
F
= .
We extend the notation
F
to the space of return-distribution functions:
F
(x)=
F
(x)
.
41.
In Chapter 4.1 we defined an operator as mapping elements from a space to itself. The term
“projection operator” here is reasonable given that F P (R).
Distributional Dynamic Programming 135
Figure 5.3
Left.
The categorical projection assigns probability mass to each location according to
a triangular kernel (central locations) and half-triangular kernels (boundary locations).
Here, m = 5.
Right.
The m = 5 categorical projection of a given distribution, shown in
grey. Each Dirac delta is coloured to match its kernel in the left panel.
The categorical projection
C
, first encountered in Chapter 3, is one such
operator; we will study it in greater detail in the remainder of this chapter.
We also made implicit use of a projection step in deriving an algorithm for
the normal representation: at each iteration, we kept track of the mean and
variance of the process but discarded the rest of the distribution, so that the
return function iterates could be described with the normal representation.
Algorithmically, we introduce a projection step following the application of
T
, leading to a projected distributional Bellman operator
F
T
. By defini-
tion, this operator maps
F
to itself, allowing for the design of distributional
algorithms represent each iterate of the sequence
k+1
=
F
T
k
using a bounded amount of memory. We will discuss such algorithmic con-
siderations in Section 5.7, after describing particular projection operators for
the categorical and quantile representations. Combined with numerical integra-
tion, the use of a projection step also makes it possible to perform dynamic
programming with continuous reward distributions (see Exercise 5.9).
5.6 The Projection Step
We now describe projection operators for the categorical and quantile represen-
tations, correspondingly called categorical projection and quantile projection.
In both cases, these operators can be seen as finding the best approximation to a
given probability distribution, as measured according to a specific probability
metric.
136 Chapter 5
To begin, recall that for a probability metric d,
P
d
(
R
)
P
(
R
) is the set of
probability distributions with finite mean and finite distance from the reference
distribution
0
(Equation 4.26). For a representation
F P
d
(
R
), a d-projection
of
2P
d
(
R
) onto
F
is a function
F ,d
:
P
d
(
R
)
!F
that finds a distribution
ˆ 2F that is d-closest to :
F ,d
2arg min
ˆ2F
d(, ˆ) . (5.23)
Although both the categorical and quantile projections that we present here
satisfy this definition, it is worth noting that in the most general setting neither
the existence nor uniqueness of a d-projection
F ,d
is actually guaranteed (see
Remark 5.3). We lift the notion of a d-projection to return-distribution functions
in our usual manner; the d-projection of 2P
d
(R)
X
onto F
X
is
F
X
,d
(x)=
F ,d
(x)
.
When unambiguous, we overload notation and write
F ,d
to denote the
projection onto F
X
.
It is natural to think of the
d
-projection of the return-distribution function
onto
F
X
as the best achievable approximation within this representation,
measured in terms of d. We thus call
F ,d
and
F
X
,d
the (d,
F
)-optimal
approximations to 2P (R) and , respectively.
Categorical projection.
In Chapter 3, we defined the categorical projection
C
as assigning the probability mass q of a particle located at z to the two
locations nearest to z in the fixed support {
1
,
...
,
m
}. Specifically, the cate-
gorical projection assigns this mass q in (inverse) proportion to the distance
to these two neighbours. We now extend this idea to the case where we wish
to more generally project a distribution
2P
1
(
R
) onto the m-categorical
representation.
Given a probability distribution 2P
1
(R), its categorical projection
C
=
m
i=1
p
i
i
has parameters
p
i
= E
Z
h
i
&
–1
m
Z
i

, i = 1, ..., m , (5.24)
for a set of functions h
i
:
R !
[0, 1] that we will define below. Here we write p
i
in terms of an expectation rather than a sum, with the idea that this expectation
can be efficiently computed (this is the case when
is itself a m-categorical
distribution).
Distributional Dynamic Programming 137
When i = 2, ..., m 1, the function h
i
is the triangular kernel
h
i
(z)=h(z) = max(0, 1 |z|) .
We use this notation to describe the proportional assignment of probability mass
for the inner locations
2
,
...
,
m–1
. One can verify that the triangular kernel
assigns probability mass from
to the location
i
in proportion to the distance
to its neighbours (Figure 5.3).
The parameters of the extreme locations are computed somewhat differently, as
these also capture the probability mass associated with values greater than
m
and smaller than
1
. For these, we use the half-triangular kernels
h
1
(z)=
1 z 0
max(0, 1 |z|) z >0
h
m
(z)=
max(0, 1 |z|) z 0
1 z >0.
Exercise 5.10 asks you to prove that the projection described here matches to
deterministic projection of Section 3.5.
Our derivation gives a mathematical formalisation of the idea of assigning pro-
portional probability mass to the locations nearest to a given particle, described
in Chapter 3. In fact, it also describes the projection of
in the Cramér distance
(
`
2
) onto the m-categorical representation. This is stated formally as follows,
and proven in Remark 5.4.
Proposition 5.14.
Let
2P
1
(
R
). The m-categorical probability distri-
bution whose parameters are given by Equation 5.24 is the (unique)
`
2
-projection onto F
C,m
. 4
Quantile projection.
We call the quantile projection of a probability distri-
bution of
2P
(
R
) a specific projection of
in the 1-Wasserstein distance
(w
1
) onto the m-quantile representation (
F
Q,m
,w
1
). With this choice of distance,
this projection can be expressed in closed form and is easily implemented. In
addition, we will see in Section 5.9 that it leads to a well-behaved dynamic
programming algorithm. As with the categorical projection, we introduce the
shorthand
Q
for the projection operator
F
Q,m
,w
1
.
Consider a probability distribution
2P
1
(
R
). We are interested in a probability
distribution
Q
2F
Q,m
that minimises the 1-Wasserstein distance from :
minimise w
1
(,
0
) subject to
0
2F
Q,m
.
138 Chapter 5
By definition, such a solution must take the form
Q
=
1
m
m
i=1
i
.
The following establishes that choosing (
i
)
m
i=1
to be a particular set of quantiles
of yields a valid w
1
-projection of .
Proposition 5.15.
Let
2P
1
(
R
). The m-quantile probability distribution
whose parameters are given by
i
= F
–1
2i –1
2m
i = 1, ..., m (5.25)
is a w
1
-projection of onto F
Q,m
. 4
Equation 5.25 arises because the i
th
particle of a m-quantile distribution is
“responsible” for the portion of the 1-Wasserstein distance measured on the
interval [
i–1
m
,
i
m
] (Figure 5.4). As formalised by the following lemma, the choice
of the midpoint quantile
2i–1
2m
minimises the 1-Wasserstein distance to
on this
interval.
Lemma 5.16.
Let
2P
1
(
R
) with cumulative distribution function F
. Let
0 a < b 1. Then a solution to
min
2R
b
a
F
–1
()–
d (5.26)
is given by the quantile midpoint
= F
–1
a + b
2
.
4
The proof is given as Remark 5.5.
Proof of Proposition 5.15.
Let
0
=
1
m
m
i=1
i
be a m-quantile distribution.
Assume that its locations are sorted, i.e.
1
2
···
m
. For
2
(0, 1), its
inverse cumulative distribution function is
F
–1
0
()=
d me
.
This function is constant on the intervals (0,
1
m
), [
1
m
,
2
m
),
...
,[
m–1
m
, 1). The 1-
Wasserstein distance between
and
0
therefore decomposes into a sum of m
Distributional Dynamic Programming 139
2 0 2 4 6 8
Return
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Probability
2 0 2 4 6 8
Return
0.0
0.1
0.2
0.3
0.4
Probability
Figure 5.4
Left.
The quantile projection finds the quantiles of the distribution
(the dashed line
depicts its cumulative distribution function) for
i
=
2i–1
2m
, i = 1,
...
m. The shaded area
corresponds to the 1-Wasserstein distance between
and its quantile projection
Q
(solid line, m = 5).
Right.
The optimal (w
1
,
F
Q,m
)-approximation to the distribution
,
shown in grey.
terms:
w
1
(,
0
)=
1
0
F
–1
()–F
–1
0
()
d
=
m
i=1
i
m
(i–1)
m
F
–1
()–
i
d .
By Lemma 5.16, the i
th
term of the sum is minimised by the quantile midpoint
F
–1
(
i
), where
i
=
1
2
i –1
m
+
i
m
=
2i –1
2m
.
Unlike the categorical-Cramér case, in general there is not a unique m-quantile
distribution
0
2F
Q,m
that is closest in w
1
to a given distribution
2P
1
(
R
).
The following example illustrates how the issue might take shape.
Example 5.17. Consider the set of Dirac distributions
F
Q,1
={
: 2R}.
Let
=
1
2
0
+
1
2
1
be the Bernoulli(
1
/
2
) distribution. For any
2
[0, 1],
2F
Q,1
is an optimal (w
1
, F
Q,1
)-approximation to :
w
1
(,
) = min
0
2F
Q,1
w
1
(,
0
),
140 Chapter 5
Perhaps surprisingly, this show that the distribution
1
/2
, half-way between the
two possible outcomes and an intuitive one-particle approximation to
, is a no
better choice than
0
when measured in terms of 1-Wasserstein distance. 4
5.7 Distributional Dynamic Programming
We embed the projected Bellman operator in an
for
loop to obtain an algorithmic
template for approximating the return function (Algorithm 5.2). We call this
template distributional dynamic programming (DDP),
42
as it computes
k+1
=
F
T
k
(5.27)
by iteratively applying a projected distributional Bellman operator. A special
case is when the representation is closed under
T
, in which case no projection
is needed. However, by contrast with Equation 5.6, the use of a projection allows
us to consider algorithms for a greater variety of representations. Summarising
the results of the previous sections, instantiating this template involves three
parts:
Choice of representation.
We first need a probability distribution representa-
tion
F
. Provided that this representation uses finitely many parameters, this
enables us to store return functions in memory, using the implied mapping from
parameters to probability distributions.
Update step.
We then need a subroutine for computing a single application
of the distributional Bellman operator to a return function represented by
F
(Equation 5.8).
Projection step.
We finally need a subroutine that maps the outputs of the
update step to probability distributions in
F
. In particular, when
F
is a
d-projection, this involves finding an optimal (d,
F
)-approximation at each
iteration.
For empirical representations, including the categorical and quantile represen-
tations, the update step can be implemented by Algorithm 5.1. That is, when
there are finitely many rewards the output of
T
applied to any m-particle
representation is a collection of empirical distributions:
2F
E,m
=)T
2F
E
.
As a consequence, it is sufficient to have an efficient subroutine for projecting
empirical distributions back to F
C,m
, F
Q,m
, or F
E,m
.
42.
More precisely, this is distributional dynamic programming applied to the problem of policy
evaluation. A sensible, but less memorable alternative is “iterative distributional policy evaluation”.
Distributional Dynamic Programming 141
Algorithm 5.2: Distributional dynamic programming
Algorithm parameters: representation F , projection
F
desired number of iterations K,
initial return function
0
2F
X
Initialise
0
for k = 1, ..., K do
0
T
. Algorithm 5.1
foreach state x 2X do
(x) (
F
0
)(x)
end foreach
end for
return
For example, consider the projection of the empirical distribution
=
n
j=1
q
j
z
j
onto the m-categorical representation (Definition 5.12). For i = 1,
...
, m,
Equation 5.24 becomes
p
i
=
n
j=1
q
j
h
i
&
–1
m
(z
j
i
)
,
which can be implemented in linear time with two FOR loops (as was done in
Algorithm 3.4).
Similarly, the projection of
onto the m-quantile representation (Defini-
tion 5.11) is achieved by sorting the locations z
i
to construct the cumulative
distribution function from their associated probabilities q
i
:
F
(z)=
n
j=1
q
j {z
j
z}
,
from which quantile midpoints can be extracted.
Because the categorical-Cramér and quantile-w
1
pairs recur so often throughout
this book, it is convenient to give a name to the algorithms that iteratively apply
their respective projected operators. We call these algorithms categorical and
142 Chapter 5
Algorithm 5.3: Categorical dynamic programming
Algorithm parameters: representation parameters
1
, ...,
m
, m,
initial probabilities
(p
i
(x))
m
i=1
: x 2X
,
desired number of iterations K
for k = 1, ..., K do
(x)=
m
i=1
p
i
(x)
i
for x 2X

z
0
j
(x), p
0
j
(x)
m(x)
j=1
: x 2X
T
. Algorithm 5.1
foreach state x 2X do
for i = 1, , m do
p
i
(x)
m(x)
j=1
p
0
j
(x)h
i
&
–1
m
(z
0
j
i
)
end for
end foreach
end for
return
(p
i
(x))
m
i=1
: x 2X)
quantile dynamic programming, respectively (CDP and QDP; Algorithms 5.3
and 5.4).
43
For both categorical and quantile dynamic programming, the computational
cost is dominated by the number of particles produced by the distributional
Bellman operator, prior to projection. Since the number of particles in the
representation is a constant m, we have that per state, there may be up to
N = mN
R
N
X
particles in this intermediate step. Thus the cost of K iterations
with the m-categorical representation is O(KmN
R
N
2
X
), not too dissimilar to the
cost of performing iterative policy evaluation with value functions. Due to the
sorting operation, the cost of K iterations with the m-quantile representation is
larger, at O
KmN
R
N
2
X
log
(mN
R
N
X
)
. In Chapter 6 we describe an incremental
algorithm that avoids the explicit sorting operation and is in some cases more
computationally efficient.
43.
To be fully accurate, we should call these the categorical-Cramér and quantile-w
1
dynamic
programming algorithms, given that they combine particular choices of probability representation
and projection. However, brevity has its virtues.
Distributional Dynamic Programming 143
Algorithm 5.4: Quantile dynamic programming
Algorithm parameters: initial locations
(
i
(x))
m
i=1
: x 2X),
desired number of iterations K
for k = 1, ..., K do
(x)=
m
i=1
1
m
i
(x)
for x 2X
0
=

z
0
j
(x), p
0
j
(x)
m(x)
j=1
: x 2X
T
. Algorithm 5.1
foreach state x 2X do
(z
0
j
(x), p
0
j
(x))
m(x)
i=1
sort((z
0
j
(x), p
0
j
(x))
m(x)
i=1
)
for j = 1, , m do
P
j
(x)
j
i=1
p
0
i
(x)
end for
for i = 1, , m do
j min{l : P
l
(x)
i
}
i
(x) z
0
j
(x)
end for
end foreach
end for
return
(
i
(x))
m
i=1
: x 2X
In designing distributional dynamic programming algorithms, there is a good
deal of flexibility in the choice of representation, and in the projection step once
a representation has been selected. There are basic properties we would like
from the sequence defined by Equation 5.27, such as a guarantee of convergence,
and further a limit which does not depend on our choice of initialisation. Certain
combinations of representation and projection will ensure these properties hold,
as we explore in Section 5.9, while others may lead to very poorly-behaved
algorithms (see Exercise 5.19). In addition, using a representation and projection
also necessarily incurs some approximation error relative to the true return
function. It is often possible to obtain quantitative bounds on this approximation
error, as Section 5.10 describes, but often judgement must be used as to what
qualitative types of approximation are the most acceptable for task in hand; we
return to this point in Section 5.11.
144 Chapter 5
0
1
···
k
ˆ
0
ˆ
1
··· ˆ
k
T
F
T
F
T
F
F
T
F
T
F
T
Figure 5.5
A diffusion-free projection operator
F
yields a distributional dynamic program-
ming procedure that is equivalent to first computing an exact return function and then
projecting it.
5.8 Error Due To Diffusion
In Section 5.4 we showed that the distributional algorithm for the normal
representation finds the best fit to the return function
, as measured in
Kullback-Leibler divergence. Implicit in our derivation was the fact that we
could interleave projection and update steps to obtain the same solution as if
we had first determined
without approximation, and then found its best fit in
F
N
. We call a projection operator with this property diffusion-free (Figure 5.5).
Definition 5.18.
Consider a representation
F
and a projection operator
F
for that representation. The projection operator
F
is said to be diffusion-free
if, for any return function 2F
X
, we have
F
T
F
=
F
T
.
As a consequence, for any k
0 and any
2F
X
, a diffusion-free projection
operator satisfies
(
F
T
)
k
=
F
(T
)
k
. 4
Algorithms that implement diffusion-free projection operators are quite appeal-
ing, because they behave as if no approximation had been made until the final
iteration. Unfortunately, such algorithms are the exception, rather than the rule.
By contrast, without this guarantee an algorithm may accumulate excess error
from iteration to iteration we say that the iterates
0
,
1
,
2
,
...
undergo diffu-
sion. Known projection algorithms for m-particle representations suffer from
this issue, as the following example illustrates.
Example 5.19.
Consider a chain of n states with a deterministic left-to-right
transition function (Figure 5.6). The last state of this chain, x
n
, is terminal and
produces a reward of 1; all other states yield no reward. For t
0, the discounted
Distributional Dynamic Programming 145
x
1
x
3
x
5
x
7
x
9
State
Return
m-Categorical
Approximation
0.0
0.2
0.4
0.6
0.8
1.0
x
1
x
3
x
5
x
7
x
9
State
Return
Categorical
Dynamic Programming
0.0
0.2
0.4
0.6
0.8
1.0
(a)
(b) (c)
Figure 5.6
(a)
An n-state chain with a single nonzero reward at the end.
(b)
The (
`
2
,
F
C,m
)-optimal
approximation to the return function, for n = 10, m = 11, and
1
= 0,
...
,
11
= 1. The
probabilities assigned to each location are indicated in greyscale (white = 0, black =
1).
(c)
The approximation found by categorical dynamic programming with the same
representation.
return at state x
nt
is deterministic and has value
t
; let us denote its distribution
by
(x
nt
). If we approximate this distribution with m = 11 particles uniformly
spaced from 0 to 1, the best categorical approximation assigns probability to
the two particles closest to
t
.
If we instead use categorical dynamic programming, we find a different solu-
tion. Because each iteration of the projected Bellman operator must produce a
categorical distribution, the iterates undergo diffusion. Far from the terminal
state, the return distribution found by Algorithm 5.2 is distributed on a much
larger support than the best categorical approximation of (x
i
). 4
The diffusion in Example 5.19 can be explained analytically. Let us asso-
ciate each particle with an integer j = 0,
...
, 10, corresponding to the location
j
10
. For concreteness, let
=1–
"
for some 0 <
"
0.1, and consider the
return-distribution function obtained after N iterations of categorical dynamic
programming:
ˆ =(
C
T
)
n
0
.
Because there are no cycles, one can show that further iterations leave the
approximation unchanged.
146 Chapter 5
If we interpret the m particle locations as states in a Markov chain, then we can
view the return distribution
ˆ
(x
nt
) as the probability distribution of this Markov
chain after t time steps (t
2
{0,
...
, n 1}). The transition function for states
j =1...10 is
P(bjc | j)=dje j
P(dje | j)=1–dje+ j .
For j = 0, we simply have P(0 | 0) = 1, i.e. state 0 is terminal. When
is suf-
ficiently small compared to the gap
&
m
= 0.1 between neighbouring particle
locations, the process can be approximated with a Binomial distribution:
ˆ
G(x
nt
) 1–t
–1
BINOM(1 , t),
where
ˆ
G
is the return-variable function associated with
ˆ
. Figure 5.6c gives a
rough illustration of this point, with a bell-shaped distribution emerging in the
return distribution at state x
1
.
5.9 Convergence of Distributional Dynamic Programming
We now use the theory of operators developed in Chapter 4 to characterise the
behaviour of distributional dynamic programming in the policy evaluation set-
ting: its convergence rate, its point of convergence, and also the approximation
error incurred. Specifically, this theory allows us to measure how these quanti-
ties are impacted by different choices of representation and projection. Although
the algorithmic discussion in previous sections has focused on implementa-
tions of distributional dynamic programming in the case of finitely-supported
reward distributions, the results presented here apply without this assumption
(see Exercise 5.9 for indications of how distributional dynamic programming
algorithms may be implemented in such cases). As we will see in Chapter 6,
the theory developed here also informs incremental algorithms for learning the
return-distribution function.
Let us consider a probability metric
d
, possibly different from the metric
under which the projection is performed (when applicable), and which we call
the analysis metric. We use the analysis metric to characterise instances of
Algorithm 5.2 in terms of the Lipschitz constant of the projected operator.
Definition 5.20.
Let (M,
d
) be a metric space, and let
O
: M
!
M be a function
on this space. The Lipschitz constant of O under the metric d is
kOk
d
= sup
U,U
0
2M
U6=U
0
d(OU, OU
0
)
d(U, U
0
)
. 4
Distributional Dynamic Programming 147
When
O
is a contraction mapping, its Lipschitz constant is simply its con-
traction modulus. That is, under the conditions of Theorem 4.25 applied to a
c-homogeneous d, we have
kT
k
d
c
.
Definition 5.20 extends the notion of a contraction modulus to operators, such
as projections, that are not contraction mappings.
Lemma 5.21.
Let (M,
d
) be a metric space, and let
O
1
,
O
2
: M
!
M be func-
tions on this space. Write
O
1
O
2
for the composition of these mappings.
Then
kO
1
O
2
k
d
kO
1
k
d
kO
2
k
d
. 4
Proof. By applying the definition of the Lipschitz constant twice, we have
d(O
1
O
2
U, O
1
O
2
U
0
) kO
1
k
d
d(O
2
U, O
2
U
0
) kO
1
k
d
kO
2
k
d
d(U, U
0
),
as required.
Lemma 5.21 gives a template for validating and understanding different instan-
tiations of Algorithm 5.2. Here, we consider the metric space defined by the
finite domain of our analysis metric, (
P
d
(
R
)
X
,
d
), and use Lemma 5.21 to
characterise
F
T
in terms of the Lipschitz constants of its parts (
F
and
T
). By analogy with the conditions of Proposition 4.27, where needed we
will assume that the environment (more specifically, its random quantities) are
reasonably behaved under d.
Assumption 5.22(d).
The finite domain
P
d
(
R
)
X
is closed under both the
projection
F
and the Bellman operator T
, in the sense that
2P
d
(R)
X
=)T
,
F
2P
d
(R)
X
. 4
Assumption 5.22(
d
) guarantees that distributions produced by the distributional
Bellman operator and the projection have finite distances from one another. For
many choices of
d
of interest, Assumption 5.22 can be easily shown to hold
when the reward distributions are bounded; contrast with Proposition 4.16.
The Lipschitz constant of projection operators must be at least 1, since for any
2F ,
F
= .
In the case of the Cramér distance `
2
, we can show that it behaves much like a
Euclidean metric, from which we obtain the following result on
`
2
projections.
148 Chapter 5
Lemma 5.23.
Consider a representation
F P
`
2
(
R
). If
F
is complete with
respect to
`
2
and convex, then the
`
2
-projection
F ,`
2
:
P
`
2
(
R
)
!F
is a non-
expansion in that metric:
k
F ,`
2
k
`
2
=1.
Furthermore, the result extends to return functions and the supremum extension
of `
2
:
k
F ,`
2
k
`
2
=1. 4
The proof is given in Remark 5.6. Exercise 5.15 asks you show that the m-
categorical representation is convex and complete with respect to the Cramér
distance. From this, we immediately derive the following.
Lemma 5.24.
The projected distributional Bellman operator instantiated with
F
=
F
C,m
and d =
`
2
has contraction modulus
1
/2
with respect to
`
2
on the space
P
`
2
(R)
X
. That is,
k
C
T
k
`
2
1
/2
. 4
Lemma 5.24 gives a formal reason for why the use of the m-categorical rep-
resentation, coupled with a projection based on triangular kernel, produces a
convergent distributional dynamic programming algorithm.
The m-quantile representation, however, is not convex. This precludes a direct
appeal to an argument based on a quantile analogue to Lemma 5.23. Nor is
the issue simply analytical: the w
1
Lipschitz constant of the projected Bellman
operator
Q
T
is greater than 1 (see Exercise 5.16). Instead, we need to use
the
1
-Wasserstein distance as our analysis metric. As the w
1
-projection onto
F
Q,m
is a non-expansion in w
1
, we obtain the following result.
Lemma 5.25.
Under Assumption 5.22(w
1
), the projected distributional Bell-
man operator
Q
T
:
P
1
(
R
)
X
!P
1
(
R
)
X
has contraction modulus
in
w
1
.
That is,
k
Q
T
k
w
1
. 4
Proof.
Since
T
is a
-contraction in
w
1
by Proposition 4.15, it is sufficient
to prove that
Q
:
P
1
(
R
)
!P
1
(
R
) is a non-expansion in w
1
; the result then
follows from Lemma 5.21. Given two distributions
1
,
2
2P
1
(R), we have
w
1
(
1
,
2
) = sup
2(0,1)
|F
–1
1
()–F
–1
2
()| .
Distributional Dynamic Programming 149
Now note that
Q
1
=
1
m
m
i=1
F
–1
1
(
i
)
,
Q
2
=
1
m
m
i=1
F
–1
2
(
i
)
,
where
i
=
2i–1
2m
. We have
w
1
(
Q
1
,
Q
2
) = max
i=1,,m
F
–1
1
(
i
)–F
–1
2
(
i
)
sup
2(0,1)
F
–1
1
()–F
–1
2
()
= w
1
(
1
,
2
),
as required.
The derivation of a contraction modulus for a projected Bellman operator
provides us with two results. By Proposition 4.7, it establishes that if
F
T
has a fixed point
ˆ
, then Algorithm 5.2 converges to this fixed point. Second, it
also establishes the existence of such a fixed point when the representation is
complete with respect to the analysis metric
d
, based on Banach’s fixed point
theorem; a proof is provided in Remark 5.7.
Theorem 5.26
(Banach’s fixed point theorem)
.
Let (M,
d
) be a complete
metric space and let
O
be a contraction mapping on M, with respect to
d
.
Then O has a unique fixed point U
2M . 4
Proposition 5.27.
Let
d
be a c-homogeneous, regular probability metric
that is p-convex for some p
2
[1,
1
) and let
F
be a representation complete
with respect to
d
. Consider Algorithm 5.2 instantiated with a projection
step described by the operator
F
, and suppose that Assumption 5.22(
d
)
holds. If
F
is a non-expansion in
d
, then the corresponding projected
Bellman operator
F
T
has a unique fixed point
ˆ
in
P
d
(
R
)
X
satisfying
ˆ
=
F
T
ˆ
.
Additionally, for any " > 0, if K the number of iterations is such that
K
log
1
"
+ log d(
0
, ˆ
)
c log
1
with
0
(x)
2P
d
(
R
) for all x, then the output
K
of Algorithm 5.2 satisfies
d(
K
, ˆ
) " . 4
Proof.
By the assumptions in the statement, we have that
T
is a
c
-contraction
on
P
d
(
R
)
X
by Theorem 4.25, and by Lemma 5.21,
F
T
is a
c
-contraction
on
P
d
(
R
)
X
. By Banach’s fixed point theorem, there is a unique fixed point
ˆ
150 Chapter 5
for
F
T
in P
d
(R)
X
. Now note
d(
K
, ˆ
)=d(
F
T
K–1
,
F
T
ˆ
)
c
d(
K–1
, ˆ
),
so by induction
d(
K
, ˆ
)
cK
d(
0
, ˆ
).
Setting the right-hand side to less than
"
and rearranging yields the result.
Although Proposition 5.27 does not allow us to conclude that a particular
algorithm will fail, it cautions us against projections in certain probability
metrics. For example, because the distributional Bellman operator is only a non-
expansion in the supremum total variation distance, we cannot guarantee a good
approximation with respect to this metric after any finite number of iterations.
Because we would like the projection step to be computationally efficient, this
argument also gives us a criterion with which to choose a representation. For
example, although the m-particle representation is clearly more flexible than
either the categorical or quantile representations, it is currently not known how
to efficiently and usefully project onto it.
5.10 Quality of the Distributional Approximation
Having identified conditions under which the iterates produced by distributional
dynamic programming converge, we now ask: to what do they converge? We
answer this question by measuring how close the fixed point
ˆ
of the projected
Bellman operator (computed by Algorithm 5.2 in the limit of the number of
iterations) is to the true return function
. The quality of this approximation
depends on a number of factors: the choice and size of representation
F
, which
determines the optimal approximation of
within
F
, as well as properties of
the projection step.
Proposition 5.28.
Let
d
be a c-homogeneous, regular probability metric
that is p-convex for some p
2
[1,
1
), and let
F P
d
(
R
) be a representa-
tion complete with respect to
d
. Let
F
be a projection operator that is a
non-expansion in
d
, and suppose Assumption 5.22(
d
) holds. Consider the
projected Bellman operator
F
T
with fixed point ˆ
2P
d
(R)
X
. Then
d(ˆ
,
)
d(
F
,
)
1–
c
.
When
F
is a d-projection in some probability metric d,
F
is a
(d, F )-optimal approximation of the return function
. 4
Distributional Dynamic Programming 151
Proof. We have
d(
, ˆ
) d(
,
F
)+d(
F
, ˆ
)
= d(
,
F
)+d(
F
T
,
F
T
ˆ
)
d(
,
F
)+
c
d(
, ˆ
).
Rearranging then gives the result.
From this, we immediately derive a result regarding the fixed point
ˆ
C
of the
projected operator
C
T
.
Corollary 5.29.
The fixed point
ˆ
C
of the categorical-projected Bellman
operator
C
T
: P
1
(R) !P
1
(R) satisfies
`
2
(ˆ
C
,
)
`
2
(
C
,
)
1–
1
/2
.
If the return distributions (
(x):x
2X
) are supported on [
1
,
m
], we further
have that, for each x 2X,
min
2F
C,m
`
2
2
(,
(x)) &
m
m
i=1
F
(x)
1
+ i&
m
F
(x)
1
+(i 1)&
m
2
&
m
,
(5.28)
and hence
`
2
(ˆ
C
,
)
1
1–
1
/2
m
1
m –1
.4
Proposition 5.28 suggests that the excess error may be greater when the projec-
tion is performed in a metric for which c is small. In the particular case of the
Cramér distance, the constant in Corollary 5.29 can in fact be strengthened to
1
p
1–
by arguing about the geometry of the probability space under
`
2
under cer-
tain conditions; see Remark 5.6 for a discussion of this geometry and Rowland
et al. [2018] for details on the strengthened bound.
Varying the number of particles in the categorical representation allows the
user to control both the complexity of the associated dynamic programming
algorithm and the error of the fixed point
ˆ
C
(Figure 5.7). As discussed in the
first part of this chapter, both the memory and time complexity of computing
with this representation increase with m. On the other hand, Corollary 5.29
establishes that increasing m also reduces the approximation error incurred from
dynamic programming.
Proposition 5.28 can be similarly used to analyse quantile dynamic program-
ming. This result, under the conditions of Lemma 5.25, shows that the fixed
152 Chapter 5
Figure 5.7
The return-distribution function estimates obtained by applying categorical dynamic
programming to the Cliffs domain (Example 2.9; here, with the safe policy). Each panel
corresponds to a different number of particles.
point
ˆ
Q
of the projected operator
Q
T
for the m-quantile representation
satisfies
w
1
(ˆ
Q
,
)
w
1
(
Q
,
)
1–
.
Unfortunately, the distance
w
1
(
Q
,
) does not necessarily vanish as the
number of particles m in the quantile representation increases. This is because
w
1
is in some sense a very strict distance (Exercise 5.18 makes this point
precise). Nevertheless, the dynamic programming algorithm associated with the
quantile representation enjoys a similar trade-off between complexity and accu-
racy as established for the categorical algorithm above. This can be shown by
instead analysing the algorithm via the more lenient w
1
distance; Exercise 5.20
provides a guide to a proof of this result.
Lemma 5.30.
Suppose that for each (x, a)
2X A
, the reward distribution
P
R
(
·
| x, a) is supported on the interval [R
MIN
, R
MAX
]. Then the m-quantile fixed
point ˆ
Q
satisfies
w
1
(ˆ
Q
,
)
3(V
MAX
V
MIN
)
2m(1 )
.
In addition, consider an initial return function
0
with distributions with support
bounded in [V
MIN
, V
MAX
], and the iterates
k+1
=
Q
T
k
produced by quantile dynamic programming. Then we have, for all k 0,
w
1
(
k
,
)
k
(V
MAX
V
MIN
)+
3(V
MAX
V
MIN
)
2m(1 )
. 4
Distributional Dynamic Programming 153
5.11 Designing Distributional Dynamic Programming Algorithms
Although both categorical and quantile dynamic programming algorithms arise
from natural choices, there is no reason to believe that they lead to the best
approximation of the return-distribution function for a fixed number of parame-
ters. For example, can the error be reduced by using a
m
2
-particle representation
instead? Are there measurable characteristics of the environment that could
guide the choice of distribution representation?
As a guide to further investigations, Table 5.1 summarises the desirable prop-
erties of representations and projections that were studied in this chapter, as
they pertain to the design of distributional dynamic programming algorithms.
Because these properties arise from the combination of a representation with a
particular projection, every such combination is likely to exhibit a different set
of properties. This highlights how new DDP algorithms with desirable charac-
teristics may be produced by simply trying out different combinations. ojection
with the quantile representation produces an algorithm By combining differ-
ent representations and projections, we naturally expect different properties
to emerge; for example, Because these properties arise from the combination
of a representation with a particular projection, one can naturally expect new
algorithms for example, one can imagine a new algorithm based on the quantile
representation that is a non-expansion in the 1- or 2-Wasserstein distances.
Included in this table is whether a representation is mean-preserving. We say
that a representation
F
and its associated projection operator
F
are mean-
preserving if for any distribution
from a suitable space, the mean of
F
is
the same as that of
. The m-categorical algorithm presented in this chapter is
mean-preserving provided the range of considered returns does not exceed the
boundaries of its support; the m-quantile algorithm is not. In addition to the
mean, we can also consider what happens to other aspects of the approximated
distributions. For example, the m-categorical algorithm produces probability
distributions that in general have more variance than those of the true return
function.
The choice of representation also has consequences beyond distributional
dynamic programming. In the next chapter, for example, we will consider
the design of incremental algorithms for learning return-distribution functions
from samples. There, we will see that the projected operator derived from the
categorical representation directly translates into an incremental algorithm,
while the operator derived from the quantile representation does not. In Chapter
9 we will also see how the choice of representation interplays with the use of
parametrised models such as neural networks to represent the return function
154 Chapter 5
compactly. Another consideration, not listed here, concerns the sensitivity of the
algorithm to its parameters: for example, for small values of m, the m-quantile
representation tends to be a better choice than the m-categorical representation,
which suffer from large gaps between its particles’ locations (as an extreme
example, take m = 2).
T
-closed Tractable Expressive Diffusion-free Mean-preserving
Empirical DP XXXX
Normal DP XXX
Categorical DP XX *
Quantile DP XX
Table 5.1
Desirable characteristics of distributional dynamic programming algorithms. “Empirical
DP” and “Normal DP” refer to distributional dynamic programming with the empir-
ical and normal distributions, respectively (Sections 5.3 and 5.4). While no known
representation-projection pair satisfies all of these, the categorical-Cramér and quantile-
w
1
choices offer a good compromise.
*
The categorical representation is mean-preserving
provided its support spans the range of possible returns.
The design and study of representations remains an active topic in distributional
reinforcement learning. The representations we presented here are by no mean
an exhaustive portrait of the field. For example, Barth-Maron et al. [2018]
considered using mixtures of m normal distributions in domains with vector-
valued action spaces. Analysing representations like these is more challenging
because known projection methods suffer from local minima, which in turn
implies that dynamic programming may give different (and possibly suboptimal)
solutions for different initial conditions.
5.12 Technical Remarks
Remark 5.1
(
Finiteness of R
)
.
In the algorithms presented in this chapter,
we assumed that rewards are distributed on a finite set
R
. This is not actually
needed for most of our analysis, but makes it possible to compute expectations
and convolutions in finite time and hence devise concrete dynamic programming
algorithms. However, there are many problems in which the rewards are better
modelled using continuous or unbounded distributions. Rewards generated from
observations of a physical process are often well-modelled by a normal distri-
bution to account for sensor noise. Rewards derived from a queuing process,
Distributional Dynamic Programming 155
such as the number of customers that make a purchase at an ice cream shop in a
give time interval, can be modelled by a Poisson distribution.
There are a few ways to extend distributional dynamic programming to con-
tinuous reward distributions. One approach is to avoid explicitly enumerating
possible rewards and instead draw samples from the relevant reward distribution;
we will see in Chapter 6 how to derive incremental algorithms that approximate
the return-distribution function from samples. Another approach is to represent
reward distributions symbolically and implement the operations of distributional
dynamic programming as symbol manipulation. In most modern programming
languages, this can be done using object-oriented or functional programming,
for example using callables in Python or function pointers in C++. Yet another
approach is to approximate the distributional Bellman operator by numerical
integration; this idea is developed in Exercise 5.9. 4
Remark 5.2
(
NP-hardness of computing return distributions.
)
.
Recall that
a problem is said to be NP-hard if its solution can also be used to solve all
problems in the class NP, by means of a polynomial-time reduction [Cormen
et al., 2001]. This remark illustrates how the problem of a computing certain
aspects of the return-distribution function for a given Markov decision process is
NP-hard, by reduction from one of Karp’s original NP-complete problems, the
Hamiltonian cycle problem. Reductions from the Hamiltonian cycle problem
have previously been used to prove the NP-hardness of a variety of problems
relating to constrained Markov decision processes [Feinberg, 2000].
Let G =(V, E) be a graph, with V = {1,
...
, n} for some n
2N
+
. The Hamiltonian
cycle problem asks whether there is a permutation of vertices
: V
!
V such
that {
(i),
(i + 1)}
2
E for each i
2
[n 1], and {
(n),
(1)}
2
E. Now consider
an MDP whose state space is the set of integers from 1 to n, denoted
X
=
{1,
...
, n}, and with a singleton action set
A
={a}, a transition kernel that
encodes a random walk over the graph G, and a uniform initial state distribution
0
. Further, specify reward distributions as P
R
(
·
| x, a)=
x
for each x
2X
, and
set
<
1
/
n+2
. For such a discount factor, there is a one-to-one mapping between
trajectories and returns. As the action set is a singleton, there is a single policy
. It can be shown that there is a Hamiltonian cycle in G if and only if the
support of the return distribution has non-empty intersection with the set
2S
n
n–1
t=0
t
(t + 1) +
n
(1),
n–1
t=0
t
(t + 1) +
n
((1) + 1)
. (5.29)
Since the MDP description is clearly computable in polynomial time from the
graph specifying the Hamiltonian cycle problem, it must therefore be the case
156 Chapter 5
that ascertaining whether the set in Expression 5.29 has non-empty intersec-
tion with the return distribution support is NP-hard. The full proof is left as
Exercise 5.21. 4
Remark 5.3
(
Existence of
d
-projections
)
.
As an example of a setting where d-
projections do not exist, let
1
,
,
m
2R
and consider the softmax categorical
representation
F =
m
i=1
e
'
i
m
j=1
e
'
j
i
: '
1
, , '
m
2R
.
For the distribution
1
and metric d = w
2
, there is no optimal (
F
, d)-
approximation to
1
, since for any distribution in
F
, there is another
distribution in
F
which lies closer to
1
with respect to w
2
. The issue is
that there are elements of the representation which are arbitrarily close to the
distribution to be projected, but there is no distribution in the representation
that dominates all others as in Equation 5.23. An example of a sufficient condi-
tion for the existence of an optimal (
F
, d)-approximation to
ˆ 2P
d
(
R
) is that
the metric space (
F
, d) has the Bolzano-Weierstrass property, also known as
sequential compactness: for any sequence (
k
)
k0
in
F
, there is subsequence
that converges to a point in
F
with respect to d. When this assumption holds,
we may take the sequence (
k
)
k0
in F to satisfy
d(
k
, ˆ) inf
2F
d(, ˆ)+
1
k +1
.
Using the Bolzano-Weierstrass property, we may pass to a subsequence (
k
n
)
n0
converging to
2F . We then observe
d(
, ˆ) d(
,
k
n
)+d(
k
n
, ˆ) ! inf
2F
d(, ˆ),
showing that
is an optimal (
F
, d)-approximation to
ˆ
. The softmax repre-
sentation (under the w
2
metric) fails to have the Bolzano-Weierstrass property.
As an example, the sequence of distributions (
k
k+1
1
+
1
k+1
2
)
k0
in
F
has no
subsequence that converges to a point in F . 4
Remark 5.4
(
Proof of Proposition 5.14
)
.
We will demonstrate the following
Pythagorean identity. For any 2P
`
2
(R), and
0
2F
C,m
, we have
`
2
2
(,
0
)=`
2
2
(,
C
)+`
2
2
(
C
,
0
),
From this, it follows that
0
=
C
is the unique
`
2
-projection of
onto
F
C,m
,
since this choice of
0
uniquely minimises the right-hand side. To show this
identity, we first establish an interpretation of the cumulative distribution func-
tion (CDF) of
C
as averaging the CDF values of
on each interval (
i
,
i+1
)
Distributional Dynamic Programming 157
for i = 1, , m 1. First note that for z 2(
i
,
i+1
], we have
h
i
&
–1
m
(z
i
)
=1–&
–1
m
|z
i
|
=
1
i+1
i
i+1
i
+
i
z
=
i+1
z
i+1
i
.
Now, for i = 1, , m 1, we have
F
C
(
i
)=
i
j=1
E
Z
[h
j
(Z)]
= E
Z
{Z
i
}+ {Z 2(
i
,
i+1
]}
i+1
Z
i+1
i
=
1
i+1
i
i+1
i
F
(z)dz .
Now note
`
2
2
(,
0
)=
1
1
(F
(z) F
0
(z))
2
dz
(a)
=
1
1
(F
(z) F
C
(z))
2
dz +
1
1
(F
C
(z) F
0
(z))
2
dz
+2
1
1
(F
(z) F
C
(z))(F
C
(z) F
0
(z))dz
= `
2
2
(,
C
)+`
2
2
(
C
,
0
)
+2
1
1
+
m–1
i=1
i+1
i
+
1
m
(F
(z) F
C
(z))(F
C
(z) F
0
(z))dz
(b)
= `
2
2
(,
C
)+`
2
2
(
C
,
0
)
establishing the identity as required. Here, (a) follows by adding and subtracting
F
C
inside the parentheses and expanding. (b) follows by noting that on
(–
1
,
1
) and (
m
,
1
), F
C
= F
v
0
, and on each interval (
i
,
i+1
), F
C
F
v
0
is constant and F
C
is constant and equals the average of F
on the interval,
meaning that
i+1
i
(F
(z) F
C
(z))(F
C
(z) F
0
(z))dz
=(F
C
(
i
)–F
0
(
i
))
i+1
i
(F
(z) F
C
(z))dz =0,
158 Chapter 5
as required. 4
Remark 5.5
(
Proof of Lemma 5.16
)
.
Assume that F
–1
is continuous at
=
a+b
2
; this is not necessary but simplifies the proof. For any ,
F
–1
()–
(5.30)
is a convex function, and hence so is Equation 5.26. A subgradient
44
for
Equation 5.30 is
g
()=
1 if < F
–1
()
–1 if > F
–1
()
0 if = F
–1
().
A subgradient for Equation 5.26 is therefore
g
[a,b]
()=
b
=a
g
()d
=
F
()
=a
–1 d +
b
=F
()
1d
= –(F
()–a)+(b F
()).
Setting the subgradient to zero and solving for ,
0=a + b –2F
()
=) F
()=
a + b
2
=) = F
–1
a + b
2
. 4
Remark 5.6
(
Proof of Lemma 5.23
)
.
The argument follows the standard proof
that finite-dimensional Euclidean projections onto closed convex sets are non-
expansions. Throughout, we write
for
F ,`
2
for conciseness. We begin with
the observation that for any 2P
`
2
(R) and
0
2F , we have
R
(F
(z)–F
(z))(F
0
(z)–F
(z))dz 0 , (5.31)
since if not, we have (1 ") + "⌫
0
2F for all " 2(0, 1) by convexity, and
`
2
2
(, (1 ") + "⌫
0
)=
R
(F
(z) (1 ")F
(z)–"F
0
(z))
2
dz
= `
2
2
(, )–2"
R
(F
(z)–F
(z))(F
0
(z)–F
(z))dz + O("
2
).
44.
For a convex function f :
R ! R
, we say that g :
R ! R
is a subgradient for f if for all z
1
, z
2
2 R
,
we have f (z
2
) f (z
1
)+g(z
1
)(z
2
z
1
).
Distributional Dynamic Programming 159
This final line must be at least as great as
`
2
2
(
,
) for all
" 2
(0, 1), by definition
of
. It must therefore be the case that Inequality 5.31 holds, since if not, we
could select
"
> 0 sufficiently small to make
`
2
2
(
, (1
"
)
+
"⌫
0
) smaller than
`
2
2
(, ).
Now take
1
,
2
2P
`
2
(R). Applying the above inequality twice yields
R
(F
1
(z)–F
1
(z))(F
2
(z)–F
1
(z))dz 0,
R
(F
2
(z)–F
2
(z))(F
1
(z)–F
2
(z))dz 0.
Adding these inequalities then yields
R
(F
1
(z)–F
2
(z)+F
1
(z)–F
2
(z))(F
1
(z)–F
2
(z))dz 0
=)`
2
2
(
1
,
2
)+
R
(F
1
(z)–F
2
(z))(F
1
(z)–F
2
(z))dz 0
=)`
2
2
(
1
,
2
)
R
(F
2
(z)–F
1
(z))(F
1
(z)–F
2
(z))dz .
Applying the Cauchy-Schwarz inequality to the remaining integral then yields
`
2
2
(
1
,
2
) `
2
(
1
,
2
)`
2
(
1
,
2
),
from which the result follows by rearranging. 4
Remark 5.7
(
Proof of Banach’s fixed point theorem (Theorem 5.26)
)
.
The
map
O
has at most one fixed point by Proposition 4.5. It therefore suffices to
exhibit a fixed point in M. Let
2
[0, 1) be the contraction modulus of
O
, and
let U
0
2
M. Consider the sequence (U
k
)
k0
defined by U
k+1
=
O
U
k
for k
0.
For any l > k, we have
d(U
l
, U
k
)
k
d(U
lk
, U
0
)
k
lk–1
j=0
d(U
j
, U
j+1
)
k
lk–1
j=0
j
d(U
1
, U
0
)
k
1–
d(U
1
, U
0
).
160 Chapter 5
Therefore, as k
!1
, we have d(U
l
, U
k
)
!
0, so (U
k
)
k0
is a Cauchy sequence.
By completeness of (M, d), (U
k
)
k0
has a limit U
2
M. Finally, for any k >0
we have
d(U
, OU
) d(U
, U
k
)+d(U
k
, OU
) d(U
, U
k
)+d(U
k–1
, U
),
and as d(U
, U
k
) !0 we deduce that d(U
, OU
) = 0. Hence U
is the unique
fixed point of O. 4
5.13 Bibliographical Remarks
5.1.
The term “dynamic programming” and the Bellman equation are due to
Bellman [1957a]. The relationship between the Bellman equation, value func-
tions, and linear systems of equations is studied at length by Puterman [2014]
and Bertsekas [2012]. Bertsekas [2011] provides a treatment of iterative policy
evaluation generalised to matrices other than discounted stochastic matrices.
The advantages of the iterative process is well documented in the field, and
plays a central role in the work of Sutton and Barto [2018], which is our source
for the term “iterative policy evaluation”.
5.2.
Our notion of a probability distribution representation reflects the common
principle in machine learning of modelling distributions with simple parametric
families of distributions; see e.g. the books by MacKay [2003], Bishop [2006],
Wainwright and Jordan [2008], Murphy [2012]. The Bernoulli representation
was introduced in the context of distributional reinforcement learning, mostly
as a curio, by Bellemare et al. [2017a]. Gaussian approximations have been
extensively used in reinforcement learning, often in Bayesian settings [Dearden
et al., 1998, Engel et al., 2003, Morimura et al., 2010b, Lee et al., 2013]. Similar
to Equation 5.10, Morimura et al. [2010a] presents a distributional Bellman
operator in terms of cumulative distribution functions; see also Chung and Sobel
[1987].
5.3.
Li et al. [2021] consider what is effectively distributional dynamic program-
ming with the empirical representation; they show that in the undiscounted,
finite-horizon setting with reward distributions supported on finitely-many
integers, exact computation is tractable. Our notion of the empirical represen-
tation has roots in particle filtering [Doucet et al., 2001, Robert and Casella,
2004, Brooks et al., 2011, Doucet and Johansen, 2011] and is also a repre-
sentation in modern variational inference algorithms [Liu and Wang, 2016].
The NP-hardness result (properly discussed in Remark 5.2) is new as given
here, but Mannor and Tsitsiklis [2011] give a related result in the context of
mean-variance optimisation.
Distributional Dynamic Programming 161
5.4.
Sobel [1982] is usually noted as the source of the Bellman equation for the
variance and its associated operator. The equation plays an important role in the-
oretical exploration [Lattimore and Hutter, 2012, Azar et al., 2013]. Tamar et al.
[2016] studies the variance equation in the context of function approximation.
5.5.
The m-categorical representation was used in a distributional setting by
Bellemare et al. [2017a], inspired by the success of categorical representations
in generative modelling [Van den Oord et al., 2016]. Dabney et al. [2018b]
introduced the quantile representation to avoid the inefficiencies in using a fixed
set of evenly-spaced locations, as well as deriving an algorithm more closely
grounded in the Wasserstein distances.
Morimura et al. [2010a] used the m-particle representation to design a risk-
sensitive distributional reinforcement learning algorithm. In a similar vein,
Maddison et al. [2017] used the same representation in the context of exponen-
tial utility reinforcement learning. Both approaches are closely related to particle
filtering and sequential Monte Carlo methods [Gordon et al., 1993, Doucet et al.,
2001, Särkkä, 2013, Naesseth et al., 2019, Chopin and Papaspiliopoulos, 2020],
which rely on stochastic sampling and resampling procedures, by contrast to
the deterministic dynamic programming methods of this chapter.
5.6, 5.8.
The categorical projection was originally proposed as an ad-hoc solu-
tion to address the need to map the output of the distributional Bellman operator
back onto the support of the distribution. Its description as the expectation of a
triangular kernel was shown by Rowland et al. [2018], justifying its use from
a theoretical perspective and providing the proof of Proposition 5.14. Lemma
5.16 is due to Dabney et al. [2018b].
5.7, 5.9–5.10.
The language and analysis of projected operators is inherited
from the theoretical analysis of linear function approximation in reinforcement
learning; a canonical exposition may be found in Tsitsiklis and Van Roy [1997]
and Lagoudakis and Parr [2003]. Because the space of probability distributions
is not a vector space, the analysis is somewhat different, and among other
things require more technical care (as discussed in Chapter 4). A version of
Theorem 5.28 in the special case of CDP appears in Rowland et al. [2018]. Of
note, in the linear function approximation setting the main technical argument
revolves around the noncontractive nature of the stochastic matrix P
in a
weighted L
2
norm, whereas here it is due to the c-homogeneity of the analysis
metric (and does not involve P
). See Chapter 9.
5.11.
A discussion of the mean-preserving property is given in Lyle et al. [2019].
162 Chapter 5
5.14 Exercises
Exercise 5.1.
Suppose that a Markov decision process is acyclic, in the sense
that for any policy and non-terminal state x 2X,
P
X
t
= x | X
0
= x
= 0 for all t >0.
Consider applying iterative policy evaluation to this MDP, beginning with the
initial condition V(x
?
) = 0 for all terminal states x
?
2X
. Show that it converges
to V
in a finite number of iterations K, and give a bound on K. 4
Exercise 5.2.
Show that the normal representation (Definition 5.10) is closed
under the distributional Bellman operator T
under the following conditions:
(i) The policy is deterministic: (a | x) 2{0, 1};
(ii) The transition function is deterministic: P
X
(x
0
| x, a) 2{0, 1}; and
(iii) The rewards are normally distributed, P
R
(· | x, a)=N(µ
x,a
,
2
x,a
). 4
Exercise 5.3.
In Algorithm 5.1 we made use of the fact that the return G
(x
?
)
from the terminal state x
?
is 0, arguing that this is a more effective procedure.
(i)
Explain how this change affects the output of categorical and quantile
dynamic programming, compared to the algorithm that explicitly maintains
and computes a return-distribution estimate for x
?
.
(ii)
Explain how this changes affects the analysis given in Section 5.9 onwards.
4
Exercise 5.4.
Provide counterexamples showing that if any of the conditions
of the previous exercise does not hold, then the normal representation may not
be closed under T
. 4
Exercise 5.5.
Consider a probability distribution
2P
2
(
R
) with probability
density f
. For a normal distribution
0
2F
N
with probability density f
0
, define
the Kullback-Leibler divergence
KL( k
0
)=
R
f
(z) log
f
(z)
f
0
(z)
dz .
Show that the normal distribution
ˆ
=
N
(
µ
,
2
) minimising
KL
(
k ˆ
) has
parameters given by
µ = E
Z
[Z],
2
= E
Z
[(Z µ)
2
] . (5.32)
4
Distributional Dynamic Programming 163
Exercise 5.6.
Consider again a probability distribution
2P
2
(
R
) with instan-
tiation Z, and for a normal distribution
0
2F
N
define the cross-entropy
loss
CE(,
0
)= E
Z
log f
0
(z)
.
Show that the normal distribution
0
that minimises this cross-entropy loss has
parameters given by Equation 5.32. Contrasting with the preceding exercise,
explain why this result applies irrespective of whether
0
has a probability
density. 4
Exercise 5.7.
Prove Lemma 5.6 from the definition of the pushforward
distribution. 4
Exercise 5.8.
The naive implementation of Algorithm 5.1 requires O
N
k+1
X
N
k
R
memory to perform the k
th
iteration. Describe an implementation that reduces
this cost to O
N
X
N
k
R
. 4
Exercise 5.9.
Consider a Markov decision process in which the rewards are
normally distributed:
P
R
(· | x, a)=N
µ(x, a),
2
(x, a)
, for x 2X, a 2A.
Suppose that we represent our distributions with the m-categorical representa-
tion, with projection in Cramér distance
C
:
(x)=
m
i=1
p
i
(x)
i
.
(i) Show that
(
C
T
)(x)=
a2A
x
0
2X
(a | x)P
X
(x
0
| x, a)
m
i=1
p
i
(x
0
) E
C
R+✓
i
| X = x, A = a].
(ii) Construct a numerical integration scheme that approximates the terms
E
C
R+✓
i
| X = x, A = a]
to " precision on individual probabilities, for any x, a.
(iii)
Use this scheme to construct a distributional dynamic programming algo-
rithm for Markov decision processes with normally distributed rewards.
What is its per-iteration computational cost?
(iv)
Suppose that the support
1
,
...
,
m
is evenly spaced on [
MIN
,
MAX
], that
2
(x, a) = 1 for all x, a, and additionally
µ
(x, a)
2
[(1
)
MIN
, (1
)
MAX
].
Analogous to the results of Section 5.10, bound the approximation error
164 Chapter 5
resulting from this algorithm, as a function of m,
"
, the number of iterations
K. 4
Exercise 5.10.
Prove that the deterministic projection of Section 3.5, defined
in terms of a map to two neighbouring particles, is equivalent to the triangular
kernel formulation presented in Section 5.6. 4
Exercise 5.11.
Show that each of the representations of Definitions 5.11–
5.13 is complete with respect to w
p
, for p
2
[1,
1
]. That is, for
F 2
{
F
Q,m
,
F
C,m
,
F
E,m
}, show that if
2P
p
(
R
) is the limit of a sequence of
probability distributions (
k
)
k0
2F , then 2F . 4
Exercise 5.12.
Show that the empirical representation
F
E
(Definition 5.5) is
not complete with respect to the 1-Wasserstein distance. Explain, concretely,
why this causes difficulties in defining a distance-based projection onto
F
E
.
4
Exercise 5.13.
Explain what happens if the inverse cumulative distribution
function F
–1
is not continuous in Lemma 5.16. When does that arise? What are
the implications for the w
1
-projection onto the m-quantile representation?
4
Exercise 5.14.
Implement distributional dynamic programming with the empir-
ical representation in the programming language of your choice. Apply it to the
deterministic Markov decision process depicted in Figure 5.6, for a finite num-
ber of iterations K, beginning with
0
(x)=
0
. Plot the supremum 1-Wasserstein
distance between the iterates
k
and
as a function of k. 4
Exercise 5.15.
Prove that the m-categorical representation
F
C,m
is convex and
complete with respect to the Cramér distance `
2
, for all m 2N
+
. 4
Exercise 5.16.
Show that the projected Bellman operator, instantiated with the
m-quantile representation
F
Q,m
and the w
1
-projection, is not a contraction in
w
1
. 4
Exercise 5.17.
Consider the metric space given by the open interval (0, 1)
equipped with the Euclidean metric on the real line, d(x, y)=|x y|.
(i) Show that this metric space is not complete.
(ii)
Construct a simple contraction map on this metric space that has no fixed
point, illustrating the necessity of the condition of completeness in Banach’s
fixed point theorem. 4
Distributional Dynamic Programming 165
Exercise 5.18. Let p =
p
2
2
, and consider the probability distribution
= (1 p)
0
+ p
1
.
Show that, for all m
2N
+
, the projection of
onto the m-quantile representation,
written
Q
, is such that
w
1
(
Q
, )=1.
Comment on the utility of the supremum-w
1
metric in analysing the conver-
gence and the quality of the fixed point of quantile dynamic programming.
4
Exercise 5.19. Consider the Bernoulli representation:
{p
0
+ (1 p)
1
: p 2[0, 1]} ;
this is also the categorical representation with m = 2,
1
= 0,
2
= 1. Consider the
distributional dynamic programming obtained by combining this representation
with a w
1
-projection.
(i) Describe the projection operator mathematically.
(ii)
Consider a two-state, single action MDP with states x, y with transition
dynamics such that each state transitions immediately to the other, and
rewards that are deterministically 0. Show that with
> 1/2, the distributional
dynamic programming operator defined above is not a contraction mapping,
and in particular has multiple fixed points. 4
Exercise 5.20.
The purpose of this exercise is to develop a guarantee of the
approximation error incurred by the fixed point
ˆ
Q
of the projected distributional
Bellman operator
Q
T
and the m-quantile representation; a version of this
analysis originally appeared in Rowland et al. [2019]. Here, we write
P
B
(
R
)
for the space of probability distributions bounded on [V
MIN
, V
MAX
].
(i)
For a distribution
2P
B
(
R
) and the quantile projection
Q
:
P
1
(
R
)
!
F
Q,m
, show that we have
w
1
(
Q
, )
V
MAX
V
MIN
2m
.
(ii)
Hence, using the triangle inequality, show that for any
,
0
2P
B
(
R
), we
have
w
1
(
Q
,
Q
0
) w
1
(,
0
)+
V
MAX
V
MIN
m
.
Thus, while
Q
is not an non-expansion under w
1
, per Exercise 5.16, it is in
some sense ‘not far’ from satisfying this condition.
166 Chapter 5
(iii)
Hence, using the triangle inequality with the return functions
ˆ
Q
,
Q
, and
, show that if
2P
B
(R)
X
, then we have
w
1
(ˆ
Q
,
)
3(V
MAX
V
MIN
)
2m(1 )
.
(iv)
Finally, show that w
1
(
,
0
)
w
1
(
,
0
) for any
,
0
2P
B
(
R
). Thus
starting from the observation that
w
1
(
k
,
) w
1
(
k
, ˆ
Q
)+w
1
(ˆ
Q
,
),
deduce that
w
1
(
k
,
)
k
(V
MAX
V
MIN
)+
3(V
MAX
V
MIN
)
2m(1 )
. 4
Exercise 5.21.
The aim of this exercise is to fill out the details in the reduction
described in Remark 5.2.
(i)
Consider an MDP where r(x, a) > 0 is the (deterministic) reward received
from choosing action a in state x (that is, the distribution P
R
(
·
| x, a) is a
Dirac delta at r(x, a)). Deduce that if the values r(x, a) are distinct across
state-action pairs, and that the discount factor satisfies
<
R
MIN
R
MIN
+ R
MAX
then there is an injective mapping from trajectories to returns.
(ii)
Hence show that for the class of MDPs described in Remark 5.2 the
trajectories whose returns lie in the set
2S
n
n–1
t=0
t
(t + 1) +
n
(1),
n–1
t=0
t
(t + 1) +
n
((1) + 1)
are precisely the trajectories whose initial n + 1 states correspond to a
Hamiltonian cycle. 4