4
Operators and Metrics
Anyone who has learned to play a musical instrument knows that practice makes
perfect. Along the way, however, one’s ability at playing a difﬁcult passage
usually varies according to a number of factors. On occasion, something that
could be played easily the day before now seems insurmountable. The adage
expresses an abstract notion that practice improves performance, on average or
over a long period of time rather than a concrete statement about instantaneous
ability.
In the same way, reinforcement learning algorithms deployed in real situations
behave differently from moment to moment. Variations arise due to different
initial conditions, speciﬁc choices of parameters, hardware non-determinism, or
simply because of randomness in the agent’s interactions with its environment.
These factors make it hard to make precise predictions, for example about the
magnitude of the value function estimate learned by TD learning at a particular
state x and step k, other than by extensive simulations. Nevertheless, the large-
scale behaviour of TD learning is relatively predictable, sufﬁciently so that
convergence can be established under certain conditions, and convergence rates
can be derived.
This chapter introduces the language of operators as an effective abstraction
with which to study such long-term behaviour, characterise the asymptotic
properties of reinforcement learning algorithms, and eventually explain what
makes an effective algorithm. In addition to being useful in the study of existing
algorithms, operators also serve as a kind of blueprint when designing new
algorithms, from which incremental methods such as categorical temporal-
difference learning can then be derived.
79
80 Chapter 4
4.1 The Bellman Operator
The value function V
characterises the expected return obtained by following
a policy , beginning in a given state x:
V
(x)=E
1
t=0
t
R
t
| X
0
= x
.
The Bellman equation establishes a relationship between the expected return
from one state and from its successors:
V
(x)=E
R + V
(X
0
)|X = x
.
Let us now consider a state-indexed collection of real variables, written V
2R
X
,
which we call a value function estimate. By substituting V
for V in the original
Bellman equation, we obtain the system of equations
V(x)=E
R + V(X
0
)|X = x
, for all x 2X . (4.1)
From Chapter 2, we know that V
is one solution to the above.
Are there other solutions to Equation 4.1? In this chapter we answer this question
(negatively) by interpreting the right-hand side of the equation as applying a
transformation on the estimate V. For a given realisation (x, a, r, x
0
) of the
random transition, this transformation indexes V by x
0
, multiplies it by the
discount factor, and adds it to the immediate reward (this yields r +
V(x
0
)). The
actual transformation returns the value that is obtained by following these steps,
in expectation. Functions that map elements of a space onto itself, such as this
one (from estimates to estimates) are called operators.
Deﬁnition 4.1.
The Bellman operator is the mapping T
:
R
X
!R
X
deﬁned
by
(T
V)(x)=E
[R + V(X
0
)|X = x] . (4.2)
Here, the notation T
V should be understood as T
, applied to V”. 4
The Bellman operator gives a particularly concise way of expressing the
transformations implied in Equation 4.1:
V = T
V .
As we will see in later chapters, it also serves as the springboard for the design
and analysis of algorithms for learning V
.
When working with the Bellman operator, it is often useful to treat V as a
ﬁnite-dimensional vector in
R
X
and to express the Bellman operator in terms
Operators and Metrics 81
of vector operations. That is, we write
T
V = r
+ P
V , (4.3)
where r
(x)=E
[R | X = x] and P
is the transition operator
24
deﬁned as
(P
V)(x)=
a2A
(a | x)
x
0
2X
P
X
(x
0
| x, a)V(x
0
).
Equation 4.3 follows from these deﬁnitions and the linearity of expectations.
A vector
˜
V 2R
X
is a solution to Equation 4.1 if it remains unchanged by the
transformation corresponding to the Bellman operator T
; that is, if it is a ﬁxed
point of T
. This means that the value function V
is a ﬁxed point of T
:
V
= T
V
.
To demonstrate that V
is the only ﬁxed point of the Bellman operator, we will
appeal to the notion of contraction mappings.
4.2 Contraction Mappings
When we apply the operator T
to a value function estimate V
2R
X
, we obtain
a new estimate T
V
2R
X
. A characteristic property of the Bellman operator is
that this new estimate is guaranteed to be closer to V
than V (unless V = V
,
of course). In fact, as we will see in this section, applying the operator to any
two estimates must bring them closer together.
To formalise what we mean by “closer”, we need a way of measuring distances
between value function estimates. Because these estimates can be viewed as
ﬁnite-dimensional vectors, there are many well-established ways of doing so:
the reader may have come across the Euclidean (L
2
) distance, the Manhattan
(L
1
) distance, and curios such as the British Rail distance. We use the term
metric to describe distances that satisfy the following standard deﬁnition.
Deﬁnition 4.2.
Given a set M, a metric d : M
M
!R
is a function that
satisﬁes, for all U, V, W 2M,
(a) d(U, V) 0,
(b) d(U, V) = 0 if and only if U = V,
(c) d(U, V) d(U, W)+d(W, V),
24.
It is also possible to express P
as a stochastic matrix, in which case P
V describes a matrix- 82 Chapter 4
(d) d(U, V)=d(V, U).
We call the pair (M, d)ametric space. 4
In our setting, M is the space of value function estimates,
R
X
. Because we
assume that there are ﬁnitely many states, this space can be equivalently thought
of as the space of real-valued vectors with N
X
entries, where N
X
is the number
of states. On this space, we measure distances in terms of the L
1
metric, deﬁned
by
kV V
0
k
1
= max
x2X
|V(x)–V
0
(x)| , V, V
0
2R
X
. (4.4)
A key result is that the Bellman operator T
is a contraction mapping with
respect to this metric. Informally, this means that its application to different
value function estimates brings them closer by at least a constant multiplicative
factor, called its contraction modulus.
Deﬁnition 4.3.
Let (M, d) be a metric space. A function
O
: M
!
M is a con-
traction mapping with respect to d and with contraction modulus
2
[0, 1), if
for all U, U
0
2M ,
d(OU, OU
0
) d(U, U
0
). 4
Proposition 4.4.
The operator T
:
R
X
!R
X
is a contraction mapping
with respect to the L
1
metric on
R
X
with contraction modulus given by
the discount factor . That is, for any two value functions V, V
0
2R
X
,
kT
V T
V
0
k
1
kV V
0
k
1
. 4
Proof.
The proof is most easily stated in vector notation. Here, we make use of
two properties of the operator P
. First, P
is linear, in the sense that for any
V, V
0
,
P
V + P
V
0
= P
(V + V
0
).
Second, because (P
V)(x) is a convex combination of elements from V, it must
be that
kP
Vk
1
kVk
1
.
From here, we have
kT
V T
V
0
k
1
= k(r
+ P
V)–(r
+ P
V
0
)k
1
= kP
V P
V
0
k
1
= kP
(V V
0
)k
1 Operators and Metrics 83
kV V
0
k
1
,
as desired.
The fact that T
is a contraction mapping guarantees the uniqueness of V
as a
solution to the equation V = T
V. As made formal by the following proposition,
because the operator T
brings any two value functions closer together, it cannot
keep more than one value function ﬁxed.
Proposition 4.5.
Let (M, d) be a metric space and
O
: M
!
M be a
contraction mapping. Then O has at most one ﬁxed point in M. 4
Proof.
Let
2
[0, 1) be the contraction modulus of
O
, and suppose U, U
0
2
M
are distinct ﬁxed points of
O
, so that d(U, U
0
) > 0 (following Deﬁnition 4.2).
Then we have
d(U, U
0
)=d(OU, OU
0
) d(U, U
0
),
Because we know that V
is a ﬁxed point of the Bellman operator T
, following
Proposition 4.5 we deduce that there are no other such ﬁxed points and hence
no other solutions to the Bellman equation. As the phrasing of Proposition 4.5
suggests, in some metric spaces it is possible for
O
to be a contraction mapping,
yet to not possess a ﬁxed point. This can matter when dealing with return
functions, as we will see in the second half of this chapter and in Chapter 5.
Example 4.6. Consider the no-loop operator
T
NL
V
(x)=E
R + V(X
0
)
{X
0
6= x}
| X = x
,
where the name denotes the fact that we omit the next-state value whenever a
transition from x to itself occurs. By inspection, we can determine that the ﬁxed
point of this operator is
V
NL
(x)=E
T–1
t=0
t
R
t
| X
0
= x
, 84 Chapter 4
where T denotes the (random) ﬁrst time at which X
T
= X
T–1
. In words, this ﬁxed
point describes the discounted sum of rewards obtained until the ﬁrst time that
an action leaves the state unchanged.
25
Exercise 4.1 asks you to show that T
NL
is a contraction mapping with modulus
= max
x2X
P
X
0
6= x | X = x
.
Following Proposition 4.5, we deduce that this is the unique ﬁxed point to
T
NL
. 4
When an operator
O
is contractive, we can also straightforwardly construct a
mathematical approximation to its ﬁxed point.
26
This approximation is given
by the sequence (U
k
)
k0
, deﬁned by an initial value U
0
2
M, and the recursive
relationship
U
k+1
= OU
k
.
By contractivity, successive iterates of this sequence must come progressively
closer to the operator’s ﬁxed point. This is formalised by the following.
Proposition 4.7.
Let (M, d) be a metric space, and let
O
: M
!
M be a
contraction mapping with contraction modulus
2
[0, 1) and ﬁxed point
U
2
M. Then for any initial point U
0
2
M, the sequence (U
k
)
k0
deﬁned
by U
k+1
= OU
k
is such that
d(U
k
, U
)
k
d(U
0
, U
). (4.5)
and in particular d(U
k
, U
) !0 as k !1. 4
Proof.
We will prove Equation 4.5 by induction, from which we obtain conver-
gence of (U
k
)
k0
in d. For k = 0, Equation 4.5 trivially holds. Now suppose for
some k 0, we have
d(U
k
, U
)
k
d(U
0
, U
).
Then note that
d(U
k+1
, U
)
(a)
= d(OU
k
, OU
)
(b)
d(U
k
, U
)
(c)
k+1
d(U
0
, U
),
25.
The reader is invited to consider the kind of environments in which policies that maximise the
no-loop return are substantially different from those that maximise the usual expected return.
26.
We use the term mathematical approximation to distinguish it from an approximation that can
be computed. That is, there may or may not exist an algorithm that can determine the elements of
the sequence (U
k
)
k0
given the initial estimate U
0
. Operators and Metrics 85
where (a) follows from the deﬁnition of the sequence (U
k
)
k0
and the fact that
U
is ﬁxed by
O
, (b) follows from the contractivity of
O
, and (c) follows from
the inductive hypothesis. By induction, we conclude that Equation 4.5 holds for
all k 2N.
In the case of the Bellman operator T
, Proposition 4.7 means that repeated
application of T
to any initial value function estimate V
0
2R
X
produces
a sequence of estimates (V
k
)
k0
that are progressively closer to V
. This
observation serves as the starting point for a number of computational
approaches that approximate V
, including dynamic programming (Chapter 5)
and temporal-difference learning (Chapter 6).
4.3 The Distributional Bellman Operator
Designing distributional reinforcement learning algorithms such as categorical
temporal-difference learning involves a few choices such as how to represent
probability distributions in a computer’s memory that do not have an equiva-
lent in classical reinforcement learning. Throughout this book, we will make
use of the distributional Bellman operator to understand and characterise many
of these choices. To begin, recall the random variable Bellman equation:
G
(x)
D
= R + G
(X
0
), X = x . (4.6)
As in the expected-value setting, we construct a random-variable operator by
viewing the right-hand side of Equation 4.6 as a transformation of G
. In this
case, we break down the transformation of G
into three operations, each of
which produces a new random variable (Figure 4.1):
(a) G
(X
0
): The indexing of the collection of random variables G
by X
0
;
(b)
G
(X
0
): The multiplication of the random variable G
(X
0
) with the scalar
;
(c) R + G
(X
0
): The addition of two random variables (R and G
(X
0
)).
More generally, we may apply these operations to any state-indexed collection
of random variables G =
G(x):x
2X
), taken to be independent of the random
transitions used to deﬁne the transformation. With some mathematical caveats
discussed below, let us introduce the random-variable Bellman operator
(T
G)(x)
D
= R + G(X
0
), X = x . (4.7)
Equation 4.7 states that the application of the Bellman operator to G (evaluated
at x; the left-hand side) produces a random variable that is equal in distribution
to the random variable constructed on the right-hand side. Because this holds 86 Chapter 4
(a) (b) (c)
Figure 4.1
The random-variable Bellman operator is composed of three operations:
(a)
indexing
into a collection of random variables,
(b)
multiplication by the discount factor, and
(c)
addition of two random variables. Here, we assume that R and X
0
take on a single value
for clarity.
for all x, we think of
T
as mapping G to a new collection of random variables
T
G.
The random-variable operator is appealing because it is concise and easily
understood. In many circumstances, this makes it the tool of choice for reasoning
about distributional reinforcement learning problems. One issue, however, is that
its deﬁnition above is mathematically incomplete. This is because it speciﬁes
the probability distribution of (
T
G)(x), but not its identity as a mapping from
some sample space to the real numbers. As discussed in Section 2.7, without
care we may produce random variables that exhibit undesirable behaviour, for
example because rewards at different points in time are improperly correlated.
More immediately, the theory of contraction mappings needs a clear deﬁnition
of the space on which an operator is deﬁned in the case of the random-variable
operator, this requires us to specify a space of random variables to operate
in. Properly deﬁning such a space is possible but requires some technical
subtlety and measure-theoretic considerations; we refer the interested reader to
Section 4.9.
A more direct solution is to consider the distributional Bellman operator as
a mapping on the space of return-distribution functions. Starting with the
distributional Bellman equation
(x)=E
[
b
R,
#
(X
0
)|X = x],
we again view the right-hand side as the result of applying a series of
transformations, in this case to probability distributions. Operators and Metrics 87
Deﬁnition 4.8.
The distributional Bellman operator
T
:
P
(
R
)
X
!P
(
R
)
X
is the mapping deﬁned by
(T
)(x)=E
[
b
R,
#
(X
0
)|X = x] . (4.8)
4
Here, the operations on probability distributions are expressed (rather com-
pactly) by the expectation in Equation 4.8 and the use of the pushforward
distribution derived from the bootstrap function
b
; these are the operations of
mixing, scaling, and translation previously described in Section 2.8.
We can gain additional insight into how the operator transforms a return function
by considering the situation in which the random reward R and the return
distributions
(x) admit respective probability densities p
R
(r | x, a) and p
x
(z).
In this case, the probability density of (T
)(x), denoted p
0
x
, is
p
0
x
(z)=
–1
a2A
(a | x)
r2R
p
R
(r | x, a)
x
0
2X
P
X
(x
0
| x, a)p
x
0
z r
dr . (4.9)
Expressed in terms of probability densities, the indexing of a collection of
random variables becomes a mixture of densities, while their addition becomes a
convolution; this is in fact what is depicted in Figure 4.1. In terms of cumulative
distribution functions we have
F
(T
)(x)
(z)=E
F
(X
0
)
z R
| X = x
.
However, we prefer the operator that deals directly with probability distributions
(Equation 4.8) as it can be used to concisely express more complex operations on
distributions. One such operation is the projection of a probability distribution
onto a ﬁnitely-parametrised set, which we will use in Chapter 5 to construct
algorithms for approximating
.
Using Deﬁnition 4.8, we can formally express the fact that the return-distribution
function
is the only solution to the equation
= T
.
The proof is relatively technical and will be given in Section 4.8.
Proposition 4.9. The return-distribution function
satisﬁes
= T
and is the unique ﬁxed point of the distributional Bellman operator
T
.
4 88 Chapter 4
When working with the distributional Bellman operator, one should be mindful
that the random reward R and next state X
0
are generally not independent,
because they both depend on the chosen action A (we brieﬂy mentioned this
concern in Section 2.8). In Equation 4.9, this is explicitly handled by the outer
sum over actions. Analogously, we can make explicit the dependency on A by
introducing a second expectation in Equation 4.8:
(T
)(x)=E
E
[(b
R,
)
#
(X
0
)|X = x, A]|X = x
.
By conditioning the inner expectation on the action A, we make the random
variables R and
G(X
0
) conditionally independent in the inner expectation. We
will make use of this technique in proving Theorem 4.25, the main theoretical
result of this chapter.
In some circumstances, it is useful to translate between operations on proba-
bility distributions and those on random variables. We do this by means of a
representative set of random variables called an instantiation.
Deﬁnition 4.10.
Given a probability distribution
2P
(
R
), we say that a
random variable Z is an instantiation of
if its distribution is
, written Z
.
Similarly, we say that a collection of random variables G =(G(x):x
2X
) is an
instantiation of a return-distribution function
2P
(
R
)
X
if for every x
2X
, we
have G(x) (x). 4
Given a return-distribution function
2P
(
R
)
X
, the new return-distribution
function
T
can be obtained by constructing an instantiation G of
, perform-
ing the transformation on the collection of random variables G as described
at the beginning of this section, and then extracting the distributions of the
resulting random variables. This is made formal as follows.
Proposition 4.11.
Let
2P
(
R
)
X
, and let G =(G(x):x
2X
) be an instan-
tiation of
. For each x
2X
, let (X = x, A, R, X
0
) be a sample transition
independent of G. Then R + G(X
0
) has the distribution (T
)(x):
D
(R + G(X
0
)|X = x)=(T
)(x). 4
Proof.
The result follows immediately from the deﬁnition of the distributional
Bellman operator. For clarity, we step through the argument again, mirroring
the transformations set out at the beginning of the section. First, the indexing Operators and Metrics 89
0
G G
0
T
T
Figure 4.2
A commutative diagram illustrating two perspectives on the application of the distri-
butional Bellman operator. The top horizontal line represents the direct application to
the return distribution function
, yielding
0
. The alternative path ﬁrst instantiates the
return distribution function
as a collection of random variables G =(G(x):(x)
2X
),
transforms G to obtain another collection of random variables G
0
, and then extracts the
distributions of these random variables to obtain
0
.
transformation gives
D
(G(X
0
)|X = x)=
x
0
2X
P
(X
0
= x
0
| X = x)(x
0
)
= E
[(X
0
)|X = x].
Next, scaling by yields
D
(G(X
0
)|X = x)=E
[(b
0,
)
#
(X
0
)|X = x],
and ﬁnally adding the immediate reward R gives the result
D
(R + G(X
0
)|X = x)=E
[(b
R,
)
#
(X
0
)|X = x].
Proposition 4.11 is an instance of a recurring principle in distributional rein-
this book, we will illustrate this point as it arises with a commutative diagram;
the particular case under consideration is depicted in Figure 4.2.
4.4 Wasserstein Distances for Return Functions
Many desirable properties of reinforcement learning algorithms (for example,
the fact that they produce a good approximation of the value function) are
due to the contractive nature of the Bellman operator T
. In this section, we
will establish that the distributional Bellman operator
T
, too, is a contraction
mapping analogous to the value-based operator, the application of
T
brings
return functions closer together.
One difference between expected-value and distributional reinforcement learn-
ing is that the space of return-distribution functions
P
(
R
)
X
is substantially
90 Chapter 4
different from the space of value functions. To measure distances between value
functions, we can simply treat them as ﬁnite-dimensional vectors, taking the
absolute difference of value estimates at individual states. By contrast, it is
somewhat less intuitive to see what “close” means when comparing probability
distributions. Throughout this chapter we will consider a number of probability
metrics that measure distances between distributions, each presenting differ-
ent mathematical and computational properties. We begin with the family of
Wasserstein distances.
Deﬁnition 4.12.
Let
2P
(
R
) be a probability distribution with cumulative
distribution function F
. Let Z be an instantiation of
(in particular, F
Z
= F
).
The generalised inverse F
–1
is given by
F
–1
() = inf
z2R
{z : F
(z) }.
–1
Z
= F
–1
. 4
Deﬁnition 4.13.
Let p
2
[1,
1
). The p-Wasserstein distance is a function w
p
:
P(R) P(R) ![0, 1] given by
w
p
(,
0
)=
1
0
F
–1
()–F
–1
0
()
p
d
1/p
.
The 1-Wasserstein distance w
1
: P(R) P (R) ![0, 1] is
w
1
(,
0
) = sup
2(0,1)
F
–1
()–F
–1
0
()
. 4
Graphically, the Wasserstein distances between two probability distributions
measure the area between their cumulative distribution functions, with val-
ues along the abscissa taken to the p
th
power; see Figure 4.3. When p =
1
,
this becomes the largest horizontal difference between the inverse cumulative
distribution functions. The p-Wasserstein distances satisfy the deﬁnition of a
metric , except that they may not be ﬁnite for arbitrary pairs of distributions
in
P
(
R
); see Exercise 4.6. Properly speaking, they are said to be extended
metrics, since they may take values on the real line extended to include inﬁnity.
Most probability metrics that we will consider are extended metrics rather than
metrics in the sense of Deﬁnition 4.2. We measure distances between return-
distribution functions in terms of the largest Wasserstein distance between
probability distributions at individual states. Operators and Metrics 91
Figure 4.3
Left.
Illustration of the p-Wasserstein distance between a normal distribution
=
N
(1, 1)
and a mixture of two normal distributions
0
=
1
2
N
(–1, 0.5) +
1
2
N
(3, 0.5).
Right.
Illus-
tration of the
`
p
metric for the same distributions (see Section 4.5). In both cases, the
shading indicates the axis along which the differences are taken to the p
th
exponent.
Deﬁnition 4.14.
Let p
2
[1,
1
]. The supremum p-Wasserstein distance
w
p
between two return-distribution functions ,
0
2P(R)
X
is deﬁned by
27
w
p
(,
0
) = sup
x2X
w
p
((x),
0
(x)) . 4
The supremum p-Wasserstein distances fulﬁl all requirements of an extended
metric on the space of return-distribution functions
P
(
R
)
X
; see Exercise 4.7.
Based on these distances, we give our ﬁrst contractivity result regarding the
distributional Bellman operator; its proof is given at the end of the section.
Proposition 4.15.
The distributional Bellman operator is a contraction
mapping on
P
(
R
)
X
in the supremum p-Wasserstein distance, for all p
2
[1, 1]. More precisely,
w
p
(T
, T
0
) w
p
(,
0
),
for all ,
0
2P(R)
X
. 4
Proposition 4.15 is signiﬁcant in that it establishes a close parallel between the
expected-value and distributional operators. Following the line of reasoning
given in Section 4.2, it provides the mathematical justiﬁcation for the develop-
ment and analysis of computational approaches for ﬁnding the return function
. More immediately, it also enables us to characterise the convergence of the
27.
Because we assume that there are ﬁnitely many states, we can equivalently write
max
x2X
in the
deﬁnition of supremum distance. However, we prefer the more generally applicable sup
x2X
. 92 Chapter 4
sequence
k+1
= T
k
(4.10)
to the return function
.
Proposition 4.16.
Suppose that for each (x, a)
2X A
, the reward distri-
bution P
R
(
·
| x, a) is supported on the interval [R
MIN
, R
MAX
]. Then for any
initial return function
0
whose distributions are bounded on the interval
R
MIN
1–
,
R
MAX
1–
, the sequence
k+1
= T
k
converges to
in the supremum p-Wasserstein distance (for all p
2
[1,
1
]).
4
The restriction to bounded rewards in Proposition 4.16 is necessary to make
use of the tools developed in Section 4.2, at least without further qualiﬁcation.
This is because Proposition 4.7 requires all distances to be ﬁnite, which is not
guaranteed under our deﬁnition of a probability metric. If, for example, the
initial condition
0
is such that
w
p
(
0
,
)=1,
then Proposition 4.15 is not of much use. A less restrictive, but more technically
elaborate set of assumptions will be presented later in the chapter. For now, we
provide the proof of the two preceding results. First, we obtain a reasonably
simple proof of Proposition 4.15 by considering an alternative formulation of
the p-Wasserstein distances in terms of couplings.
Deﬁnition 4.17.
Let
,
0
2P
(
R
) be two probability distributions. A coupling
between
and
0
is a joint distribution
2P
(
R
2
) such that if (Z, Z
0
) is an
instantiation of
, then also Z has distribution
and Z
0
has distribution
0
.We
write (,
0
) P(R
2
) for the set of all couplings of and
0
. 4
Proposition 4.18
(see Villani  for a proof)
.
Let p
2
[1,
1
).
Expressed in terms of an optimal coupling, the p-Wasserstein distance
between two distributions ,
0
2P(R) is
w
p
(,
0
) = min
2(,
0
)
E
(Z,Z
0
)
[|Z Z
0
|
p
]
1/p
.
The 1-Wasserstein distance between and
0
can be written as
w
1
(,
0
) = min
2(,
0
)
inf
z 2R :
P
(Z,Z
0
)
(|Z Z
0
|>z)=0
. 4 Operators and Metrics 93
Informally, the optimal coupling ﬁnds an arrangement of the two probability
distributions that maximises “agreement”: it produces outcomes that are as close
as possible. In Proposition 4.18, the optimal coupling takes on a very simple
form given by inverse cumulative distribution functions. For
,
0
2P
(
R
), an
optimal coupling is the probability distribution of the random variable
F
–1
(), F
–1
0
()
, U
[0, 1]
. (4.11)
This can be understood by noting how the 1-Wasserstein distance between
and
0
is obtained by measuring the horizontal distance between the two cumulative
distribution functions, at each level 2[0, 1] (Figure 4.3).
Proof of Proposition 4.15.
Let p
2
[1,
1
) be ﬁxed. For each x
2X
, consider
the optimal coupling between
(x) and
0
(x) and instantiate it as the pair of ran-
dom variables
G(x), G
0
(x)
. Next, denote by (x, A, R, X
0
) the random transition
beginning in x
2X
, constructed to be independent from G(y) and G
0
(y), for all
y 2X. With these variables, write
˜
G(x)=R + G(X
0
),
˜
G
0
(x)=R + G
0
(X
0
).
By Proposition 4.11,
˜
G
(x) has distribution (
T
)(x) and
˜
G
0
(x) has distribution
(
T
0
)(x). The pair
˜
G
(x),
˜
G
0
(x)
therefore forms a valid coupling of these
distributions. Now
w
p
p
(T
)(x), (T
0
)(x)
(a)
E
R + G(X
0
)
R + G
0
(X
0
)
p
| X = x
(b)
=
p
E
G(X
0
)–G
0
(X
0
)
p
| X = x
(c)
p
x
0
2X
P
(X
0
= x
0
| X = x) E
G(x
0
)–G
0
(x
0
)
p
(d)
p
sup
x
0
2X
E
G(x
0
)–G
0
(x
0
)
p
(e)
=
p
w
p
p
(,
0
).
Taking a supremum over x
2X
on the left-hand side and the p
th
root of both
sides yields the result. Here, (a) follows since the Wasserstein distance is
deﬁned as a minimum over couplings, (b) follows from algebraic manipulation
of the expectation, (c) follows from independence of the sample transition
(X = x, A, R, X
0
) and the random variables (G(x), G
0
(x):x
2X
), (d) because the
maximum of non-negative quantities is at least as great as their weighted
average, and (e) follows since (G(x
0
), G
0
(x
0
)) was deﬁned as an optimal coupling
of (x
0
) and
0
(x
0
). The proof for p = 1 is similar (see Exercise 4.8). 94 Chapter 4
Proof of Proposition 4.16.
Let us denote by
P
B
(
R
) the space of distributions
bounded on [V
MIN
, V
MAX
], where as usual
V
MIN
=
R
MIN
1–
, V
MAX
=
R
MAX
1–
.
We will show that under the assumption of rewards bounded on [R
MIN
, R
MAX
],
(a) The return function
is in P
B
(R), and
(b) The distributional Bellman operator maps P
B
(R) to itself.
Consequently, we can invoke Proposition 4.7 with
O
=
T
and M =
P
B
(
R
)
to conclude that for any initial
0
2P
B
(
R
)
X
, the sequence of iterates (
k
)
k0
converges to
with respect to d = w
p
, for any p 2[1, 1].
To prove (a), note that for any state x 2X,
G
(x)=
1
t=0
t
R
t
, X
0
= x ,
and since R
t
2
[R
MIN
, R
MAX
] for all t, then also G
(x)
2
[V
MIN
, V
MAX
]. For (b),
let
2P
B
(
R
)
X
and denote by G an instantiation of this return-distribution
function. For any x 2X,
P
R + G(X
0
) V
MAX
| X = x
= P
G(X
0
) V
MAX
R | X = x
P
G(X
0
) V
MAX
R
MAX
| X = x
= P
G(X
0
) V
MAX
| X = x
= 1.
By the same reasoning,
P
R + G(X
0
) V
MIN
| X = x
= 1.
Since R +
G(X
0
), X = x is an instantiation of (
T
)(x) for each x, we conclude
that if 2P
B
(R)
X
, then also T
2P
B
(R)
X
.
4.5 `
p
Probability Metrics and the Cramér Distance
The previous section established that the distributional Bellman operator is
well-behaved with respect to the family of Wasserstein distances. However,
these are but a few among many standard probability metrics. We will see in
Chapter 5 that theoretical analysis sometimes requires us to study the behaviour
of the distributional operator with respect to other metrics. In addition, many
practical algorithms directly optimise a metric (typically expressed as a loss
function) as part of their operation (see Chapter 10). The Cramér distance,a
member of the broader family of `
p
metrics, is of particular interest to us. Operators and Metrics 95
Deﬁnition 4.19.
Let p
2
[1,
1
). The distance
`
p
:
P
(
R
)
P
(
R
)
!
[0,
1
] is a
probability metric deﬁned by
`
p
(,
0
)=
R
|F
(z)–F
0
(z)|
p
dz
1/p
. (4.12)
For p = 2, this is the Cramér distance.
28
The
`
1
or Kolmogorov-Smirnov
distance is given by
`
1
(,
0
) = sup
z2R
F
(z)–F
0
(z)
.
The respective supremum `
p
distances are given by (,
0
2P(R)
X
)
`
p
(,
0
) = sup
x2X
`
p
(x), (x
0
)
.
These are extended metrics on P(R)
X
. 4
Where the p-Wasserstein distances measure differences in outcomes, the
`
p
distances measure differences in the probabilities associated with these out-
comes. This is because the exponent p is applied to cumulative probabilities
(this is illustrated in Figure 4.3). The distributional Bellman operator is also a
contraction mapping under the
`
p
distances for p
2
[1,
1
), albeit with a larger
contraction modulus.
29
Proposition 4.20.
For p
2
[1,
1
), the distributional Bellman operator
T
is a contraction mapping on
P
(
R
)
X
with respect to
`
p
, with contraction
modulus
1
/p
. That is,
`
p
(T
, T
0
)
1
/p
`
p
(,
0
)
for all ,
0
2P(R)
X
. 4
The proof of Proposition 4.20 will follow as a corollary of a more general result
given in Section 4.6. One way to relate it to our earlier result is to consider the
behaviour of the sequence deﬁned by
k+1
= T
k
. (4.13)
As measured in the p-Wasserstein distance, the sequence (
k
)
k0
approaches
at a rate of
; but if we instead measure distances using the
`
p
metric, this
rate is slower only
1
/p
. Measured in terms of
`
1
(the Kolmogorov-Smirnov
28.
Historically, the Cramér distance has been deﬁned as the square of
`
2
. In our context, it seems
unambiguous to use the word for `
2
itself.
29.
For p = 1, the distributional Bellman operator has contraction modulus
; this is sensible given
that `
1
= w
1
. 96 Chapter 4
distance), the sequence of iterates may in fact not seem to approach
at all.
To see this, it sufﬁces to consider a single-state process with zero reward (that
is, P
X
(X
0
= x | X = x) = 1 and R = 0) and a discount factor
= 0.9. In this case,
(x)=
0
. For the initial condition
0
(x)=
1
, we obtain
1
(x)=(T
0
)(x)=
.
Now, the (supremum)
`
1
distance between
and
0
is 1, because for any
z 2(0, 1),
F
0
(x)
(z)=0 F
(x)
(z) = 1.
However, the
`
1
distance between
and
1
is also 1, by the same argument
(but now restricted to z 2(0, )). Hence there is no 2[0, 1) for which
`
1
(
1
,
)<`
1
(
0
,
).
Exercise 4.16 asks you to prove a similar result for a probability metric called
The more general point is that different probability metrics are sensitive to
different characteristics of probability distributions, and to varying degrees.
At one extreme, the
1
-Wasserstein distance is effectively insensitive to the
probability associated with different outcomes, while at the other extreme,
the Kolmogorov-Smirnov distance is insensitive to the scale of the difference
between outcomes. In Section 4.6, we will show that a metric’s sensitivity to
differences in outcomes determines the contraction modulus of the distributional
Bellman operator under that metric; informally speaking, this explains the
“nice” behaviour of the distributional Bellman operator under the Wasserstein
distances.
Before moving on, let us summarise the results established thus far. By combin-
ing the theory of contraction mappings with suitable probability metrics, we
were able to characterise the behaviour of the iterates
k+1
= T
k
. (4.14)
In the following chapters we will use this as the basis for the design of imple-
mentable algorithms that approximate the return distribution
, and will
appeal to contraction mapping theory to provide theoretical guarantees for
these algorithms. In particular, in Chapter 6 we will analyse the categorical
temporal-difference learning under the lens of the Cramér distance. While the
results presented until now sufﬁce for most practical purposes, the following
sections deal with some of the more technical considerations that arise from
studying Equation 4.14 under general conditions, in particular the issue of
inﬁnite distances between distributions. Operators and Metrics 97
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
Probability Density
2 1 0 1 2 3 4 5
Return
0.0
Probability Density
2 1 0 1 2 3 4 5
Return
0.0
2 1 0 1 2 3 4 5
Return
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Probability
2 1 0 1 2 3 4 5
Return
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Probability
(a) (b)
(c) (d)
Figure 4.4
The distributional Bellman operator is not a contraction mapping in either the supremum
form of
(a, b)
total variation distance (d
TV
, shaded in the top panels; see Exercise
4.16 for a deﬁnition) or
(c, d)
Kolmogorov-Smirnov distance
`
1
(vertical distance in
the bottom panels). The left panels show the density
(a)
and cumulative distribution
function
(c)
of two distributions
(x) (blue) and
0
(x) (red). The right panels show the
same after applying the distributional Bellman operator
(b, d)
, speciﬁcally considering
the transformation induced by the discount factor
. The lack of contractivity can be
explained by the fact that neither d
TV
nor
`
1
is a homogeneous probability metric
(Section 4.6).
4.6 Sufﬁcient Conditions for Contractivity
In the remainder of this chapter, we characterise in greater generality the
behaviour of the sequence of return function estimates described by Equa-
tion 4.14, viewed under the lens of different probability metrics. We begin with
a formal deﬁnition of what it means for a function d to be a probability metric.
Deﬁnition 4.21.
A probability metric is an extended metric on the space of
probability distributions, written
d : P(R) P(R) ![0, 1]. 98 Chapter 4
Its supremum extension is the function d : P(R)
X
P(R)
X
!R deﬁned as
d(,
0
) = sup
x2X
d((x),
0
(x)) .
We refer to
d
as a return-function metric; it is an extended metric on
P
(
R
)
X
.
4
Our analysis is based on three properties that a probability metric should possess
in order to guarantee contractivity. These three properties relate closely to the
three fundamental operations that make up the distributional Bellman operator:
scaling, convolution, and mixture of distributions (equivalently: scaling, addi-
tion, and indexing of random variables). In this analysis, we will ﬁnd that some
properties are more easily stated in terms of random variables, others in terms
of probability distributions. Accordingly, given two probability distributions
,
0
with instantiations Z, Z
0
, let us overload notation and write
d(Z, Z
0
)=d(,
0
).
Deﬁnition 4.22.
Let c > 0. The probability metric d is c-homogeneous if for
any scalar
2
[0, 1) and any two distributions
,
0
2P
(
R
) with associated
random variables Z, Z
0
, we have
d(Z, Z
0
)=
c
d(Z, Z
0
),
In terms of probability distributions, this is equivalently given by the condition
d((b
0,
)
#
, (b
0,
)
#
0
)=
c
d(,
0
).
If no such c exists, we say that d is not homogeneous. 4
Deﬁnition 4.23.
The probability metric d is regular if for any two distribu-
tions
,
0
2P
(
R
) with associated random variables Z, Z
0
, and an independent
random variable W, we have
d(W + Z, W + Z
0
) d(Z, Z
0
) . (4.15)
In terms of distributions, this is
d
E
W
[(b
W,1
)
#
], E
W
[(b
W,1
)
#
0
]
d(,
0
). 4
Deﬁnition 4.24.
Given p
2
[1,
1
), the probability metric d is p-convex
30
if for
any 2(0, 1) and distributions
1
,
2
,
0
1
,
0
2
2P(R), we have
d
p
↵⌫
1
+ (1 )
2
, ↵⌫
0
1
+ (1 )
0
2
d
p
(
1
,
0
1
) + (1 )d
p
(
2
,
0
2
). 4
30.
This matches the usual deﬁnition of convexity (for d
p
) if one treats the pair (
1
,
2
) as a single
argument from P (R) P(R). Operators and Metrics 99
Note that while the convexity property is stated here in terms of a mixture of
two distributions, it implies an analogous property for mixtures of ﬁnitely-many
distributions.
Theorem 4.25.
Consider a probability metric d. Suppose that d is regular,
c-homogeneous for some c > 0, and that there exists p
2
[1,
1
) such that d
is p-convex. Then for all return-distribution functions
,
0
2P
(
R
)
X
, we
have
d(T
, T
0
)
c
d(,
0
). 4
Proof.
Fix a state x
2X
and action a
2A
. For this state, consider the sample
transition (X = x, A = a, R, X
0
) (Equation 2.12), and recall that R and X
0
are
independent given X and A, since
R P
R
( · | X, A) X
0
P
X
( · | X, A).
Let G and G
0
be instantiations of
and
0
, respectively.
31
We introduce a state-
action variant of the distributional Bellman operator,
T
:
P
(
R
)
X
!P
(
R
)
XA
,
given by
(T )(x, a)=E[(b
R,
)
#
(X
0
)|X = x, A = a].
Note that this operator is deﬁned independently of the policy
, since the action
a is speciﬁed as an argument. We then calculate directly, indicating where each
of the hypotheses of the result are used:
d
p
((T )(x, a), (T
0
)(x, a)) = d
p
R + G(X
0
), R + G
0
(X
0
)
(a)
d
p
G(X
0
), G
0
(X
0
)
(b)
=
pc
d
p
G(X
0
), G
0
(X
0
)
(c)
pc
x
0
2X
P
(X
0
= x
0
| X = x, A = a)d
p
G(x
0
), G
0
(x
0
)
pc
sup
x
0
2X
d
p
G(x
0
), G
0
(x
0
)
=
pc
d
p
(,
0
).
Here, (a) follows from regularity of d, (b) follows from the c-homogeneous
property of d, and (c) follows from p-convexity, where the mixture is over the
31.
Note that the proof doesn’t require us to assume that G(x) and G(y) are independent for x
6
= y.
See Exercise 4.12. 100 Chapter 4
values taken by the random variable X
0
. We also note that
(T
)(x)=
a2A
(a | x)(T )(x, a),
and hence by p-convexity of d, we have
d
p
(T
)(x), (T
0
)(x)
a2A
(a | x)d
p
(T
)(x, a), (T
0
)(x, a)
pc
d
p
(,
0
).
Taking the supremum over x
2X
on the left-hand side and taking p
th
-roots then
yields the result.
Theorem 4.25 illustrates how the contractivity of the distributional Bellman
operator in the probability metric d (speciﬁcally, in its supremum extension)
follows from natural properties of d. We see that the contraction modulus is
closely tied to the homogeneity of d, which informally characterises the extent
to which scaling random variables by a factor
brings them “closer together”.
The theorem provides an alternative to our earlier result regarding Wasserstein
distances and enables us to establish the contractivity under the
`
p
distances,
but also under other probability metrics. Exercise 4.19, in particular, explores
contractivity under the so-called maximum mean discrepancy (MMD) distances.
Proof of Proposition 4.20 (contractivity in `
p
distances).
We will apply The-
orem 4.25 to the probability metric
`
p
, for p
2
[1,
1
). It is therefore sufﬁcient
to demonstrate that `
p
is
1
/p-homogeneous, regular and p-convex.
1/p
-homogeneity.
Let
,
0
2P
(
R
) with associated random variables Z, Z
0
.We
make use of the fact that for 2[0, 1),
F
Z
(z)=F
Z
z
.
Writing `
p
p
for the p
th
power of `
p
, we have
`
p
p
(Z, Z
0
)=
R
F
Z
(z)–F
Z
0
(z)
p
dz
=
R
F
Z
(
z
)–F
Z
0
(
z
)
p
dz
(a)
=
R
F
Z
(z)–F
Z
0
(z)
p
dz
= `
p
p
(Z, Z
0
) , (4.16) Operators and Metrics 101
where (a) follows from a change of variables z/
7!
z in the integral. Therefore,
we deduce that `
p
(Z, Z
0
)=
1/p
`
p
(Z, Z
0
).
Regularity.
Let
,
0
2P
(
R
), with Z, Z
0
independent instantiations of
and
0
respectively, and let W be a random variable independent of Z, Z
0
. Then
`
p
p
(W + Z, W + Z
0
)=
R
|F
W+Z
(z)–F
W+Z
0
(z)|
p
dz
=
R
|E
W
[F
Z
(z W)] E
W
[F
Z
0
(z W)]|
p
dz
(a)
R
E
W
[|F
Z
(z W)–F
Z
0
(z W)|
p
]dz
(b)
= E
W
R
|F
Z
(z W)–F
Z
0
(z W)|
p
]dz
= `
p
p
(Z, Z
0
),
where (a) follows from Jensen’s inequality, and (b) follows by swapping the
integral and expectation (more formally justiﬁed by Tonelli’s theorem).
p-convexity. Let 2(0, 1), and
1
,
0
1
,
2
,
0
2
2P(R). Note that
F
↵⌫
1
+(1–)
2
(z)=F
1
(z) + (1 )F
2
(z),
and similarly for the primed distributions
0
1
,
0
2
. By convexity of the function
z 7!|z|
p
on the real numbers and Jensen’s inequality,
|F
↵⌫
1
+(1–)
2
(z)–F
↵⌫
0
1
+(1–)
0
2
(z)|
p
|F
1
(z)–F
0
1
(z)|
p
+ (1 )|F
2
(z)–F
0
2
(z)|
p
,
for all z 2R. Hence,
`
p
p
(↵⌫
1
+ (1 )
2
, ↵⌫
0
1
+ (1 )
0
2
) ↵`
p
p
(
1
,
0
1
) + (1 )`
p
p
(
2
,
0
2
),
and `
p
is p-convex.
4.7 A Matter of Domain
Suppose that we have demonstrated, by means of Theorem 4.25, that the distri-
butional Bellman operator is a contraction mapping in the supremum extension
of some probability metric d. Is this sufﬁcient to guarantee that the sequence
k+1
= T
k
converges to the return function
, by means of Proposition 4.7? In general,
no, because d may assign inﬁnite distances to certain pairs of distributions.
To invoke Proposition 4.7, we identify a subset of probability distributions
P
d
(
R
) that are all within ﬁnite d-distance of each other and then ensure that
102 Chapter 4
the distributional Bellman operator is well-behaved on this subset. Speciﬁcally,
we identify a set of conditions under which
(a) The distributional Bellman operator T
maps P
d
(R)
X
to itself, and
(b) The return function
(the ﬁxed point of T
) lies in P
d
(R)
X
.
For most common probability metrics and natural problem settings, these
requirements are easily veriﬁed. In Proposition 4.16, for example, we demon-
strated that under the assumption that the reward distributions are bounded,
then Proposition 4.7 can be applied with the Wasserstein distances. The aim of
this section is to extend the analysis to a broader set of probability metrics, but
also a greater number of problem settings, including those where the reward
distributions are not bounded.
Deﬁnition 4.26.
Let d be a probability metric. Its ﬁnite domain
P
d
(
R
)
P
(
R
)
is the set of probability distributions with ﬁnite ﬁrst moment and ﬁnite d-distance
to the distribution that puts all of its mass on zero:
P
d
(R)=
2P(R):d(,
0
)<1, E
Z
[|Z|] < 1
. (4.17)
4
By the triangle inequality, for any two distributions
,
0
2P
d
(
R
), we are
guaranteed d(
,
0
)<
1
. Although the choice of
0
as the reference point is
somewhat arbitrary, it is sensible given that many reinforcement learning prob-
lems include the possibility of receiving no reward at all (e.g., G
(x) = 0). The
ﬁnite ﬁrst moment assumption is made in light of Assumption 2.5, which
guarantees that return distributions have well-deﬁned expectations. Operators and Metrics 103
Proposition 4.27.
Let d be a probability metric satisfying the conditions
of Theorem 4.25, with ﬁnite domain
P
d
(
R
). Let
T
be the distribu-
tional Bellman operator corresponding to a given Markov decision process
(X, A,
0
, P
X
, P
R
). Suppose that
(a)
2P
d
(R), and
(b) P
d
(
R
) is closed under
T
: for any
2P
d
(
R
)
X
, we have that
T
2
P
d
(R)
X
.
Then for any initial condition
0
2P
d
(
R
), the sequence of iterates deﬁned
by
k+1
= T
k
converges to
with respect to d. 4
Proposition 4.27 is a specialisation of Proposition 4.7 to the distributional
setting, and generalises our earlier result regarding bounded reward distributions.
Effectively, it allows us to prove the convergence of the sequence (
k
)
k0
for a
family of Markov decision processes satisfying the two conditions above. The
condition
2P
d
(
R
), while seemingly benign, does not automatically hold;
Exercise 4.20 illustrates the issue using a modiﬁed p-Wasserstein distance.
Example 4.28
(
`
p
metrics)
.
For a given p
2
[1,
1
), the ﬁnite domain of the
probability metric `
p
is
P
`
p
(R)={ 2P(R): E
Z
[|Z|] < 1},
the set of distributions with ﬁnite ﬁrst moment. This follows because
`
p
p
(,
0
)=
0
1
F
(z)
p
dz +
1
0
(1 F
(z))
p
dz
0
1
F
(z)dz +
1
0
(1 F
(z))dz
(a)
= E[max(0, Z)] + E[max(0, Z)]
= E[|Z|] ,
where (a) follows from expressing F
(z) and 1 F
(z) as integrals and using
Tonelli’s theorem:
1
0
(1 F
(z))dz =
1
0
E
Z
[ {Z > z}]dz = E
Z
1
0
{Z > z}dz
= E
Z
[max(0, Z)] ,
and similarly for the integral from 1 to 0. 104 Chapter 4
The conditions of Proposition 4.27 are guaranteed by Assumption 2.5 (rewards
have ﬁnite ﬁrst moments). From Chapter 2, we know that under this assumption
the random return has ﬁnite expected value and hence
(x) 2P
`
p
(R), for all x 2X.
Similarly, we can show from elementary operations that if R and G(x) satisfy
E
|R|
X = x]<1, E[|G(x)|
< 1, for all x 2X ,
then also
E
|R + G(X
0
)|
X = x
< 1, for all x 2X .
Following Proposition 4.27, provided that
0
2P
`
p
(
R
)
X
, then the sequence
(
k
)
k0
converges to
with respect to `
p
. 4
When d is the p-Wasserstein distance (p
2
[1,
1
]) the ﬁnite domain takes on a
particularly useful form which we denote by P
p
(R):
P
p
(R)=
2P(R): E
Z
[|Z|
p
]<1
, p 2[0, 1),
P
1
(R)=
2P(R):9C > 0 s.t. ([–C, C]) = 1
.
For p <
1
, this is the set of distributions with bounded p
th
moments; for p =
1
,
the set of distributions with bounded support. In particular, observe that the
ﬁnite domains of `
1
and w
1
coincide: P
`
p
(R)=P
1
(R).
As with the
`
p
distances, we can satisfy the conditions of Proposition 4.27 for
d = w
p
by introducing an assumption on the reward distributions. In this case, we
simply require that these be in
P
p
(
R
). As this assumption will recur throughout
the book, we state it here in full; Exercise 4.11 goes through the steps of the
corresponding result.
Assumption 4.29(
p
).
For each state-action pair (x, a)
2X A
, the reward
distribution P
R
( · | x, a) is in P
p
(R). 4
Proposition 4.30.
Let p
2
[1,
1
]. Under Assumption 4.29(p), the return
function
has ﬁnite p
th
moments (p =
1
: is bounded). In addition, for
any initial return function
0
2P
p
(R), the sequence deﬁned by
k+1
= T
k
converges to
with respect to the supremum p-Wasserstein metric. 4
Operators and Metrics 105
4.8 Weak Convergence of Return Functions*
Proposition 4.27 implies that if each distribution
(x) lies in the ﬁnite domain
P
d
(
R
) of a given probability metric d that is regular, c-homogeneous, and
p-convex, then
is the unique solution to the equation
= T
(4.18)
in the space
P
d
(
R
)
X
. It does not, however, rule out the existence of solutions
outside this space. This concern can be addressed by showing that for any
0
2P(R)
X
, the sequence of probability distributions (
k
(x))
k0
deﬁned by
k+1
= T
k
converges weakly to the return distribution
(x), for each state x
2X
. In
addition to giving an alternative perspective on the quantitative convergence
results of these iterates, the uniqueness of
as a solution to Equation 4.18
(stated as Proposition 4.9) follows immediately from Proposition 4.34 below.
Deﬁnition 4.31.
Let (
k
)
k0
be a sequence of distributions in
P
(
R
), and let
2P
(
R
) be another probability distribution. We say that (
k
)
k0
converges
weakly to
if for every z
2R
at which F
is continuous, we have F
k
(z)
!
F
(z). 4
We will show that for each x
2X
, the sequence (
k
(x))
k0
converges weakly to
(x). A simple approach is to consider the relationships between well-chosen
instantiations of
k
(for each k
2N
) and
, by means of the following classical
result [see e.g. Billingsley, 2012]. Recall that a sequence of random variables
(Z
k
)
k0
converges to Z with probability 1 if
P( lim
k!1
Z
k
= Z)=1.
Lemma 4.32.
Let (
k
)
k0
be a sequence in
P
(
R
), and
2P
(
R
) be another
probability distribution. Let (Z
k
)
k0
and Z be instantiations of these distributions
all deﬁned on the same probability space. If Z
k
!
Z with probability 1 then
k
! weakly. 4
Lemma 4.32 is not a uniformly-applicable approach to demonstrating weak
convergence; there always exists such instantiations by Skorokhod’s representa-
tion theorem [Billingsley, 2012, Section 25], but ﬁnding such instantiations is
not always straightforward. However, in our case, there are a very natural set of
instantiations that work, constructed by the following result. 106 Chapter 4
Lemma 4.33.
Let
2P
(
R
)
X
, and let G be an instantiation of
. For x
2X
, if
(X
t
, A
t
, R
t
)
t0
is a random trajectory with initial state X
0
= x and generated by
following
, independent of G, then
k–1
t=0
t
R
t
+
k
G(X
k
) is an instantiation of
((T
)
k
)(x). 4
Proof. This follows by inductively applying Proposition 4.11.
Proposition 4.34. Let
0
2P(R)
X
, and for k 0 deﬁne
k+1
= T
k
.
Then we have
k
(x)
!
(x) weakly for each x
2X
, and consequently
is the unique ﬁxed point of T
in P(R )
X
. 4
Proof.
Fix x
2X
, let G
0
be an instantiation of
0
, and on the same probability
space, let (X
t
, A
t
, R
t
)
t0
be a trajectory generated by
with initial state X
0
= x,
independent of G
0
(the existence of such a probability space is guaranteed by
the Ionescu-Tulcea theorem, as described in Remark 2.1). Then
G
k
(x)=
k–1
t=0
t
R
t
+
k
G
0
(X
k
)
is an instantiation of
k
(x) by Lemma 4.33. Furthermore,
1
t=0
t
R
t
is an
instantiation of
(x). We have
G
k
(x)–
1
t=0
t
R
t
=
k
G
0
(X
k
)–
1
t=k
t
R
t
k
G
0
(X
k
)
+
1
t=k
t
R
t
!0,
with probability 1. The convergence of the ﬁrst term follows since |G
0
(X
k
)|
max
x2X
|G
0
(x)| is bounded w.p. 1 and
k
!
0, and the convergence of the second
term follows from convergence of
k–1
t=0
t
R
t
to
1
t=0
t
R
t
w.p. 1. Therefore
we have G
k
(x)
!
1
t=0
t
R
t
w.p. 1, and so
k
(x)
!
(x) weakly. Finally, this
implies that there can be no other ﬁxed point of
T
in
P
(
R
)
X
; if
0
were such a
ﬁxed point, then we would have
k
=
0
for all k
0, and would simultaneously
have
k
(x) !
(x) weakly for all x 2X, a contradiction unless
0
=
.
As the name indicates, the notion of weak convergence is not as strong as many
other notions of convergence of probability distributions. In general, it does not
even guarantee convergence of the mean of the sequence of distributions; see
Exercise 4.21. In addition, we lose any notion of convergence rate which was
provided by the contractive nature of the distributional operator under speciﬁc
Operators and Metrics 107
metrics. The need for stronger guarantees, such as those offered by Proposition
4.27, motivates the contraction mapping theory developed in this chapter.
4.9 Random Variable Bellman Operators*
In this chapter, we deﬁned the distributional Bellman operator
T
as a mapping
on the space of return-distribution functions
P
(
R
)
X
. We also saw that the action
of the operator on a return function
2P
(
R
)
X
can be understood both through
direct manipulation of the probability distributions or through manipulation of
a collection of random variables instantiating these distributions.
Viewing the operator through its effect on the distribution of a collection of
representative random variables is a useful tool for understanding distributional
reinforcement learning, and may prompt the reader to ask whether it is possible
to avoid referring to probability distributions at all, working instead directly
with random variables. We describe one approach to this below using the tools
of probability theory, and then discuss some of its shortcomings.
Let G
0
=(G
0
(x):x
2X
) be an initial collection of real-valued random variables,
indexed by state, supported on a probability space (
0
,
F
0
,
P
0
). For each k
2N
+
,
let (
k
,
F
k
,
P
k
) be another probability space, supporting a collection of random
variables ((A
k
(x), R
k
(x, a), X
0
k
(x, a)) : x
2X
, a
2A
), with A
k
(x)
(
·
| x), and
independently R
k
(x, a)
P
R
(
·
| x, a), X
k
(x, a)
P
X
(
·
| x, a). We then consider
the product probability space on
=
k2N
k
. All random variables deﬁned
above can naturally be viewed as functions on this joint probability space, that
depend on
!
=(
!
0
,
!
1
,
!
2
,
...
)
2
only through the coordinate
!
k
that matches
the index k on the random variable. Note that under this construction, all random
variables with distinct indices are independent.
Now deﬁne
X
N
as the set of real-valued random variables on (
,
F
,
P
) (where
F
is the product
-algebra) that depend on only ﬁnitely-many coordinates of
! 2
. We can deﬁne a Bellman operator
T
:
X
N
!X
N
as follows. Given
G =(G(x):x
2X
)
2X
X
N
, let K
2N
be the smallest integer such that the ran-
dom variables (G(x):x
2X
) depend on
!
=(
!
0
,
!
1
,
!
2
,
)
2
only through
!
0
,
,
!
K–1
; such an integer exists due to the deﬁnition of
X
N
and the ﬁniteness
of X. We then deﬁne T
G 2X
N
by
(T
G)(x)=R
K
(x, A
K
(x)) + G(X
0
K
(x, A
K
(x)) .
108 Chapter 4
With this deﬁnition, we can obtain a sequence of collections of random variables
(G
k
)
k0
, deﬁned iteratively by G
k+1
=
T
G
k
, for k
0.
32
We have therefore for-
malised an operator entirely within the realm of random variables, without
reference to the distributions of the iterates (G
k
)
k0
. By construction, the distri-
bution of the random variables (G
k
)
k0
matches the sequence of distributions
that would be obtained by working directly on the space of probability dis-
tributions with the usual distributional Bellman operator. More concretely, if
0
2P
(
R
)
X
is such that
0
(x) is the distribution of G
0
(x) for each x
2X
, then
we have that
k
, deﬁned by
k
=(
T
)
k
0
, is such that
k
(x) is the distribution of
G
k
(x), for each x
2X
. Thus, the random variable Bellman operator constructed
above is consistent with the distributional Bellman operator that is the main
focus of this chapter.
One difﬁculty with this random variable operator is that it does not have a ﬁxed
point; while the distribution of the random variables G
k
(
·
) converges to that of
G
(
·
), the random variables themselves, as functions on the probability space,
do not converge.
33
Thus, while it is one way to view distributional reinforcement
learning purely in terms of random variables, it is much less natural to analyse
algorithms from this perspective, rather than through probability distributions
as described in this chapter.
4.10 Technical Remarks
Remark 4.1.
Our exposition in this chapter has focused on the standard
distributional Bellman equation
G
(x)
D
= R + G
(X
0
), X = x .
A similar development is possible with the alternative notions of random return
mentioned in Section 2.9, including the ‘probability of continuing’ interpretation
of the discount factor . In general, the distributional operators that arise from
these alternative notions of the random return are still contractions, although
the metrics and contraction moduli involved in these statements differ. For
example, while the standard distributional Bellman operator is not a contraction
in total variation distance, the distributional Bellman operator associated with
the ‘probability of continuing’ random return is; Exercise 4.17 explores this
point in greater detail. 4
32. This is real equality between random variables, rather than in distribution.
33.
This is a fairly subtle point the reader is invited to consider what happens to G
k
(x) and G
(x)
as mappings from to R. Operators and Metrics 109
Remark 4.2.
The ideas developed in this chapter, and indeed the other chapters
of the book, can also be applied to learning other properties related to the return.
Achab  explores several methods for learning the distribution of the
random variables R +
V
(X
0
), under the different possible initial conditions X =
x; these objects interpolate between the expected return and the full distribution
of the return. Exercise 4.22 explores the development of contractive operators
for the distributions of these objects. 4
Remark 4.3.
As well as allowing us to quantify the contractivity of the dis-
tributional Bellman operator, the probability metrics described in this chapter
can be used to measure the accuracy of algorithms that aim to approximate the
return distribution. In particular, they can be used to give quantitative guarantees
on the accuracy of the non-parametric distributional Monte Carlo algorithm
described in Remark 3.1. These guarantees can then be used to determine how
many sample trajectories are required to approximate the true return distribution
at a required level of accuracy, for example when evaluating the performance of
the algorithms in Chapters 5 and 6. We assume that K returns (G
k
)
K
k=1
beginning
at state x
0
have been generated independently via the policy
, and consider the
accuracy of the non-parametric distributional Monte Carlo estimator
ˆ
(x
0
)=
1
K
K
k=1
G
k
.
Different realisations of the sampled returns will lead to different estimates; the
following result provides a guarantee in the Kolmogorov-Smirnov distance
`
1
that holds with high probability over the possible realisations of the returns. For
any " > 0, we have
`
1
(ˆ
(x
0
),
(x
0
)) " , with probability at least 1 2 exp(–2K"
2
).
Thus, for a desired level of accuracy
"
and a desired conﬁdence 1
, K can be
selected so that 2
exp
(–2K
"
2
)<
, which then yields the guarantee that with prob-
ability at least 1
, we have
`
1
(
ˆ
(x
0
),
(x
0
))
"
. This is in fact a key result
in empirical process theory known as the Dvoretzky–Kiefer–Wolfowitz (DKW)
inequality [Dvoretzky et al., 1956, Massart, 1990]. Its uses speciﬁcally in rein-
forcement learning include the works of Keramati et al.  and Chandak
et al. . Similar concentration bounds are possible under other probability
metrics (such as the Wasserstein distances; see e.g. Weed and Bach ,
Bobkov and Ledoux ), though typically some form of a priori information
about the return distribution, such as bounds on minimum/maximum returns, is
required to establish such bounds. 4
110 Chapter 4
4.11 Bibliographical Remarks
4.1–4.2.
The use of operator theory to understand reinforcement learning algo-
rithms is standard to most textbooks in the ﬁeld [Bertsekas and Tsitsiklis, 1996,
Szepesvári, 2010, Puterman, 2014, Sutton and Barto, 2018], and is commonly
used in theoretical reinforcement learning [Munos, 2003, Bertsekas, 2011,
Scherrer, 2014]. Puterman  provides a thorough introduction to vector
notation for Markov decision processes. Improved convergence results can be
obtained by studying the eigenspectrum of the transition kernel, as shown by e.g.
Morton , Bertsekas [1994, 2012]. The elementary contraction mapping
theory described in Section 4.2 goes back to Banach . Our reference on
metric spaces is the deﬁnitive textbook by Rudin .
Not all reinforcement learning algorithms are readily analysed using the theory
of contraction mappings. This is the case for policy-gradient algorithms [Sutton
et al., 2000, but see Ghosh et al., 2020 and Bhandari and Russo, 2021], but also
value-based algorithms such as advantage learning [Baird, 1999, Bellemare
et al., 2016].
4.3.
The random-variable Bellman operator presented here is from our earlier
work [Bellemare et al., 2017a], which provided an analysis in the p-Wasserstein
distances. There, the technical issues discussed in Section 4.9 were obviated
by declaring R, X
0
and the collection (G(x):x
2X
) to be independent. Those
issues were raised in a later paper [Rowland et al., 2018], which also introduced
the distributional Bellman equation in terms of probability measures and the
pushforward notation. The probability density equation (Equation 4.9) can be
found in Morimura et al. [2010b]. Earlier instances of distributional operators
are given by Chung and Sobel , Morimura et al. [2010a], who provide
an operator on cumulative distribution functions, and Jaquette , who
provides an operator on Laplace transforms.
4.4.
The Wasserstein distance can be traced back to Leonid Kantorovich [Kan-
torovich, 1942] and has been rediscovered (and renamed) multiple times in its
history. The name we use here is common, but a misnomer as Leonid Vaserstein
(after whom the distance is named) did not himself do any substantial work
on the topic. Among other names, we note the Mallows metric [Bickel and
Freedman, 1981] and the Earth-Mover distance [Rubner et al., 1998]. Much
earlier, Monge  was the ﬁrst to study the problem of optimal transporta-
tion from a transport theory perspective. See Vershik  and Panaretos
and Zemel  for further historical comments, and Villani  for a
Operators and Metrics 111
survey of theoretical properties. A version of the contraction analysis in p-
Wasserstein distances was given by Bellemare et al. [2017a]; we owe the proof
of Proposition 4.15 in terms of optimal couplings to Philip Amortila.
The use of contraction mapping theory to analyse stochastic ﬁxed-point equa-
tions was introduced by Rösler , who analyses the Quicksort algorithm
by characterising the distributional ﬁxed points of contraction mappings in 2-
Wasserstein distance. Applications and generalisation of this technique includes
the analysis of further recursive algorithms, models in stochastic geometry, and
branching processes [Rösler, 1992, Rachev and Rüschendorf, 1995, Neininger,
1999, Rösler and Rüschendorf, 2001, Rösler, 2001, Neininger, 2001, Neininger
and Rüschendorf, 2004, Rüschendorf, 2006, Rüschendorf and Neininger, 2006,
Alsmeyer, 2012]. Although the random-variable Bellman equation (Equation
2.15) can be viewed as a system of recursive distributional equations, the empha-
sis on a collection of effectively independent random variables (G
) differs
from the usual treatment of such equations.
4.5.
The family of
`
p
distances described in this chapter are covered at length
in the work of Rachev et al. , which studies an impressive variety of
probability metrics. A version of the contraction analysis in Cramér distance
was originally given by Rowland et al. . In two and more dimensions,
the Cramér distance is generalised by the energy distance [Székely, 2002,
Székely and Rizzo, 2013, Rizzo and Székely, 2016], itself a member of the
maximum mean discrepancy (MMD) family [Gretton et al., 2012]; contraction
analysis in terms of MMD metrics was undertaken by Nguyen et al. 
(see Exercise 4.19 for further details). Another special case of the
`
p
metrics
considered in this chapter is the Kolmogorov-Smirnov distance (
`
1
), which
features in results in empirical process theory, such as the Glivenko-Cantelli
theorem. Many of these metrics are integral probability metrics [Müller, 1997],
which allows for a dual formulation with appealing algorithmic consequences.
Chung and Sobel  provide a non-expansion result in total variation
distance (without naming it as such; the proof uses an integral probability
metric formulation).
4.6.
The properties of regularity, convexity, and c-homogeneity were introduced
by Zolotarev  in a slightly more general setting. Our earlier work pre-
sented these in a modern context [Bellemare et al., 2017b], albeit with only a
mention of their potential use in reinforcement learning. Although that work
proposed the term “sum-invariant” as mnemonically simpler, this is only techni-
cally correct when Equation 4.15 holds with equality; we have thus chosen to
keep the original name. Theorem 4.25 is new to this book.
112 Chapter 4
The characterisation of the Wasserstein distance as an optimal transport problem
in Proposition 4.18 is the standard presentation of the Wasserstein distance in
more abstract settings, which allows it to be applied to probability distributions
over reasonably general metric spaces [Villani, 2003, 2008, Ambrosio et al.,
2005, Santambrogio, 2015]. Optimal transport has also increasingly found
application within machine learning in recent years, particularly in generative
modelling [Arjovsky et al., 2017]. Optimal transport and couplings also arise in
the study of bisimulation metrics for Markov decision processes [Ferns et al.,
2004, Ferns and Precup, 2014, Amortila et al., 2019] as well as analytical tools
for sample-based algorithms [Amortila et al., 2020]. Peyré et al.  provide
an overview of algorithms, analysis, and applications associated with optimal
transport in machine learning and related disciplines.
4.7–4.8.
Villani  gives further discussion on the domain of Wasserstein
distances, and on their relationship to weak convergence.
4.9.
The usefulness of a random-variable operator has been a source of intense
debate between the authors of this book. The form we present here is inspired
by the “stack of rewards” model from Lattimore and Szepesvári .
4.12 Exercises
Exercise 4.1.
Show that the no-loop operator deﬁned in Example 4.6 is a
contraction mapping with modulus
= max
x2X
P
X
0
6= x | X = x
. 4
Exercise 4.2. For p 2[1, 1), let k·k
p
be the L
p
norm over R
X
, deﬁned as
kVk
p
=
x2X
|V(x)|
p
1/p
.
Show that T
is not a contraction mapping in the metric induced by the L
p
norm unless p =
1
(see Equation 4.4 for a deﬁnition of the L
1
metric). Hint. A
two-state example sufﬁces. 4
Exercise 4.3.
In this exercise, you will use the ideas of Sections 4.1 to study
several operators associated with expected-value reinforcement learning.
(i)
For n
2N
+
, consider the n-step evaluation operator T
n
:
R
X
!R
X
deﬁned
by
(T
n
V)(x)=E
n–1
t=0
t
R
t
+
n
V(X
n
)
X
0
= x
. Operators and Metrics 113
Show that T
n
has V
as a ﬁxed point, and is a contraction mapping with
respect to the L
1
metric with contraction modulus
n
. Hence, deduce that
repeated application of T
n
to any initial value function estimate converges
to V
. Show that in fact, T
n
=(T
)
n
.
(ii) Consider the -return operator T
: R
X
!R
X
deﬁned by
(T
V)(x) = (1 )
1
n=1
n–1
E
n–1
t=0
t
R
t
+
n
V(X
n
)
X
0
= x
.
Show that T
has V
as a ﬁxed point, and is a contraction mapping with
respect to the L
1
metric with contraction modulus
1–
1–
.
Hence, deduce that repeated application of T
to any initial value function
estimate converges to V
. 4
Exercise 4.4.
Consider the random-variable operator (Equation 4.7). Given
a return-variable function G
0
, write G
0
(x,
!
)=G
0
(x)(
!
) for the realisation of
the random return corresponding to
! 2
2X
, let
(x, A(x), R(x), X
0
(x)) be an independent random transition deﬁned on the same
probability space, and write (A(x,
!
), R(x,
!
), X
0
(x,
!
)) to denote the dependence
of these random variables on
! 2
. Suppose that for each k
0, we deﬁne the
return-variable function G
k+1
as
G
k+1
(x, !)=R(x, !)+G
k
(X
0
(x, !), !).
For a given x, characterise the function
G
(x, !) = lim
k!1
G
k
(x, !). 4
Exercise 4.5.
Suppose that you are given a description of a Markov decision
process along with a policy
, with the property that the policy
reaches a
terminal state (i.e., one for which the return is zero) in at most T
2N
steps.
Describe a recursive procedure that takes in a scalar
!
on [0, 1] and outputs a
return z
2R
such that
P
G
(X
0
)
z
=
!
. Hint. You may want to use the fact
that sampling a random variable Z can be emulated by drawing
uniformly
from [0, 1], and returning F
–1
Z
(). 4
Exercise 4.6. Let p 2[1, 1].
(i)
Show that for any
,
0
2P
(
R
) with ﬁnite p
th
moments, we have w
p
(
,
)<
1
. Hint. Use the triangle inequality with intermediate distribution
0
. Hence, 114 Chapter 4
prove that the p-Wasserstein metric is indeed a metric on
P
p
(
R
), for p
2
[1, 1].
(ii)
Show that on the space
P
(
R
), the p-Wasserstein metric satisﬁes all require-
ments of a metric except ﬁniteness. For each p
2
[1,
1
], exhibit a pair of
distributions ,
0
2P(R) such that w
p
(,
0
)=1.
(iii)
Show that if 1
q < p <
1
, we have
E
[|Z|
p
]<
1 =) E
[|Z|
q
]<
1
for any
random variable Z. Deduce that
P
p
(
R
)
P
q
(
R
). Hint. Consider applying
Jensen’s inequality with the function z 7!|z|
p/q
. 4
Exercise 4.7.
Let d :
P
(
R
)
P
(
R
)
!
[0,
1
] be a probability metric with
ﬁnite domain P
d
(R).
(i) Prove that the supremum distance d is an extended metric on P(R)
X
.
(ii) Prove that it is a metric on P
d
(R)
X
. 4
Exercise 4.8. Prove Proposition 4.15 for p = 1. 4
Exercise 4.9.
Consider two normally-distributed random variables with mean
µ
and
µ
0
and common variance
2
. Derive an expression for the
1
-Wasserstein
distance between these two distributions. Conclude that Assumption 4.29(
1
) is
sufﬁcient, but not necessary, for two distributions to have ﬁnite
1
-Wasserstein
distance. How does the situation change if the two normal distributions in
question have unequal variances? 4
Exercise 4.10.
Explain, in words, why Assumption 4.29(p) is needed for Propo-
sition 4.30. By considering the deﬁnition of the p-Wasserstein distance, explain
why for p > 1 and 1
q < p, Assumption 4.29(q) is not sufﬁcient to guarantee
convergence in the p-Wasserstein distance. 4
Exercise 4.11.
This exercise guides you through the proof of Proposition 4.30.
(i)
First, show that under Assumption 4.29(p), the return distributions
(x)
have ﬁnite p
th
moments, for all x
2X
. You may ﬁnd it useful to deal with
p =
1
separately, and in the case p
2
[1,
1
), you may ﬁnd it useful to rewrite
E
1
t=0
t
R
t
p
| X = x
= (1 )
p
E
1
t=0
(1 )
t
R
t
p
| X = x
,
and use Jensen’s inequality on the function z 7!|z|
p
. Operators and Metrics 115
(ii)
Let
2P
p
(
R
)
X
, and let G be an instantiation of
. First letting p
2
[1,
1
),
use the inequality |z
1
+ z
2
|
p
2
p–1
(|z
1
|
p
+|z
2
|
p
) to argue that if (X = x, A, R, X
0
)
is a sample transition independent of G, then
E
[|R + G(X
0
)|
p
| X = x]<1.
Hence, argue that under Assumption 4.29(p),
P
p
(
R
)
X
is closed under
T
.
Argue separately that this holds for p = 1 too.
Hence argue that Proposition 4.27 applies, and hence conclude that Proposi-
tion 4.30 holds. 4
Exercise 4.12.
In the proof of Theorem 4.25, we did not need to assume that
for the two distinct states x, y
2X
, their associated returns G(x) and G(y) are
independent. Explain why. 4
Exercise 4.13. The (1,1)-Pareto distribution
PAR
has cumulative distribution
F
PAR
(z)=
0 if z < 1,
1–
1
z
if z 1.
Justify the necessity of including Assumption 2.5 (ﬁnite-mean rewards) in
Deﬁnition 4.26 by demonstrating that
`
2
(
PAR
,
0
)<1 yet E
Z
[Z]=1. 4
Exercise 4.14.
The purpose of this exercise is to contrast the p-Wasserstein and
`
p
distances. For each of the following, ﬁnd a pair of probability distributions
,
0
2P(R) such that, for a given " > 0,
(i) w
2
(,
0
) smaller than `
2
(,
0
);
(ii) w
2
(,
0
) larger than `
2
(,
0
);
(iii) w
1
(,
0
)=" and `
1
(,
0
) = 1;
(iv) w
1
(,
0
) = 1 and `
1
(,
0
)=". 4
Exercise 4.15.
Show that the dependence on
1/p
is tight in Proposition 4.20.
4
Exercise 4.16.
The total variation distance d
TV
:
P
(
R
)
P
(
R
)
!R
is
deﬁned by
d
TV
(,
0
) = sup
UR
|(U)–
0
(U)| , (4.19)
116 Chapter 4
for all
,
0
2P
(
R
).
34
Show, by means of a counterexample, that the distribu-
tional Bellman operator is not a contraction mapping in the supremum extension
of this distance. 4
Exercise 4.17.
Consider the alternative notion of the return introduced in
Section 2.9 in which
is treated as the probability of continuing. Write down
the distributional Bellman operator that corresponds to this notion of the return.
For which metrics considered in this chapter is this distributional Bellman oper-
ator a contraction mapping? Show in particular that this distributional Bellman
operator is a contraction with respect to the supremum version of the total vari-
ation distance over return-distribution functions, introduced in Exercise 4.16.
What is its contraction modulus? 4
Exercise 4.18.
Remark 2.3 describes some differences between Markov deci-
sion processes with ﬁnite state spaces (as we consider throughout the book) and
generalisations with inﬁnite state spaces. The contraction mapping theory in
this chapter is one case where stronger assumptions are required when moving
to larger state spaces. Using the example described in Remark 2.3, show that
Assumption 4.29(w
1
) is insufﬁcient to make
P
1
(
R
) closed under the distribu-
tional Bellman operator
T
when the state space is countably inﬁnite. How
could this assumption be strengthened to guarantee closedness? 4
Exercise 4.19
(*)
.
The goal of this exercise is to explore the contractivity of
the distributional Bellman operator with respect to a class of metrics known as
maximum mean discrepancies; this analysis was undertaken by Nguyen et al.
. A kernel on
R
is a function K :
R R !R
with the property that for any
ﬁnite set {z
1
,
, z
n
}
R
, the m
m matrix with (i, j)
th
entry K(z
i
, z
j
) is positive
semi-deﬁnite. A function K : R R !R satisfying the weaker condition that
n
i,j=1
c
i
c
j
K(z
i
, z
j
) 0
whenever
n
i=1
c
i
= 0 is called a conditionally positive-deﬁnite kernel. Condi-
tionally positive-deﬁnite kernels form a measure of similarity between pairs of
points in
R
, and can also be used to deﬁne notions of distance over probabil-
ity distributions. The maximum mean discrepancy (MMD) probability metric
34.
For the reader with a measure-theoretic background, the supremum here is over measurable
subsets of R. Operators and Metrics 117
associated with the conditionally positive-deﬁnite kernel K is deﬁned by
MMD
K
(,
0
)=
E
X
X
0
[K(X, X
0
)] + E
Y
0
Y
0
0
[K(Y, Y
0
)] 2E
X
Y
0
[K(X, Y)]
1/2
,
where each pair of random variables in the expectations above are taken to be
independent.
(i)
Consider the function K
(z
1
, z
2
) = –|z
1
z
2
|
, with
2
(0, 2)). Székely and
Rizzo [2013, Proposition 2] show that this deﬁnes a conditionally positive-
deﬁnite kernel. Show that
MMD
K
a
is regular, c-homogeneous (for some c >
0), and p-convex (for some p
2
[1,
1
)). Hence use Theorem 4.25 to establish
that the distributional Bellman operator is a contraction with respect to
MMD
K
, under suitable assumptions.
(ii)
The Gaussian kernel, or squared exponential kernel, with variance
2
> 0 and
length scale
> 0 is deﬁned by K(z
1
, z
2
)=
2
exp
(–(z
1
z
2
)
2
/(2
2
)). Show,
through the use of a counterexample, that the MMD corresponding to the
Gaussian kernel is not c-homogeneous for any c > 0, and so Theorem 4.25
cannot be applied. Further, ﬁnd an MDP and policy
that serve as a coun-
terexample to the contractivity of the distributional Bellman operator with
respect to this MMD metric. 4
Exercise 4.20.
Let p
2
[1,
1
], and consider a modiﬁcation of the p-Wasserstein
distance,
˜
w
p
, such that
˜
w
p
(
,
0
)=w
p
(
,
0
) if both
,
0
2P
(
R
) are expressible
as ﬁnite mixtures of Dirac deltas, and
˜
w
p
(,
0
)=1 otherwise.
(i)
Show that
P
˜
w
p
(
R
) is the set of distributions expressible as ﬁnite mixtures
of Dirac deltas.
(ii)
Exhibit a Markov decision process and policy
for which all conditions of
Proposition 4.27 hold except for the condition
2P
˜
w
p
(R)
X
.
(iii)
Show that the sequence of iterates (
k
)
k0
does not converge to
under
˜
w
p
in this case. 4
Exercise 4.21. Consider the sequence of distributions (
k
)
1
k=1
deﬁned by
k
=
k –1
k
0
+
1
k
k
.
Show that this sequence converges weakly to another distribution
. From this,
deduce that weak convergence does not imply convergence of expectations.
4
118 Chapter 4
Exercise 4.22. Achab  considers the random variables
˜
G
(x)=R + V
(X
0
), X = x .
What does the distribution of this random variable capture about the underlying
MDP and policy
? Using the tools of this chapter, derive an operator over
P
(
R
)
X
that has the collection of distributions corresponding to this random
variable under each of the initial conditions in
X
as a ﬁxed point, and analyse
the properties of this operator. 4