4
Operators and Metrics
Anyone who has learned to play a musical instrument knows that practice makes
perfect. Along the way, however, one’s ability at playing a difficult passage
usually varies according to a number of factors. On occasion, something that
could be played easily the day before now seems insurmountable. The adage
expresses an abstract notion that practice improves performance, on average or
over a long period of time rather than a concrete statement about instantaneous
ability.
In the same way, reinforcement learning algorithms deployed in real situations
behave differently from moment to moment. Variations arise due to different
initial conditions, specific choices of parameters, hardware non-determinism, or
simply because of randomness in the agent’s interactions with its environment.
These factors make it hard to make precise predictions, for example about the
magnitude of the value function estimate learned by TD learning at a particular
state x and step k, other than by extensive simulations. Nevertheless, the large-
scale behaviour of TD learning is relatively predictable, sufficiently so that
convergence can be established under certain conditions, and convergence rates
can be derived.
This chapter introduces the language of operators as an effective abstraction
with which to study such long-term behaviour, characterise the asymptotic
properties of reinforcement learning algorithms, and eventually explain what
makes an effective algorithm. In addition to being useful in the study of existing
algorithms, operators also serve as a kind of blueprint when designing new
algorithms, from which incremental methods such as categorical temporal-
difference learning can then be derived.
79
80 Chapter 4
4.1 The Bellman Operator
The value function V
characterises the expected return obtained by following
a policy , beginning in a given state x:
V
(x)=E
1
t=0
t
R
t
| X
0
= x
.
The Bellman equation establishes a relationship between the expected return
from one state and from its successors:
V
(x)=E
R + V
(X
0
)|X = x
.
Let us now consider a state-indexed collection of real variables, written V
2R
X
,
which we call a value function estimate. By substituting V
for V in the original
Bellman equation, we obtain the system of equations
V(x)=E
R + V(X
0
)|X = x
, for all x 2X . (4.1)
From Chapter 2, we know that V
is one solution to the above.
Are there other solutions to Equation 4.1? In this chapter we answer this question
(negatively) by interpreting the right-hand side of the equation as applying a
transformation on the estimate V. For a given realisation (x, a, r, x
0
) of the
random transition, this transformation indexes V by x
0
, multiplies it by the
discount factor, and adds it to the immediate reward (this yields r +
V(x
0
)). The
actual transformation returns the value that is obtained by following these steps,
in expectation. Functions that map elements of a space onto itself, such as this
one (from estimates to estimates) are called operators.
Definition 4.1.
The Bellman operator is the mapping T
:
R
X
!R
X
defined
by
(T
V)(x)=E
[R + V(X
0
)|X = x] . (4.2)
Here, the notation T
V should be understood as T
, applied to V”. 4
The Bellman operator gives a particularly concise way of expressing the
transformations implied in Equation 4.1:
V = T
V .
As we will see in later chapters, it also serves as the springboard for the design
and analysis of algorithms for learning V
.
When working with the Bellman operator, it is often useful to treat V as a
finite-dimensional vector in
R
X
and to express the Bellman operator in terms
Operators and Metrics 81
of vector operations. That is, we write
T
V = r
+ P
V , (4.3)
where r
(x)=E
[R | X = x] and P
is the transition operator
24
defined as
(P
V)(x)=
a2A
(a | x)
x
0
2X
P
X
(x
0
| x, a)V(x
0
).
Equation 4.3 follows from these definitions and the linearity of expectations.
A vector
˜
V 2R
X
is a solution to Equation 4.1 if it remains unchanged by the
transformation corresponding to the Bellman operator T
; that is, if it is a fixed
point of T
. This means that the value function V
is a fixed point of T
:
V
= T
V
.
To demonstrate that V
is the only fixed point of the Bellman operator, we will
appeal to the notion of contraction mappings.
4.2 Contraction Mappings
When we apply the operator T
to a value function estimate V
2R
X
, we obtain
a new estimate T
V
2R
X
. A characteristic property of the Bellman operator is
that this new estimate is guaranteed to be closer to V
than V (unless V = V
,
of course). In fact, as we will see in this section, applying the operator to any
two estimates must bring them closer together.
To formalise what we mean by “closer”, we need a way of measuring distances
between value function estimates. Because these estimates can be viewed as
finite-dimensional vectors, there are many well-established ways of doing so:
the reader may have come across the Euclidean (L
2
) distance, the Manhattan
(L
1
) distance, and curios such as the British Rail distance. We use the term
metric to describe distances that satisfy the following standard definition.
Definition 4.2.
Given a set M, a metric d : M
M
!R
is a function that
satisfies, for all U, V, W 2M,
(a) d(U, V) 0,
(b) d(U, V) = 0 if and only if U = V,
(c) d(U, V) d(U, W)+d(W, V),
24.
It is also possible to express P
as a stochastic matrix, in which case P
V describes a matrix-
vector multiplication. We will return to this point in Chapter 5.
82 Chapter 4
(d) d(U, V)=d(V, U).
We call the pair (M, d)ametric space. 4
In our setting, M is the space of value function estimates,
R
X
. Because we
assume that there are finitely many states, this space can be equivalently thought
of as the space of real-valued vectors with N
X
entries, where N
X
is the number
of states. On this space, we measure distances in terms of the L
1
metric, defined
by
kV V
0
k
1
= max
x2X
|V(x)–V
0
(x)| , V, V
0
2R
X
. (4.4)
A key result is that the Bellman operator T
is a contraction mapping with
respect to this metric. Informally, this means that its application to different
value function estimates brings them closer by at least a constant multiplicative
factor, called its contraction modulus.
Definition 4.3.
Let (M, d) be a metric space. A function
O
: M
!
M is a con-
traction mapping with respect to d and with contraction modulus
2
[0, 1), if
for all U, U
0
2M ,
d(OU, OU
0
) d(U, U
0
). 4
Proposition 4.4.
The operator T
:
R
X
!R
X
is a contraction mapping
with respect to the L
1
metric on
R
X
with contraction modulus given by
the discount factor . That is, for any two value functions V, V
0
2R
X
,
kT
V T
V
0
k
1
kV V
0
k
1
. 4
Proof.
The proof is most easily stated in vector notation. Here, we make use of
two properties of the operator P
. First, P
is linear, in the sense that for any
V, V
0
,
P
V + P
V
0
= P
(V + V
0
).
Second, because (P
V)(x) is a convex combination of elements from V, it must
be that
kP
Vk
1
kVk
1
.
From here, we have
kT
V T
V
0
k
1
= k(r
+ P
V)–(r
+ P
V
0
)k
1
= kP
V P
V
0
k
1
= kP
(V V
0
)k
1
Operators and Metrics 83
kV V
0
k
1
,
as desired.
The fact that T
is a contraction mapping guarantees the uniqueness of V
as a
solution to the equation V = T
V. As made formal by the following proposition,
because the operator T
brings any two value functions closer together, it cannot
keep more than one value function fixed.
Proposition 4.5.
Let (M, d) be a metric space and
O
: M
!
M be a
contraction mapping. Then O has at most one fixed point in M. 4
Proof.
Let
2
[0, 1) be the contraction modulus of
O
, and suppose U, U
0
2
M
are distinct fixed points of
O
, so that d(U, U
0
) > 0 (following Definition 4.2).
Then we have
d(U, U
0
)=d(OU, OU
0
) d(U, U
0
),
which is a contradiction.
Because we know that V
is a fixed point of the Bellman operator T
, following
Proposition 4.5 we deduce that there are no other such fixed points and hence
no other solutions to the Bellman equation. As the phrasing of Proposition 4.5
suggests, in some metric spaces it is possible for
O
to be a contraction mapping,
yet to not possess a fixed point. This can matter when dealing with return
functions, as we will see in the second half of this chapter and in Chapter 5.
Example 4.6. Consider the no-loop operator
T
NL
V
(x)=E
R + V(X
0
)
{X
0
6= x}
| X = x
,
where the name denotes the fact that we omit the next-state value whenever a
transition from x to itself occurs. By inspection, we can determine that the fixed
point of this operator is
V
NL
(x)=E
T–1
t=0
t
R
t
| X
0
= x
,
84 Chapter 4
where T denotes the (random) first time at which X
T
= X
T–1
. In words, this fixed
point describes the discounted sum of rewards obtained until the first time that
an action leaves the state unchanged.
25
Exercise 4.1 asks you to show that T
NL
is a contraction mapping with modulus
= max
x2X
P
X
0
6= x | X = x
.
Following Proposition 4.5, we deduce that this is the unique fixed point to
T
NL
. 4
When an operator
O
is contractive, we can also straightforwardly construct a
mathematical approximation to its fixed point.
26
This approximation is given
by the sequence (U
k
)
k0
, defined by an initial value U
0
2
M, and the recursive
relationship
U
k+1
= OU
k
.
By contractivity, successive iterates of this sequence must come progressively
closer to the operator’s fixed point. This is formalised by the following.
Proposition 4.7.
Let (M, d) be a metric space, and let
O
: M
!
M be a
contraction mapping with contraction modulus
2
[0, 1) and fixed point
U
2
M. Then for any initial point U
0
2
M, the sequence (U
k
)
k0
defined
by U
k+1
= OU
k
is such that
d(U
k
, U
)
k
d(U
0
, U
). (4.5)
and in particular d(U
k
, U
) !0 as k !1. 4
Proof.
We will prove Equation 4.5 by induction, from which we obtain conver-
gence of (U
k
)
k0
in d. For k = 0, Equation 4.5 trivially holds. Now suppose for
some k 0, we have
d(U
k
, U
)
k
d(U
0
, U
).
Then note that
d(U
k+1
, U
)
(a)
= d(OU
k
, OU
)
(b)
d(U
k
, U
)
(c)
k+1
d(U
0
, U
),
25.
The reader is invited to consider the kind of environments in which policies that maximise the
no-loop return are substantially different from those that maximise the usual expected return.
26.
We use the term mathematical approximation to distinguish it from an approximation that can
be computed. That is, there may or may not exist an algorithm that can determine the elements of
the sequence (U
k
)
k0
given the initial estimate U
0
.
Operators and Metrics 85
where (a) follows from the definition of the sequence (U
k
)
k0
and the fact that
U
is fixed by
O
, (b) follows from the contractivity of
O
, and (c) follows from
the inductive hypothesis. By induction, we conclude that Equation 4.5 holds for
all k 2N.
In the case of the Bellman operator T
, Proposition 4.7 means that repeated
application of T
to any initial value function estimate V
0
2R
X
produces
a sequence of estimates (V
k
)
k0
that are progressively closer to V
. This
observation serves as the starting point for a number of computational
approaches that approximate V
, including dynamic programming (Chapter 5)
and temporal-difference learning (Chapter 6).
4.3 The Distributional Bellman Operator
Designing distributional reinforcement learning algorithms such as categorical
temporal-difference learning involves a few choices such as how to represent
probability distributions in a computer’s memory that do not have an equiva-
lent in classical reinforcement learning. Throughout this book, we will make
use of the distributional Bellman operator to understand and characterise many
of these choices. To begin, recall the random variable Bellman equation:
G
(x)
D
= R + G
(X
0
), X = x . (4.6)
As in the expected-value setting, we construct a random-variable operator by
viewing the right-hand side of Equation 4.6 as a transformation of G
. In this
case, we break down the transformation of G
into three operations, each of
which produces a new random variable (Figure 4.1):
(a) G
(X
0
): The indexing of the collection of random variables G
by X
0
;
(b)
G
(X
0
): The multiplication of the random variable G
(X
0
) with the scalar
;
(c) R + G
(X
0
): The addition of two random variables (R and G
(X
0
)).
More generally, we may apply these operations to any state-indexed collection
of random variables G =
G(x):x
2X
), taken to be independent of the random
transitions used to define the transformation. With some mathematical caveats
discussed below, let us introduce the random-variable Bellman operator
(T
G)(x)
D
= R + G(X
0
), X = x . (4.7)
Equation 4.7 states that the application of the Bellman operator to G (evaluated
at x; the left-hand side) produces a random variable that is equal in distribution
to the random variable constructed on the right-hand side. Because this holds
86 Chapter 4
(a) (b) (c)
Figure 4.1
The random-variable Bellman operator is composed of three operations:
(a)
indexing
into a collection of random variables,
(b)
multiplication by the discount factor, and
(c)
addition of two random variables. Here, we assume that R and X
0
take on a single value
for clarity.
for all x, we think of
T
as mapping G to a new collection of random variables
T
G.
The random-variable operator is appealing because it is concise and easily
understood. In many circumstances, this makes it the tool of choice for reasoning
about distributional reinforcement learning problems. One issue, however, is that
its definition above is mathematically incomplete. This is because it specifies
the probability distribution of (
T
G)(x), but not its identity as a mapping from
some sample space to the real numbers. As discussed in Section 2.7, without
care we may produce random variables that exhibit undesirable behaviour, for
example because rewards at different points in time are improperly correlated.
More immediately, the theory of contraction mappings needs a clear definition
of the space on which an operator is defined in the case of the random-variable
operator, this requires us to specify a space of random variables to operate
in. Properly defining such a space is possible but requires some technical
subtlety and measure-theoretic considerations; we refer the interested reader to
Section 4.9.
A more direct solution is to consider the distributional Bellman operator as
a mapping on the space of return-distribution functions. Starting with the
distributional Bellman equation
(x)=E
[
b
R,
#
(X
0
)|X = x],
we again view the right-hand side as the result of applying a series of
transformations, in this case to probability distributions.
Operators and Metrics 87
Definition 4.8.
The distributional Bellman operator
T
:
P
(
R
)
X
!P
(
R
)
X
is the mapping defined by
(T
)(x)=E
[
b
R,
#
(X
0
)|X = x] . (4.8)
4
Here, the operations on probability distributions are expressed (rather com-
pactly) by the expectation in Equation 4.8 and the use of the pushforward
distribution derived from the bootstrap function
b
; these are the operations of
mixing, scaling, and translation previously described in Section 2.8.
We can gain additional insight into how the operator transforms a return function
by considering the situation in which the random reward R and the return
distributions
(x) admit respective probability densities p
R
(r | x, a) and p
x
(z).
In this case, the probability density of (T
)(x), denoted p
0
x
, is
p
0
x
(z)=
–1
a2A
(a | x)
r2R
p
R
(r | x, a)
x
0
2X
P
X
(x
0
| x, a)p
x
0
z r
dr . (4.9)
Expressed in terms of probability densities, the indexing of a collection of
random variables becomes a mixture of densities, while their addition becomes a
convolution; this is in fact what is depicted in Figure 4.1. In terms of cumulative
distribution functions we have
F
(T
)(x)
(z)=E
F
(X
0
)
z R
| X = x
.
However, we prefer the operator that deals directly with probability distributions
(Equation 4.8) as it can be used to concisely express more complex operations on
distributions. One such operation is the projection of a probability distribution
onto a finitely-parametrised set, which we will use in Chapter 5 to construct
algorithms for approximating
.
Using Definition 4.8, we can formally express the fact that the return-distribution
function
is the only solution to the equation
= T
.
The proof is relatively technical and will be given in Section 4.8.
Proposition 4.9. The return-distribution function
satisfies
= T
and is the unique fixed point of the distributional Bellman operator
T
.
4
88 Chapter 4
When working with the distributional Bellman operator, one should be mindful
that the random reward R and next state X
0
are generally not independent,
because they both depend on the chosen action A (we briefly mentioned this
concern in Section 2.8). In Equation 4.9, this is explicitly handled by the outer
sum over actions. Analogously, we can make explicit the dependency on A by
introducing a second expectation in Equation 4.8:
(T
)(x)=E
E
[(b
R,
)
#
(X
0
)|X = x, A]|X = x
.
By conditioning the inner expectation on the action A, we make the random
variables R and
G(X
0
) conditionally independent in the inner expectation. We
will make use of this technique in proving Theorem 4.25, the main theoretical
result of this chapter.
In some circumstances, it is useful to translate between operations on proba-
bility distributions and those on random variables. We do this by means of a
representative set of random variables called an instantiation.
Definition 4.10.
Given a probability distribution
2P
(
R
), we say that a
random variable Z is an instantiation of
if its distribution is
, written Z
.
Similarly, we say that a collection of random variables G =(G(x):x
2X
) is an
instantiation of a return-distribution function
2P
(
R
)
X
if for every x
2X
, we
have G(x) (x). 4
Given a return-distribution function
2P
(
R
)
X
, the new return-distribution
function
T
can be obtained by constructing an instantiation G of
, perform-
ing the transformation on the collection of random variables G as described
at the beginning of this section, and then extracting the distributions of the
resulting random variables. This is made formal as follows.
Proposition 4.11.
Let
2P
(
R
)
X
, and let G =(G(x):x
2X
) be an instan-
tiation of
. For each x
2X
, let (X = x, A, R, X
0
) be a sample transition
independent of G. Then R + G(X
0
) has the distribution (T
)(x):
D
(R + G(X
0
)|X = x)=(T
)(x). 4
Proof.
The result follows immediately from the definition of the distributional
Bellman operator. For clarity, we step through the argument again, mirroring
the transformations set out at the beginning of the section. First, the indexing
Operators and Metrics 89
0
G G
0
T
T
Figure 4.2
A commutative diagram illustrating two perspectives on the application of the distri-
butional Bellman operator. The top horizontal line represents the direct application to
the return distribution function
, yielding
0
. The alternative path first instantiates the
return distribution function
as a collection of random variables G =(G(x):(x)
2X
),
transforms G to obtain another collection of random variables G
0
, and then extracts the
distributions of these random variables to obtain
0
.
transformation gives
D
(G(X
0
)|X = x)=
x
0
2X
P
(X
0
= x
0
| X = x)(x
0
)
= E
[(X
0
)|X = x].
Next, scaling by yields
D
(G(X
0
)|X = x)=E
[(b
0,
)
#
(X
0
)|X = x],
and finally adding the immediate reward R gives the result
D
(R + G(X
0
)|X = x)=E
[(b
R,
)
#
(X
0
)|X = x].
Proposition 4.11 is an instance of a recurring principle in distributional rein-
forcement learning that “different routes lead to the same answer”. Throughout
this book, we will illustrate this point as it arises with a commutative diagram;
the particular case under consideration is depicted in Figure 4.2.
4.4 Wasserstein Distances for Return Functions
Many desirable properties of reinforcement learning algorithms (for example,
the fact that they produce a good approximation of the value function) are
due to the contractive nature of the Bellman operator T
. In this section, we
will establish that the distributional Bellman operator
T
, too, is a contraction
mapping analogous to the value-based operator, the application of
T
brings
return functions closer together.
One difference between expected-value and distributional reinforcement learn-
ing is that the space of return-distribution functions
P
(
R
)
X
is substantially
90 Chapter 4
different from the space of value functions. To measure distances between value
functions, we can simply treat them as finite-dimensional vectors, taking the
absolute difference of value estimates at individual states. By contrast, it is
somewhat less intuitive to see what “close” means when comparing probability
distributions. Throughout this chapter we will consider a number of probability
metrics that measure distances between distributions, each presenting differ-
ent mathematical and computational properties. We begin with the family of
Wasserstein distances.
Definition 4.12.
Let
2P
(
R
) be a probability distribution with cumulative
distribution function F
. Let Z be an instantiation of
(in particular, F
Z
= F
).
The generalised inverse F
–1
is given by
F
–1
() = inf
z2R
{z : F
(z) }.
We additionally write F
–1
Z
= F
–1
. 4
Definition 4.13.
Let p
2
[1,
1
). The p-Wasserstein distance is a function w
p
:
P(R) P(R) ![0, 1] given by
w
p
(,
0
)=
1
0
F
–1
()–F
–1
0
()
p
d
1/p
.
The 1-Wasserstein distance w
1
: P(R) P (R) ![0, 1] is
w
1
(,
0
) = sup
2(0,1)
F
–1
()–F
–1
0
()
. 4
Graphically, the Wasserstein distances between two probability distributions
measure the area between their cumulative distribution functions, with val-
ues along the abscissa taken to the p
th
power; see Figure 4.3. When p =
1
,
this becomes the largest horizontal difference between the inverse cumulative
distribution functions. The p-Wasserstein distances satisfy the definition of a
metric , except that they may not be finite for arbitrary pairs of distributions
in
P
(
R
); see Exercise 4.6. Properly speaking, they are said to be extended
metrics, since they may take values on the real line extended to include infinity.
Most probability metrics that we will consider are extended metrics rather than
metrics in the sense of Definition 4.2. We measure distances between return-
distribution functions in terms of the largest Wasserstein distance between
probability distributions at individual states.
Operators and Metrics 91
Figure 4.3
Left.
Illustration of the p-Wasserstein distance between a normal distribution
=
N
(1, 1)
and a mixture of two normal distributions
0
=
1
2
N
(–1, 0.5) +
1
2
N
(3, 0.5).
Right.
Illus-
tration of the
`
p
metric for the same distributions (see Section 4.5). In both cases, the
shading indicates the axis along which the differences are taken to the p
th
exponent.
Definition 4.14.
Let p
2
[1,
1
]. The supremum p-Wasserstein distance
w
p
between two return-distribution functions ,
0
2P(R)
X
is defined by
27
w
p
(,
0
) = sup
x2X
w
p
((x),
0
(x)) . 4
The supremum p-Wasserstein distances fulfil all requirements of an extended
metric on the space of return-distribution functions
P
(
R
)
X
; see Exercise 4.7.
Based on these distances, we give our first contractivity result regarding the
distributional Bellman operator; its proof is given at the end of the section.
Proposition 4.15.
The distributional Bellman operator is a contraction
mapping on
P
(
R
)
X
in the supremum p-Wasserstein distance, for all p
2
[1, 1]. More precisely,
w
p
(T
, T
0
) w
p
(,
0
),
for all ,
0
2P(R)
X
. 4
Proposition 4.15 is significant in that it establishes a close parallel between the
expected-value and distributional operators. Following the line of reasoning
given in Section 4.2, it provides the mathematical justification for the develop-
ment and analysis of computational approaches for finding the return function
. More immediately, it also enables us to characterise the convergence of the
27.
Because we assume that there are finitely many states, we can equivalently write
max
x2X
in the
definition of supremum distance. However, we prefer the more generally applicable sup
x2X
.
92 Chapter 4
sequence
k+1
= T
k
(4.10)
to the return function
.
Proposition 4.16.
Suppose that for each (x, a)
2X A
, the reward distri-
bution P
R
(
·
| x, a) is supported on the interval [R
MIN
, R
MAX
]. Then for any
initial return function
0
whose distributions are bounded on the interval
R
MIN
1–
,
R
MAX
1–
, the sequence
k+1
= T
k
converges to
in the supremum p-Wasserstein distance (for all p
2
[1,
1
]).
4
The restriction to bounded rewards in Proposition 4.16 is necessary to make
use of the tools developed in Section 4.2, at least without further qualification.
This is because Proposition 4.7 requires all distances to be finite, which is not
guaranteed under our definition of a probability metric. If, for example, the
initial condition
0
is such that
w
p
(
0
,
)=1,
then Proposition 4.15 is not of much use. A less restrictive, but more technically
elaborate set of assumptions will be presented later in the chapter. For now, we
provide the proof of the two preceding results. First, we obtain a reasonably
simple proof of Proposition 4.15 by considering an alternative formulation of
the p-Wasserstein distances in terms of couplings.
Definition 4.17.
Let
,
0
2P
(
R
) be two probability distributions. A coupling
between
and
0
is a joint distribution
2P
(
R
2
) such that if (Z, Z
0
) is an
instantiation of
, then also Z has distribution
and Z
0
has distribution
0
.We
write (,
0
) P(R
2
) for the set of all couplings of and
0
. 4
Proposition 4.18
(see Villani [2008] for a proof)
.
Let p
2
[1,
1
).
Expressed in terms of an optimal coupling, the p-Wasserstein distance
between two distributions ,
0
2P(R) is
w
p
(,
0
) = min
2(,
0
)
E
(Z,Z
0
)
[|Z Z
0
|
p
]
1/p
.
The 1-Wasserstein distance between and
0
can be written as
w
1
(,
0
) = min
2(,
0
)
inf
z 2R :
P
(Z,Z
0
)
(|Z Z
0
|>z)=0
. 4
Operators and Metrics 93
Informally, the optimal coupling finds an arrangement of the two probability
distributions that maximises “agreement”: it produces outcomes that are as close
as possible. In Proposition 4.18, the optimal coupling takes on a very simple
form given by inverse cumulative distribution functions. For
,
0
2P
(
R
), an
optimal coupling is the probability distribution of the random variable
F
–1
(), F
–1
0
()
, U
[0, 1]
. (4.11)
This can be understood by noting how the 1-Wasserstein distance between
and
0
is obtained by measuring the horizontal distance between the two cumulative
distribution functions, at each level 2[0, 1] (Figure 4.3).
Proof of Proposition 4.15.
Let p
2
[1,
1
) be fixed. For each x
2X
, consider
the optimal coupling between
(x) and
0
(x) and instantiate it as the pair of ran-
dom variables
G(x), G
0
(x)
. Next, denote by (x, A, R, X
0
) the random transition
beginning in x
2X
, constructed to be independent from G(y) and G
0
(y), for all
y 2X. With these variables, write
˜
G(x)=R + G(X
0
),
˜
G
0
(x)=R + G
0
(X
0
).
By Proposition 4.11,
˜
G
(x) has distribution (
T
)(x) and
˜
G
0
(x) has distribution
(
T
0
)(x). The pair
˜
G
(x),
˜
G
0
(x)
therefore forms a valid coupling of these
distributions. Now
w
p
p
(T
)(x), (T
0
)(x)
(a)
E
R + G(X
0
)
R + G
0
(X
0
)
p
| X = x
(b)
=
p
E
G(X
0
)–G
0
(X
0
)
p
| X = x
(c)
p
x
0
2X
P
(X
0
= x
0
| X = x) E
G(x
0
)–G
0
(x
0
)
p
(d)
p
sup
x
0
2X
E
G(x
0
)–G
0
(x
0
)
p
(e)
=
p
w
p
p
(,
0
).
Taking a supremum over x
2X
on the left-hand side and the p
th
root of both
sides yields the result. Here, (a) follows since the Wasserstein distance is
defined as a minimum over couplings, (b) follows from algebraic manipulation
of the expectation, (c) follows from independence of the sample transition
(X = x, A, R, X
0
) and the random variables (G(x), G
0
(x):x
2X
), (d) because the
maximum of non-negative quantities is at least as great as their weighted
average, and (e) follows since (G(x
0
), G
0
(x
0
)) was defined as an optimal coupling
of (x
0
) and
0
(x
0
). The proof for p = 1 is similar (see Exercise 4.8).
94 Chapter 4
Proof of Proposition 4.16.
Let us denote by
P
B
(
R
) the space of distributions
bounded on [V
MIN
, V
MAX
], where as usual
V
MIN
=
R
MIN
1–
, V
MAX
=
R
MAX
1–
.
We will show that under the assumption of rewards bounded on [R
MIN
, R
MAX
],
(a) The return function
is in P
B
(R), and
(b) The distributional Bellman operator maps P
B
(R) to itself.
Consequently, we can invoke Proposition 4.7 with
O
=
T
and M =
P
B
(
R
)
to conclude that for any initial
0
2P
B
(
R
)
X
, the sequence of iterates (
k
)
k0
converges to
with respect to d = w
p
, for any p 2[1, 1].
To prove (a), note that for any state x 2X,
G
(x)=
1
t=0
t
R
t
, X
0
= x ,
and since R
t
2
[R
MIN
, R
MAX
] for all t, then also G
(x)
2
[V
MIN
, V
MAX
]. For (b),
let
2P
B
(
R
)
X
and denote by G an instantiation of this return-distribution
function. For any x 2X,
P
R + G(X
0
) V
MAX
| X = x
= P
G(X
0
) V
MAX
R | X = x
P
G(X
0
) V
MAX
R
MAX
| X = x
= P
G(X
0
) V
MAX
| X = x
= 1.
By the same reasoning,
P
R + G(X
0
) V
MIN
| X = x
= 1.
Since R +
G(X
0
), X = x is an instantiation of (
T
)(x) for each x, we conclude
that if 2P
B
(R)
X
, then also T
2P
B
(R)
X
.
4.5 `
p
Probability Metrics and the Cramér Distance
The previous section established that the distributional Bellman operator is
well-behaved with respect to the family of Wasserstein distances. However,
these are but a few among many standard probability metrics. We will see in
Chapter 5 that theoretical analysis sometimes requires us to study the behaviour
of the distributional operator with respect to other metrics. In addition, many
practical algorithms directly optimise a metric (typically expressed as a loss
function) as part of their operation (see Chapter 10). The Cramér distance,a
member of the broader family of `
p
metrics, is of particular interest to us.
Operators and Metrics 95
Definition 4.19.
Let p
2
[1,
1
). The distance
`
p
:
P
(
R
)
P
(
R
)
!
[0,
1
] is a
probability metric defined by
`
p
(,
0
)=
R
|F
(z)–F
0
(z)|
p
dz
1/p
. (4.12)
For p = 2, this is the Cramér distance.
28
The
`
1
or Kolmogorov-Smirnov
distance is given by
`
1
(,
0
) = sup
z2R
F
(z)–F
0
(z)
.
The respective supremum `
p
distances are given by (,
0
2P(R)
X
)
`
p
(,
0
) = sup
x2X
`
p
(x), (x
0
)
.
These are extended metrics on P(R)
X
. 4
Where the p-Wasserstein distances measure differences in outcomes, the
`
p
distances measure differences in the probabilities associated with these out-
comes. This is because the exponent p is applied to cumulative probabilities
(this is illustrated in Figure 4.3). The distributional Bellman operator is also a
contraction mapping under the
`
p
distances for p
2
[1,
1
), albeit with a larger
contraction modulus.
29
Proposition 4.20.
For p
2
[1,
1
), the distributional Bellman operator
T
is a contraction mapping on
P
(
R
)
X
with respect to
`
p
, with contraction
modulus
1
/p
. That is,
`
p
(T
, T
0
)
1
/p
`
p
(,
0
)
for all ,
0
2P(R)
X
. 4
The proof of Proposition 4.20 will follow as a corollary of a more general result
given in Section 4.6. One way to relate it to our earlier result is to consider the
behaviour of the sequence defined by
k+1
= T
k
. (4.13)
As measured in the p-Wasserstein distance, the sequence (
k
)
k0
approaches
at a rate of
; but if we instead measure distances using the
`
p
metric, this
rate is slower only
1
/p
. Measured in terms of
`
1
(the Kolmogorov-Smirnov
28.
Historically, the Cramér distance has been defined as the square of
`
2
. In our context, it seems
unambiguous to use the word for `
2
itself.
29.
For p = 1, the distributional Bellman operator has contraction modulus
; this is sensible given
that `
1
= w
1
.
96 Chapter 4
distance), the sequence of iterates may in fact not seem to approach
at all.
To see this, it suffices to consider a single-state process with zero reward (that
is, P
X
(X
0
= x | X = x) = 1 and R = 0) and a discount factor
= 0.9. In this case,
(x)=
0
. For the initial condition
0
(x)=
1
, we obtain
1
(x)=(T
0
)(x)=
.
Now, the (supremum)
`
1
distance between
and
0
is 1, because for any
z 2(0, 1),
F
0
(x)
(z)=0 F
(x)
(z) = 1.
However, the
`
1
distance between
and
1
is also 1, by the same argument
(but now restricted to z 2(0, )). Hence there is no 2[0, 1) for which
`
1
(
1
,
)<`
1
(
0
,
).
Exercise 4.16 asks you to prove a similar result for a probability metric called
the total variation distance (see also Figure 4.4).
The more general point is that different probability metrics are sensitive to
different characteristics of probability distributions, and to varying degrees.
At one extreme, the
1
-Wasserstein distance is effectively insensitive to the
probability associated with different outcomes, while at the other extreme,
the Kolmogorov-Smirnov distance is insensitive to the scale of the difference
between outcomes. In Section 4.6, we will show that a metric’s sensitivity to
differences in outcomes determines the contraction modulus of the distributional
Bellman operator under that metric; informally speaking, this explains the
“nice” behaviour of the distributional Bellman operator under the Wasserstein
distances.
Before moving on, let us summarise the results established thus far. By combin-
ing the theory of contraction mappings with suitable probability metrics, we
were able to characterise the behaviour of the iterates
k+1
= T
k
. (4.14)
In the following chapters we will use this as the basis for the design of imple-
mentable algorithms that approximate the return distribution
, and will
appeal to contraction mapping theory to provide theoretical guarantees for
these algorithms. In particular, in Chapter 6 we will analyse the categorical
temporal-difference learning under the lens of the Cramér distance. While the
results presented until now suffice for most practical purposes, the following
sections deal with some of the more technical considerations that arise from
studying Equation 4.14 under general conditions, in particular the issue of
infinite distances between distributions.
Operators and Metrics 97
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
Probability Density
2 1 0 1 2 3 4 5
Return
0.0
Probability Density
2 1 0 1 2 3 4 5
Return
0.0
2 1 0 1 2 3 4 5
Return
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Probability
2 1 0 1 2 3 4 5
Return
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Probability
(a) (b)
(c) (d)
Figure 4.4
The distributional Bellman operator is not a contraction mapping in either the supremum
form of
(a, b)
total variation distance (d
TV
, shaded in the top panels; see Exercise
4.16 for a definition) or
(c, d)
Kolmogorov-Smirnov distance
`
1
(vertical distance in
the bottom panels). The left panels show the density
(a)
and cumulative distribution
function
(c)
of two distributions
(x) (blue) and
0
(x) (red). The right panels show the
same after applying the distributional Bellman operator
(b, d)
, specifically considering
the transformation induced by the discount factor
. The lack of contractivity can be
explained by the fact that neither d
TV
nor
`
1
is a homogeneous probability metric
(Section 4.6).
4.6 Sufficient Conditions for Contractivity
In the remainder of this chapter, we characterise in greater generality the
behaviour of the sequence of return function estimates described by Equa-
tion 4.14, viewed under the lens of different probability metrics. We begin with
a formal definition of what it means for a function d to be a probability metric.
Definition 4.21.
A probability metric is an extended metric on the space of
probability distributions, written
d : P(R) P(R) ![0, 1].
98 Chapter 4
Its supremum extension is the function d : P(R)
X
P(R)
X
!R defined as
d(,
0
) = sup
x2X
d((x),
0
(x)) .
We refer to
d
as a return-function metric; it is an extended metric on
P
(
R
)
X
.
4
Our analysis is based on three properties that a probability metric should possess
in order to guarantee contractivity. These three properties relate closely to the
three fundamental operations that make up the distributional Bellman operator:
scaling, convolution, and mixture of distributions (equivalently: scaling, addi-
tion, and indexing of random variables). In this analysis, we will find that some
properties are more easily stated in terms of random variables, others in terms
of probability distributions. Accordingly, given two probability distributions
,
0
with instantiations Z, Z
0
, let us overload notation and write
d(Z, Z
0
)=d(,
0
).
Definition 4.22.
Let c > 0. The probability metric d is c-homogeneous if for
any scalar
2
[0, 1) and any two distributions
,
0
2P
(
R
) with associated
random variables Z, Z
0
, we have
d(Z, Z
0
)=
c
d(Z, Z
0
),
In terms of probability distributions, this is equivalently given by the condition
d((b
0,
)
#
, (b
0,
)
#
0
)=
c
d(,
0
).
If no such c exists, we say that d is not homogeneous. 4
Definition 4.23.
The probability metric d is regular if for any two distribu-
tions
,
0
2P
(
R
) with associated random variables Z, Z
0
, and an independent
random variable W, we have
d(W + Z, W + Z
0
) d(Z, Z
0
) . (4.15)
In terms of distributions, this is
d
E
W
[(b
W,1
)
#
], E
W
[(b
W,1
)
#
0
]
d(,
0
). 4
Definition 4.24.
Given p
2
[1,
1
), the probability metric d is p-convex
30
if for
any 2(0, 1) and distributions
1
,
2
,
0
1
,
0
2
2P(R), we have
d
p
↵⌫
1
+ (1 )
2
, ↵⌫
0
1
+ (1 )
0
2
d
p
(
1
,
0
1
) + (1 )d
p
(
2
,
0
2
). 4
30.
This matches the usual definition of convexity (for d
p
) if one treats the pair (
1
,
2
) as a single
argument from P (R) P(R).
Operators and Metrics 99
Note that while the convexity property is stated here in terms of a mixture of
two distributions, it implies an analogous property for mixtures of finitely-many
distributions.
Theorem 4.25.
Consider a probability metric d. Suppose that d is regular,
c-homogeneous for some c > 0, and that there exists p
2
[1,
1
) such that d
is p-convex. Then for all return-distribution functions
,
0
2P
(
R
)
X
, we
have
d(T
, T
0
)
c
d(,
0
). 4
Proof.
Fix a state x
2X
and action a
2A
. For this state, consider the sample
transition (X = x, A = a, R, X
0
) (Equation 2.12), and recall that R and X
0
are
independent given X and A, since
R P
R
( · | X, A) X
0
P
X
( · | X, A).
Let G and G
0
be instantiations of
and
0
, respectively.
31
We introduce a state-
action variant of the distributional Bellman operator,
T
:
P
(
R
)
X
!P
(
R
)
XA
,
given by
(T )(x, a)=E[(b
R,
)
#
(X
0
)|X = x, A = a].
Note that this operator is defined independently of the policy
, since the action
a is specified as an argument. We then calculate directly, indicating where each
of the hypotheses of the result are used:
d
p
((T )(x, a), (T
0
)(x, a)) = d
p
R + G(X
0
), R + G
0
(X
0
)
(a)
d
p
G(X
0
), G
0
(X
0
)
(b)
=
pc
d
p
G(X
0
), G
0
(X
0
)
(c)
pc
x
0
2X
P
(X
0
= x
0
| X = x, A = a)d
p
G(x
0
), G
0
(x
0
)
pc
sup
x
0
2X
d
p
G(x
0
), G
0
(x
0
)
=
pc
d
p
(,
0
).
Here, (a) follows from regularity of d, (b) follows from the c-homogeneous
property of d, and (c) follows from p-convexity, where the mixture is over the
31.
Note that the proof doesn’t require us to assume that G(x) and G(y) are independent for x
6
= y.
See Exercise 4.12.
100 Chapter 4
values taken by the random variable X
0
. We also note that
(T
)(x)=
a2A
(a | x)(T )(x, a),
and hence by p-convexity of d, we have
d
p
(T
)(x), (T
0
)(x)
a2A
(a | x)d
p
(T
)(x, a), (T
0
)(x, a)
pc
d
p
(,
0
).
Taking the supremum over x
2X
on the left-hand side and taking p
th
-roots then
yields the result.
Theorem 4.25 illustrates how the contractivity of the distributional Bellman
operator in the probability metric d (specifically, in its supremum extension)
follows from natural properties of d. We see that the contraction modulus is
closely tied to the homogeneity of d, which informally characterises the extent
to which scaling random variables by a factor
brings them “closer together”.
The theorem provides an alternative to our earlier result regarding Wasserstein
distances and enables us to establish the contractivity under the
`
p
distances,
but also under other probability metrics. Exercise 4.19, in particular, explores
contractivity under the so-called maximum mean discrepancy (MMD) distances.
Proof of Proposition 4.20 (contractivity in `
p
distances).
We will apply The-
orem 4.25 to the probability metric
`
p
, for p
2
[1,
1
). It is therefore sufficient
to demonstrate that `
p
is
1
/p-homogeneous, regular and p-convex.
1/p
-homogeneity.
Let
,
0
2P
(
R
) with associated random variables Z, Z
0
.We
make use of the fact that for 2[0, 1),
F
Z
(z)=F
Z
z
.
Writing `
p
p
for the p
th
power of `
p
, we have
`
p
p
(Z, Z
0
)=
R
F
Z
(z)–F
Z
0
(z)
p
dz
=
R
F
Z
(
z
)–F
Z
0
(
z
)
p
dz
(a)
=
R
F
Z
(z)–F
Z
0
(z)
p
dz
= `
p
p
(Z, Z
0
) , (4.16)
Operators and Metrics 101
where (a) follows from a change of variables z/
7!
z in the integral. Therefore,
we deduce that `
p
(Z, Z
0
)=
1/p
`
p
(Z, Z
0
).
Regularity.
Let
,
0
2P
(
R
), with Z, Z
0
independent instantiations of
and
0
respectively, and let W be a random variable independent of Z, Z
0
. Then
`
p
p
(W + Z, W + Z
0
)=
R
|F
W+Z
(z)–F
W+Z
0
(z)|
p
dz
=
R
|E
W
[F
Z
(z W)] E
W
[F
Z
0
(z W)]|
p
dz
(a)
R
E
W
[|F
Z
(z W)–F
Z
0
(z W)|
p
]dz
(b)
= E
W
R
|F
Z
(z W)–F
Z
0
(z W)|
p
]dz
= `
p
p
(Z, Z
0
),
where (a) follows from Jensen’s inequality, and (b) follows by swapping the
integral and expectation (more formally justified by Tonelli’s theorem).
p-convexity. Let 2(0, 1), and
1
,
0
1
,
2
,
0
2
2P(R). Note that
F
↵⌫
1
+(1–)
2
(z)=F
1
(z) + (1 )F
2
(z),
and similarly for the primed distributions
0
1
,
0
2
. By convexity of the function
z 7!|z|
p
on the real numbers and Jensen’s inequality,
|F
↵⌫
1
+(1–)
2
(z)–F
↵⌫
0
1
+(1–)
0
2
(z)|
p
|F
1
(z)–F
0
1
(z)|
p
+ (1 )|F
2
(z)–F
0
2
(z)|
p
,
for all z 2R. Hence,
`
p
p
(↵⌫
1
+ (1 )
2
, ↵⌫
0
1
+ (1 )
0
2
) ↵`
p
p
(
1
,
0
1
) + (1 )`
p
p
(
2
,
0
2
),
and `
p
is p-convex.
4.7 A Matter of Domain
Suppose that we have demonstrated, by means of Theorem 4.25, that the distri-
butional Bellman operator is a contraction mapping in the supremum extension
of some probability metric d. Is this sufficient to guarantee that the sequence
k+1
= T
k
converges to the return function
, by means of Proposition 4.7? In general,
no, because d may assign infinite distances to certain pairs of distributions.
To invoke Proposition 4.7, we identify a subset of probability distributions
P
d
(
R
) that are all within finite d-distance of each other and then ensure that
102 Chapter 4
the distributional Bellman operator is well-behaved on this subset. Specifically,
we identify a set of conditions under which
(a) The distributional Bellman operator T
maps P
d
(R)
X
to itself, and
(b) The return function
(the fixed point of T
) lies in P
d
(R)
X
.
For most common probability metrics and natural problem settings, these
requirements are easily verified. In Proposition 4.16, for example, we demon-
strated that under the assumption that the reward distributions are bounded,
then Proposition 4.7 can be applied with the Wasserstein distances. The aim of
this section is to extend the analysis to a broader set of probability metrics, but
also a greater number of problem settings, including those where the reward
distributions are not bounded.
Definition 4.26.
Let d be a probability metric. Its finite domain
P
d
(
R
)
P
(
R
)
is the set of probability distributions with finite first moment and finite d-distance
to the distribution that puts all of its mass on zero:
P
d
(R)=
2P(R):d(,
0
)<1, E
Z
[|Z|] < 1
. (4.17)
4
By the triangle inequality, for any two distributions
,
0
2P
d
(
R
), we are
guaranteed d(
,
0
)<
1
. Although the choice of
0
as the reference point is
somewhat arbitrary, it is sensible given that many reinforcement learning prob-
lems include the possibility of receiving no reward at all (e.g., G
(x) = 0). The
finite first moment assumption is made in light of Assumption 2.5, which
guarantees that return distributions have well-defined expectations.
Operators and Metrics 103
Proposition 4.27.
Let d be a probability metric satisfying the conditions
of Theorem 4.25, with finite domain
P
d
(
R
). Let
T
be the distribu-
tional Bellman operator corresponding to a given Markov decision process
(X, A,
0
, P
X
, P
R
). Suppose that
(a)
2P
d
(R), and
(b) P
d
(
R
) is closed under
T
: for any
2P
d
(
R
)
X
, we have that
T
2
P
d
(R)
X
.
Then for any initial condition
0
2P
d
(
R
), the sequence of iterates defined
by
k+1
= T
k
converges to
with respect to d. 4
Proposition 4.27 is a specialisation of Proposition 4.7 to the distributional
setting, and generalises our earlier result regarding bounded reward distributions.
Effectively, it allows us to prove the convergence of the sequence (
k
)
k0
for a
family of Markov decision processes satisfying the two conditions above. The
condition
2P
d
(
R
), while seemingly benign, does not automatically hold;
Exercise 4.20 illustrates the issue using a modified p-Wasserstein distance.
Example 4.28
(
`
p
metrics)
.
For a given p
2
[1,
1
), the finite domain of the
probability metric `
p
is
P
`
p
(R)={ 2P(R): E
Z
[|Z|] < 1},
the set of distributions with finite first moment. This follows because
`
p
p
(,
0
)=
0
1
F
(z)
p
dz +
1
0
(1 F
(z))
p
dz
0
1
F
(z)dz +
1
0
(1 F
(z))dz
(a)
= E[max(0, Z)] + E[max(0, Z)]
= E[|Z|] ,
where (a) follows from expressing F
(z) and 1 F
(z) as integrals and using
Tonelli’s theorem:
1
0
(1 F
(z))dz =
1
0
E
Z
[ {Z > z}]dz = E
Z
1
0
{Z > z}dz
= E
Z
[max(0, Z)] ,
and similarly for the integral from 1 to 0.
104 Chapter 4
The conditions of Proposition 4.27 are guaranteed by Assumption 2.5 (rewards
have finite first moments). From Chapter 2, we know that under this assumption
the random return has finite expected value and hence
(x) 2P
`
p
(R), for all x 2X.
Similarly, we can show from elementary operations that if R and G(x) satisfy
E
|R|
X = x]<1, E[|G(x)|
< 1, for all x 2X ,
then also
E
|R + G(X
0
)|
X = x
< 1, for all x 2X .
Following Proposition 4.27, provided that
0
2P
`
p
(
R
)
X
, then the sequence
(
k
)
k0
converges to
with respect to `
p
. 4
When d is the p-Wasserstein distance (p
2
[1,
1
]) the finite domain takes on a
particularly useful form which we denote by P
p
(R):
P
p
(R)=
2P(R): E
Z
[|Z|
p
]<1
, p 2[0, 1),
P
1
(R)=
2P(R):9C > 0 s.t. ([–C, C]) = 1
.
For p <
1
, this is the set of distributions with bounded p
th
moments; for p =
1
,
the set of distributions with bounded support. In particular, observe that the
finite domains of `
1
and w
1
coincide: P
`
p
(R)=P
1
(R).
As with the
`
p
distances, we can satisfy the conditions of Proposition 4.27 for
d = w
p
by introducing an assumption on the reward distributions. In this case, we
simply require that these be in
P
p
(
R
). As this assumption will recur throughout
the book, we state it here in full; Exercise 4.11 goes through the steps of the
corresponding result.
Assumption 4.29(
p
).
For each state-action pair (x, a)
2X A
, the reward
distribution P
R
( · | x, a) is in P
p
(R). 4
Proposition 4.30.
Let p
2
[1,
1
]. Under Assumption 4.29(p), the return
function
has finite p
th
moments (p =
1
: is bounded). In addition, for
any initial return function
0
2P
p
(R), the sequence defined by
k+1
= T
k
converges to
with respect to the supremum p-Wasserstein metric. 4
Operators and Metrics 105
4.8 Weak Convergence of Return Functions*
Proposition 4.27 implies that if each distribution
(x) lies in the finite domain
P
d
(
R
) of a given probability metric d that is regular, c-homogeneous, and
p-convex, then
is the unique solution to the equation
= T
(4.18)
in the space
P
d
(
R
)
X
. It does not, however, rule out the existence of solutions
outside this space. This concern can be addressed by showing that for any
0
2P(R)
X
, the sequence of probability distributions (
k
(x))
k0
defined by
k+1
= T
k
converges weakly to the return distribution
(x), for each state x
2X
. In
addition to giving an alternative perspective on the quantitative convergence
results of these iterates, the uniqueness of
as a solution to Equation 4.18
(stated as Proposition 4.9) follows immediately from Proposition 4.34 below.
Definition 4.31.
Let (
k
)
k0
be a sequence of distributions in
P
(
R
), and let
2P
(
R
) be another probability distribution. We say that (
k
)
k0
converges
weakly to
if for every z
2R
at which F
is continuous, we have F
k
(z)
!
F
(z). 4
We will show that for each x
2X
, the sequence (
k
(x))
k0
converges weakly to
(x). A simple approach is to consider the relationships between well-chosen
instantiations of
k
(for each k
2N
) and
, by means of the following classical
result [see e.g. Billingsley, 2012]. Recall that a sequence of random variables
(Z
k
)
k0
converges to Z with probability 1 if
P( lim
k!1
Z
k
= Z)=1.
Lemma 4.32.
Let (
k
)
k0
be a sequence in
P
(
R
), and
2P
(
R
) be another
probability distribution. Let (Z
k
)
k0
and Z be instantiations of these distributions
all defined on the same probability space. If Z
k
!
Z with probability 1 then
k
! weakly. 4
Lemma 4.32 is not a uniformly-applicable approach to demonstrating weak
convergence; there always exists such instantiations by Skorokhod’s representa-
tion theorem [Billingsley, 2012, Section 25], but finding such instantiations is
not always straightforward. However, in our case, there are a very natural set of
instantiations that work, constructed by the following result.
106 Chapter 4
Lemma 4.33.
Let
2P
(
R
)
X
, and let G be an instantiation of
. For x
2X
, if
(X
t
, A
t
, R
t
)
t0
is a random trajectory with initial state X
0
= x and generated by
following
, independent of G, then
k–1
t=0
t
R
t
+
k
G(X
k
) is an instantiation of
((T
)
k
)(x). 4
Proof. This follows by inductively applying Proposition 4.11.
Proposition 4.34. Let
0
2P(R)
X
, and for k 0 define
k+1
= T
k
.
Then we have
k
(x)
!
(x) weakly for each x
2X
, and consequently
is the unique fixed point of T
in P(R )
X
. 4
Proof.
Fix x
2X
, let G
0
be an instantiation of
0
, and on the same probability
space, let (X
t
, A
t
, R
t
)
t0
be a trajectory generated by
with initial state X
0
= x,
independent of G
0
(the existence of such a probability space is guaranteed by
the Ionescu-Tulcea theorem, as described in Remark 2.1). Then
G
k
(x)=
k–1
t=0
t
R
t
+
k
G
0
(X
k
)
is an instantiation of
k
(x) by Lemma 4.33. Furthermore,
1
t=0
t
R
t
is an
instantiation of
(x). We have
G
k
(x)–
1
t=0
t
R
t
=
k
G
0
(X
k
)–
1
t=k
t
R
t
k
G
0
(X
k
)
+
1
t=k
t
R
t
!0,
with probability 1. The convergence of the first term follows since |G
0
(X
k
)|
max
x2X
|G
0
(x)| is bounded w.p. 1 and
k
!
0, and the convergence of the second
term follows from convergence of
k–1
t=0
t
R
t
to
1
t=0
t
R
t
w.p. 1. Therefore
we have G
k
(x)
!
1
t=0
t
R
t
w.p. 1, and so
k
(x)
!
(x) weakly. Finally, this
implies that there can be no other fixed point of
T
in
P
(
R
)
X
; if
0
were such a
fixed point, then we would have
k
=
0
for all k
0, and would simultaneously
have
k
(x) !
(x) weakly for all x 2X, a contradiction unless
0
=
.
As the name indicates, the notion of weak convergence is not as strong as many
other notions of convergence of probability distributions. In general, it does not
even guarantee convergence of the mean of the sequence of distributions; see
Exercise 4.21. In addition, we lose any notion of convergence rate which was
provided by the contractive nature of the distributional operator under specific
Operators and Metrics 107
metrics. The need for stronger guarantees, such as those offered by Proposition
4.27, motivates the contraction mapping theory developed in this chapter.
4.9 Random Variable Bellman Operators*
In this chapter, we defined the distributional Bellman operator
T
as a mapping
on the space of return-distribution functions
P
(
R
)
X
. We also saw that the action
of the operator on a return function
2P
(
R
)
X
can be understood both through
direct manipulation of the probability distributions or through manipulation of
a collection of random variables instantiating these distributions.
Viewing the operator through its effect on the distribution of a collection of
representative random variables is a useful tool for understanding distributional
reinforcement learning, and may prompt the reader to ask whether it is possible
to avoid referring to probability distributions at all, working instead directly
with random variables. We describe one approach to this below using the tools
of probability theory, and then discuss some of its shortcomings.
Let G
0
=(G
0
(x):x
2X
) be an initial collection of real-valued random variables,
indexed by state, supported on a probability space (
0
,
F
0
,
P
0
). For each k
2N
+
,
let (
k
,
F
k
,
P
k
) be another probability space, supporting a collection of random
variables ((A
k
(x), R
k
(x, a), X
0
k
(x, a)) : x
2X
, a
2A
), with A
k
(x)
(
·
| x), and
independently R
k
(x, a)
P
R
(
·
| x, a), X
k
(x, a)
P
X
(
·
| x, a). We then consider
the product probability space on
=
k2N
k
. All random variables defined
above can naturally be viewed as functions on this joint probability space, that
depend on
!
=(
!
0
,
!
1
,
!
2
,
...
)
2
only through the coordinate
!
k
that matches
the index k on the random variable. Note that under this construction, all random
variables with distinct indices are independent.
Now define
X
N
as the set of real-valued random variables on (
,
F
,
P
) (where
F
is the product
-algebra) that depend on only finitely-many coordinates of
! 2
. We can define a Bellman operator
T
:
X
N
!X
N
as follows. Given
G =(G(x):x
2X
)
2X
X
N
, let K
2N
be the smallest integer such that the ran-
dom variables (G(x):x
2X
) depend on
!
=(
!
0
,
!
1
,
!
2
,
)
2
only through
!
0
,
,
!
K–1
; such an integer exists due to the definition of
X
N
and the finiteness
of X. We then define T
G 2X
N
by
(T
G)(x)=R
K
(x, A
K
(x)) + G(X
0
K
(x, A
K
(x)) .
108 Chapter 4
With this definition, we can obtain a sequence of collections of random variables
(G
k
)
k0
, defined iteratively by G
k+1
=
T
G
k
, for k
0.
32
We have therefore for-
malised an operator entirely within the realm of random variables, without
reference to the distributions of the iterates (G
k
)
k0
. By construction, the distri-
bution of the random variables (G
k
)
k0
matches the sequence of distributions
that would be obtained by working directly on the space of probability dis-
tributions with the usual distributional Bellman operator. More concretely, if
0
2P
(
R
)
X
is such that
0
(x) is the distribution of G
0
(x) for each x
2X
, then
we have that
k
, defined by
k
=(
T
)
k
0
, is such that
k
(x) is the distribution of
G
k
(x), for each x
2X
. Thus, the random variable Bellman operator constructed
above is consistent with the distributional Bellman operator that is the main
focus of this chapter.
One difficulty with this random variable operator is that it does not have a fixed
point; while the distribution of the random variables G
k
(
·
) converges to that of
G
(
·
), the random variables themselves, as functions on the probability space,
do not converge.
33
Thus, while it is one way to view distributional reinforcement
learning purely in terms of random variables, it is much less natural to analyse
algorithms from this perspective, rather than through probability distributions
as described in this chapter.
4.10 Technical Remarks
Remark 4.1.
Our exposition in this chapter has focused on the standard
distributional Bellman equation
G
(x)
D
= R + G
(X
0
), X = x .
A similar development is possible with the alternative notions of random return
mentioned in Section 2.9, including the ‘probability of continuing’ interpretation
of the discount factor . In general, the distributional operators that arise from
these alternative notions of the random return are still contractions, although
the metrics and contraction moduli involved in these statements differ. For
example, while the standard distributional Bellman operator is not a contraction
in total variation distance, the distributional Bellman operator associated with
the ‘probability of continuing’ random return is; Exercise 4.17 explores this
point in greater detail. 4
32. This is real equality between random variables, rather than in distribution.
33.
This is a fairly subtle point the reader is invited to consider what happens to G
k
(x) and G
(x)
as mappings from to R.
Operators and Metrics 109
Remark 4.2.
The ideas developed in this chapter, and indeed the other chapters
of the book, can also be applied to learning other properties related to the return.
Achab [2020] explores several methods for learning the distribution of the
random variables R +
V
(X
0
), under the different possible initial conditions X =
x; these objects interpolate between the expected return and the full distribution
of the return. Exercise 4.22 explores the development of contractive operators
for the distributions of these objects. 4
Remark 4.3.
As well as allowing us to quantify the contractivity of the dis-
tributional Bellman operator, the probability metrics described in this chapter
can be used to measure the accuracy of algorithms that aim to approximate the
return distribution. In particular, they can be used to give quantitative guarantees
on the accuracy of the non-parametric distributional Monte Carlo algorithm
described in Remark 3.1. These guarantees can then be used to determine how
many sample trajectories are required to approximate the true return distribution
at a required level of accuracy, for example when evaluating the performance of
the algorithms in Chapters 5 and 6. We assume that K returns (G
k
)
K
k=1
beginning
at state x
0
have been generated independently via the policy
, and consider the
accuracy of the non-parametric distributional Monte Carlo estimator
ˆ
(x
0
)=
1
K
K
k=1
G
k
.
Different realisations of the sampled returns will lead to different estimates; the
following result provides a guarantee in the Kolmogorov-Smirnov distance
`
1
that holds with high probability over the possible realisations of the returns. For
any " > 0, we have
`
1
(ˆ
(x
0
),
(x
0
)) " , with probability at least 1 2 exp(–2K"
2
).
Thus, for a desired level of accuracy
"
and a desired confidence 1
, K can be
selected so that 2
exp
(–2K
"
2
)<
, which then yields the guarantee that with prob-
ability at least 1
, we have
`
1
(
ˆ
(x
0
),
(x
0
))
"
. This is in fact a key result
in empirical process theory known as the Dvoretzky–Kiefer–Wolfowitz (DKW)
inequality [Dvoretzky et al., 1956, Massart, 1990]. Its uses specifically in rein-
forcement learning include the works of Keramati et al. [2020] and Chandak
et al. [2021]. Similar concentration bounds are possible under other probability
metrics (such as the Wasserstein distances; see e.g. Weed and Bach [2019],
Bobkov and Ledoux [2019]), though typically some form of a priori information
about the return distribution, such as bounds on minimum/maximum returns, is
required to establish such bounds. 4
110 Chapter 4
4.11 Bibliographical Remarks
4.1–4.2.
The use of operator theory to understand reinforcement learning algo-
rithms is standard to most textbooks in the field [Bertsekas and Tsitsiklis, 1996,
Szepesvári, 2010, Puterman, 2014, Sutton and Barto, 2018], and is commonly
used in theoretical reinforcement learning [Munos, 2003, Bertsekas, 2011,
Scherrer, 2014]. Puterman [2014] provides a thorough introduction to vector
notation for Markov decision processes. Improved convergence results can be
obtained by studying the eigenspectrum of the transition kernel, as shown by e.g.
Morton [1971], Bertsekas [1994, 2012]. The elementary contraction mapping
theory described in Section 4.2 goes back to Banach [1922]. Our reference on
metric spaces is the definitive textbook by Rudin [1976].
Not all reinforcement learning algorithms are readily analysed using the theory
of contraction mappings. This is the case for policy-gradient algorithms [Sutton
et al., 2000, but see Ghosh et al., 2020 and Bhandari and Russo, 2021], but also
value-based algorithms such as advantage learning [Baird, 1999, Bellemare
et al., 2016].
4.3.
The random-variable Bellman operator presented here is from our earlier
work [Bellemare et al., 2017a], which provided an analysis in the p-Wasserstein
distances. There, the technical issues discussed in Section 4.9 were obviated
by declaring R, X
0
and the collection (G(x):x
2X
) to be independent. Those
issues were raised in a later paper [Rowland et al., 2018], which also introduced
the distributional Bellman equation in terms of probability measures and the
pushforward notation. The probability density equation (Equation 4.9) can be
found in Morimura et al. [2010b]. Earlier instances of distributional operators
are given by Chung and Sobel [1987], Morimura et al. [2010a], who provide
an operator on cumulative distribution functions, and Jaquette [1976], who
provides an operator on Laplace transforms.
4.4.
The Wasserstein distance can be traced back to Leonid Kantorovich [Kan-
torovich, 1942] and has been rediscovered (and renamed) multiple times in its
history. The name we use here is common, but a misnomer as Leonid Vaserstein
(after whom the distance is named) did not himself do any substantial work
on the topic. Among other names, we note the Mallows metric [Bickel and
Freedman, 1981] and the Earth-Mover distance [Rubner et al., 1998]. Much
earlier, Monge [1781] was the first to study the problem of optimal transporta-
tion from a transport theory perspective. See Vershik [2013] and Panaretos
and Zemel [2020] for further historical comments, and Villani [2008] for a
Operators and Metrics 111
survey of theoretical properties. A version of the contraction analysis in p-
Wasserstein distances was given by Bellemare et al. [2017a]; we owe the proof
of Proposition 4.15 in terms of optimal couplings to Philip Amortila.
The use of contraction mapping theory to analyse stochastic fixed-point equa-
tions was introduced by Rösler [1991], who analyses the Quicksort algorithm
by characterising the distributional fixed points of contraction mappings in 2-
Wasserstein distance. Applications and generalisation of this technique includes
the analysis of further recursive algorithms, models in stochastic geometry, and
branching processes [Rösler, 1992, Rachev and Rüschendorf, 1995, Neininger,
1999, Rösler and Rüschendorf, 2001, Rösler, 2001, Neininger, 2001, Neininger
and Rüschendorf, 2004, Rüschendorf, 2006, Rüschendorf and Neininger, 2006,
Alsmeyer, 2012]. Although the random-variable Bellman equation (Equation
2.15) can be viewed as a system of recursive distributional equations, the empha-
sis on a collection of effectively independent random variables (G
) differs
from the usual treatment of such equations.
4.5.
The family of
`
p
distances described in this chapter are covered at length
in the work of Rachev et al. [2013], which studies an impressive variety of
probability metrics. A version of the contraction analysis in Cramér distance
was originally given by Rowland et al. [2018]. In two and more dimensions,
the Cramér distance is generalised by the energy distance [Székely, 2002,
Székely and Rizzo, 2013, Rizzo and Székely, 2016], itself a member of the
maximum mean discrepancy (MMD) family [Gretton et al., 2012]; contraction
analysis in terms of MMD metrics was undertaken by Nguyen et al. [2021]
(see Exercise 4.19 for further details). Another special case of the
`
p
metrics
considered in this chapter is the Kolmogorov-Smirnov distance (
`
1
), which
features in results in empirical process theory, such as the Glivenko-Cantelli
theorem. Many of these metrics are integral probability metrics [Müller, 1997],
which allows for a dual formulation with appealing algorithmic consequences.
Chung and Sobel [1987] provide a non-expansion result in total variation
distance (without naming it as such; the proof uses an integral probability
metric formulation).
4.6.
The properties of regularity, convexity, and c-homogeneity were introduced
by Zolotarev [1976] in a slightly more general setting. Our earlier work pre-
sented these in a modern context [Bellemare et al., 2017b], albeit with only a
mention of their potential use in reinforcement learning. Although that work
proposed the term “sum-invariant” as mnemonically simpler, this is only techni-
cally correct when Equation 4.15 holds with equality; we have thus chosen to
keep the original name. Theorem 4.25 is new to this book.
112 Chapter 4
The characterisation of the Wasserstein distance as an optimal transport problem
in Proposition 4.18 is the standard presentation of the Wasserstein distance in
more abstract settings, which allows it to be applied to probability distributions
over reasonably general metric spaces [Villani, 2003, 2008, Ambrosio et al.,
2005, Santambrogio, 2015]. Optimal transport has also increasingly found
application within machine learning in recent years, particularly in generative
modelling [Arjovsky et al., 2017]. Optimal transport and couplings also arise in
the study of bisimulation metrics for Markov decision processes [Ferns et al.,
2004, Ferns and Precup, 2014, Amortila et al., 2019] as well as analytical tools
for sample-based algorithms [Amortila et al., 2020]. Peyré et al. [2019] provide
an overview of algorithms, analysis, and applications associated with optimal
transport in machine learning and related disciplines.
4.7–4.8.
Villani [2008] gives further discussion on the domain of Wasserstein
distances, and on their relationship to weak convergence.
4.9.
The usefulness of a random-variable operator has been a source of intense
debate between the authors of this book. The form we present here is inspired
by the “stack of rewards” model from Lattimore and Szepesvári [2020].
4.12 Exercises
Exercise 4.1.
Show that the no-loop operator defined in Example 4.6 is a
contraction mapping with modulus
= max
x2X
P
X
0
6= x | X = x
. 4
Exercise 4.2. For p 2[1, 1), let k·k
p
be the L
p
norm over R
X
, defined as
kVk
p
=
x2X
|V(x)|
p
1/p
.
Show that T
is not a contraction mapping in the metric induced by the L
p
norm unless p =
1
(see Equation 4.4 for a definition of the L
1
metric). Hint. A
two-state example suffices. 4
Exercise 4.3.
In this exercise, you will use the ideas of Sections 4.1 to study
several operators associated with expected-value reinforcement learning.
(i)
For n
2N
+
, consider the n-step evaluation operator T
n
:
R
X
!R
X
defined
by
(T
n
V)(x)=E
n–1
t=0
t
R
t
+
n
V(X
n
)
X
0
= x
.
Operators and Metrics 113
Show that T
n
has V
as a fixed point, and is a contraction mapping with
respect to the L
1
metric with contraction modulus
n
. Hence, deduce that
repeated application of T
n
to any initial value function estimate converges
to V
. Show that in fact, T
n
=(T
)
n
.
(ii) Consider the -return operator T
: R
X
!R
X
defined by
(T
V)(x) = (1 )
1
n=1
n–1
E
n–1
t=0
t
R
t
+
n
V(X
n
)
X
0
= x
.
Show that T
has V
as a fixed point, and is a contraction mapping with
respect to the L
1
metric with contraction modulus
1–
1–
.
Hence, deduce that repeated application of T
to any initial value function
estimate converges to V
. 4
Exercise 4.4.
Consider the random-variable operator (Equation 4.7). Given
a return-variable function G
0
, write G
0
(x,
!
)=G
0
(x)(
!
) for the realisation of
the random return corresponding to
! 2
. Additionally, for each x
2X
, let
(x, A(x), R(x), X
0
(x)) be an independent random transition defined on the same
probability space, and write (A(x,
!
), R(x,
!
), X
0
(x,
!
)) to denote the dependence
of these random variables on
! 2
. Suppose that for each k
0, we define the
return-variable function G
k+1
as
G
k+1
(x, !)=R(x, !)+G
k
(X
0
(x, !), !).
For a given x, characterise the function
G
(x, !) = lim
k!1
G
k
(x, !). 4
Exercise 4.5.
Suppose that you are given a description of a Markov decision
process along with a policy
, with the property that the policy
reaches a
terminal state (i.e., one for which the return is zero) in at most T
2N
steps.
Describe a recursive procedure that takes in a scalar
!
on [0, 1] and outputs a
return z
2R
such that
P
G
(X
0
)
z
=
!
. Hint. You may want to use the fact
that sampling a random variable Z can be emulated by drawing
uniformly
from [0, 1], and returning F
–1
Z
(). 4
Exercise 4.6. Let p 2[1, 1].
(i)
Show that for any
,
0
2P
(
R
) with finite p
th
moments, we have w
p
(
,
)<
1
. Hint. Use the triangle inequality with intermediate distribution
0
. Hence,
114 Chapter 4
prove that the p-Wasserstein metric is indeed a metric on
P
p
(
R
), for p
2
[1, 1].
(ii)
Show that on the space
P
(
R
), the p-Wasserstein metric satisfies all require-
ments of a metric except finiteness. For each p
2
[1,
1
], exhibit a pair of
distributions ,
0
2P(R) such that w
p
(,
0
)=1.
(iii)
Show that if 1
q < p <
1
, we have
E
[|Z|
p
]<
1 =) E
[|Z|
q
]<
1
for any
random variable Z. Deduce that
P
p
(
R
)
P
q
(
R
). Hint. Consider applying
Jensen’s inequality with the function z 7!|z|
p/q
. 4
Exercise 4.7.
Let d :
P
(
R
)
P
(
R
)
!
[0,
1
] be a probability metric with
finite domain P
d
(R).
(i) Prove that the supremum distance d is an extended metric on P(R)
X
.
(ii) Prove that it is a metric on P
d
(R)
X
. 4
Exercise 4.8. Prove Proposition 4.15 for p = 1. 4
Exercise 4.9.
Consider two normally-distributed random variables with mean
µ
and
µ
0
and common variance
2
. Derive an expression for the
1
-Wasserstein
distance between these two distributions. Conclude that Assumption 4.29(
1
) is
sufficient, but not necessary, for two distributions to have finite
1
-Wasserstein
distance. How does the situation change if the two normal distributions in
question have unequal variances? 4
Exercise 4.10.
Explain, in words, why Assumption 4.29(p) is needed for Propo-
sition 4.30. By considering the definition of the p-Wasserstein distance, explain
why for p > 1 and 1
q < p, Assumption 4.29(q) is not sufficient to guarantee
convergence in the p-Wasserstein distance. 4
Exercise 4.11.
This exercise guides you through the proof of Proposition 4.30.
(i)
First, show that under Assumption 4.29(p), the return distributions
(x)
have finite p
th
moments, for all x
2X
. You may find it useful to deal with
p =
1
separately, and in the case p
2
[1,
1
), you may find it useful to rewrite
E
1
t=0
t
R
t
p
| X = x
= (1 )
p
E
1
t=0
(1 )
t
R
t
p
| X = x
,
and use Jensen’s inequality on the function z 7!|z|
p
.
Operators and Metrics 115
(ii)
Let
2P
p
(
R
)
X
, and let G be an instantiation of
. First letting p
2
[1,
1
),
use the inequality |z
1
+ z
2
|
p
2
p–1
(|z
1
|
p
+|z
2
|
p
) to argue that if (X = x, A, R, X
0
)
is a sample transition independent of G, then
E
[|R + G(X
0
)|
p
| X = x]<1.
Hence, argue that under Assumption 4.29(p),
P
p
(
R
)
X
is closed under
T
.
Argue separately that this holds for p = 1 too.
Hence argue that Proposition 4.27 applies, and hence conclude that Proposi-
tion 4.30 holds. 4
Exercise 4.12.
In the proof of Theorem 4.25, we did not need to assume that
for the two distinct states x, y
2X
, their associated returns G(x) and G(y) are
independent. Explain why. 4
Exercise 4.13. The (1,1)-Pareto distribution
PAR
has cumulative distribution
F
PAR
(z)=
0 if z < 1,
1–
1
z
if z 1.
Justify the necessity of including Assumption 2.5 (finite-mean rewards) in
Definition 4.26 by demonstrating that
`
2
(
PAR
,
0
)<1 yet E
Z
[Z]=1. 4
Exercise 4.14.
The purpose of this exercise is to contrast the p-Wasserstein and
`
p
distances. For each of the following, find a pair of probability distributions
,
0
2P(R) such that, for a given " > 0,
(i) w
2
(,
0
) smaller than `
2
(,
0
);
(ii) w
2
(,
0
) larger than `
2
(,
0
);
(iii) w
1
(,
0
)=" and `
1
(,
0
) = 1;
(iv) w
1
(,
0
) = 1 and `
1
(,
0
)=". 4
Exercise 4.15.
Show that the dependence on
1/p
is tight in Proposition 4.20.
4
Exercise 4.16.
The total variation distance d
TV
:
P
(
R
)
P
(
R
)
!R
is
defined by
d
TV
(,
0
) = sup
UR
|(U)–
0
(U)| , (4.19)
116 Chapter 4
for all
,
0
2P
(
R
).
34
Show, by means of a counterexample, that the distribu-
tional Bellman operator is not a contraction mapping in the supremum extension
of this distance. 4
Exercise 4.17.
Consider the alternative notion of the return introduced in
Section 2.9 in which
is treated as the probability of continuing. Write down
the distributional Bellman operator that corresponds to this notion of the return.
For which metrics considered in this chapter is this distributional Bellman oper-
ator a contraction mapping? Show in particular that this distributional Bellman
operator is a contraction with respect to the supremum version of the total vari-
ation distance over return-distribution functions, introduced in Exercise 4.16.
What is its contraction modulus? 4
Exercise 4.18.
Remark 2.3 describes some differences between Markov deci-
sion processes with finite state spaces (as we consider throughout the book) and
generalisations with infinite state spaces. The contraction mapping theory in
this chapter is one case where stronger assumptions are required when moving
to larger state spaces. Using the example described in Remark 2.3, show that
Assumption 4.29(w
1
) is insufficient to make
P
1
(
R
) closed under the distribu-
tional Bellman operator
T
when the state space is countably infinite. How
could this assumption be strengthened to guarantee closedness? 4
Exercise 4.19
(*)
.
The goal of this exercise is to explore the contractivity of
the distributional Bellman operator with respect to a class of metrics known as
maximum mean discrepancies; this analysis was undertaken by Nguyen et al.
[2021]. A kernel on
R
is a function K :
R R !R
with the property that for any
finite set {z
1
,
, z
n
}
R
, the m
m matrix with (i, j)
th
entry K(z
i
, z
j
) is positive
semi-definite. A function K : R R !R satisfying the weaker condition that
n
i,j=1
c
i
c
j
K(z
i
, z
j
) 0
whenever
n
i=1
c
i
= 0 is called a conditionally positive-definite kernel. Condi-
tionally positive-definite kernels form a measure of similarity between pairs of
points in
R
, and can also be used to define notions of distance over probabil-
ity distributions. The maximum mean discrepancy (MMD) probability metric
34.
For the reader with a measure-theoretic background, the supremum here is over measurable
subsets of R.
Operators and Metrics 117
associated with the conditionally positive-definite kernel K is defined by
MMD
K
(,
0
)=
E
X
X
0
[K(X, X
0
)] + E
Y
0
Y
0
0
[K(Y, Y
0
)] 2E
X
Y
0
[K(X, Y)]
1/2
,
where each pair of random variables in the expectations above are taken to be
independent.
(i)
Consider the function K
(z
1
, z
2
) = –|z
1
z
2
|
, with
2
(0, 2)). Székely and
Rizzo [2013, Proposition 2] show that this defines a conditionally positive-
definite kernel. Show that
MMD
K
a
is regular, c-homogeneous (for some c >
0), and p-convex (for some p
2
[1,
1
)). Hence use Theorem 4.25 to establish
that the distributional Bellman operator is a contraction with respect to
MMD
K
, under suitable assumptions.
(ii)
The Gaussian kernel, or squared exponential kernel, with variance
2
> 0 and
length scale
> 0 is defined by K(z
1
, z
2
)=
2
exp
(–(z
1
z
2
)
2
/(2
2
)). Show,
through the use of a counterexample, that the MMD corresponding to the
Gaussian kernel is not c-homogeneous for any c > 0, and so Theorem 4.25
cannot be applied. Further, find an MDP and policy
that serve as a coun-
terexample to the contractivity of the distributional Bellman operator with
respect to this MMD metric. 4
Exercise 4.20.
Let p
2
[1,
1
], and consider a modification of the p-Wasserstein
distance,
˜
w
p
, such that
˜
w
p
(
,
0
)=w
p
(
,
0
) if both
,
0
2P
(
R
) are expressible
as finite mixtures of Dirac deltas, and
˜
w
p
(,
0
)=1 otherwise.
(i)
Show that
P
˜
w
p
(
R
) is the set of distributions expressible as finite mixtures
of Dirac deltas.
(ii)
Exhibit a Markov decision process and policy
for which all conditions of
Proposition 4.27 hold except for the condition
2P
˜
w
p
(R)
X
.
(iii)
Show that the sequence of iterates (
k
)
k0
does not converge to
under
˜
w
p
in this case. 4
Exercise 4.21. Consider the sequence of distributions (
k
)
1
k=1
defined by
k
=
k –1
k
0
+
1
k
k
.
Show that this sequence converges weakly to another distribution
. From this,
deduce that weak convergence does not imply convergence of expectations.
4
118 Chapter 4
Exercise 4.22. Achab [2020] considers the random variables
˜
G
(x)=R + V
(X
0
), X = x .
What does the distribution of this random variable capture about the underlying
MDP and policy
? Using the tools of this chapter, derive an operator over
P
(
R
)
X
that has the collection of distributions corresponding to this random
variable under each of the initial conditions in
X
as a fixed point, and analyse
the properties of this operator. 4