Operators and Metrics

Anyone who has learned to play a musical instrument knows that practice makes

perfect. Along the way, however, one’s ability at playing a diﬃcult passage

usually varies according to a number of factors. On occasion, something that

could be played easily the day before now seems insurmountable. The adage

expresses an abstract notion – that practice improves performance, on average or

over a long period of time – rather than a concrete statement about instantaneous

ability.

In the same way, reinforcement learning algorithms deployed in real sit-

uations behave diﬀerently from moment to moment. Variations arise due to

diﬀerent initial conditions, speciﬁc choices of parameters, hardware nonde-

terminism, or simply because of randomness in the agent’s interactions with

its environment. These factors make it hard to make precise predictions, for

example, about the magnitude of the value function estimate learned by TD

learning at a particular state

and step

, other than by extensive simulations.

Nevertheless, the large-scale behavior of TD learning is relatively predictable,

suﬃciently so that convergence can be established under certain conditions, and

convergence rates can be derived.

This chapter introduces the language of operators as an eﬀective abstrac-

tion with which to study such long-term behavior, characterize the asymptotic

properties of reinforcement learning algorithms, and eventually explain what

makes an eﬀective algorithm. In addition to being useful in the study of existing

algorithms, operators also serve as a kind of blueprint when designing new algo-

rithms, from which incremental methods such as categorical temporal-diﬀerence

learning can then be derived. In parallel, we will also explore probability metrics

– essentially distance functions between probability distributions. These metrics

play an immediate role in our analysis of the distributional Bellman operator,

and will recur in later chapters as we design algorithms for approximating return

distributions.

Draft version. 77

78 Chapter 4

4.1 The Bellman Operator

The value function

characterizes the expected return obtained by following

a policy π, beginning in a given state x:

(x) = E

∞

t=0

| X

= x

The Bellman equation establishes a relationship between the expected return

from one state and from its successors:

(x) = E



R + γV

) | X = x



Let us now consider a state-indexed collection of real variables, written

V ∈R

which we call a value function estimate. By substituting

for

in the original

Bellman equation, we obtain the system of equations

V(x) = E



R + γV(X

) | X = x



, for all x ∈X. (4.1)

From Chapter 2, we know that V

is one solution to the above.

Are there other solutions to Equation 4.1? In this chapter, we answer this

question (negatively) by interpreting the right-hand side of the equation as

applying a transformation on the estimate

. For a given realization (

x, a, r, x

)

of the random transition, this transformation indexes

, multiplies it by

the discount factor, and adds it to the immediate reward (this yields

γV

(

)).

The actual transformation returns the value that is obtained by following these

steps, in expectation. Functions that map elements of a space onto itself, such

as this one (from estimates to estimates), are called operators.

Deﬁnition 4.1.

The Bellman operator is the mapping

→R

deﬁned

V)(x) = E

[R + γV(X

) |X = x] . (4.2)

Here, the notation T

V should be understood as “T

, applied to V.” 4

The Bellman operator gives a particularly concise way of expressing the

transformations implied in Equation 4.1:

V = T

V .

As we will see in later chapters, it also serves as the springboard for the design

and analysis of algorithms for learning V

When working with the Bellman operator, it is often useful to treat

as a

ﬁnite-dimensional vector in

and to express the Bellman operator in terms of

vector operations. That is, we write

V = r

+ γP

V , (4.3)

Draft version.

Operators and Metrics 79

where r

(x) = E

[R | X = x] and P

is the transition operator

deﬁned as

V)(x) =

a∈A

π(a | x)

∈X

| x, a)V(x

) .

Equation 4.3 follows from these deﬁnitions and the linearity of expectations.

A vector

V ∈R

is a solution to Equation 4.1 if it remains unchanged by the

transformation corresponding to the Bellman operator

; that is, if it is a ﬁxed

point of T

. This means that the value function V

is a ﬁxed point of T

= T

To demonstrate that

is the only ﬁxed point of the Bellman operator, we will

appeal to the notion of contraction mappings.

4.2 Contraction Mappings

When we apply the operator

to a value function estimate

V ∈R

, we obtain

a new estimate

V ∈R

. A characteristic property of the Bellman operator is

that this new estimate is guaranteed to be closer to

than

(unless

, of

course). In fact, as we will see in this section, applying the operator to any two

estimates must bring them closer together.

To formalize what we mean by “closer,” we need a way of measuring dis-

tances between value function estimates. Because these estimates can be viewed

as ﬁnite-dimensional vectors, there are many well-established ways of doing so:

the reader may have come across the Euclidean (

) distance, the Manhattan

(

) distance, and curios such as the British Rail distance. We use the term

metric to describe distances that satisfy the following standard deﬁnition.

Deﬁnition 4.2.

Given a set

, a metric

M × M →R

is a function that

satisﬁes, for all U, V, W ∈ M,

(a) d(U, V) ≥0,

(b) d(U, V) = 0 if and only if U = V,

(d) d(U, V) = d(V, U).

We call the pair (M, d) a metric space. 4

In our setting,

is the space of value function estimates,

. Because we

assume that there are ﬁnitely many states, this space can be equivalently thought

of as the space of real-valued vectors with

entries, where

is the number

24.

It is also possible to express

as a stochastic matrix, in which case

describes a matrix–

vector multiplication. We will return to this point in Chapter 5.

Draft version.

80 Chapter 4

of states. On this space, we measure distances in terms of the

∞

metric, deﬁned

kV −V

∞

= max

x∈X

|V(x) −V

(x)|, V, V

∈R

. (4.4)

A key result is that the Bellman operator

is a contraction mapping with

respect to this metric. Informally, this means that its application to diﬀerent

value function estimates brings them closer by at least a constant multiplicative

factor, called its contraction modulus.

Deﬁnition 4.3.

Let (

M, d

) be a metric space. A function

M → M

is a con-

traction mapping with respect to

and with contraction modulus

β ∈

1), if

for all U, U

∈ M,

d(OU, OU

) ≤βd(U, U

) . 4

Proposition 4.4.

The operator

→R

is a contraction mapping

with respect to the

∞

metric on

with contraction modulus given by

the discount factor γ. That is, for any two value functions V, V

∈R

V −T

∞

≤γkV −V

∞

. 4

Proof.

The proof is most easily stated in vector notation. Here, we make use of

two properties of the operator

. First,

is linear, in the sense that for any

V, V

V + P

= P

(V + V

Second, because (

)(

) is a convex combination of elements from

, it must

be that

∞

≤kVk

∞

From here, we have

V −T

∞

= k(r

+ γP

V) −(r

+ γP

∞

= kγP

V −γP

∞

= γkP

(V −V

∞

≤γkV −V

∞

as desired.

The fact that

is a contraction mapping guarantees the uniqueness of

as a

solution to the equation

. As made formal by the following proposition,

because the operator

brings any two value functions closer together, it cannot

keep more than one value function ﬁxed.

Draft version.

Operators and Metrics 81

Proposition 4.5.

Let (

M, d

) be a metric space and

M → M

be a

contraction mapping. Then O has at most one ﬁxed point in M. 4

Proof.

Let

β ∈

1) be the contraction modulus of

, and suppose

U, U

∈ M

are distinct ﬁxed points of

, so that

(

U, U

)

0 (following Deﬁnition 4.2).

Then we have

d(U, U

) = d(OU, OU

) ≤βd(U, U

) ,

which is a contradiction.

Because we know that

is a ﬁxed point of the Bellman operator

, fol-

lowing Proposition 4.5, we deduce that there are no other such ﬁxed points

– and hence no other solutions to the Bellman equation. As the phrasing of

Proposition 4.5 suggests, in some metric spaces, it is possible for

to be a

contraction mapping yet to not possess a ﬁxed point. This can matter when

dealing with return functions, as we will see in the second half of this chapter

and in Chapter 5.

Example 4.6. Consider the no-loop operator





(x) = E



R + γV(X

)

, x}

| X = x



where the name denotes the fact that we omit the next-state value whenever a

transition from

to itself occurs. By inspection, we can determine that the ﬁxed

point of this operator is

(x) = E

T −1

t=0

| X

= x

where

denotes the (random) ﬁrst time at which

T −1

. In words, this

ﬁxed point describes the discounted sum of rewards obtained until the ﬁrst time

that an action leaves the state unchanged.

Exercise 4.1 asks you to show that

is a contraction mapping with modulus

β = γ max

x∈X



, x | X = x



Following Proposition 4.5, we deduce that this is the unique ﬁxed point to

. 4

25.

The reader is invited to consider the kind of environments in which policies that maximize the

no-loop return are substantially diﬀerent from those that maximize the usual expected return.

Draft version.

82 Chapter 4

When an operator

is contractive, we can also straightforwardly construct a

mathematical approximation to its ﬁxed point.

This approximation is given

by the sequence (

)

k≥0

, deﬁned by an initial value

∈ M

, and the recursive

relationship

k+1

= OU

By contractivity, successive iterates of this sequence must come progressively

closer to the operator’s ﬁxed point. This is formalized by the following.

Proposition 4.7.

Let (

M, d

) be a metric space, and let

M → M

be a

contraction mapping with contraction modulus

β ∈

1) and ﬁxed point

∗

∈ M

. Then for any initial point

∈ M

, the sequence (

)

k≥0

deﬁned

by U

k+1

= OU

is such that

d(U

, U

∗

) ≤β

d(U

, U

∗

). (4.5)

and in particular d(U

, U

∗

) →0 as k →∞. 4

Proof.

We will prove Equation 4.5 by induction, from which we obtain conver-

gence of (

)

k≥0

. For

= 0, Equation 4.5 trivially holds. Now suppose for

some k ≥0, we have

d(U

, U

∗

) ≤β

d(U

, U

∗

) .

Then note that

d(U

k+1

, U

∗

)

(a)

= d(OU

, OU

∗

)

(b)

≤ βd(U

, U

∗

)

(c)

≤ β

k+1

d(U

, U

∗

) ,

where (a) follows from the deﬁnition of the sequence (

)

k≥0

and the fact that

∗

is ﬁxed by

, (b) follows from the contractivity of

, and (c) follows from

the inductive hypothesis. By induction, we conclude that Equation 4.5 holds for

all k ∈N.

In the case of the Bellman operator T

, Proposition 4.7 means that repeated

application of

to any initial value function estimate

∈R

produces

a sequence of estimates (

)

k≥0

that are progressively closer to

. This

observation serves as the starting point for a number of computational

approaches that approximate

, including dynamic programming (Chapter 5)

and temporal-diﬀerence learning (Chapter 6).

26.

We use the term mathematical approximation to distinguish it from an approximation that can

be computed. That is, there may or may not exist an algorithm that can determine the elements of

the sequence (U

)

k≥0

given the initial estimate U

Draft version.

Operators and Metrics 83

4.3 The Distributional Bellman Operator

Designing distributional reinforcement learning algorithms such as categorical

temporal-diﬀerence learning involves a few choices – such as how to represent

probability distributions in a computer’s memory – that do not have an equiva-

lent in classical reinforcement learning. Throughout this book, we will make

use of the distributional Bellman operator to understand and characterize many

of these choices. To begin, recall the random-variable Bellman equation:

(x)

= R + γG

), X = x . (4.6)

As in the expected-value setting, we construct a random-variable operator by

viewing the right-hand side of Equation 4.6 as a transformation of

. In this

case, we break down the transformation of

into three operations, each of

which produces a new random variable (Figure 4.1):

(a) G

): the indexing of the collection of random variables G

by X

;

(b) γG

(

): the multiplication of the random variable

(

) with the scalar

γ;

): the addition of two random variables (R and γG

)).

More generally, we may apply these operations to any state-indexed collection

of random variables



(

) :

x ∈X

), taken to be independent of the random

transitions used to deﬁne the transformation. With some mathematical caveats

discussed below, let us introduce the random-variable Bellman operator

G)(x)

= R + γG(X

), X = x . (4.7)

Equation 4.7 states that the application of the Bellman operator to

(evaluated

; the left-hand side) produces a random variable that is equal in distribution

to the random variable constructed on the right-hand side. Because this holds

for all

, we think of

as mapping

to a new collection of random variables

The random-variable operator is appealing because it is concise and easily

understood. In many circumstances, this makes it the tool of choice for reasoning

about distributional reinforcement learning problems. One issue, however, is that

its deﬁnition above is mathematically incomplete. This is because it speciﬁes

the probability distribution of (

)(

), but not its identity as a mapping from

some sample space to the real numbers. As discussed in Section 2.7, without

care we may produce random variables that exhibit undesirable behavior: for

example, because rewards at diﬀerent points in time are improperly correlated.

More immediately, the theory of contraction mappings needs a clear deﬁnition

of the space on which an operator is deﬁned – in the case of the random-variable

operator, this requires us to specify a space of random variables to operate

Draft version.

84 Chapter 4

(a) (b) (c)

Figure 4.1

The random-variable Bellman operator is composed of three operations:

(a)

indexing

into a collection of random variables,

(b)

multiplication by the discount factor, and

(c)

addition of two random variables. Here, we assume that

and

take on a single value

for clarity.

in. Properly deﬁning such a space is possible but requires some technical

subtlety and measure-theoretic considerations; we refer the interested reader to

Section 4.9.

A more direct solution is to consider the distributional Bellman operator

as a mapping on the space of return-distribution functions. Starting with the

distributional Bellman equation

(x) = E

[



R,γ



) |X = x] ,

we again view the right-hand side as the result of applying a series of

transformations, in this case to probability distributions.

Deﬁnition 4.8.

The distributional Bellman operator

(

)

→P

(

)

the mapping deﬁned by

η)(x) = E

[



R,γ



η(X

) |X = x] . (4.8)

Here, the operations on probability distributions are expressed (rather com-

pactly) by the expectation in Equation 4.8 and the use of the pushforward

distribution derived from the bootstrap function

; these are the operations of

mixing, scaling, and translation previously described in Section 2.8.

We can gain additional insight into how the operator transforms a return

function

by considering the situation in which the random reward

and the

return distributions

(

) admit respective probability densities

(

r | x, a

) and

(z). In this case, the probability density of (T

η)(x), denoted p

, is

(z) = γ

−1

a∈A

π(a | x)

r∈R

(r | x, a)

∈X

| x, a)p



z −r



dr . (4.9)

Expressed in terms of probability densities, the indexing of a collection of

random variables becomes a mixture of densities, while their addition becomes a

Draft version.

Operators and Metrics 85

convolution; this is in fact what is depicted in Figure 4.1. In terms of cumulative

distribution functions, we have

η)(x)

(z) = E

η(X

)



z −R



| X = x

However, we prefer the operator that deals directly with probability distributions

(Equation 4.8) as it can be used to concisely express more complex operations on

distributions. One such operation is the projection of a probability distribution

onto a ﬁnitely parameterized set, which we will use in Chapter 5 to construct

algorithms for approximating η

Using Deﬁnition 4.8, we can formally express the fact that the return-

distribution function η

is the only solution to the equation

η = T

η .

The proof is relatively technical and will be given in Section 4.8.

Proposition 4.9. The return-distribution function η

satisﬁes

= T

and is the unique ﬁxed point of the distributional Bellman operator

When working with the distributional Bellman operator, one should be mind-

ful that the random reward

and next state

are generally not independent,

because they both depend on the chosen action

(we brieﬂy mentioned this

concern in Section 2.8). In Equation 4.9, this is explicitly handled by the outer

sum over actions. Analogously, we can make explicit the dependency on

introducing a second expectation in Equation 4.8:

η)(x) = E



[(b

R,γ

)

η(X

) |X = x, A] |X = x



By conditioning the inner expectation on the action

, we make the random

variables

and

γG

(

) conditionally independent in the inner expectation. We

will make use of this technique in proving Theorem 4.25, the main theoretical

result of this chapter.

In some circumstances, it is useful to translate between operations on proba-

bility distributions and those on random variables. We do this by means of a

representative set of random variables called an instantiation.

Deﬁnition 4.10.

Given a probability distribution

ν ∈P

(

), we say that a ran-

dom variable

is an instantiation of

if its distribution is

, written

Z ∼ν

Similarly, we say that a collection of random variables

= (

(

) :

x ∈X

) is an

instantiation of a return-distribution function

η ∈P

(

)

if for every

x ∈X

, we

have G(x) ∼η(x). 4

Draft version.

86 Chapter 4

Given a return-distribution function

η ∈P

(

)

, the new return-distribution

function

can be obtained by constructing an instantiation

, perform-

ing the transformation on the collection of random variables

as described

at the beginning of this section, and then extracting the distributions of the

resulting random variables. This is made formal as follows.

Proposition 4.11.

Let

η ∈P

(

)

, and let

= (

(

) :

x ∈X

) be an instan-

tiation of

. For each

x ∈X

, let (

x, A, R, X

) be a sample transition

independent of G. Then R + γG(X

) has the distribution (T

η)(x):

(R + γG(X

) | X = x) = (T

η)(x) . 4

Proof.

The result follows immediately from the deﬁnition of the distributional

Bellman operator. For clarity, we step through the argument again, mirroring

the transformations set out at the beginning of the section. First, the indexing

transformation gives

(G(X

) | X = x) =

∈X

= x

| X = x)η(x

)

= E

[η(X

) | X = x] .

Next, scaling by γ yields

(γG(X

) | X = x) = E

[(b

0,γ

)

η(X

) | X = x] ,

and ﬁnally adding the immediate reward R gives the result

(R + γG(X

) | X = x) = E

[(b

R,γ

)

η(X

) | X = x] .

Proposition 4.11 is an instance of a recurring principle in distributional rein-

forcement learning that “diﬀerent routes lead to the same answer.” Throughout

this book, we will illustrate this point as it arises with a commutative diagram;

the particular case under consideration is depicted in Figure 4.2.

4.4 Wasserstein Distances for Return Functions

Many desirable properties of reinforcement learning algorithms (for example,

the fact that they produce a good approximation of the value function) are

due to the contractive nature of the Bellman operator

. In this section, we

will establish that the distributional Bellman operator

, too, is a contraction

mapping – analogous to the value-based operator, the application of

brings

return functions closer together.

One diﬀerence between expected-value and distributional reinforcement

learning is that the space of return-distribution functions

(

)

is substantially

Draft version.

Operators and Metrics 87

η η

G G

Figure 4.2

A commutative diagram illustrating two perspectives on the application of the distri-

butional Bellman operator. The top horizontal line represents the direct application to

the return-distribution function

, yielding

. The alternative path ﬁrst instantiates the

return-distribution function

as a collection of random variables

= (

(

) : (

)

∈X

transforms

to obtain another collection of random variables

, and then extracts the

distributions of these random variables to obtain η

diﬀerent from the space of value functions. To measure distances between value

functions, we can simply treat them as ﬁnite-dimensional vectors, taking the

absolute diﬀerence of value estimates at individual states. By contrast, it is

somewhat less intuitive to see what “close” means when comparing probability

distributions. Throughout this chapter, we will consider a number of probability

metrics that measure distances between distributions, each presenting diﬀer-

ent mathematical and computational properties. We begin with the family of

Wasserstein distances.

Deﬁnition 4.12.

Let

ν ∈P

(

) be a probability distribution with cumulative

distribution function

. Let

be an instantiation of

(in particular,

The generalized inverse F

−1

is given by

−1

(τ) = inf

z∈R

{z : F

(z) ≥τ}.

We additionally write F

−1

= F

−1

. 4

Deﬁnition 4.13.

Let

p ∈

, ∞

). The

-Wasserstein distance is a function

P(R) ×P(R) →[0, ∞] given by

(ν, ν

) =



−1

(τ) −F

−1

(τ)



dτ

1/p

The ∞-Wasserstein distance w

∞

: P(R) ×P(R) →[0, ∞] is

∞

(ν, ν

) = sup

τ∈(0,1)



−1

(τ) − F

−1

(τ)



. 4

Graphically, the Wasserstein distances between two probability distributions

measure the area between their cumulative distribution functions, with val-

ues along the abscissa taken to the

th power; see Figure 4.3. When

∞

Draft version.

88 Chapter 4

this becomes the largest horizontal diﬀerence between the inverse cumulative

distribution functions. The

-Wasserstein distances satisfy the deﬁnition of a

metric, except that they may not be ﬁnite for arbitrary pairs of distributions

(

); see Exercise 4.6. Properly speaking, they are said to be extended

metrics, since they may take values on the real line extended to include inﬁnity.

Most probability metrics that we will consider are extended metrics rather than

metrics in the sense of Deﬁnition 4.2. We measure distances between return-

distribution functions in terms of the largest Wasserstein distance between

probability distributions at individual states.

Deﬁnition 4.14.

Let

p ∈

, ∞

]. The supremum

-Wasserstein distance

between two return-distribution functions η, η

∈P(R)

is deﬁned by

(η, η

) = sup

x∈X

(η(x), η

(x)) . 4

The supremum

-Wasserstein distances fulﬁll all requirements of an extended

metric on the space of return-distribution functions

(

)

; see Exercise 4.7.

Based on these distances, we give our ﬁrst contractivity result regarding the

distributional Bellman operator; its proof is given at the end of the section.

Proposition 4.15.

The distributional Bellman operator is a contraction

mapping on

(

)

in the supremum

-Wasserstein distance, for all

p ∈

[1, ∞]. More precisely,

η, T

) ≤γw

(η, η

) ,

for all η, η

∈P (R)

. 4

Proposition 4.15 is signiﬁcant in that it establishes a close parallel between

the expected-value and distributional operators. Following the line of reasoning

given in Section 4.2, it provides the mathematical justiﬁcation for the develop-

ment and analysis of computational approaches for ﬁnding the return function

. More immediately, it also enables us to characterize the convergence of the

sequence

k+1

= T

(4.10)

to the return function η

27.

Because we assume that there are ﬁnitely many states, we can equivalently write

max

in the

deﬁnition of supremum distance. However, we prefer the more generally applicable sup.

Draft version.

Operators and Metrics 89

Figure 4.3

Left

: Illustration of the

-Wasserstein distance between a normal distribution

and a mixture of two normal distributions

(

−

5) +

5).

Right

: Illus-

tration of the



metric for the same distributions (see Section 4.5). In both cases, the

shading indicates the axis along which the diﬀerences are taken to the pth exponent.

Proposition 4.16.

Suppose that for each (

x, a

)

∈X×A

, the reward distri-

bution

(

· | x, a

) is supported on the interval [

min

, R

max

]. Then for any

initial return function

whose distributions are bounded on the interval

min

1−γ

max

1−γ

, the sequence

k+1

= T

converges to

in the supremum

-Wasserstein distance (for all

p ∈

, ∞

]).

The restriction to bounded rewards in Proposition 4.16 is necessary to make

use of the tools developed in Section 4.2, at least without further qualiﬁcation.

This is because Proposition 4.7 requires all distances to be ﬁnite, which is not

guaranteed under our deﬁnition of a probability metric. If, for example, the

initial condition η

is such that

(η

, η

) = ∞,

then Proposition 4.15 is not of much use. A less restrictive but more technically

elaborate set of assumptions will be presented later in the chapter. For now, we

provide the proof of the two preceding results. First, we obtain a reasonably

simple proof of Proposition 4.15 by considering an alternative formulation of

the p-Wasserstein distances in terms of couplings.

Deﬁnition 4.17.

Let

ν, ν

∈P

(

) be two probability distributions. A coupling

between

and

is a joint distribution

υ ∈P

(

) such that if (

Z, Z

) is an

instantiation of

, then also

has distribution

and

has distribution

. We

write Γ(ν, ν

) ⊆P(R

) for the set of all couplings of ν and ν

. 4

Draft version.

90 Chapter 4

Proposition 4.18

(see Villani (2008) for a proof)

Let

p ∈

, ∞

Expressed in terms of an optimal coupling, the

-Wasserstein distance

between two distributions ν, ν

∈P (R) is

(ν, ν

) = min

υ∈Γ(ν,ν

)

(Z,Z

)∼υ

[|Z −Z

]

1/p

The ∞-Wasserstein distance between ν and ν

can be written as

∞

(ν, ν

) = min

υ∈Γ(ν,ν

)

inf

z ∈R :

(Z,Z

)∼υ

(|Z −Z

|> z) = 0

. 4

Informally, the optimal coupling ﬁnds an arrangement of the two probability

distributions that maximizes “agreement”: it produces outcomes that are as close

as possible. In Proposition 4.18, the optimal coupling takes on a very simple

form given by inverse cumulative distribution functions. For

ν, ν

∈P

(

), an

optimal coupling is the probability distribution of the random variable



−1

(τ), F

−1

(τ)



, τ ∼U



[0, 1]



. (4.11)

This can be understood by noting how the 1-Wasserstein distance between

and

is obtained by measuring the horizontal distance between the two cumulative

distribution functions, at each level τ ∈[0, 1] (Figure 4.3).

Proof of Proposition 4.15.

Let

p ∈

, ∞

) be ﬁxed. For each

x ∈X

, consider the

optimal coupling between

(

) and

(

) and instantiate it as the pair of ran-

dom variables



(

)

, G

(

)



. Next, denote by (

x, A, R, X

) the random transition

beginning in

x ∈X

, constructed to be independent from

(

) and

(

), for all

y ∈X. With these variables, write

G(x) = R + γG(X

) ,

(x) = R + γG

) .

By Proposition 4.11,

G(x) has distribution (T

η)(x) and

(x) has distribution

(

)(

). The pair



(

)

(

)



therefore forms a valid coupling of these

distributions. Now



η)(x), (T

)(x)



(a)

≤ E





R + γG(X

)



−



R + γG

)





| X = x

(b)

= γ



G(X

) −G

)



| X = x

(c)

≤ γ

∈X

= x

| X = x) E



G(x

) −G

)



(d)

≤ γ

sup

∈X



G(x

) −G

)



(e)

= γ

(η, η

) .

Draft version.

Operators and Metrics 91

Taking a supremum over

x ∈X

on the left-hand side and the

th root of both

sides yields the result. Here, (a) follows since the Wasserstein distance is

deﬁned as a minimum over couplings, (b) follows from algebraic manipulation

of the expectation, (c) follows from independence of the sample transition

(

x, A, R, X

) and the random variables (

(

)

, G

(

) :

x ∈X

), (d) because the

maximum of nonnegative quantities is at least as great as their weighted average,

and (e) follows since (

(

)

, G

(

)) was deﬁned as an optimal coupling of

(

)

and η

). The proof for p = ∞ is similar (see Exercise 4.8).

Proof of Proposition 4.16.

Let us denote by

(

) the space of distributions

bounded on [V

min

, V

max

], where as usual

min

1 −γ

, V

max

1 −γ

We will show that under the assumption of rewards bounded on [R

min

, R

max

(a) the return function η

is in P

(R), and

(b) the distributional Bellman operator maps P

(R) to itself.

Consequently, we can invoke Proposition 4.7 with

and

(

)

to conclude that for any initial

∈P

(

)

, the sequence of iterates (

)

k≥0

converges to η

with respect to d = w

, for any p ∈[1, ∞].

To prove (a), note that for any state x ∈X,

(x) =

∞

t=0

, X

= x ,

and since

∈

[

min

, R

max

] for all

, then also

(

)

∈

[

min

, V

max

]. For (b), let

η ∈P

(

)

and denote by

an instantiation of this return-distribution function.

For any x ∈X,



R + γG(X

) ≤V

max

| X = x



= P



γG(X

) ≤V

max

−R | X = x



≥P



γG(X

) ≤V

max

−R

max

| X = x



= P



G(X

) ≤V

max

| X = x



= 1.

By the same reasoning,



R + γG(X

) ≥V

min

| X = x



= 1.

Since

γG

(

)

, X

is an instantiation of (

)(

) for each

, we conclude

that if η ∈P

(R)

, then also T

η ∈P

(R)

Draft version.

92 Chapter 4

4.5 `

Probability Metrics and the Cramér Distance

The previous section established that the distributional Bellman operator is well

behaved with respect to the family of Wasserstein distances. However, these

are but a few among many standard probability metrics. We will see in Chapter

5 that theoretical analysis sometimes requires us to study the behavior of the

distributional operator with respect to other metrics. In addition, many practical

algorithms directly optimize a metric (typically expressed as a loss function) as

part of their operation (see Chapter 10). The Cramér distance, a member of the

broader family of 

metrics, is of particular interest to us.

Deﬁnition 4.19.

Let

p ∈

, ∞

). The distance



(

)

×P

(

)

→

, ∞

] is a

probability metric deﬁned by



(ν, ν

) =

(z) −F

(z)|

1/p

. (4.12)

For

= 2, this is the Cramér distance.

The



∞

or Kolmogorov–Smirnov

distance is given by



∞

(ν, ν

) = sup

z∈R



(z) −F

(z)



The respective supremum 

distances are given by (η, η

∈P(R)

)



(η, η

) = sup

x∈X





η(x), η(x

)



These are extended metrics on P(R)

. 4

Where the

-Wasserstein distances measure diﬀerences in outcomes, the



distances measure diﬀerences in the probabilities associated with these

outcomes. This is because the exponent

is applied to cumulative probabilities

(this is illustrated in Figure 4.3). The distributional Bellman operator is also a

contraction mapping under the



distances for

p ∈

, ∞

), albeit with a larger

contraction modulus.

28.

Historically, the Cramér distance has been deﬁned as the square of



. In our context, it seems

unambiguous to use the word for 

itself.

29.

For

= 1, the distributional Bellman operator has contraction modulus

; this is sensible given

that 

= w

Draft version.

Operators and Metrics 93

Proposition 4.20.

For

p ∈

, ∞

), the distributional Bellman operator

is a contraction mapping on

(

)

with respect to



, with contraction

modulus γ

. That is,



η, T

) ≤γ



(η, η

)

for all η, η

∈P (R)

. 4

The proof of Proposition 4.20 will follow as a corollary of a more general

result given in Section 4.6. One way to relate it to our earlier result is to consider

the behavior of the sequence deﬁned by

k+1

= T

. (4.13)

As measured in the

-Wasserstein distance, the sequence (

)

k≥0

approaches

at a rate of

; but if we instead measure distances using the



metric, this

rate is slower – only

. Measured in terms of



∞

(the Kolmogorov–Smirnov

distance), the sequence of iterates may in fact not seem to approach

at all.

To see this, it suﬃces to consider a single-state process with zero reward (that

is,

(

x | X

) = 1 and

= 0) and a discount factor

= 0

9. In this case,

(x) = δ

. For the initial condition η

(x) = δ

, we obtain

(x) = (T

)(x) = δ

Now, the (supremum)



∞

distance between

and

is 1, because for any

z ∈(0, 1),

(x)

(z) = 0 F

(x)

(z) = 1.

However, the



∞

distance between

and

is also 1, by the same argument

(but now restricted to z ∈(0, γ)). Hence, there is no β ∈[0, 1) for which



∞

(η

, η

) < β

∞

(η

, η

Exercise 4.16 asks you to prove a similar result for a probability metric called

the total variation distance (see also Figure 4.4).

The more general point is that diﬀerent probability metrics are sensitive to

diﬀerent characteristics of probability distributions, and to varying degrees.

At one extreme, the

∞

-Wasserstein distance is eﬀectively insensitive to the

probability associated with diﬀerent outcomes, while at the other extreme,

the Kolmogorov–Smirnov distance is insensitive to the scale of the diﬀerence

between outcomes. In Section 4.6, we will show that a metric’s sensitivity to

diﬀerences in outcomes determines the contraction modulus of the distributional

Bellman operator under that metric; informally speaking, this explains the “nice”

behavior of the distributional Bellman operator under the Wasserstein distances.

Draft version.

94 Chapter 4

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

Probability Density

2 1 0 1 2 3 4 5

Return

0.0

Probability Density

2 1 0 1 2 3 4 5

Return

0.0

2 1 0 1 2 3 4 5

Return

0.0

0.2

0.4

0.6

0.8

1.0

Cumulative Probability

2 1 0 1 2 3 4 5

Return

0.0

0.2

0.4

0.6

0.8

1.0

Cumulative Probability

(a) (b)

Figure 4.4

The distributional Bellman operator is not a contraction mapping in either the supremum

form of

(a, b)

total variation distance (

, shaded in the top panels; see Exercise

4.16 for a deﬁnition) or

(c, d)

Kolmogorov–Smirnov distance



∞

(vertical distance in

the bottom panels). The left panels show the density

(a)

and cumulative distribution

function

(c)

of two distributions

(

) (blue) and

(

) (red). The right panels show the

same after applying the distributional Bellman operator

(b, d)

, speciﬁcally considering

the transformation induced by the discount factor

. The lack of contractivity can be

explained by the fact that neither

nor



∞

is a homogeneous probability metric

(Section 4.6).

Before moving on, let us summarize the results established thus far. By

combining the theory of contraction mappings with suitable probability metrics,

we were able to characterize the behavior of the iterates

k+1

= T

. (4.14)

In the following chapters, we will use this as the basis for the design of

implementable algorithms that approximate the return distribution

and will

appeal to contraction mapping theory to provide theoretical guarantees for these

algorithms. In particular, in Chapter 6, we will analyze the categorical temporal-

diﬀerence learning under the lens of the Cramér distance. While the results

presented until now suﬃce for most practical purposes, the following sections

Draft version.

Operators and Metrics 95

deal with some of the more technical considerations that arise from study-

ing Equation 4.14 under general conditions, particularly the issue of inﬁnite

distances between distributions.

4.6 Suﬃcient Conditions for Contractivity

In the remainder of this chapter, we characterize in greater generality the

behavior of the sequence of return function estimates described by Equation

4.14, viewed under the lens of diﬀerent probability metrics. We begin with a

formal deﬁnition of what it means for a function d to be a probability metric.

Deﬁnition 4.21.

A probability metric is an extended metric on the space of

probability distributions, written

d : P(R) ×P(R) →[0, ∞] .

Its supremum extension is the function d : P(R)

×P(R)

→R deﬁned as

d(η, η

) = sup

x∈X

d(η(x), η

(x)) .

We refer to

as a return-function metric; it is an extended metric on

(

)

Our analysis is based on three properties that a probability metric should

possess in order to guarantee contractivity. These three properties relate closely

to the three fundamental operations that make up the distributional Bellman

operator: scaling, convolution, and mixture of distributions (equivalently: scal-

ing, addition, and indexing of random variables). In this analysis, we will ﬁnd

that some properties are more easily stated in terms of random variables, oth-

ers in terms of probability distributions. Accordingly, given two probability

distributions ν, ν

with instantiations Z, Z

, let us overload notation and write

d(Z, Z

) = d(ν, ν

Deﬁnition 4.22.

Let

c >

0. The probability metric

-homogeneous if for any

scalar

γ ∈

1) and any two distributions

ν, ν

∈P

(

) with associated random

variables Z, Z

, we have

d(γZ, γZ

) = γ

d(Z, Z

) .

In terms of probability distributions, this is equivalently given by the condition

d((b

0,γ

)

ν, (b

0,γ

)

) = γ

d(ν, ν

) .

If no such c exists, we say that d is not homogeneous. 4

Deﬁnition 4.23.

The probability metric

is regular if for any two distributions

ν, ν

∈P

(

) with associated random variables

Z, Z

, and an independent random

Draft version.

96 Chapter 4

variable W, we have

d(W + Z, W + Z

) ≤d(Z, Z

) . (4.15)

In terms of distributions, this is



[(b

W,1

)

ν], E

[(b

W,1

)

]



≤d(ν, ν

) . 4

Deﬁnition 4.24.

Given

p ∈

, ∞

), the probability metric

-convex

if for

any α ∈(0, 1) and distributions ν

, ν

∈P(R), we have



αν

+ (1 −α)ν

, αν

+ (1 −α)ν



≤αd

(ν

, ν

) + (1 −α)d

(ν

, ν

) . 4

Although this

-convexity property is given for a mixture of two distributions,

it implies an analogous property for mixtures of ﬁnitely many distributions.

Theorem 4.25.

Consider a probability metric

. Suppose that

is regular,

-homogeneous for some

c >

0, and that there exists

p ∈

, ∞

) such that

-convex. Then for all return-distribution functions

η, η

∈P

(

)

, we

have

d(T

η, T

) ≤γ

d(η, η

) . 4

Proof.

Fix a state

x ∈X

and action

a ∈A

. For this state, consider the sample

transition (

x, A

a, R, X

) (Equation 2.12), and recall that

and

are

independent given X and A, since

R ∼P

( · | X, A) X

∼P

( · | X, A) .

Let

and

be instantiations of

and

, respectively.

We introduce a state-

action variant of the distributional Bellman operator,

(

)

→P

(

)

X×A

given by

(Tη)(x, a) = E[(b

R,γ

)

η(X

) |X = x, A = a].

Note that this operator is deﬁned independently of the policy

, since the action

is speciﬁed as an argument. We then calculate directly, indicating where each

hypothesis of the result is used:

((Tη)(x, a), (Tη

)(x, a)) = d



R + γG(X

), R + γG

)



(a)

≤ d



γG(X

), γG

)



(b)

= γ



G(X

), G

)



30.

This matches the usual deﬁnition of convexity (for

) if one treats the pair (

, ν

) as a single

argument from P(R) ×P(R).

31. Note: the proof does not assume the independence of G(x) and G(y), x , y. See Exercise 4.12.

Draft version.

Operators and Metrics 97

(c)

≤ γ

∈X

= x

| X = x, A = a)d



G(x

), G

)



≤γ

sup

∈X



G(x

), G

)



= γ

(η, η

) .

Here, (a) follows from regularity of

, (b) follows from the

-homogeneous

property of

, and (c) follows from

-convexity, where the mixture is over the

values taken by the random variable X

. We also note that

η)(x) =

a∈A

π(a | x)(Tη)(x, a),

and hence by p-convexity of d, we have



η)(x), (T

)(x)



≤

a∈A

π(a | x)d



η)(x, a), (T

)(x, a)



≤γ

(η, η

) .

Taking the supremum over

x ∈X

on the left-hand side and taking

th roots then

yields the result.

Theorem 4.25 illustrates how the contractivity of the distributional Bellman

operator in the probability metric

(speciﬁcally, in its supremum extension)

follows from natural properties of

. We see that the contraction modulus is

closely tied to the homogeneity of

, which informally characterizes the extent

to which scaling random variables by a factor

brings them “closer together.”

The theorem provides an alternative to our earlier result regarding Wasserstein

distances and enables us to establish the contractivity under the



distances but

also under other probability metrics. Exercise 4.19 explores contractivity under

the so-called maximum mean discrepancy (MMD) family of distances.

Proof of Proposition 4.20 (contractivity in 

distances).

We will apply Theo-

rem 4.25 to the probability metric



, for

p ∈

, ∞

). It is therefore suﬃcient to

demonstrate that 

/p-homogeneous, regular, and p-convex.

1/p-homogeneity.

Let

ν, ν

∈P

(

) with associated random variables

Z, Z

We make use of the fact that for γ ∈[0, 1),

γZ

(z) = F





Writing 

for the pth power of 

, we have



(γZ, γZ

) =



γZ

(z) −F

γZ

(z)



Draft version.

98 Chapter 4



(

) −F

(

)



(a)



(z) −F

(z)



= γ

(Z, Z

) , (4.16)

where (a) follows from a change of variables

z/γ 7→z

in the integral. Therefore,

we deduce that 

(γZ, γZ

) = γ

1/p



(Z, Z

Regularity.

Let

ν, ν

∈P

(

), with

Z, Z

independent instantiations of

and

, respectively, and let W be a random variable independent of Z, Z

. Then,



(W + Z, W + Z

) =

W+Z

(z) −F

W+Z

(z)|

(z −W)] −E

(z −W)]|

(a)

≤

[|F

(z −W) −F

(z −W)|

]dz

(b)

= E



(z −W) −F

(z −W)|

]dz



= 

(Z, Z

) ,

where (a) follows from Jensen’s inequality, and (b) follows by swapping the

integral and expectation (more formally justiﬁed by Tonelli’s theorem).

p-convexity. Let α ∈(0, 1), and ν

, ν

∈P (R). Note that

αν

+(1−α)ν

(z) = αF

(z) + (1 −α)F

(z) ,

and similarly for the primed distributions

, ν

. By convexity of the function

z 7→|z|

on the real numbers and Jensen’s inequality,

αν

+(1−α)ν

(z) −F

αν

+(1−α)ν

(z)|

≤α|F

(z) −F

(z)|

+ (1 −α)|F

(z) −F

(z)|

for all z ∈R. Hence,



(αν

+ (1 −α)ν

, αν

+ (1 −α)ν

) ≤ α

(ν

, ν

) + (1 −α)

(ν

, ν

and 

is p-convex.

4.7 A Matter of Domain

Suppose that we have demonstrated, by means of Theorem 4.25, that the distri-

butional Bellman operator is a contraction mapping in the supremum extension

Draft version.

Operators and Metrics 99

of some probability metric d. Is this suﬃcient to guarantee that the sequence

k+1

= T

converges to the return function

, by means of Proposition 4.7? In general,

no, because

may assign inﬁnite distances to certain pairs of distributions.

To invoke Proposition 4.7, we identify a subset of probability distributions

(

) that are all within ﬁnite

-distance of each other and then ensure that

the distributional Bellman operator is well behaved on this subset. Speciﬁcally,

we identify a set of conditions under which

(a) the distributional Bellman operator T

maps P

(R)

to itself, and

(b) the return function η

(the ﬁxed point of T

) lies in P

(R)

For most common probability metrics and natural problem settings, these

requirements are easily veriﬁed. In Proposition 4.16, for example, we demon-

strated that under the assumption that the reward distributions are bounded,

then Proposition 4.7 can be applied with the Wasserstein distances. The aim of

this section is to extend the analysis to a broader set of probability metrics but

also to a greater number of problem settings, including those where the reward

distributions are not bounded.

Deﬁnition 4.26.

Let

be a probability metric. Its ﬁnite domain

(

)

⊆P

(

)

is the set of probability distributions with ﬁnite ﬁrst moment and ﬁnite

distance to the distribution that puts all of its mass on zero:

(R) =



ν ∈P(R) : d(ν, δ

) < ∞, E

Z∼ν

[|Z|] < ∞



. (4.17)

By the triangle inequality, for any two distributions

ν, ν

∈P

(

), we are

guaranteed

(

ν, ν

)

< ∞

. Although the choice of

as the reference point is some-

what arbitrary, it is sensible given that many reinforcement learning problems

include the possibility of receiving no reward at all (e.g.,

(

) = 0). The ﬁnite

ﬁrst-moment assumption is made in light of Assumption 2.5, which guarantees

that return distributions have well-deﬁned expectations.

Draft version.

100 Chapter 4

Proposition 4.27.

Let

be a probability metric satisfying the conditions

of Theorem 4.25, with ﬁnite domain

(

). Let

be the distribu-

tional Bellman operator corresponding to a given Markov decision process

(X, A, ξ

, P

). Suppose that

(a) η

∈P

(R), and

(b) P

(

) is closed under

: for any

η ∈P

(

)

, we have that

η ∈

(R)

Then for any initial condition

∈P

(

), the sequence of iterates deﬁned

k+1

= T

converges to η

with respect to d. 4

Proposition 4.27 is a specialization of Proposition 4.7 to the distributional

setting and generalizes our earlier result regarding bounded reward distributions.

Eﬀectively, it allows us to prove the convergence of the sequence (

)

k≥0

for a

family of Markov decision processes satisfying the two conditions above. The

condition

∈P

(

), while seemingly benign, does not automatically hold;

Exercise 4.20 illustrates the issue using a modiﬁed p-Wasserstein distance.

Example 4.28

(



metrics)

For a given

p ∈

, ∞

), the ﬁnite domain of the

probability metric 



(R) = {ν ∈P(R) : E

Z∼ν

[|Z|] < ∞},

the set of distributions with ﬁnite ﬁrst moment. This follows because



(ν, δ

) =

−∞

(z)

dz +

∞

(1 −F

(z))

≤

−∞

(z)dz +

∞

(1 −F

(z))dz

(a)

= E[max(0, −Z)] + E[max(0, Z)]

= E[|Z|] ,

where (

) follows from expressing

(

) and 1

−F

(

) as integrals and using

Tonelli’s theorem:

∞

(1 −F

(z))dz =

∞

Z∼ν

[ {Z > z}]dz

= E

Z∼ν

∞

{Z > z}dz

Draft version.

Operators and Metrics 101

= E

Z∼ν

[max(0, Z)] ,

and similarly for the integral from −∞ to 0.

The conditions of Proposition 4.27 are guaranteed by Assumption 2.5

(rewards have ﬁnite ﬁrst moments). From Chapter 2, we know that under

this assumption, the random return has ﬁnite expected value and hence

(x) ∈P



(R), for all x ∈X.

Similarly, we can show from elementary operations that if R and G(x) satisfy



|R|



X = x] < ∞, E[|G(x)|



< ∞, for all x ∈X,

then also



|R + γG(X



X = x



< ∞, for all x ∈X.

Following Proposition 4.27, provided that

∈P



(

)

, then the sequence

(η

)

k≥0

converges to η

with respect to 

. 4

When

is the

-Wasserstein distance (

p ∈

, ∞

]), the ﬁnite domain takes on

a particularly useful form that we denote by P

(R):

(R) =



ν ∈P(R) : E

Z∼ν

[|Z|

] < ∞



, p ∈[0, ∞) ,

∞

(R) =



ν ∈P(R) : ∃C > 0 s.t. ν([−C, C]) = 1



For

p < ∞

, this is the set of distributions with bounded

th moments; for

∞

the set of distributions with bounded support. In particular, observe that the

ﬁnite domains of 

and w

coincide: P



(R) = P

(R).

As with the



distances, we can satisfy the conditions of Proposition 4.27

for

by introducing an assumption on the reward distributions. In this

case, we simply require that these be in

(

). As this assumption will recur

throughout the book, we state it here in full; Exercise 4.11 goes through the

steps of the corresponding result.

Assumption 4.29(p).

For each state-action pair (

x, a

)

∈X×A

, the reward

distribution P

( · | x, a) is in P

(R). 4

Proposition 4.30.

Let

p ∈

, ∞

]. Under Assumption 4.29(

), the return

function

has ﬁnite

th moments (

∞

: is bounded). In addition, for

any initial return function η

∈P

(R), the sequence deﬁned by

k+1

= T

converges to η

with respect to the supremum p-Wasserstein metric. 4

Draft version.

102 Chapter 4

4.8 Weak Convergence of Return Functions*

Proposition 4.27 implies that if each distribution

(

) lies in the ﬁnite domain

(

) of a given probability metric

that is regular,

-homogeneous, and

p-convex, then η

is the unique solution to the equation

η = T

η (4.18)

in the space

(

)

. It does not, however, rule out the existence of solutions

outside this space. This concern can be addressed by showing that for any

∈P (R)

, the sequence of probability distributions (η

(x))

k≥0

deﬁned by

k+1

= T

converges weakly to the return distribution

(

), for each state

x ∈X

. In addition

to giving an alternative perspective on the quantitative convergence results of

these iterates, the uniqueness of

as a solution to Equation 4.18 (stated as

Proposition 4.9) follows immediately from Proposition 4.34 below.

Deﬁnition 4.31.

Let (

)

k≥0

be a sequence of distributions in

(

), and let

ν ∈

(

) be another probability distribution. We say that (

)

k≥0

converges weakly

to ν if for every z ∈R at which F

is continuous, we have F

(z) →F

(z). 4

We will show that for each

x ∈X

, the sequence (

(

))

k≥0

converges weakly

(

). A simple approach is to consider the relationships between well-chosen

instantiations of

(for each

k ∈N

) and

, by means of the following classical

result (see e.g., Billingsley 2012). Recall that a sequence of random variables

)

k≥0

converges to Z with probability 1 if

P( lim

k→∞

= Z) = 1 .

Lemma 4.32.

Let (

)

k≥0

be a sequence in

(

) and

ν ∈P

(

) be another

probability distribution. Let (

)

k≥0

and

be instantiations of these distributions

all deﬁned on the same probability space. If

→Z

with probability 1, then

→ν weakly. 4

Lemma 4.32 is not a uniformly applicable approach to demonstrating

weak convergence; there always exists such instantiations by Skorokhod’s

representation theorem (Section 25), but ﬁnding such instantiations is not

always straightforward. However, in our case, there are a very natural set of

instantiations that work, constructed by the following result.

Lemma 4.33.

Let

η ∈P

(

)

, and let

be an instantiation of

. For

x ∈X

, if

(

, A

, R

)

t≥0

is a random trajectory with initial state

and generated by

following

, independent of

, then

k−1

t=0

(

) is an instantiation of

((T

)

η)(x). 4

Draft version.

Operators and Metrics 103

Proof. This follows by inductively applying Proposition 4.11.

Proposition 4.34. Let η

∈P (R)

, and for k ≥0 deﬁne

k+1

= T

Then we have

(

)

→η

(

) weakly for each

x ∈X

, and consequently

the unique ﬁxed point of T

in P(R)

. 4

Proof.

Fix

x ∈X

, let

be an instantiation of

, and on the same probability

space, let (

, A

, R

)

t≥0

be a trajectory generated by

with initial state

independent of

(the existence of such a probability space is guaranteed by

the Ionescu–Tulcea theorem, as described in Remark 2.1). Then

(x) =

k−1

t=0

+ γ

)

is an instantiation of

(

) by Lemma 4.33. Furthermore,

∞

t=0

is an

instantiation of η

(x). We have



(x) −

∞

t=0



) −

∞

t=k



≤γ



)



∞

t=k



→0 ,

with probability 1. The convergence of the ﬁrst term follows since

(

)

|≤

max

x∈X

(

)

is bounded with probability 1 (w.p. 1) and

→

0, and the

convergence of the second term follows from convergence of

k−1

t=0

∞

t=0

w.p. 1. Therefore, we have

(

)

→

∞

t=0

w.p. 1, and so

(

)

→

(

) weakly. Finally, this implies that there can be no other ﬁxed point of

(

)

; if

were such a ﬁxed point, then we would have

for

all

k ≥

0 and would simultaneously have

(

)

→η

(

) weakly for all

x ∈X

, a

contradiction unless η

= η

As the name indicates, the notion of weak convergence is not as strong as

many other notions of convergence of probability distributions. In general, it

does not even guarantee convergence of the mean of the sequence of distribu-

tions; see Exercise 4.21. In addition, we lose any notion of convergence rate

provided by the contractive nature of the distributional operator under speciﬁc

metrics. The need for stronger guarantees, such as those oﬀered by Proposition

4.27, motivates the contraction mapping theory developed in this chapter.

Draft version.

104 Chapter 4

4.9 Random-Variable Bellman Operators*

In this chapter, we deﬁned the distributional Bellman operator

as a mapping

on the space of return-distribution functions

(

)

. We also saw that the action

of the operator on a return function

η ∈P

(

)

can be understood both through

direct manipulation of the probability distributions or through manipulation of

a collection of random variables instantiating these distributions.

Viewing the operator through its eﬀect on the distribution of a collection of

representative random variables is a useful tool for understanding distributional

reinforcement learning and may prompt the reader to ask whether it is possible

to avoid referring to probability distributions at all, working instead directly

with random variables. We describe one approach to this below using the tools

of probability theory and then discuss some of its shortcomings.

Let

= (

(

) :

x ∈X

) be an initial collection of real-valued random vari-

ables, indexed by state, supported on a probability space (Ω

, F

, P

). For each

k ∈N

, let (Ω

, F

, P

) be another probability space, supporting a collection of

random variables ((

(

)

, R

(

x, a

)

, X

(

x, a

)) :

x ∈X, a ∈A

), with

(

)

∼π

(

· | x

and independently

(

x, a

)

∼P

(

· | x, a

(

x, a

)

∼P

(

·|x, a

). We then con-

sider the product probability space on Ω =

k∈N

Ω

. All random variables

deﬁned above can naturally be viewed as functions on this joint probability

space which depend on

= (

, ω

, . . .

)

∈

Ω only through the coordinate

that matches the index

on the random variable. Note that under this

construction, all random variables with distinct indices are independent.

Now deﬁne

as the set of real-valued random variables on (Ω

, F , P

)

(where

is the product

-algebra) that depend on only ﬁnitely many coordi-

nates of

ω ∈

Ω. We can deﬁne a Bellman operator

→X

as follows.

Given

= (

(

) :

x ∈X

)

∈X

, let

K ∈N

be the smallest integer such that the

random variables (

(

) :

x ∈X

) depend on

= (

, ω

, …

)

∈

Ω only through

, …, ω

K−1

; such an integer exists due to the deﬁnition of

and the ﬁniteness

of X. We then deﬁne T

G ∈X

G)(x) = R

(x, A

(x)) + γG(X

(x, A

(x)) .

With this deﬁnition, we can obtain a sequence of collections of random vari-

ables (

)

k≥0

, deﬁned iteratively by

k+1

, for

k ≥

We have therefore

formalized an operator entirely within the realm of random variables, without

reference to the distributions of the iterates (

)

k≥0

. By construction, the distri-

bution of the random variables (

)

k≥0

matches the sequence of distributions

that would be obtained by working directly on the space of probability dis-

tributions with the usual distributional Bellman operator. More concretely, if

32. This is real equality between random variables, rather than in distribution.

Draft version.

Operators and Metrics 105

∈P

(

)

is such that

(

) is the distribution of

(

) for each

x ∈X

, then

we have that

, deﬁned by

= (

)

, is such that

(

) is the distribution of

(

), for each

x ∈X

. Thus, the random-variable Bellman operator constructed

above is consistent with the distributional Bellman operator that is the main

focus of this chapter.

One diﬃculty with this random variable operator is that it does not have a

ﬁxed point; while the distribution of the random variables

(

) converges to that

(

), the random variables themselves, as functions on the probability space,

do not converge.

Thus, while it is one way to view distributional reinforcement

learning purely in terms of random variables, it is much less natural to analyze

algorithms from this perspective, rather than through probability distributions

as described in this chapter.

4.10 Technical Remarks

Remark 4.1.

Our exposition in this chapter has focused on the standard

distributional Bellman equation

(x)

= R + γG

), X = x .

A similar development is possible with the alternative notions of random return

mentioned in Section 2.9, including the random-horizon return. In general, the

distributional operators that arise from these alternative notions of the random

return are still contraction mappings, although the metrics and contraction

moduli involved in these statements diﬀer. For example, while the standard

distributional Bellman operator is not a contraction in total variation distance,

the distributional Bellman operator associated with the random-horizon return

is; Exercise 4.17 explores this point in greater detail. 4

Remark 4.2.

The ideas developed in this chapter, and indeed the other chapters

of the book, can also be applied to learning other properties related to the return.

Achab (2020) explores several methods for learning the distribution of the

random variables

γV

(

), under the diﬀerent possible initial conditions

; these objects interpolate between the expected return and the full distribution

of the return. Exercise 4.22 explores the development of contractive operators

for the distributions of these objects. 4

Remark 4.3.

As well as allowing us to quantify the contractivity of the distribu-

tional Bellman operator, the probability metrics described in this chapter can be

used to measure the accuracy of algorithms that aim to approximate the return

33.

This is a fairly subtle point – the reader is invited to consider what happens to

(

) and

(

)

as mappings from Ω to R.

Draft version.

106 Chapter 4

distribution. In particular, they can be used to give quantitative guarantees on the

accuracy of the nonparametric distributional Monte Carlo algorithm described

in Remark 3.1. These guarantees can then be used to determine how many

sample trajectories are required to approximate the true return distribution at a

required level of accuracy: for example, when evaluating the performance of

the algorithms in Chapters 5 and 6. We assume that

returns (

)

k=1

beginning

at state

have been generated independently via the policy

and consider the

accuracy of the nonparametric distributional Monte Carlo estimator

ˆη

) =

k=1

Diﬀerent realizations of the sampled returns will lead to diﬀerent estimates; the

following result provides a guarantee in the Kolmogorov–Smirnov distance



∞

that holds with high probability over the possible realizations of the returns. For

any ε > 0, we have



∞

(ˆη

), η

)) ≤ε , with probability at least 1 −2 exp(−2Kε

) .

Thus, for a desired level of accuracy

and a desired conﬁdence 1

−δ

can

be selected so that 2

exp

(

−

Kε

)

< δ

, which then yields the guarantee that with

probability at least 1

−δ

, we have



∞

(

ˆη

(

)

, η

(

))

≤ε

. This is in fact a key

result in empirical process theory known as the Dvoretzky–Kiefer–Wolfowitz

inequality (Dvoretzky et al. 1956; Massart 1990). Its uses speciﬁcally in rein-

forcement learning include the works of Keramati et al. (2020) and Chandak

et al. (2021). Similar concentration bounds are possible under other probability

metrics (such as the Wasserstein distances; see, e.g., Weed and Bach 2019;

Bobkov and Ledoux 2019), though typically some form of a priori information

about the return distribution, such as bounds on minimum/maximum returns, is

required to establish such bounds. 4

4.11 Bibliographical Remarks

4.1–4.2.

The use of operator theory to understand reinforcement learning algo-

rithms is standard to most textbooks in the ﬁeld (Bertsekas and Tsitsiklis 1996;

Szepesvári 2010; Puterman 2014; Sutton and Barto 2018), and is commonly

used in theoretical reinforcement learning (Munos 2003; Bertsekas 2011; Scher-

rer 2014). Puterman (2014) provides a thorough introduction to vector notation

for Markov decision processes. Improved convergence results can be obtained

by studying the eigenspectrum of the transition kernel, as shown by, for exam-

ple, Morton (1971) and Bertsekas (1994, 2012). The elementary contraction

mapping theory described in Section 4.2 goes back to Banach (1922). Our

reference on metric spaces is the deﬁnitive textbook by Rudin (1976).

Draft version.

Operators and Metrics 107

Not all reinforcement learning algorithms are readily analyzed using the

theory of contraction mappings. This is the case for policy-gradient algorithms

(Sutton et al. 2000; but see Ghosh et al. 2020; Bhandari and Russo 2021), but

also value-based algorithms such as advantage learning (Baird 1999; Bellemare

et al. 2016).

4.3.

The random-variable Bellman operator presented here is from our earlier

work (Bellemare et al. 2017a), which provided an analysis in the

-Wasserstein

distances. There, the technical issues discussed in Section 4.9 were obviated

by declaring

R, X

and the collection (

(

) :

x ∈X

) to be independent. Those

issues were raised in a later paper (Rowland et al. 2018), which also introduced

the distributional Bellman equation in terms of probability measures and the

pushforward notation. The probability density equation (Equation 4.9) can be

found in Morimura et al. (2010b). Earlier instances of distributional operators

are given by Chung and Sobel (1987) and Morimura et al. (2010a), who provide

an operator on cumulative distribution functions and Jaquette (1976), who

provides an operator on Laplace transforms.

4.4.

The Wasserstein distance can be traced back to Leonid Kantorovich (1942)

and has been rediscovered (and renamed) multiple times in its history. The name

we use here is common but a misnomer as Leonid Vaserstein (after whom the

distance is named) did not himself do any substantial work on the topic. Among

other names, we note the Mallows metric (Bickel and Freedman 1981) and the

Earth–Mover distance (Rubner et al. 1998). Much earlier, Monge (1781) was

the ﬁrst to study the problem of optimal transportation from a transport theory

perspective. See Vershik (2013) and Panaretos and Zemel (2020) for further

historical comments and Villani (2008) for a survey of theoretical properties.

A version of the contraction analysis in

-Wasserstein distances was given by

Bellemare et al. (2017a); we owe the proof of Proposition 4.15 in terms of

optimal couplings to Philip Amortila.

The use of contraction mapping theory to analyze stochastic ﬁxed-point equa-

tions was introduced by Rösler (1991), who analyzes the Quicksort algorithm

by characterizing the distributional ﬁxed points of contraction mappings in 2-

Wasserstein distance. Applications and generalization of this technique include

the analysis of further recursive algorithms, models in stochastic geometry, and

branching processes (Rösler 1992; Rachev and Rüschendorf 1995; Neininger

1999; Rösler and Rüschendorf 2001; Rösler 2001; Neininger 2001; Neininger

and Rüschendorf 2004; Rüschendorf 2006; Rüschendorf and Neininger 2006;

Alsmeyer 2012). Although the random-variable Bellman equation (Equation

2.15) can be viewed as a system of recursive distributional equations, the empha-

sis on a collection of eﬀectively independent random variables (

) diﬀers from

the usual treatment of such equations.

Draft version.

108 Chapter 4

4.5.

The family of



distances described in this chapter is covered at length

in the work of Rachev et al. (2013), which studies an impressive variety of

probability metrics. A version of the contraction analysis in Cramér distance

was originally given by Rowland et al. (2018). In two and more dimensions, the

Cramér distance is generalized by the energy distance (Székely 2002; Székely

and Rizzo 2013; Rizzo and Székely 2016), itself a member of the maximum

mean discrepancy (MMD) family (Gretton et al. 2012); contraction analysis

in terms of MMD metrics was undertaken by Nguyen et al. (2021) (see Exer-

cise 4.19 for further details). Another special case of the



metrics considered in

this chapter is the Kolmogorov–Smirnov distance (



∞

), which features in results

in empirical process theory, such as the Glivenko–Cantelli theorem. Many of

these metrics are integral probability metrics (Müller 1997), which allows for a

dual formulation with appealing algorithmic consequences. Chung and Sobel

(1987) provide a nonexpansion result in total variation distance (without naming

it as such; the proof uses an integral probability metric formulation).

4.6.

The properties of regularity, convexity, and

-homogeneity were introduced

by Zolotarev (1976) in a slightly more general setting. Our earlier work pre-

sented these in a modern context (Bellemare et al. 2017b), albeit with only a

mention of their potential use in reinforcement learning. Although that work

proposed the term “sum-invariant” as mnemonically simpler, this is only techni-

cally correct when Equation 4.15 holds with equality; we have thus chosen to

keep the original name. Theorem 4.25 is new to this book.

The characterization of the Wasserstein distance as an optimal transport prob-

lem in Proposition 4.18 is the standard presentation of the Wasserstein distance

in more abstract settings, which allows it to be applied to probability distri-

butions over reasonably general metric spaces (Villani 2003, 2008; Ambrosio

et al. 2005; Santambrogio 2015). Optimal transport has also increasingly found

application within machine learning in recent years, particularly in generative

modeling (Arjovsky et al. 2017). Optimal transport and couplings also arise

in the study of bisimulation metrics for Markov decision processes (Ferns et

al. 2004; Ferns and Precup 2014; Amortila et al. 2019) as well as analytical

tools for sample-based algorithms (Amortila et al. 2020). Peyré and Cuturi

(2019) provide an overview of algorithms, analysis, and applications associated

with optimal transport in machine learning and related disciplines.

4.7–4.8.

Villani (2008) gives further discussion on the domain of Wasserstein

distances and on their relationship to weak convergence.

4.9.

The usefulness of a random-variable operator has been a source of intense

debate between the authors of this book. The form we present here is inspired

by the “stack of rewards” model from Lattimore and Szepesvári (2020).

Draft version.

Operators and Metrics 109

4.12 Exercises

Exercise 4.1.

Show that the no-loop operator deﬁned in Example 4.6 is a

contraction mapping with modulus

β = γ max

x∈X



, x | X = x



. 4

Exercise 4.2. For p ∈[1, ∞), let k·k

be the L

norm over R

, deﬁned as

kVk



x∈X

|V(x)|



1/p

Show that

is not a contraction mapping in the metric induced by the

norm unless

∞

(see Equation 4.4 for a deﬁnition of the

∞

metric). Hint. A

two-state example suﬃces. 4

Exercise 4.3.

In this exercise, you will use the ideas of Section 4.1 to study

several operators associated with expected-value reinforcement learning.

(i)

For

n ∈N

, consider the

-step evaluation operator

→R

deﬁned

V)(x) = E



n−1

t=0

+ γ

V(X

)



= x



Show that

has

as a ﬁxed point and is a contraction mapping with

respect to the

∞

metric with contraction modulus

. Hence, deduce that

repeated application of

to any initial value function estimate converges

to V

. Show that in fact, T

= (T

)

(ii) Consider the λ-return operator T

: R

→R

deﬁned by

V)(x) = (1 −λ)

∞

n=1

n−1



n−1

t=0

+ γ

V(X

)



= x



Show that

has

as a ﬁxed point and is a contraction mapping with

respect to the L

∞

metric with contraction modulus

1 −λ

1 −γλ

Hence, deduce that repeated application of

to any initial value function

estimate converges to V

. 4

Exercise 4.4.

Consider the random-variable operator (Equation 4.7). Given

a return-variable function

, write

(

x, ω

) =

(

)(

) for the realization

of the random return corresponding to

ω ∈

Ω. Additionally, for each

x ∈X

let (

x, A

(

)

, R

(

)

, X

(

)) be an independent random transition deﬁned on the

same probability space, and write (

(

x, ω

)

, R

(

x, ω

)

, X

(

x, ω

)) to denote the

Draft version.

110 Chapter 4

dependence of these random variables on

ω ∈

Ω. Suppose that for each

k ≥

we deﬁne the return-variable function G

k+1

(x, ω) = R(x, ω) + γG

(x, ω), ω) .

For a given x, characterize the function

∗

(x, ω) = lim

k→∞

(x, ω) . 4

Exercise 4.5.

Suppose that you are given a description of a Markov decision

process along with a policy

, with the property that the policy

reaches a

terminal state (i.e., one for which the return is zero) in at most

T ∈N

steps.

Describe a recursive procedure that takes in a scalar

on [0

1] and outputs

a return

z ∈R

such that



(

)

≤z



. Hint. You may want to use the fact

that sampling a random variable

can be emulated by drawing

uniformly

from [0, 1], and returning F

−1

(τ). 4

Exercise 4.6. Let p ∈[1, ∞].

(i)

Show that for any

ν, ν

∈P

(

) with ﬁnite

th moments, we have

(

ν, ν

)

∞

. Hint. Use the triangle inequality with intermediate distribution

. Hence,

prove that the

-Wasserstein metric is indeed a metric on

(

), for

p ∈

[1, ∞].

(ii)

Show that on the space

(

), the

-Wasserstein metric satisﬁes all require-

ments of a metric except ﬁniteness. For each

p ∈

, ∞

], exhibit a pair of

distributions ν, ν

∈P(R) such that w

(ν, ν

) = ∞.

(iii)

Show that if 1

≤q < p < ∞

, we have

[

|Z|

]

< ∞

⇒ E

[

|Z|

]

< ∞

for any

random variable

. Deduce that

(

)

⊆P

(

). Hint. Consider applying

Jensen’s inequality with the function z 7→|z|

p/q

. 4

Exercise 4.7.

Let

(

)

×P

(

)

→

, ∞

] be a probability metric with ﬁnite

domain P

(R).

(i) Prove that the supremum distance d is an extended metric on P(R)

(ii) Prove that it is a metric on P

(R)

. 4

Exercise 4.8. Prove Proposition 4.15 for p = ∞. 4

Exercise 4.9.

Consider two normally distributed random variables with mean

and

and common variance

. Derive an expression for the

∞

-Wasserstein

distance between these two distributions. Conclude that Assumption 4.29(

∞

) is

suﬃcient, but not necessary, for two distributions to have ﬁnite

∞

-Wasserstein

distance. How does the situation change if the two normal distributions in

question have unequal variances? 4

Draft version.

Operators and Metrics 111

Exercise 4.10.

Explain, in words, why Assumption 4.29(

) is needed for Propo-

sition 4.30. By considering the deﬁnition of the

-Wasserstein distance, explain

why for

p >

1 and 1

≤q < p

, Assumption 4.29(

) is not suﬃcient to guarantee

convergence in the p-Wasserstein distance. 4

Exercise 4.11.

This exercise guides you through the proof of Proposition 4.30.

(i)

First, show that under Assumption 4.29(

), the return distributions

(

)

have ﬁnite

th moments, for all

x ∈X

. You may ﬁnd it useful to deal with

∞

separately, and in the case

p ∈

, ∞

), you may ﬁnd it useful to rewrite



∞

t=0



| X = x

= (1 −γ)

−p



∞

t=0

(1 −γ)γ



| X = x

and use Jensen’s inequality on the function z 7→|z|

(ii)

Let

η ∈P

(

)

, and let

be an instantiation of

. First letting

p ∈

, ∞

), use the inequality

≤

p−1

(

) to argue that if (

x, A, R, X

) is a sample transition independent of G, then

[|R + γG(X

|X = x] < ∞.

Hence, argue that under Assumption 4.29(

(

)

is closed under

Argue separately that this holds for p = ∞ too.

Hence, argue that Proposition 4.27 applies, and hence conclude that Proposi-

tion 4.30 holds. 4

Exercise 4.12.

In the proof of Theorem 4.25, we did not need to assume that

for the two distinct states

x, y ∈X

, their associated returns

(

) and

(

) are

independent. Explain why. 4

Exercise 4.13. The (1,1)-Pareto distribution ν

par

has cumulative distribution

par

(z) =

0 if z < 1,

1 −

if z ≥1 .

Justify the necessity of including Assumption 2.5 (ﬁnite-mean rewards) in

Deﬁnition 4.26 by demonstrating that



(ν

par

, δ

) < ∞ yet E

Z∼ν

[Z] = ∞. 4

Exercise 4.14.

The purpose of this exercise is to contrast the

-Wasserstein and



distances. For each of the following, ﬁnd a pair of probability distributions

ν, ν

∈P (R) such that, for a given ε > 0,

(i) w

(ν, ν

) smaller than 

(ν, ν

);

(ii) w

(ν, ν

) larger than 

(ν, ν

);

(iii) w

∞

(ν, ν

) = ε and 

∞

(ν, ν

) = 1;

(iv) w

∞

(ν, ν

) = 1 and 

∞

(ν, ν

) = ε. 4

Draft version.

112 Chapter 4

Exercise 4.15.

Show that the dependence on

1/p

is tight in Proposition 4.20.

Exercise 4.16.

The total variation distance

(

)

×P

(

)

→R

is deﬁned

(ν, ν

) = sup

U⊆R

|ν(U) −ν

(U)|, (4.19)

for all

ν, ν

∈P

(

Show, by means of a counterexample, that the distribu-

tional Bellman operator is not a contraction mapping in the supremum extension

of this distance. 4

Exercise 4.17.

Consider the alternative notion of the return introduced in

Section 2.9, the random-horizon return, for which

is treated as the prob-

ability of continuing. Write down the distributional Bellman operator that

corresponds to this random-horizon return. For which metrics considered in this

chapter is this distributional Bellman operator a contraction mapping? Show in

particular that this distributional Bellman operator is a contraction with respect

to the supremum version of the total variation distance over return-distribution

functions, introduced in Exercise 4.16. What is its contraction modulus? 4

Exercise 4.18.

Remark 2.3 describes some diﬀerences between Markov deci-

sion processes with ﬁnite state spaces (as we consider throughout the book) and

generalizations with inﬁnite state spaces. The contraction mapping theory in

this chapter is one case where stronger assumptions are required when moving

to larger state spaces. Using the example described in Remark 2.3, show that

Assumption 4.29(

) is insuﬃcient to make

(

) closed under the distribu-

tional Bellman operator

when the state space is countably inﬁnite. How

could this assumption be strengthened to guarantee closedness? 4

Exercise 4.19

(*)

The goal of this exercise is to explore the contractivity of

the distributional Bellman operator with respect to a class of metrics known

as maximum mean discrepancies; this analysis was undertaken by Nguyen et

al. (2021). A kernel on

is a function

R ×R →R

with the property that for

any ﬁnite set

, …, z

}⊆R

, the

m ×m

matrix with (

i, j

)th entry

(

, z

) is pos-

itive semi-deﬁnite. A function

R ×R →R

satisfying the weaker condition

that

i, j=1

K(z

, z

) ≥0

34.

For the reader with a measure-theoretic background, the supremum here is over measurable

subsets of R.

Draft version.

Operators and Metrics 113

whenever

i=1

= 0 is called a conditionally positive-deﬁnite kernel. Condi-

tionally positive-deﬁnite kernels form a measure of similarity between pairs of

points in

and can also be used to deﬁne notions of distance over probability

distributions. The maximum mean discrepancy (MMD) associated with the

conditionally positive-deﬁnite kernel K is deﬁned by

MMD

(ν, ν

) =



X∼ν

∼ν

[K(X, X

)] + E

Y∼ν

∼ν

[K(Y, Y

)] −2E

X∼ν

Y∼ν

[K(X, Y)]



1/2

where each pair of random variables in the expectations above is taken to be

independent.

(i)

Consider the function

(

, z

) =

−|z

−z

, with

α ∈

2). Székely and

Rizzo (2013, Proposition 2) show that this deﬁnes a conditionally positive-

deﬁnite kernel. Show that

MMD

is regular,

-homogeneous (for some

c >

0), and

-convex (for some

p ∈

, ∞

)). Hence, use Theorem 4.25 to establish

that the distributional Bellman operator is a contraction with respect to

MMD

, under suitable assumptions.

(ii)

The Gaussian kernel, or squared exponential kernel, with variance

and length scale

λ >

0 is deﬁned by

(

, z

) =

exp

(

−

(

−z

)

)).

Show, through the use of a counterexample, that the MMD corresponding

to the Gaussian kernel is not

-homogeneous for any

c >

0, and so Theo-

rem 4.25 cannot be applied. Further, ﬁnd an MDP and policy

that serve as

a counterexample to the contractivity of the distributional Bellman operator

with respect to this MMD metric. 4

Exercise 4.20.

Let

p ∈

, ∞

], and consider a modiﬁcation of the

-Wasserstein

distance, ˜w

, such that ˜w

(ν, ν

) = w

(ν, ν

) if both ν, ν

∈P (R) are expressible

as ﬁnite mixtures of Dirac deltas, and ˜w

(ν, ν

) = ∞ otherwise.

(i)

Show that

˜w

(

) is the set of distributions expressible as ﬁnite mixtures

of Dirac deltas.

(ii)

Exhibit a Markov decision process and policy

for which all conditions of

Proposition 4.27 hold except for the condition η

∈P

˜w

(R)

(iii)

Show that the sequence of iterates (

)

k≥0

does not converge to

under

˜w

in this case. 4

Exercise 4.21. Consider the sequence of distributions (ν

)

∞

k=1

deﬁned by

k −1

Show that this sequence converges weakly to another distribution

. From this,

deduce that weak convergence does not imply convergence of expectations.

Draft version.

114 Chapter 4

Exercise 4.22. Achab (2020) considers the random variables

(x) = R + γV

) , X = x .

What does the distribution of this random variable capture about the underlying

MDP and policy

? Using the tools of this chapter, derive an operator over

(

)

that has the collection of distributions corresponding to this random

variable under each of the initial conditions in

as a ﬁxed point, and analyze

the properties of this operator. 4

Draft version.