7

Control

A superpressure balloon is a kind of aircraft whose altitude is determined by the

relative pressure between its envelope and the ambient atmosphere, and which

can be ﬂown high in the stratosphere. Like a submarine, a superpressure balloon

ascends when it becomes lighter and descends when it becomes heavier. Once

in ﬂight, superpressure balloons are passively propelled by the winds around

them, so that their direction of travel can be inﬂuenced simply by changing their

altitude. This makes it possible to steer such a balloon in an energy-efﬁcient

manner and have it operate autonomously for months at a time. Determining

the most efﬁcient way to control the ﬂight of a superpressure balloon by means

of altitude changes is an example of a control problem, the topic of this chapter.

In reinforcement learning, control problems are concerned with ﬁnding policies

that achieve or maximise speciﬁed objectives. This is in contrast with pre-

diction problems, which are concerned with characterising or quantifying the

consequences of following a particular policy. The study of control problems

involves not only the design of algorithms for learning optimal policies, but

also the study of the behaviour of these algorithms under different conditions,

such as when learning occurs one sample at a time (as per Chapter 6), when

noise is injected into the process, or when only a ﬁnite amount of data is made

available for learning. Under the distributional perspective, the dynamics of

control algorithms exhibit a surprising complexity. This chapter gives a brief

and necessarily incomplete overview of the control problem. In particular, our

treatment of control differs from most textbooks in that we focus on the distri-

butional component, and for conciseness omit some traditional material such as

policy iteration and -return algorithms.

205

206 Chapter 7

7.1 Risk-Neutral Control

The problem of ﬁnding a policy that maximises the agent’s expected return is

called the risk-neutral control problem, as it is insensitive to the deviations of

returns from their mean. We have already encountered risk-neutral control when

we introduced the Q-learning algorithm in Section 3.7. We begin this chapter

by providing a theoretical justiﬁcation for this algorithm.

Problem 7.1

(Risk-neutral control)

.

Given an MDP (

X

,

A

,

⇠

0

, P

X

, P

R

) and

discount factor 2[0, 1), ﬁnd a policy ⇡ maximising the objective function

J(⇡)=E

⇡

1

t=0

t

R

t

. 4

A solution ⇡

⇤

that maximises J is called an optimal policy.

Implicit in the deﬁnition of risk-neutral control and our deﬁnition of a policy

in Chapter 2 is the fact that the objective J is maximised by a policy that only

depends on the current state, that is one that takes the form

⇡ : X!P(A).

As noted in Section 2.2, policies of this type are more properly called sta-

tionary Markov policies, and are but a subset of possible decision rules. With

stationary Markov policies, the action A

t

is independent of the random vari-

ables X

0

, A

0

, R

0

,

...

, X

t–1

, A

t–1

, R

t–1

given X

t

. In addition, the distribution of A

t

,

conditional on X

t

, is the same for all time indices t.

By contrast, history-dependent policies select actions on the basis on the entire

trajectory up to and including X

t

(the history). Formally, a history-dependent

policy is a time-indexed collection of mappings

⇡

t

:(X⇥A⇥R)

t–1

⇥X !P(A).

In this case, we have that

A

t

|(X

0:t

, A

0:t–1

, R

0:t–1

) ⇠⇡

t

(· | X

0

, A

0

, R

0

, ..., A

t–1

, R

t–1

, X

t

) . (7.1)

When clear from context, we omit the time subscript to

⇡

t

and write

P

⇡

,

E

⇡

, G

⇡

,

and

⌘

⇡

to denote the joint distribution, expectation, random-variable function,

and return-distribution function implied by the generative equations but with

Equation 7.1 substituting the earlier deﬁnition from Section 2.2. We write

⇡

MS

for the space of stationary Markov policies, and

⇡

H

for the space of

history-dependent policies.

Control 207

It is clear that every stationary Markov policy is a history-dependent policy,

but the converse is not true. However, in risk-neutral control the added degree

of freedom provided by history-dependent policies is not needed to achieve

optimality; this is made formal by the following proposition (recall that a policy

is deterministic if it always selects the same action for a given state or history).

Proposition 7.2.

Let J(

⇡

) be as in Problem 7.1. There exists a determinis-

tic stationary Markov policy ⇡

⇤

2⇡

MS

such that

J(⇡

⇤

) J(⇡) 8⇡ 2⇡

H

. 4

Proposition 7.2 is a central result in reinforcement learning – from a computa-

tional point of view, for example, it is easier to deal with deterministic policies

(there are ﬁnitely many of them) than stochastic policies. Remark 7.1 discusses

some other beneﬁcial consequences of Proposition 7.2. Its proof involves a

surprising amount of detail; we refer the interested reader to Puterman [2014,

Section 6.2.4].

7.2 Value Iteration and Q-Learning

The main consequence of Proposition 7.2 is that when optimising the risk-

neutral objective we can restrict our attention to deterministic stationary Markov

policies. In turn, this makes it possible to ﬁnd an optimal policy

⇡

⇤

by computing

the optimal state-action value function Q

⇤

, deﬁned as

Q

⇤

(x, a) = sup

⇡2⇡

MS

E

⇡

1

t=0

t

R

t

| X = x, A = a

. (7.2)

Just as the value function V

⇡

for a given policy

⇡

satisﬁes the Bellman equation,

Q

⇤

satisﬁes the Bellman optimality equation:

Q

⇤

(x, a)=E

R + max

a

0

2A

Q

⇤

(X

0

, a

0

)|X = x, A = a

. (7.3)

The optimal state-action value function describes the expected return obtained

by acting so as to maximise the risk-neutral objective when beginning from

the state-action pair (x, a). Intuitively, we may understand Equation 7.3 as

describing this maximising behaviour recursively. While there might be multiple

optimal policies, they must (by deﬁnition) achieve the same objective value in

Problem 7.1. This value is

E

⇡

[V

⇤

(X

0

)] ,

where V

⇤

is the optimal value function:

V

⇤

(x) = max

a2A

Q

⇤

(x, a).

208 Chapter 7

Given Q

⇤

, an optimal policy is obtained by acting greedily with respect to Q

⇤

–

that is, choosing in state x any action a that maximises Q

⇤

(x, a).

Value iteration is a procedure for ﬁnding the optimal state-action value function

Q

⇤

iteratively, from which

⇡

⇤

can then be recovered by choosing actions that

have maximal state-action values. In Chapter 5 we discussed a procedure for

computing V

⇡

based on repeated applications of the Bellman operator T

⇡

. Value

iteration replaces the Bellman operator T

⇡

in this procedure with the Bellman

optimality operator

(TQ)(x, a)=E

⇡

R + max

a

0

2A

Q(X

0

, a

0

)|X = x, A = a

. (7.4)

Let us deﬁne the L

1

norm of a state-action value function Q 2R

X⇥A

as

kQk

1

= sup

x2X,a2A

|Q(x, a)|.

As the following establishes, the Bellman optimality operator is a contraction

mapping in this norm.

Lemma 7.3.

The Bellman optimality operator T is a contraction in L

1

norm

with modulus . That is, for any Q, Q

0

2R

X⇥A

,

kTQ – TQ

0

k

1

kQ – Q

0

k

1

. 4

Corollary 7.4.

The optimal state-action value function Q

⇤

is the only value

function that satisﬁes the Bellman optimality equation. Further, for any Q

0

2

R

X⇥A

, the sequence (Q

k

)

k0

deﬁned by Q

k+1

= TQ

k

(for k

0) converges to

Q

⇤

. 4

Corollary 7.4 is an immediate consequence of Lemma 7.3 and Proposition 4.7.

Before we give the proof of Lemma 7.3, it is instructive to note that despite

a visual similarity and the same contractive property in supremum norm, the

optimality operator behaves somewhat differently from the ﬁxed-policy operator

T

⇡

, deﬁned for state-action value functions as

(T

⇡

Q)(x, a)=E

⇡

R + Q(X

0

, A

0

)|X = x, A = a] , (7.5)

where conditional on the random variables (X = x, A = a, R, X

0

), we have A

0

⇠

⇡

(

·

| X

0

). In the context of this chapter, we call T

⇡

a policy evaluation operator.

Such an operator is said to be afﬁne: For any two Q-functions Q, Q

0

and

↵ 2

[0, 1], it satisﬁes

T

⇡

(↵Q + (1 – ↵)Q

0

)=↵T

⇡

Q + (1 – ↵)T

⇡

Q

0

. (7.6)

Control 209

Equivalently, the difference between T

⇡

Q and T

⇡

Q

0

can be expressed as

T

⇡

Q – T

⇡

Q

0

= P

⇡

(Q – Q

0

).

The optimality operator, on the other hand, is not afﬁne. While afﬁne operators

can be analysed almost as if they were linear,

55

the optimality operator is

generally a nonlinear operator. As such, its analysis requires a slightly different

approach.

Proof of Lemma 7.3.

The proof relies on a special property of the maximum

function. For f

1

, f

2

: A!R, it can be shown that

max

a2A

f

1

(a) – max

a2A

f

2

(a)

max

a2A

f

1

(a)–f

2

(a)

.

Now let Q, Q

0

2R

X⇥A

, and ﬁx x

2X

, a

2A

. Let us write

E

xa,⇡

[

·

]=

E

⇡

[

·

| X =

x, A = a]. By linearity of expectations, we have

|(TQ)(x, a)–(TQ

0

)(x, a)| =

E

xa,⇡

[R + max

a

0

2A

Q(X

0

, a

0

)–R – max

a

0

2A

Q

0

(X

0

, a

0

)]

=

E

xa,⇡

[ max

a

0

2A

Q(X

0

, a

0

)– max

a

0

2A

Q

0

(X

0

, a

0

)]

=

E

xa,⇡

[max

a

0

2A

Q(X

0

, a

0

) – max

a

0

2A

Q

0

(X

0

, a

0

)]

E

xa,⇡

| max

a

0

2A

Q(X

0

, a

0

) – max

a

0

2A

Q

0

(X

0

, a

0

)|

E

xa,⇡

max

a

0

2A

|Q(X

0

, a

0

)–Q

0

(X

0

, a

0

)|

max

(x

0

,a

0

)2X⇥A

|Q(x

0

, a

0

)–Q

0

(x

0

, a

0

)|

= kQ – Q

0

k

1

.

Since the bound holds for any (x, a) pair, it follows that

kTQ – TQ

0

k

1

kQ – Q

0

k

1

.

Corollary 7.5.

For any initial state-action value function Q

0

2R

X⇥A

, the

sequence of iterates Q

k+1

= TQ

k

converges to Q

⇤

in the L

1

norm. 4

We can use the unbiased estimation method of Section 6.2 to derive an incre-

mental algorithm for learning Q

⇤

, since the contractive operator T is expressible

as an expectation over a sample transition. Given a realisation (x, a, r, x

0

), we

55. Consider the proof of Proposition 4.4.

210 Chapter 7

construct the sample target

r + max

a

0

2A

Q(x

0

, a

0

).

We then incorporate this target to an update rule to obtain the Q-learning

algorithm ﬁrst encountered in Chapter 3:

Q(x, a) (1 – ↵)Q(x, a)+↵(r + max

a

0

2X

Q(x

0

, a

0

)) .

Under appropriate conditions, the convergence of Q-learning to Q

⇤

can be

established with Lemma 7.3 and a suitable extension of the analysis of Chapter

6 to the space of action-value functions.

7.3 Distributional Value Iteration

Analogous to value iteration, we can devise a distributional dynamic pro-

gramming procedure for the risk-neutral control problem; such a procedure

determines an approximation to the return function of an optimal policy. As we

will see, in some circumstances this can be accomplished without complications,

giving some theoretical justiﬁcation for the distributional algorithm presented

in Section 3.7.

As in Chapter 5, we perform distributional dynamic programming by imple-

menting the combination of a projection step with a distributional optimality

operator.

56

Because it is not possible to “directly maximise” a probability

distribution, we instead deﬁne the operator via a greedy selection rule G.

We can view the expected-value optimality operator T as substituting the expec-

tation over the next-state action A

0

in Equation 7.5 by a maximisation over a

0

.

As such, it can be rewritten as a particular policy evaluation operator T

⇡

whose

policy

⇡

depends on the operand Q; the mapping from Q to

⇡

is what we call a

greedy selection rule.

Deﬁnition 7.6.

A greedy selection rule is a mapping

G

:

R

X⇥A

!⇡

MS

with

the property that for any Q

2R

X⇥A

,

G

(Q) is greedy with respect to Q. That is,

G(Q)(a | x)>0 =) Q(x, a) = max

a

0

2A

Q(x, a

0

).

56.

As before, this implementation is at its simplest when there are ﬁnitely many possible states,

actions, and rewards, and the projection step can be computed efﬁciently. Alternatives include a

sample-based approach and, as we will see in Chapter 9, function approximation.

Control 211

We extend

G

to return functions by deﬁning, for

⌘ 2P

1

(

R

)

X⇥A

, the induced

state-action value function

Q

⌘

(x, a)= E

Z⇠⌘(x,a)

[Z],

and then letting

G(⌘)=G(Q

⌘

). 4

A greedy selection rule may produce stochastic policies, for example when

assigning equal probability to two equally-valued actions. However, it must

select actions which are maximally valued according to Q. Given a greedy

selection rule, we may rewrite the Bellman optimality operator as

TQ = T

G(Q)

Q . (7.7)

In the distributional setting, we must make explicit the dependency of the oper-

ator on the greedy selection rule

G

. Mirroring Equation 7.7, the distributional

Bellman optimality operator derived from G is

T

G

⌘ = T

G(⌘)

⌘ ,

We will see that, unlike the expected-value setting, different choices of the

greedy selection rule result in different operators with possibly different dynam-

ics – we thus speak of the distributional Bellman optimality operators, in the

plural.

Distributional value iteration algorithms combine a greedy selection rule

G

and a projection step (described by the operator

⇧

F

and implying a particular

choice of representation

F

) to compute an approximate optimal return function.

Given some initial return function

⌘

0

2F

X⇥A

, distributional value iteration

implements the iterative procedure

⌘

k+1

= ⇧

F

T

G(⌘

k

)

⌘

k

. (7.8)

In words, distributional value iteration selects a policy which at each state x is

greedy with respect to the expected values of

⌘

k

(x,

·

), and computes the return

function resulting from a single step of distributional dynamic programming

with that policy.

The induced value function Q

⌘

k

plays an important role in distributional value

iteration as it is used to derive the greedy policy

⇡

k

=

G

(

⌘

k

). When

⇧

F

is mean-

preserving (Section 5.11), Q

⌘

k

behaves as if it had been computed from standard

value iteration (Figure 7.1). That is,

Q

⌘

k+1

= TQ

⌘

k

.

212 Chapter 7

⌘ ⌘

0

Q Q

0

⇧

F

T

G

T

Figure 7.1

When the projection step

⇧

F

is mean-preserving, distributional value iteration produces

the same sequence of state-action value functions as standard value iteration.

By induction, distributional value iteration then produces the same sequence of

state-action value functions as regular value iteration. That is, given the initial

condition

Q

0

(x, a)= E

Z⇠⌘

0

(x,a)

[Z], (x, a) 2X⇥A,

and the recursion

Q

k+1

= TQ

k

,

we have

Q

⌘

k

= Q

k

, for all k 0.

As a consequence, these two processes also produce the same greedy policies:

⇡

k

= G(Q

k

).

In the following section we will use this equivalence to argue that distributional

value iteration ﬁnds an approximation to an optimal return function. When

⇧

F

is not mean-preserving, however, Q

⌘

k

may deviate from Q

k

. If that is the case,

the greedy policy

⇡

k

is likely to be different from

G

(Q

k

), and distributional

value iteration may converge to the return function of a suboptimal policy. We

discuss this point in Remark 7.2.

Before moving on, let us remark on an alternative procedure for approximating

an optimal return function. This procedure ﬁrst performs standard value iteration

to obtain an approximation

ˆ

Q

to Q

⇤

. A greedy policy

ˆ⇡

is then extracted from

ˆ

Q

.

Finally, distributional dynamic programming is used to approximate the return

function

⌘

ˆ⇡

. If

ˆ⇡

is an optimal policy, this directly achieves our stated aim, and

suggests doing away with the added complexity of distributional value iteration.

In larger problems, however, it is difﬁcult or undesirable to wait until Q

⇤

has

been determined before learning or computing the return function, or it may

not be possible to decouple value and return predictions. In these situations it is

sensible to consider distributional value iteration.

Control 213

7.4 Dynamics of Distributional Optimality Operators

In this section and the next, we analyse the behaviour of distributional optimality

operators; recall from Section 7.3 that there is one such operator for each choice

of greedy selection rule

G

, in contrast to non-distributional value iteration.

Combined with a mean-preserving projection, our analysis also informs the

behaviour of distributional value iteration. As we will see, even in the absence of

approximation due to a ﬁnite-parameter representation, distributional optimality

operators exhibit complex behaviour.

Thus far, we have analysed distributional dynamic programming algorithms

by appealing to contraction mapping theory. Demonstrating that an opera-

tor is a contraction provides a good deal of understanding about algorithms

implementing its repeated application: they converge at a geometric rate to the

operator’s ﬁxed point (when such a ﬁxed point exists). Unfortunately, distri-

butional optimality operators are not contraction mappings, as the following

shows.

Proposition 7.7.

Consider a probability metric d and let

d

be its supremum

extension. Suppose that for any z, z

0

2R,

d(

z

,

z

0

)<1.

If d is c-homogeneous, there exist a Markov decision process and two

return-distribution functions

⌘

,

⌘

0

such that for any greedy selection rule

G

and any discount factor 2(0, 1],

d(T

G

⌘, T

G

⌘

0

)>d(⌘, ⌘

0

) . (7.9)

4

Proof. Consider a MDP with two non-terminal states x, y and two actions a, b

(Figure 7.2). From state x all actions transition to state y and yield no reward. In

state y, action a results in a reward of

"

> 0, while action b results in a reward

that is –1 or 1 with equal probability; both actions are terminal. We will argue

that " can be chosen to make T

G

satisfy Equation 7.9.

First note that any optimal policy must choose action a in state y, and all optimal

policies share the same return-distribution function

⌘

⇤

. Consider another return-

distribution function

⌘

that is equal to

⌘

⇤

at all state-action pairs except that

⌘

(y, a)=

–"

(Figure 7.2, right). This implies that the greedy selection rule must

select b in y, and hence

(T

G

⌘)(x, a)=(T

G

⌘)(x, b) = (b

0,

)

#

⌘(y, b) = (b

0,

)

#

⌘

⇤

(y, b).

214 Chapter 7

⌘ T

G

⌘⌘

⇤

x, · " ±"

y, a –"" "

y, b ±1 ±1 ±1

Figure 7.2

Left.

A Markov decision process for which no distributional Bellman optimality operator

is a contraction mapping.

Right.

The optimal return-distribution function

⌘

⇤

and initial

return-distribution function

⌘

used in the proof of Proposition 7.7 (expressed in terms of

their support). Given a c-homogeneous probability metric d, the proof chooses

"

so as to

make d(T

G

⌘, ⌘

⇤

)>d(⌘, ⌘

⇤

).

Because both actions are terminal from y, we also have that

(T

G

⌘)(y, a)=⌘

⇤

(y, a)

(T

G

⌘)(y, b)=⌘

⇤

(y, b).

Let us write

⌫

=

1

2

–1

+

1

2

1

for the reward distribution associated with (y, b). We

have

d(⌘, ⌘

⇤

)=d

⌘(y, a), ⌘

⇤

(y, a)

= d(

–"

,

"

) , (7.10)

and

d(T

G

⌘, T

G

⌘

⇤

)=d

(T

G

⌘)(x, a), (T

G

⌘

⇤

)(x, a)

= d

(b

0,

)

#

⌘

⇤

(y, b), (b

0,

)

#

⌘

⇤

(y, a)

=

c

d(⌫,

"

) , (7.11)

where the last line follows by c-homogeneity (Deﬁnition 4.22). We will show

that for sufﬁciently small " > 0 we have

d(⌫,

"

)>

–c

d(

–"

,

"

),

from which the result follows by Equations 7.10 and 7.11. To this end, note that

d(

–"

,

"

)="

c

d(

–1

,

1

). (7.12)

Since d is ﬁnite for any pair of Dirac deltas, we have that d(

–1

,

1

)<

1

and so

lim

"!0

d(

–"

,

"

) = 0 . (7.13)

Control 215

On the other hand, by the triangle inequality we have

d(⌫,

"

)+d(

0

,

"

) d(⌫,

0

) > 0 , (7.14)

where the second inequality follows because

⌫

and

0

are different distributions.

Again by c-homogeneity of d, we deduce that

lim

"!0

d(

0

,

"

)=0,

and hence

lim inf

"!0

d(⌫,

"

) d(⌫,

0

)>0.

Therefore, for " > 0 sufﬁciently small, we have

d(⌫,

"

)>

–c

d(

–"

,

"

),

as required.

Proposition 7.7 establishes that for any metric d that is sufﬁciently well-behaved

to guarantee that policy evaluation operators are contraction mappings (ﬁnite on

bounded-support distributions, c-homogeneous), optimality operators are not

generally contraction mappings with respect to

d

. As a consequence we cannot

directly apply the tools of Chapter 4 to characterise distributional optimality

operators. The issue, which is implied in the proof above, is that it is possible

for distributions to have similar expectations yet differ substantially e.g. in their

variance.

With a more careful line of reasoning, we can still identify situations in which

the iterates

⌘

k+1

=

T

G

⌘

k

do converge. The most common scenario is when there

is a unique optimal policy. In this case, the analysis is simpliﬁed by the existence

of an action gap in the optimal value function.

57

Deﬁnition 7.8.

Let Q

2R

X⇥A

. The action gap at a state x is the difference

between the highest-valued and second highest-valued actions:

GAP(Q, x) = min

Q(x, a

⇤

)–Q(x, a):a

⇤

, a 2A, a

⇤

6= a, Q(x, a

⇤

) = max

a

0

2A

Q(x, a

0

)

.

The action gap of Q is the smallest gap over all states:

GAP(Q) = min

x2X

GAP(Q, x).

By extension, the action gap for a return function ⌘ is

GAP(⌘) = min

x2X

GAP(Q

⌘

, x), Q

⌘

(x, a)= E

Z⇠⌘(x,a)

[Z]. 4

57. Example 6.1 introduced an update rule that increases this action gap.

216 Chapter 7

Deﬁnition 7.8 is such that if two actions are optimal at a given state, then

GAP

(Q

⇤

) = 0. When

GAP

(Q

⇤

) > 0, however, there is a unique action that is

optimal at each state (and vice-versa). In this case, we can identify an iteration

K from which

G

(

⌘

k

)=

⇡

⇤

for all k

K. From that point on, any distributional

optimality operator reduces to the

⇡

⇤

evaluation operator which enjoys now-

familiar convergence guarantees.

Theorem 7.9.

Let

T

G

be the distributional Bellman optimality operator

instantiated with a greedy selection rule

G

. Suppose that there is a unique

optimal policy

⇡

⇤

and let p

2

[1,

1

]. Under Assumption 4.29(p) (well-

behaved reward distributions), for any initial return-distribution function

⌘

0

2P(R) the sequence of iterates deﬁned by

⌘

k+1

= T

G

⌘

k

.

converges to ⌘

⇡

⇤

with respect to the metric w

p

. 4

Theorem 7.9 is stated in terms of supremum p-Wasserstein distances for con-

ciseness. Following the conditions of Theorem 4.25 and under a different set

of assumptions, we can of course also establish the convergence in say, the

supremum Cramér distance.

Corollary 7.10.

Suppose that

⇧

F

is a mean-preserving projection for some

representation

F

, and that there is a unique optimal policy

⇡

⇤

. If Assump-

tion 5.22(w

p

) holds and

⇧

F

is a non-expansion in

w

p

, then under the conditions

of Theorem 7.9 the sequence

⌘

k+1

= ⇧

F

T

G

⌘

k

produced by distributional value iteration converges to the ﬁxed point

ˆ⌘

⇡

⇤

= ⇧

F

T

⇡

⇤

ˆ⌘

⇡

⇤

. 4

Proof of Theorem 7.9.

As there is a unique optimal policy, it must be that the

action gap of Q

⇤

is strictly greater than zero. Fix

"

=

1

2

GAP

(Q

⇤

). Following the

discussion of the previous section, we have that

Q

⌘

k+1

= TQ

⌘

k

.

Because T is a contraction in L

1

norm, we know that there exists a K

"

2N

after which

kQ

⌘

k

– Q

⇤

k

1

< " 8k K

"

. (7.15)

Control 217

For a ﬁxed x, let a

⇤

be the optimal action in that state. From Equation 7.15 we

deduce that for any a 6= a

⇤

,

Q

⌘

k

(x, a

⇤

) Q

⇤

(x, a

⇤

)–"

Q

⇤

(x, a)+GAP(Q

⇤

)–"

> Q

⌘

k

(x, a)+GAP(Q

⇤

)–2"

= Q

⌘

k

(x, a).

Thus the greedy action in state x after time K

"

is the optimal action for that state.

Thus G(⌘

k

)=⇡

⇤

for k K

"

and

⌘

k+1

= T

⇡

⇤

⌘

k

k K

"

.

We can treat

⌘

K

"

as a new initial condition

⌘

0

0

, and apply Proposition 4.30 to

conclude that ⌘

k

!⌘

⇡

⇤

.

In Section 3.7 we introduced the categorical Q-learning algorithm in terms of a

deterministic greedy policy. Generalised to an arbitrary greedy selection rule

G

,

categorical Q-learning is deﬁned by the update

⌘(x, a) (1 – ↵)⌘(x, a)+↵

a

0

2A

G(⌘)(a

0

| x

0

)

⇧

C

(b

r,

)

#

⌘

x

0

, a

0

.

Because the categorical projection is mean-preserving, its induced state-value

function follows the update

Q

⌘

(x, a) (1 – ↵)⌘(x, a)+↵

a

0

2A

G(⌘)(a

0

| x

0

)

r + Q

⌘

(x

0

, a

0

)

r+ max

a

0

2A

Q

⌘

(x

0

,a

0

)

.

Using the tools of Chapter 6, one can establish the convergence of categorical Q-

learning under certain conditions, including the assumption that there is a unique

optimal policy

⇡

⇤

. The proof essentially combines the insight that under certain

conditions the sequence of greedy policies tracked by the algorithm matches that

of a non-distributional Q-learning algorithm, and hence eventually converges

to

⇡

⇤

, at which point the algorithm is essentially performing categorical policy

evaluation of

⇡

⇤

. The actual proof is somewhat technical; we omit it here and

refer the interested reader to Rowland et al. [2018].

7.5 Dynamics In the Presence of Multiple Optimal Policies*

In the value-based setting, it does not matter which greedy selection rule is

used to represent the optimality operator: By deﬁnition, any greedy selection

rule must be equivalent to directly maximising over Q(x,

·

). In the distributional

218 Chapter 7

setting, however, different rules usually result in different operators. As a

concrete example, compare the rule “among all actions whose expected value is

maximal, pick the one with smallest variance” to “assign equal probability to

actions whose expected value is maximal”.

Theorem 7.9 relies on the fact that, when there is a unique optimal policy

⇡

⇤

, we can identify a time after which the distributional optimality operator

behaves like a policy evaluation operator. When there are multiple optimal

policies, however, the action gap of the optimal value function Q

⇤

is zero and

the argument cannot be used. To understand why this is problematic, it is useful

to write the iterates (⌘

k

)

k0

more explicitly in terms of the policies ⇡

k

= G(⌘

k

):

⌘

k+1

= T

⇡

k

⌘

k

= T

⇡

k

T

⇡

k–1

⌘

k–1

= T

⇡

k

···T

⇡

0

⌘

0

. (7.16)

When the action gap is zero, the sequence of policies

⇡

k

,

⇡

k+1

,

...

may continue

to vary over time, depending on the greedy selection rule. Although all optimal

actions have the same optimal value (guaranteeing the convergence of the

expected values to Q

⇤

), they may correspond to different distributions. Thus

distributional value iteration – even with a mean-preserving projection – may

mix together the distribution of different optimal policies. Even if (

⇡

k

)

k0

converges, the policy it converges to may depend on initial conditions (Exercise

7.5). In the worst case, the iterates (

⌘

k

)

k0

might not even converge, as the

following example shows.

Example 7.11

(Failure to converge)

.

Consider a Markov decision process with

a single non-terminal state x and two actions, a and b (Figure 7.3, left). Action

a gives a reward of 0 and leaves the state unchanged, while action b gives a

reward of –1 or 1 with equal probability and leads to a terminal state. Note that

the expected return from taking either action is 0.

Let

2

(0, 1). We will exhibit a sequence of return function estimates (

⌘

k

)

k0

that is produced by distributional value iteration and does not have a limit. We

do so by constructing a greedy selection rule that achieves the desired behaviour.

Suppose that

⌘

0

(x, b)=

1

2

(

–1

+

1

).

For any initial parameter c

0

2R, if

⌘

0

(x, a)=

1

2

(

–c

0

+

c

0

),

then by induction there is a sequence of scalars (c

k

)

k0

such that

⌘

k

(x, a)=

1

2

(

–c

k

+

c

k

) 8k 2N . (7.17)

Control 219

Figure 7.3

Left.

A Markov decision process in which the sequence of return function estimates

(

⌘

k

)

k0

do not converge (Example 7.11).

Right.

The return-distribution function estimate

at (x, a) as a function of k and beginning with c

0

= 1. At each k, the pair of dots indicates

the support of the distribution.

This is because

⌘

k+1

(x, a) = (b

0,

)

#

⌘

k

(x, a) or ⌘

k+1

(x, a) = (b

0,

)

#

⌘

k

(x, b), k 0.

Deﬁne the following greedy selection rule: at iteration k + 1, choose a if c

k

1

10

,

otherwise choose b. With some algebra, this leads to the recursion (k 0)

c

k+1

=

if c

k

<

1

10

,

c

k

otherwise.

Over a period of multiple iterations, the estimate

⌘

k

(x, a) exhibits cyclical

behaviour (Figure 7.3, right). 4

The example illustrates how, without additional constraints on the greedy selec-

tion rule, it is not possible to guarantee that the iterates converge. However,

one can prove a weaker result based on the fact that only optimal actions must

eventually be chosen.

Deﬁnition 7.12.

For a given Markov decision process

M

, the set of non-

stationary Markov optimal return-distribution functions is

⌘

NMO

={⌘

¯⇡

: ¯⇡ =(⇡

k

)

k0

, ⇡

k

2⇡

MS

is optimal for M, 8k 2N}. 4

In particular, any history-dependent policy

¯⇡

satisfying the deﬁnition above is

also optimal for M.

220 Chapter 7

Theorem 7.13.

Let

T

G

be a distributional Bellman optimality operator

instantiated with some greedy selection rule, and let p

2

[1,

1

]. Under

Assumption 4.29(p), for any initial condition

⌘

0

2P

p

(

R

)

X

the sequence

of iterates ⌘

k+1

= T

G

⌘

k

converges to the set ⌘

NMO

, in the sense that

lim

k!1

inf

⌘2⌘

NMO

w

p

(⌘, ⌘

k

) = 0. 4

Proof.

Along the lines of the proof of Theorem 7.9, there must be a time K

2N

after which the greedy policies

⇡

k

, k

K are optimal. For l

2N

+

, let us construct

the history-dependent policy

¯⇡ = ⇡

K+l–1

, ⇡

K+l–2

, ..., ⇡

K+1

, ⇡

K

, ⇡

⇤

, ⇡

⇤

, ... ,

where

⇡

⇤

is some stationary Markov optimal policy. Denote the return of this

policy from the state-action pair (x, a) by G

¯⇡

(x, a), and its return-distribution

function by

⌘

¯⇡

. Because

¯⇡

is an optimal policy, we have that

⌘

¯⇡

2⌘

NMO

. Let

G

k

be an instantiation of

⌘

k

, for each k

2N

. Now for a ﬁxed x

2X

, a

2A

,

let (X

t

, A

t

, R

t

)

l

t=0

be the initial segment of a random trajectory generated by

following

¯⇡

beginning with X

0

= x, A

0

= a. More precisely, for t = 1,

...

, l we

have

A

t

|(X

0:t

, A

0:t–1

, R

0:t–1

) ⇠⇡

K+l–t

( · | X

t

).

From this, we write

G

¯⇡

(x, a)

D

=

l–1

t=0

t

R

t

+

l

G

⇡

⇤

(X

l

, A

l

).

Because

⌘

K+l

=

T

⇡

K+l–1

···T

⇡

K

⌘

K

, by inductively applying Proposition 4.11 we

also have

G

K+l

(x, a)

D

=

l–1

t=0

t

R

t

+

l

G

K

(X

l

, A

l

).

Hence

w

p

⌘

K+l

(x, a), ⌘

¯⇡

(x, a)

= w

p

G

K+l

(x, a), G

¯⇡

(x, a)

= w

p

l–1

t=0

t

R

t

+

l

G

K

(X

l

, A

l

),

l–1

t=0

t

R

t

+

l

G

⇡

⇤

(X

l

, A

l

)

.

Consequently,

w

p

⌘

K+l

(x, a), ⌘

¯⇡

(x, a)

w

p

(

l

G

K

(X

l

, A

l

),

l

G

⇡

⇤

(X

l

, A

l

))

=

l

w

p

(G

K

(X

l

, A

l

), G

⇡

⇤

(X

l

, A

l

))

Control 221

l

w

p

(G

K

, G

⇡

⇤

),

following the arguments in the proof of Proposition 4.15. The result now follows

by noting that

w

p

(G

K

, G

⇡

⇤

)<

1

under our assumptions, taking the supremum

over (x, a) on the left-hand side, and taking the limit with respect to l.

Theorem 7.13 shows that, even before the effect of the projection step is taken

into account, the behaviour of distributional value iteration is in general quite

complex. When the iterates

⌘

k

are approximated (for example, because they are

estimated from samples), non-stationary Markov return functions may also be

produced by distributional value iteration – even when there is a unique optimal

policy.

It may appear that the convergence issues highlighted by Example 7.11 and

Theorem 7.13 are consequences of using the wrong greedy selection rule. To

address these issues, one may be tempted to impose an ordering on policies (for

example, always prefer the action with the lowest variance, at equal expected

values). However, it is not clear how to do this in a satisfying way. One hurdle is

that, to avoid the cyclical behaviour demonstrated in Example 7.11, we would

like a greedy selection rule which is continuous with respect to its input. This

seems problematic, however, since we also need this rule to return the correct

answer when there is a unique optimal (and thus deterministic) policy (Exercise

7.7). This suggests that when learning to control with a distributional approach,

the learned return distributions may simultaneously reﬂect the random returns

from multiple policies.

7.6 Risk and Risk-Sensitive Control

Imagine being invited to interview for a desirable position at a prestigious

research institute abroad. Your plane tickets and the hotel have been booked

weeks in advance. Now the night before an early morning ﬂight, you make

arrangements for a cab to pick you up from your house and take you to the

airport. How long in advance of your plane’s actual departure do you request

the cab for? If someone tells you that, on average, a cab to your local airport

takes an hour – is that sufﬁcient information to make the booking? How does

your answer change when the ﬂight is scheduled around rush hour, rather than

early morning?

Fundamentally, it is often desirable that our choices be informed by the variabil-

ity in the process that produces outcomes from these choices. In this context,

we call this variability risk. Risk may be inherent to the process, or incomplete

222 Chapter 7

knowledge about the state of the world (including any potential trafﬁc jams,

and the mechanical condition of the hired cab).

In contrast to risk-neutral behaviour, decisions that take risk into account are

called risk-sensitive. The language of distributional reinforcement learning is

particularly well-suited for this purpose, since it lets us reason about the full

spectrum of outcomes, along with their associated probabilities. The rest of

the chapter gives an overview of how one may account for risk in the decision-

making process and of the computational challenges that arise when doing so.

Rather than be exhaustive, here we take the much more modest aim of exposing

the reader to some of the major themes in risk-sensitive control and their relation

to distributional reinforcement learning; references to more extensive surveys

are provided in the bibliographical remarks.

Recall that the risk-neutral objective is to maximise the expected return from

the (possibly random) initial state X

0

:

J(⇡)=E

⇡

1

t=0

t

R

t

= E

⇡

G

⇡

(X

0

)

.

Here, we may think of the expectation as mapping the random variable G

⇡

(X

0

)

to a scalar. Risk-sensitive control is the problem that arises when we replace

this expectation by a risk measure.

Deﬁnition 7.14. A risk measure

58

is a mapping

⇢ : P

⇢

(R) ![–1, 1),

deﬁned on a subset

P

⇢

(

R

)

✓P

(

R

) of probability distributions. By extension,

for a random variable Z instantiating the distribution

⌫

we write

⇢

(Z)=

⇢

(

⌫

).

4

Problem 7.15

(Risk-sensitive control)

.

Given an MDP (

X

,

A

,

⇠

0

, P

X

, P

R

),

a discount factor

2

[0, 1), and a risk measure

⇢

, ﬁnd a policy

⇡ 2⇡

H

maximising

59

J

⇢

(⇡)=⇢

1

t=0

t

R

t

. (7.18)

4

58.

More precisely, this is a static risk measure, in that it is only concerned with the return from

time t = 0. See bibliographical remarks.

59.

Typically, Problem 7.15 is formulated in terms of a risk to be minimised, which linguistically is

a more natural objective. Here, however, we consider the maximisation of J

⇢

(

⇡

) so as to keep the

presentation uniﬁed with the rest of the book.

Control 223

Figure 7.4

Illustration of value-at-risk (VaR) and conditional value-at-risk (CVaR). Depicted is the

cumulative distribution function of the mixture of normal distributions

⌫

=

1

2

N

(0, 1) +

1

2

N

(4, 1). The dashed line corresponds to VaR; CVaR (

⌧

= 0.4) can be determined from

a suitable transformation of the shaded area (see Section 7.8 and Exercise 7.10).

In Problem 7.15, we assume that the distribution of the random return lies in

P

⇢

(

R

), similar to our treatment of probability metrics in Chapter 4. From a

technical perspective, subsequent examples and results should be interpreted

with this assumption in mind.

The risk measure

⇢

may take into account higher-order moments of the return

distribution, be sensitive to rare events, and even disregard the expected value

altogether. Note that according to this deﬁnition,

⇢

=

E

also corresponds to a

risk-sensitive control problem. However, we reserve the term for risk measures

that are sensitive to more than only expected values.

Example 7.16

(Mean-variance criterion)

.

Let

> 0. The variance-penalised

risk measure penalises high-variance outcomes:

⇢

MV

(Z)=E[Z]–Var(Z). 4

Example 7.17

(Entropic risk)

.

Let

> 0. Entropic risk puts more weight on

smaller-valued outcomes:

⇢

ER

(Z)=–

1

log E[e

–Z

]. 4

Example 7.18

(Value-at-risk)

.

Let

⌧ 2

(0, 1). The value-at-risk measure (Figure

7.4) corresponds to the ⌧

th

quantile of the return distribution:

⇢

⌧

VAR

(Z)=F

–1

Z

(⌧). 4

224 Chapter 7

7.7 Challenges In Risk-Sensitive Control

Many convenient properties of the risk-neutral objective do not carry over to

risk-sensitive control. As a consequence, ﬁnding an optimal policy is usually

signiﬁcantly more involved. This remains true even when the risk-sensitive

objective (Equation 7.18) can be evaluated efﬁciently, for example by using

distributional dynamic programming to approximate the return-distribution func-

tion

⌘

⇡

. In this section we illustrate some of these challenges by characterising

optimal policies for the variance-constrained control problem.

The variance-constrained problem introduces risk sensitivity by forbidding poli-

cies whose return variance is too high. Given a parameter C

0, the objective

is to

maximise E

⇡

[G

⇡

(X

0

)]

subject to Var

⇡

G

⇡

(X

0

)

C .

(7.19)

Equation 7.19 can be shown to satisfy our deﬁnition of a risk-sensitive control

problem if we express it in terms of a Lagrange multiplier:

J

VC

(⇡) = min

0

E

⇡

G

⇡

(X

0

)

–

Var

⇡

G

⇡

(X

0

)

– C

.

The variance-penalised and variance-constrained problems are related in that

they share the Pareto set

⇡

PAR

✓⇡

H

of possibly optimal solutions. A policy

⇡

is in the set ⇡

PAR

if we have that for all ⇡

0

2⇡

H

,

(a) Var

G

⇡

(X

0

)

> Var

G

⇡

0

(X

0

)

=) E[G

⇡

(X

0

)] > E[G

⇡

0

(X

0

)], and

(b) Var

G

⇡

(X

0

)

= Var

G

⇡

0

(X

0

)

=) E[G

⇡

(X

0

)] E[G

⇡

0

(X

0

)].

In words, between two policies with equal variances, the one with lower expecta-

tion is never a solution to either problem. However, these problems are generally

not equivalent (Exercise 7.8).

Proposition 7.1 establishes the existence of a solution of the risk-neutral control

problem that is a) deterministic, b) stationary, and c) Markov. By contrast,

the solution to the variance-constrained problem may lack any or all of these

properties.

Example 7.19

(The optimal policy may not be deterministic)

.

Consider the

problem of choosing between two actions, a and b. Action a always yields a

reward of 0, while action b yields a reward of 0 or 1 with equal probability

(Figure 7.5a). If we seek the policy that maximises the expected return subject

to the variance constraint C =

3

/

16

, the best deterministic policy respecting the

variance constraint must choose a, for a reward of 0. On the other hand, the

Control 225

(a) (b) (c)

Figure 7.5

Examples demonstrating how the optimal policy for the variance-constrained control

problem might not be (a) deterministic, (b) Markov, or (c) stationary.

policy that selects a and b with equal probability achieves an expected reward

of

1

/4 and a variance of

3

/16. 4

Example 7.20

(The optimal policy may not be Markov)

.

Consider the Markov

decision process in Figure 7.5b. Suppose that we seek a policy that maximises

the expected return from state x, now subject to the variance constraint C = 0.

Let us assume

= 1 for simplicity. Action a has no variance and is therefore

a possible solution, with zero return. Action b gives a greater expected return,

at the cost of some variance. Any policy

⇡

that depends on the state alone

and chooses b in state x must incur this variance. On the other hand, the

following history-dependent policy achieves a positive expected return without

violating the variance constraint: In state x, choose b; if the ﬁrst reward R

0

is

0 select action b in x, otherwise select action a. In all cases, the return is 1, an

improvement over the best Markov policy. 4

Example 7.21

(The optimal policy may not be stationary)

.

In general, the

optimal policy may require keeping track of time. Consider the problem of

maximising the expected return from the unique state x in Figure 7.5c, subject

to

Var

⇡

(G

⇡

(X

0

))

C, for C

1

/

4

. Exercise 7.9 asks you to show that a simple

time-dependent policy that chooses a for T

C

steps then selects b achieves an

expected return of up to

p

C

. This is possible because the variance of the return

decays at a rate of

2

, while its expected value decays at the slower rate of

.

By contrast, the best randomised stationary policy performs substantially worse

for a discount factor

close to 1 and small values of C (Figure 7.6). Intuitively,

a randomised policy must choose a with a sufﬁciently large probability to avoid

receiving the random reward early, which prevents it from selecting b quickly

beyond the threshold of T

C

time steps. 4

226 Chapter 7

Figure 7.6

Expected return as a function of the discount factor

and variance constraint C in

Example 7.21. Solid and dashed lines indicate the expected return of the best stationary

and time-dependent policies, respectively. The peculiar ziz-zag shape of the curve for

the time-varying policy arises because the time T

C

at which action b is taken must be an

integer (see Exercise 7.9).

The last two examples establish that the variance-constrained risk measure

is time-inconsistent: informally, the agent’s preference for one outcome over

another at time t may be reversed at a later time. Compared to the risk-neutral

problem, the variance-constrained problem is more challenging because the

space of policies that one needs to consider is much larger. Among other things,

the lack of an optimal policy that is Markov with respect to the state alone

also implies that the dependency on the initial distribution in Equation 7.19 is

necessary to keep things well-deﬁned. The variance-constrained objective must

be optimised for globally, considering the policy at all states at once; this is in

contrast with the risk-neutral setting, where value iteration can make overall

improvements to the policy by acting greedily with respect to the value function

at individual states (see Remark 7.1). In fact, ﬁnding an optimal policy for the

variance-constrained control problem is NP-hard [Mannor and Tsitsiklis, 2011].

7.8 Conditional Value-At-Risk*

In the previous section, we saw that solutions to the variance-constrained control

problem can take unintuitive forms, including the need to penalise better-than-

expected outcomes. One issue is that variance only coarsely measures what we

mean by “risk” in the common sense of the word. To reﬁne our meaning, we

may identify two types of risk: downside risk, involving undesirable outcomes

Control 227

such as greater-than-expected losses, and upside risk, involving what we may

informally call a stroke of luck. In some situations, it is possible and useful to

separately account for these two types of risk.

To illustrate this point, we now present a distributional algorithm for optimising

conditional value-at-risk (CVaR), based on work by Bäuerle and Ott [2011]

and Chow et al. [2015]. One beneﬁt of working with full return distributions

is that the algorithmic template we present here can be reasonably adjusted to

deal with other risk measures, including the entropic risk measure described in

Example 7.17. For conciseness, in what follows we will state without proof a

few technical facts about conditional value-at-risk which can be found in those

sources and the work of Rockafellar and Uryasev [2002].

Conditional value-at-risk measures downside risk by focusing on the lower tail

behaviour of the return distribution, speciﬁcally the expected value of this tail.

This expected value quantiﬁes the magnitude of losses in extreme scenarios.

Let Z be a random variable with cumulative and inverse cumulative distribution

functions F

Z

and F

–1

Z

, respectively. For a parameter

⌧ 2

(0, 1), the CVaR of Z is

CVAR

⌧

(Z)=

1

⌧

⌧

0

F

–1

Z

(u)du . (7.20)

When the inverse cumulative distribution F

–1

Z

is strictly increasing, the right-

hand side of Equation 7.20 is equivalent to

E[Z | Z F

–1

Z

(⌧)] . (7.21)

In this case, CVaR quantiﬁes the expected return, conditioned on the event that

this return is no greater than the return’s

⌧

th

quantile – i.e., is within the

⌧

th

fraction of lowest returns.

60

In a reinforcement learning context, this leads to

the risk-sensitive objective

J

CVAR

(⇡)=CVAR

⌧

1

t=0

t

R

t

. (7.22)

In general, there may not be an optimal policy that depends on the state x alone;

however, one can show that optimality can be achieved with a deterministic,

stationary Markov policy on an augmented state that incorporates information

about the return thus far. In the context of this section, we assume that rewards

are bounded in [R

MIN

, R

MAX

]. At a high level, we optimise the CVaR objective

by

60.

In other ﬁelds, CVaR is applied to losses rather than returns, in which case it measures the

expected loss subject that this loss is above the

⌧

th percentile. For example, Equation 7.21 becomes

E[Z | Z F

–1

Z

(⌧ )], and the subsequent derivations need to be adjusted accordingly.

228 Chapter 7

(a) Deﬁning the requisite augmented state;

(b)

Performing a form of risk-sensitive value iteration on this augmented state,

using a suitable selection rule;

(c) Extracting the resulting policy.

We now explain each of these steps.

Augmented state.

Central to the algorithm and to the state augmentation pro-

cedure is a reformulation of CVaR in terms of a desired minimum return or

target b

2R

. Let [x]

+

denote the function that is 0 if x < 0, and x otherwise. For

a random variable Z and

⌧ 2

(0, 1), Rockafellar and Uryasev [2002] establish

that

CVAR

⌧

(Z) = max

b2R

b – ⌧

–1

E

[b – Z]

+

. (7.23)

When F

–1

Z

is strictly increasing, the maximum-achieving b for Equation 7.23

is the quantile F

–1

Z

(

⌧

). In fact, taking the derivative of the expression inside

the brackets with respect to

↵

yields the quantile update rule (Equation 6.11;

see Exercise 7.11). The advantage of this formulation is that it is more easily

optimised in the context of a policy-dependent return. To see this, let us write

G

⇡

=

1

t=0

t

R

t

to denote the random return from the initial state X

0

, following some history-

dependent policy ⇡ 2⇡

H

. We then have that

max

⇡2⇡

H

J

CVAR

(⇡) = max

⇡2⇡

H

max

b2R

b – ⌧

–1

E

[b – G

⇡

]

+

= max

b2R

b – ⌧

–1

min

⇡2⇡

H

E

[b – G

⇡

]

+

. (7.24)

In words, the CVaR objective can be optimised by jointly ﬁnding an optimal

target b and a policy that minimises the expected shortfall

E

[b – G

⇡

]

+

. For a

ﬁxed target b, we will see that it is possible to minimise the expected shortfall

by means of dynamic programming. By adjusting b appropriately, one then

obtains an optimal policy.

Based on Equation 7.24, let us now consider an augmented state (X

t

, B

t

), where

X

t

is as usual the current state and B

t

takes on values in

B

=[V

MIN

, V

MAX

]; we

will describe its dynamics in a moment. With this augmented state we may

consider a class of stationary Markov policies,

⇡

CVAR

, which take the form

⇡ : X⇥B!P(A).

61

61. Since B

t

is a function of the trajectory up to time t, ⇡

CVAR

is a strict subset of ⇡

H

.

Control 229

We use the variable B

t

to keep track of the amount of discounted reward that

should be obtained from X

t

onwards in order to achieve a desired minimum

return of b

0

2R

over the entire trajectory. The transition structure of the Markov

decision process over the augmented state is deﬁned by modifying the generative

equations (Section 2.3):

B

0

= b

0

A

t

|(X

0:t

, B

0:t

, A

0:t–1

, R

0:t–1

) ⇠⇡(· | X

t

, B

t

)

B

t+1

|(X

0:t+1

, B

0:t

, A

0:t

, R

0:t

)=

B

t

– R

t

;

we similarly extend the sample transition model with the variables B and B

0

.

This deﬁnition of the variables (B

t

)

t0

can be understood by noting, for example,

that a minimum return of b

0

is achieved over the whole trajectory if

1

t=1

t–1

R

t

b

0

– R

0

.

If the reward R

t

is small or negative, the new target B

t+1

may of course be larger

than B

t

. Note that the value of b

0

is a parameter of the algorithm, rather than

given by the environment.

Risk-sensitive value iteration.

We next construct a method for optimising the

expected shortfall given a target b

0

2R

. Let us write

⌘ 2P

(

R

)

X⇥B⇥A

for a

return-distribution function on the augmented state-action space, instantiated

as a return-variable function G. For ease of exposition, we will mostly work

with this latter form of the return function. As usual, we write G

⇡

for the

return-variable function associated with a policy

⇡ 2⇡

CVAR

. With this notation

in mind, we write

J

CVAR

(⇡, b

0

) = max

b2R

b – ⌧

–1

E

[b – G

⇡

(X

0

, b

0

, A

0

)]

+

(7.25)

to denote the conditional value-at-risk obtained by following policy

⇡

from the

initial state (X

0

, b

0

), with A

0

⇠⇡(· | X

0

, b

0

).

Similar to distributional value iteration, the algorithm constructs a series of

approximations to the return-distribution function by repeatedly applying the

distributional Bellman operator with a policy derived from a greedy selection

rule

˜

G. Speciﬁcally, we write

a

G

(x, b) = arg min

a2A

E

[b – G(x, b, a)]

+

(7.26)

230 Chapter 7

for the greedy action at the augmented state (x, b), breaking ties arbitrarily. The

selection rule

˜

G is itself given by

˜

G(⌘)(a | x, b)=

˜

G(G)(a | x, b)= {a =a

G

(x, b)}.

The algorithm begins by initialising

⌘

0

(x, b, a)=

0

for all x

2X

, b

2B

, and

a 2A, and iterating

⌘

k+1

= T

˜

G

⌘

k

, (7.27)

as in distributional value iteration. Expressed in terms of random-variable

functions, this is

G

k+1

(x, b, a)

D

= R + G

k

(X

0

, B

0

,a

G

k

(X

0

, B

0

)), X = x, B = b, A = a.

After k iterations, the policy

⇡

k

=

˜

G

(

⌘

k

) can be extracted according to Equation

7.26. As suggested by Equation 7.25, a suitable choice of starting state is

b

0

= arg max

b2B

b – ⌧

–1

E

[b – G

k

(X

0

, b,a

G

k

(X

0

, b))]

+

, (7.28)

As given, there are two hurdles to producing a tractable implementation of

Equation 7.27: in addition to the usual concern that return distributions may

need to be projected onto a ﬁnite-parameter representation, we also have to

contend with a real-valued state variable B

t

. Before discussing how this can be

addressed, we ﬁrst establish that Equation 7.27 is a sound approach to ﬁnding

an optimal policy for the CVaR objective.

Theorem 7.22.

Consider the sequence of return-distribution functions

(

⌘

k

)

k0

deﬁned by Equation 7.27. Then the greedy policy

⇡

k

=

˜

G

(

⌘

k

) is

such that for all x 2X, b 2B, and a 2A,

E

[b – G

⇡

k

(x, b, a)]

+

min

⇡2⇡

CVAR

E

[b – G

⇡

(X

0

, b, a)]

+

+

k

V

MAX

– V

MIN

1–

.

Consequently, we also have that the conditional value-at-risk of this policy

satisﬁes (with b

0

given by Equation 7.28)

J

CVAR

(⇡

k

, b

0

) max

b2B

max

⇡2⇡

CVAR

J

CVAR

(⇡, b)–

k

V

MAX

– V

MIN

⌧(1 – )

. 4

The proof is somewhat technical, and is provided in Remark 7.3.

Theorem 7.22 establishes that with sufﬁciently many iterations, the policy

⇡

k

is

close to optimal. Of course, when distribution approximation is introduced the

resulting policy will in general only approximately optimise the CVaR objective,

with an error term that depends on the expressivity of the probability distribu-

tion representation (i.e., the parameter m in Chapter 5). To perform dynamic

Control 231

programming with the state variable B, one may use function approximation,

the subject of the next chapter. Another solution is to consider a discrete number

of values for B and to extend the operator

T

˜

G

to operate on this discrete set

(Exercise 7.12).

7.9 Technical Remarks

Remark 7.1.

In Section 7.7 we presented some of the challenges involved with

ﬁnding an optimal policy for the variance-constrained objective. In some sense,

these challenges should not be too surprising given that that we are looking to

maximise a function J of an inﬁnite-dimensional object (a history-dependent

policy). Rather, what should be surprising is the relative ease with which one

can obtain an optimal policy in the risk-neutral setting.

From a technical perspective, this ease is a consequence of Lemma 7.3, which

guarantees that Q

⇤

(and hence

⇡

⇤

) can be efﬁciently approximated. However,

another important property of the risk-neutral setting is that the policy can be

improved locally, that is at each state simultaneously. To see this, consider a

state-action value function Q

⇡

for a given policy

⇡

, and denote by

⇡

0

a greedy

policy with regards to Q

⇡

. Then

TQ

⇡

= T

⇡

0

Q

⇡

T

⇡

Q

⇡

= Q

⇡

. (7.29)

That is, a single step of value iteration applied to the value function of a policy

⇡

results in a new value function that is at least as good as Q

⇡

at all states

– the Bellman operator is said to be monotone. Because this single step also

corresponds to the value of a non-stationary policy that acts according to

⇡

0

for

one step then switches to

⇡

, we can equivalently interpret it as constructing,

one step at a time, a deterministic history-dependent policy for solving the

risk-neutral problem.

By contrast, it is not possible to use a direct dynamic programming approach

over the objective J to ﬁnd the optimal policy for an arbitrary risk-sensitive con-

trol problem. A practical alternative is to perform the optimisation instead

with an ascent procedure (e.g., a policy gradient-type algorithm). Ascent

algorithms can often be computed in closed-form, and tend to be simpler

to implement. On the other hand, convergence is typically only guaranteed to

local optima, seemingly unavoidable when the optimisation problem is known

to be computationally hard. 4

Remark 7.2.

When the projection

⇧

F

is not mean-preserving, distributional

value iteration induces a state-action value function Q

⌘

k

that is different from

the value function Q

k

determined by standard value iteration under equivalent

232 Chapter 7

initial conditions. Under certain conditions on the distributions of rewards, it is

possible to bound this difference as k

!1

. To do so, we use a standard error

bound on approximate value iteration [see e.g. Bertsekas, 2012].

Lemma 7.23. Let (Q

k

)

k0

be a sequence of iterates in R

X⇥A

satisfying

kQ

k+1

– TQ

k

k

1

"

for some " > 0, and where T is the Bellman optimality operator. Then

lim sup

k!1

kQ

k

– Q

⇤

k

1

"

1–

. 4

In the context of distributional value iteration, we need to bound the difference

kQ

⌘

k+1

– TQ

⌘

k

k

1

.

When the rewards are bounded on the interval [R

MIN

, R

MAX

] and the projection

step is

⇧

Q

, the w

1

-projection onto the m-quantile representation, a simple bound

follows from an intermediate result used in proving Lemma 5.30 (see Exercise

5.20). In this case, for any ⌫ bounded on [V

MIN

, V

MAX

],

w

1

(⇧

Q

⌫, ⌫)

V

MAX

– V

MIN

2m

;

Conveniently, the 1-Wasserstein distance bounds the difference of means

between any distributions ⌫, ⌫

0

2P

1

(R):

E

Z⇠⌫

[Z]– E

Z⇠⌫

0

[Z]

w

1

(⌫, ⌫

0

).

This follows from the dual representation of the Wasserstein distance [Villani,

2008]. Consequently, for any (x, a),

Q

⌘

k+1

(x, a)–(T

⇡

k

Q

⌘

k

)(x, a)

w

1

⌘

k+1

(x, a), (T

⇡

k

⌘

k

)(x, a)

= w

1

(⇧

Q

T

⇡

k

⌘

k

)(x, a), (T

⇡

k

⌘

k

)(x, a)

V

MAX

– V

MIN

2m

.

By taking the maximum over (x, a) on the left-hand side of the above and

combining with Lemma 7.23, we obtain

lim

k!1

kQ

⌘

k

– Q

⇤

k

1

V

MAX

– V

MIN

2m(1 – )

. 4

Remark 7.3.

Theorem 7.22 is proven from a few facts regarding partial returns,

which we now give. Let us write

a

k

(x, b)=

a

G

k

(x, b). We deﬁne the mapping

U

k

: X⇥B⇥A!R as

U

k

(x, b, a)=E

[b – G

k

(x, b, a)]

+

,

Control 233

and for a policy ⇡ 2⇡

CVAR

similarly write

U

⇡

(x, b, a)=E

[b – G

⇡

(x, b, a)]

+

.

Lemma 7.24. For any x 2X, a 2A, b 2B, and k 2N, we have

U

k+1

(x, b, a)= E

U

k

X

0

, B

0

,a

k

(X

0

, B

0

)

| X = x, B = b, A = a

.

In addition, if ⇡ 2⇡

CVAR

is a stationary Markov policy on X⇥B, we have

U

⇡

(x, b, a)= E

⇡

U

⇡

X

0

, B

0

, A

0

| X = x, B = b, A = a

. 4

Proof.

The result follows by time-homogeneity and the Markov property. Con-

sider the sample transition model (X, B, A, R, X

0

, B

0

, A

0

), with A

0

=

a

k

(X

0

, B

0

).

Simultaneously, consider the partial trajectory (X

t

, B

t

, A

t

, R

t

)

k

t=0

for which A

0

= A

0

and A

t

⇠⇡

k–t

(· | X

t

, B

t

) for t > 0. As B

0

= B – R, we have

E

U

k

X

0

, B

0

, A

0

| X = x, B = b, A = a

= E

E

B

0

– G

k

(X

0

, B

0

, A

0

)

+

| X = x, B = b, A = a

= E

E

B

0

–

k

t=0

t

R

t

+

| X

0

= X

0

, B

0

= B

0

, A

0

= A

0

| X = x, A = a

= E

E

b – R –

k

t=0

t

R

t

+

| X

0

= X

0

, B

0

= B

0

, A

0

= A

0

| X = x, B = b, A = a

= E

E

b – R –

k+1

t=1

t

R

t

+

| X

1

= X

0

, B

1

= B

0

, A

1

= A

0

| X = x, B = b, A = a

= E

b –

k+1

t=0

t

R

t

+

| X

0

= x, B

0

= b, A

0

= a

.

The second statement follows similarly.

Lemma 7.25.

Suppose that V

MIN

0 and V

MAX

0. Let (R

t

)

t0

be a sequence

of rewards in [R

MIN

, R

MAX

]. For any b 2R and k 2N,

E

[b –

1

t=0

t

R

t

]

+

]+

k+1

V

MIN

E

[b –

k

t=0

t

R

t

]

+

] E

[b –

1

t=0

t

R

t

]

+

]+

k+1

V

MAX

.

4

234 Chapter 7

Proof. First note that, for any b, z, z

0

2R,

[b – z]

+

[b – z

0

]

+

+[z

0

– z]

+

. (7.30)

To obtain the ﬁrst inequality in the statement, we set

[b –

1

t=0

t

R

t

]

+

[b –

k

t=0

t

R

t

]

+

+ [–

1

t=k+1

t

R

t

]

+

Since rewards are bounded in [R

MIN

, R

MAX

], we have that

–

1

t=k+1

t

R

t

–

k+1

R

MIN

1–

=–

k+1

V

MIN

.

As we have assumed that V

MIN

0, it follows that

[b –

1

t=0

t

R

t

]

+

[b –

k

t=0

t

R

t

]

+

–

k+1

V

MIN

.

The second inequality in the statement is obtained analogously.

Lemma 7.26.

The sequence (

⌘

k

)

k0

deﬁned by Equation 7.27 satisﬁes, for any

x 2X, b 2B, and a 2A,

U

k

(x, b, a) = min

⇡2⇡

CVAR

E

⇡

[b –

k

t=0

t

R

t

]

+

| X = x, B = b, A = a

. (7.31)

4

Proof (sketch).

Our choice of G

0

(x, b, a) = 0 guarantees that the statement is

true for k = 0. The result then follows by Lemma 7.24, the fact that the policy

˜

G

(

⌘

k

) chooses the action minimising the left-hand side of Equation 7.31, and

by induction on k.

Proof of Theorem 7.22.

Let us assume that V

MIN

0 and V

MAX

0 so that

Lemma 7.25 can be applied. This is without loss of generality, as otherwise

we may ﬁrst construct a new sequence of rewards shifted by an appropriate

constant C, such that R

MIN

= 0, R

MAX

0; by inspection, this transformation does

not affect the statement of the Theorem 7.22.

Let ⇡

⇤

2⇡

CVAR

be an optimal deterministic policy, in the sense that

U

⇡

⇤

(x, b, a) = min

⇡2⇡

CVAR

U

⇡

(x, b, a).

Combining Lemmas 7.25 and 7.26, we have

U

⇡

⇤

(x, b, a)+

k+1

V

MIN

U

k

(x, b, a) U

⇡

⇤

(x, b, a)+

k+1

V

MAX

. (7.32)

Control 235

Write E

xba

[·]=E

⇡

[· | X = x, B = b, A = a]. By Lemma 7.24,

U

⇡

k

(x, b, a)–U

k

(x, b, a)

= E

xba

U

⇡

k

(X

0

, B

0

,a

k

(X

0

, B

0

)) – U

k–1

(X

0

, B

0

,a

k–1

(X

0

, B

0

))

= E

xba

U

⇡

k

(X

0

, B

0

,a

k

(X

0

, B

0

)) – U

k

(X

0

, B

0

,a

k

(X

0

, B

0

))

+ U

k

(X

0

, B

0

,a

k

(X

0

, B

0

)) – U

k–1

(X

0

, B

0

,a

k–1

(X

0

, B

0

))

E

xba

U

⇡

k

(X

0

, B

0

,a

k

(X

0

, B

0

)) – U

k

(X

0

, B

0

,a

k

(X

0

, B

0

))

+

k+1

V

MAX

–

k

V

MIN

.

(7.33)

Now, the quantity

"

k

(x, b, a)=U

⇡

k

(x, b, a)–U

k

(x, b, a)

is bounded above and hence

"

k

= sup

x2X,b2B,a2A

"

k

(x, b, a)

exists. Taking the supremum over x, b and a on both sides of Equation 7.33, we

have

"

k

"

k

+

k+1

V

MAX

–

k

V

MIN

and consequently for all x, b, a,

U

⇡

k

(x, b, a)–U

k

(x, b, a)

k

(V

MAX

– V

MIN

)

1–

. (7.34)

Because U

k

(x, b, a) U

⇡

⇤

(x, b, a)+

k+1

V

MAX

, then

U

⇡

k

(x, b, a) U

⇡

⇤

(x, b, a)+

k+1

V

MAX

+

k

(V

MAX

– V

MIN

)

1–

.

Now,

k+1

+

k+1

1–

=

k

+ (1 – )

1–

k

1–

.

Hence we conclude that also

U

⇡

k

(x, b, a) min

⇡2⇡

CVAR

U

⇡

(x, b, a)+

k

(V

MAX

– V

MIN

)

1–

,

as desired.

For the second statement, following Equation 7.28 we have

b

0

= arg max

b2B

b – ⌧

–1

E

[b – G

k

(X

0

, b,a

k

(X

0

, b))]

+

.

236 Chapter 7

The algorithm begins in state (X

0

, b

0

), selects action A

0

=

a

k

(X

0

, b

0

), and then

executes policy ⇡

k

from there on; its return is G

⇡

k

(X

0

, b

0

, A

0

). In particular,

J

CVAR

(⇡

k

, b

0

) = max

b2B

b – ⌧

–1

E

[b – G

⇡

k

(X

0

, b

0

, A

0

)]

+

b

0

– ⌧

–1

E

[b

0

– G

⇡

k

(X

0

, b

0

, A

0

)]

+

. (7.35)

Write

a

⇡

⇤

(x, b) for the action selected by

⇡

⇤

in (x, b). Because Equation 7.32

holds for all x, b, and a, we have

min

a2A

U

k

(x, b, a) min

a2A

U

⇡

⇤

(x, b, a)+

k+1

V

MAX

=) U

k

(x, b,a

k

(x, b)) U

⇡

⇤

(x, b,a

⇡

⇤

(x, b)) +

k+1

V

MAX

,

since

a

k

(x, b) is the action a that minimises U

k

(x, b, a). Hence, for any state x

we have

max

b2B

b – ⌧

–1

min

a2A

U

k

(x, b, a)

max

b2B

b – ⌧

–1

min

a2A

U

⇡

⇤

(x, b, a)

–

k+1

V

MAX

⌧

=) b

0

– ⌧

–1

U

k

(X

0

, b

0

,a

k

(X

0

, b

0

)) max

b2B

b – ⌧

–1

U

⇡

⇤

(X

0

, b,a

⇡

⇤

(x, b))

–

k+1

V

MAX

⌧

= max

b2B

J

CVAR

(⇡

⇤

, b)–

k+1

V

MAX

⌧

.

Combined with Equations 7.34 and 7.35, this yields

J

CVAR

(⇡

k

, b

0

) max

b2B

max

⇡2⇡

CVAR

J

CVAR

(⇡, b)–

k

(V

MAX

– V

MIN

)

⌧(1 – )

.

4

7.10 Bibliographical Remarks

7.0.

The balloon navigation example at the beginning of the chapter is from

Bellemare et al. [2020]. Sutton and Barto [2018] separates “control problem”

from “prediction problem”; the latter ﬁgures more predominantly in this book.

In earlier literature, the control problem comes ﬁrst [see e.g. Bellman, 1957a]

and prediction is typically used as an subroutine for control [Howard, 1960].

7.1.

Time-dependent policies are common in ﬁnite-horizon scenarios, and are

studied at length by Puterman [2014]. The technical core of Proposition 7.2

involves demonstrating that any feasible value function can be attained by a

stationary Markov policy; see the results by Puterman [2014, Theorem 5.5.1],

Altman [1999] and the discussion by Szepesvári [2020].

Control 237

In reinforcement learning, history-dependent policies are also used to deal with

partially observable environments, in which the agent receives an observation o

at each time step rather than the identity of its state. For example, McCallum

[1995] uses a variable-length history to represent state-action values, while

Veness et al. [2011] uses a history-based probabilistic model to learn a model

of the environment. History-dependent policies also play a central role in the

study of optimality in the fairly large class of computable environments [Hutter,

2005].

7.2.

The canonical reference for value iteration is the book by Bellman [1957a];

see also Bellman [1957b] for an asymptotic analysis in the undiscounted set-

ting. Lemma 7.3 is standard and can be found in most reinforcement learning

textbooks [Bertsekas and Tsitsiklis, 1996, Szepesvári, 2010, Puterman, 2014].

State-action value functions were introduced along with the Q-learning algo-

rithm [Watkins, 1989] and subsequently used in the development of SARSA

[Rummery and Niranjan, 1994]. Watkins and Dayan [1992] gives a restricted

result regarding the convergence of Q-learning, which is more thoroughly estab-

lished by Jaakkola et al. [1994], Tsitsiklis [1994], and Bertsekas and Tsitsiklis

[1996].

7.3–7.5.

The expression of the optimality operator as a ﬁxed-policy operator

whose policy varies with the input is common in the analysis of control algo-

rithms [see e.g. Munos, 2003, Scherrer, 2014]. The view of value iteration as

constructing a history-dependent policy is taken by Scherrer and Lesner [2012]

to derive more accurate value learning algorithms in the approximate setting.

The extension to distributional value iteration and Theorem 7.13 are from

[Bellemare et al., 2017a]. The correspondence between standard value iteration

and distributional value iteration with a mean-preserving projection is given by

Lyle et al. [2019].

The notion of action gap plays an important role in understanding the rela-

tionship between value function estimates and policies, in particular when

estimates are approximate. Farahmand [2011] gives a gap-dependent bound on

the expected return obtained by a policy derived from an approximate value

function. Bellemare et al. [2016] derives an algorithm for increasing the action

gap so as to improve performance in the approximate setting.

An example of a selection rule that explicitly incorporates distributional infor-

mation is the lexicographical rule of Jaquette [1973], which orders policies

according to the magnitude of their moments.

238 Chapter 7

7.6.

The notion of risk and risk-sensitive decisions can be traced back to

Markowitz [1952], who introduced the concept of trading off expected gains and

variations in those gains in the context of constructing an investment portfolio;

see also Steinbach [2001] for a retrospective. Artzner et al. [1999] proposes

a collection of desirable characteristics that make a risk measure coherent

in the sense that it satisﬁes certain preference axioms. Of the risk measures

mentioned here, CVaR is coherent but the variance-constrained objective is

not. Artzner et al. [2007] discusses coherent risk measures in the context of

sequential decisions. Ruszczy

´

nski [2010] introduces the notion of dynamic risk

measures for Markov decision processes, which are amenable to optimisation

via Bellman-style recursions; see also Chow [2017] for a discussion of static

and dynamic risk measures as well as time consistency. Jiang and Powell [2018]

develop sample-based optimisation methods for dynamic risk measures based

on quantiles.

Howard and Matheson [1972] considered the optimisation of an exponential

utility function applied to the random return by means of policy iteration. The

same objective is given a distributional treatment by Chung and Sobel [1987].

Heger [1994] considers optimising for worst-case returns. Haskell and Jain

[2015] study the use of occupancy measures over augmented state spaces as an

approach for ﬁnding optimal policies for risk-sensitive control; similarly, an

occupancy measure-based approach to CVaR optimisation is studied by Carpin

et al. [2016]. Mihatsch and Neuneier [2002] and Shen et al. [2013] extend

Q-learning to the optimisation of recursive risk measures, where a base risk

measure is applied at each time step. Recursive risk measures are more easily

optimised than risk measures directly applied to the random return, but are not

as easily interpreted. Martin et al. [2020] consider combining distributional

reinforcement learning with the notion of second-order stochastic dominance

as a means of action selection. Quantile criteria are considered by Filar et al.

[1995] in the case of average-reward MDPs, and more recently by Gilbert et al.

[2017] and Li et al. [2021]. Delage and Mannor [2010] solves a risk-constrained

optimisation problem to handle uncertainty in a learned model’s parameters. See

Prashanth and Fu [2021] for a survey on risk-sensitive reinforcement learning.

7.7.

Sobel [1982] establishes that an operator constructed directly from the

variance-penalised objective does not have the monotone improvement property,

making it optimisation more challenging. The examples demonstrating the need

for randomisation and a history-dependent policy are adapted from Mannor and

Tsitsiklis [2011], who also prove the NP-hardness of the problem of optimising

the variance-constrained objective. Tamar et al. [2012] propose a policy gradient

algorithm for optimising a mean-variance objective and for the CVaR objective

Control 239

[Tamar et al., 2015]; see also Prashanth and Ghavamzadeh [2013], Chow and

Ghavamzadeh [2014] for actor-critic algorithms for these criteria. Chow et al.

[2018] augment the state with the return-so-far in order to extend gradient-based

algorithms to a broader class of risk measures.

7.8.

The reformulation of the conditional value-at-risk (CVaR) of a random

variable in terms of the (convex) optimisation of a function of a variable b

2R

is due to Rockafellar and Uryasev [2000]; see also Rockafellar and Uryasev

[2002], Shapiro et al. [2009]. Bäuerle and Ott [2011] provide an algorithm for

optimising the CVaR of the random return in Markov decision processes. Their

work forms the basis for the algorithm presented in this section, although the

treatment in terms of return-distribution functions is new here. Another closely

related algorithm is due to Chow et al. [2015], who additionally provide an

approximation error bound on the computed CVaR. Brown et al. [2020] apply

Rockafellar and Uryasev’s approach to design an agent that is risk-sensitive with

respect to a prior distribution over possible reward functions. Keramati et al.

[2020] combine categorical temporal-difference learning with an exploration

bonus derived from the DKW inequality to develop an algorithm to optimise

for conditional value-at-risk.

7.11 Exercises

Exercise 7.1.

Find a counterexample that shows that the Bellman optimality

operator is not an afﬁne operator. 4

Exercise 7.2.

Consider the Markov decision process depicted in Figure 2.4a.

For which values of the discount factor

2

[0, 1) is there more than one optimal

action from state x ? Use this result to argue that the optimal policy depends on

the discount factor. 4

Exercise 7.3.

Proposition 7.7 establishes that distributional Bellman optimality

operators are not contraction mappings.

(i)

Instantiate the result with the 1-Wasserstein distance. Provide a visual

explanation for the result by drawing the relevant cumulative distribution

functions before and after the application of the operator.

(ii)

Discuss why it was necessary, in the proof of Proposition 7.7, to assume

that the probability metric d is c-homogeneous. 4

Exercise 7.4.

Suppose that there is a unique optimal policy

⇡

⇤

, as per Section

7.4. Consider the use of a projection

⇧

F

for a probability representation

F

240 Chapter 7

and the iterates

⌘

k+1

= ⇧

F

T ⌘

k

. (7.36)

Discuss under what conditions the sequence of greedy policies (

G

(

⌘

k

))

k0

converges to ⇡

⇤

when ⇧

F

is

(i) The m-categorical projection ⇧

C

;

(ii) The m-quantile projection ⇧

Q

.

Where necessary, provide proofs of your statements. Does your answer depend

on m, or on ✓

1

, ..., ✓

m

for the case of the categorical representation? 4

Exercise 7.5.

Give a Markov decision process for which the limit of the

sequence of iterates deﬁned by

⌘

k+1

=

T

G

⌘

k

depends on the initial condition

⌘

0

,

irrespective of the greedy selection rule

G

. Hint. Construct a scenario where the

implied policy ⇡

k

is the same for all k, but depends on ⌘

0

. 4

Exercise 7.6.

Consider the greedy selection rule that selects an action with

minimal variance amongst those with maximal expected value, breaking ties

uniformly at random. Provide an example Markov decision process in which

this rule results in a sequence of return-distribution functions that does not

converge, as per Example 7.11. Hint. Consider reward distributions of the form

1

3

3

i=1

✓

i

. 4

Exercise 7.7

(*)

.

Consider the 1-Wasserstein distance w

1

and its supremum

extension

w

1

. In addition, let d be a metric on

P

(

A

). Suppose that we are given

a mapping

˜

G : P(R)

X⇥A

!⇡

MS

which is continuous at every state, in the sense that for any

"

> 0, there exists a

> 0 such that for any return functions ⌘, ⌘

0

,

w

1

(⌘, ⌘

0

)< =) max

x2X

d

˜

G(⌘)(· | x),

˜

G(⌘

0

)(· | x)

< ".

Show that this mapping cannot be a greedy selection rule in the sense of

Deﬁnition 7.6. 4

Exercise 7.8.

Consider the Markov decision process depicted in Figure 7.5a.

Show that there is no 0 such that the policy maximising

J

MV

(⇡)=E

⇡

G

⇡

(X

0

)

– Var

⇡

G

⇡

(X

0

)

is stochastic. This illustrates how the variance-constrained and variance-

penalised control problems are not equivalent. 4

Control 241

Exercise 7.9. Consider the Markov decision process depicted in Figure 7.5c.

(i)

Solve for the optimal stopping time T

C

maximising the return of a time-

dependent policy that selects action a for T

C

time steps, then selects action

b (under the constraint that the variance should be no greater than C).

(ii) Prove that this policy can achieve an expected return of up to

p

C.

(iii) Based on your conclusions, design a policy that improves on this strategy.

(iv)

Show that the expectation and variance of the return of a randomised

stationary policy that selects action b with probability p are given by

E

⇡

[G

⇡

(x)] =

p

2

1–(1 – p)

Var

⇡

G

⇡

(x)

=

p

4

2

1–

2

(1 – p)

–

p

(1 – (1 – p))

2

.

(v)

Using your favourite visualisation program, chart the returns achieved by

the optimal randomised policy and the optimal time-dependent policy, for

values of C and

different from those shown in Figure 7.5c. What do

you observe?Hint. Use a root-ﬁnding algorithm to determine the maximum

expected return of a randomised policy under the constraint

Var

⇡

G

⇡

(x)

C. 4

Exercise 7.10.

Explain the relationship between the shaded area in Figure 7.4

and conditional value-at-risk for the depicted distribution. 4

Exercise 7.11.

Following Equation 7.23, for a distribution

⌫ 2P

(

R

) consider

the function

f (✓)=✓ – ⌧

–1

E

Z⇠⌫

[✓ – Z]

+

.

Show that for ⌧ 2(0, 1),

✓ ✓ – ↵⌧

d

d✓

f (✓)

is equivalent to the quantile update rule (Equation 6.11), in expectation. 4

Exercise 7.12

(*)

.

Consider a uniform discretisation

B

"

of the interval

[V

MIN

, V

MAX

] into intervals of width

"

(endpoints included). For a return-

distribution function

⌘

on the discrete space

X⇥B

"

⇥A

, deﬁne its extension

to X⇥[V

MIN

, V

MAX

] ⇥A

˜⌘(x, b, a)=⌘(x, "

b

"

, a).

242 Chapter 7

Suppose that probability distributions can be represented exactly (i.e., without

needing to resort to a ﬁnite-parameter representation). For the CVaR objective

(Equation 7.22), derive an upper bound for the suboptimality

max

⇡2⇡

CVAR

J(⇡)–J(˜⇡),

where

˜⇡

is found by the procedure of Section 7.8 applied to the discrete space

X⇥B

"

⇥A

and using the extension

⌘ 7! ˜⌘

to implement the operator

T

˜

G

.

4