Control

A superpressure balloon is a kind of aircraft whose altitude is determined by the

relative pressure between its envelope and the ambient atmosphere and which

can be ﬂown high in the stratosphere. Like a submarine, a superpressure balloon

ascends when it becomes lighter and descends when it becomes heavier. Once

in ﬂight, superpressure balloons are passively propelled by the winds around

them, so that their direction of travel can be inﬂuenced simply by changing their

altitude. This makes it possible to steer such a balloon in an energy-eﬃcient

manner and have it operate autonomously for months at a time. Determining

the most eﬃcient way to control the ﬂight of a superpressure balloon by means

of altitude changes is an example of a control problem, the topic of this chapter.

In reinforcement learning, control problems are concerned with ﬁnding poli-

cies that achieve or maximize speciﬁed objectives. This is in contrast with

prediction problems, which are concerned with characterizing or quantifying

the consequences of following a particular policy. The study of control problems

involves not only the design of algorithms for learning optimal policies but

also the study of the behavior of these algorithms under diﬀerent conditions,

such as when learning occurs one sample at a time (as per Chapter 6), when

noise is injected into the process, or when only a ﬁnite amount of data is made

available for learning. Under the distributional perspective, the dynamics of

control algorithms exhibit a surprising complexity. This chapter gives a brief

and necessarily incomplete overview of the control problem. In particular, our

treatment of control diﬀers from most textbooks in that we focus on the distri-

butional component and for conciseness omit some traditional material such as

policy iteration and λ-return algorithms.

7.1 Risk-Neutral Control

The problem of ﬁnding a policy that maximizes the agent’s expected return is

called the risk-neutral control problem, as it is insensitive to the deviations of

Draft version. 197

198 Chapter 7

returns from their mean. We have already encountered risk-neutral control when

we introduced the Q-learning algorithm in Section 3.7. We begin this chapter

by providing a theoretical justiﬁcation for this algorithm.

Problem 7.1

(Risk-neutral control)

Given an MDP (

X, A, ξ

, P

) and

discount factor γ ∈[0, 1), ﬁnd a policy π maximizing the objective function

J(π) = E

∞

t=0

. 4

A solution π

∗

that maximizes J is called an optimal policy.

Implicit in the deﬁnition of risk-neutral control and our deﬁnition of a policy

in Chapter 2 is the fact that the objective

is maximized by a policy that only

depends on the current state: that is, one that takes the form

π : X→P(A) .

As noted in Section 2.2, policies of this type are more properly called sta-

tionary Markov policies and are but a subset of possible decision rules. With

stationary Markov policies, the action

is independent of the random vari-

ables

, A

, R

, . . . , X

t−1

, A

t−1

, R

t−1

given

. In addition, the distribution of

conditional on X

, is the same for all time indices t.

By contrast, history-dependent policies select actions on the basis on the

entire trajectory up to and including

(the history). Formally, a history-

dependent policy is a time-indexed collection of mappings

: (X×A×R)

t−1

×X→P(A) .

In this case, we have that

| (X

0:t

, A

0:t−1

, R

0:t−1

) ∼π

(· | X

, A

, R

, . . . , A

t−1

, R

t−1

, X

) . (7.1)

When clear from context, we omit the time subscript to

and write

, and

to denote the joint distribution, expectation, return-variable function,

and return-distribution function implied by the generative equations but with

Equation 7.1 substituting the earlier deﬁnition from Section 2.2. We write

for the space of stationary Markov policies and

for the space of history-

dependent policies.

It is clear that every stationary Markov policy is a history-dependent policy,

though the converse is not true. In risk-neutral control, however, the added

degree of freedom provided by history-dependent policies is not needed to

achieve optimality; this is made formal by the following proposition (recall that

a policy is deterministic if it always selects the same action for a given state or

history).

Draft version.

Control 199

Proposition 7.2.

Let

(

) be as in Problem 7.1. There exists a determinis-

tic stationary Markov policy π

∗

∈π

such that

J(π

∗

) ≥J(π) ∀π ∈π

. 4

Proposition 7.2 is a central result in reinforcement learning – from a computa-

tional point of view, for example, it is easier to deal with deterministic policies

(there are ﬁnitely many of them) than stochastic policies. Remark 7.1 discusses

some other beneﬁcial consequences of Proposition 7.2. Its proof involves a

surprising amount of detail; we refer the interested reader to Puterman (2014,

Section 6.2.4).

7.2 Value Iteration and Q-Learning

The main consequence of Proposition 7.2 is that when optimizing the risk-

neutral objective, we can restrict our attention to deterministic stationary Markov

policies. In turn, this makes it possible to ﬁnd an optimal policy

∗

by computing

the optimal state-action value function Q

∗

, deﬁned as

∗

(x, a) = sup

π∈π

∞

t=0

| X = x, A = a

. (7.2)

Just as the value function

for a given policy

satisﬁes the Bellman equation,

∗

satisﬁes the Bellman optimality equation:

∗

(x, a) = E



R + γ max

∈A

∗

, a

) | X = x, A = a



. (7.3)

The optimal state-action value function describes the expected return obtained

by acting so as to maximize the risk-neutral objective when beginning from

the state-action pair (

x, a

). Intuitively, we may understand Equation 7.3 as

describing this maximizing behavior recursively. While there might be multiple

optimal policies, they must (by deﬁnition) achieve the same objective value in

Problem 7.1. This value is

∗

)] ,

where V

∗

is the optimal value function:

∗

(x) = max

a∈A

∗

(x, a) .

Given

∗

, an optimal policy is obtained by acting greedily with respect to

∗

–

that is, choosing in state x any action a that maximizes Q

∗

(x, a).

Value iteration is a procedure for ﬁnding the optimal state-action value

function

∗

iteratively, from which

∗

can then be recovered by choosing

actions that have maximal state-action values. In Chapter 5, we discussed a

Draft version.

200 Chapter 7

procedure for computing

based on repeated applications of the Bellman

operator

. Value iteration replaces the Bellman operator

in this procedure

with the Bellman optimality operator

(T Q)(x, a) = E



R + γ max

∈A

Q(X

, a

) | X = x, A = a



. (7.4)

Let us deﬁne the L

∞

norm of a state-action value function Q ∈R

X×A

kQk

∞

= sup

x∈X,a∈A

|Q(x, a)|.

As the following establishes, the Bellman optimality operator is a contraction

mapping in this norm.

Lemma 7.3.

The Bellman optimality operator

is a contraction in

∞

norm

with modulus γ. That is, for any Q, Q

∈R

X×A

kT Q −T Q

∞

≤γkQ −Q

∞

. 4

Corollary 7.4.

The optimal state-action value function

∗

is the only value

function that satisﬁes the Bellman optimality equation. Further, for any

∈

X×A

, the sequence (

)

k≥0

deﬁned by

k+1

T Q

(for

k ≥

0) converges to

∗

. 4

Corollary 7.4 is an immediate consequence of Lemma 7.3 and Proposition

4.7. Before we give the proof of Lemma 7.3, it is instructive to note that despite

a visual similarity and the same contractive property in supremum norm, the

optimality operator behaves somewhat diﬀerently from the ﬁxed-policy operator

, deﬁned for state-action value functions as

Q)(x, a) = E



R + γQ(X

, A

) | X = x, A = a] , (7.5)

where conditional on the random variables (

x, A

a, R, X

), we have

∼

(

· | X

). In the context of this chapter, we call

a policy evaluation operator.

Such an operator is said to be aﬃne: for any two Q-functions

Q, Q

and

α ∈

1],

it satisﬁes

(αQ + (1 −α)Q

) = αT

Q + (1 −α)T

. (7.6)

Equivalently, the diﬀerence between T

Q and T

can be expressed as

Q −T

= γP

(Q −Q

) .

The optimality operator, on the other hand, is not aﬃne. While aﬃne operators

can be analyzed almost as if they were linear,

the optimality operator is

generally a nonlinear operator. As such, its analysis requires a slightly diﬀerent

approach.

54. Consider the proof of Proposition 4.4.

Draft version.

Control 201

Proof of Lemma 7.3.

The proof relies on a special property of the maximum

function. For f

, f

: A→R, it can be shown that



max

a∈A

(a) −max

a∈A

(a)



≤max

a∈A



(a) − f

(a)



Now let

Q, Q

∈R

X×A

, and ﬁx

x ∈X, a ∈A

. Let us write

xa,π

[

] =

[

· | X

x, A = a]. By linearity of expectations, we have

|(T Q)(x, a) −(T Q

)(x, a)|=



xa,π

[R + γ max

∈A

Q(X

, a

) −R −γ max

∈A

, a

)]



xa,π

[γ max

∈A

Q(X

, a

) −γ max

∈A

, a

)]



= γ



xa,π

[max

∈A

Q(X

, a

) −max

∈A

, a

)]



≤γ E

xa,π



|max

∈A

Q(X

, a

) −max

∈A

, a



≤γ E

xa,π



max

∈A

|Q(X

, a

) −Q

, a



≤γ max

)∈X×A

|Q(x

, a

) −Q

, a

= γkQ −Q

∞

Since the bound holds for any (x, a) pair, it follows that

kT Q −T Q

∞

≤γkQ −Q

∞

Corollary 7.5.

For any initial state-action value function

∈R

X×A

, the

sequence of iterates Q

k+1

= T Q

converges to Q

∗

in the L

∞

norm. 4

We can use the unbiased estimation method of Section 6.2 to derive an

incremental algorithm for learning

∗

, since the contractive operator

expressible as an expectation over a sample transition. Given a realization

(x, a, r, x

), we construct the sample target

r + γ max

∈A

Q(x

, a

) .

We then incorporate this target to an update rule to obtain the Q-learning

algorithm ﬁrst encountered in Chapter 3:

Q(x, a) ←(1 −α)Q(x, a) + α(r + γ max

∈X

Q(x

, a

)) .

Under appropriate conditions, the convergence of Q-learning to

∗

can be

established with Lemma 7.3 and a suitable extension of the analysis of Chapter

6 to the space of action-value functions.

Draft version.

202 Chapter 7

7.3 Distributional Value Iteration

Analogous to value iteration, we can devise a distributional dynamic pro-

gramming procedure for the risk-neutral control problem; such a procedure

determines an approximation to the return function of an optimal policy. As

we will see, in some circumstances, this can be accomplished without compli-

cations, giving some theoretical justiﬁcation for the distributional algorithm

presented in Section 3.7.

As in Chapter 5, we perform distributional dynamic programming by imple-

menting the combination of a projection step with a distributional optimality

operator.

Because it is not possible to “directly maximize” a probability

distribution, we instead deﬁne the operator via a greedy selection rule G.

We can view the expected-value optimality operator

as substituting the

expectation over the next-state action

in Equation 7.5 by a maximization

over

. As such, it can be rewritten as a particular policy evaluation operator

whose policy

depends on the operand

; the mapping from

is what

we call a greedy selection rule.

Deﬁnition 7.6.

A greedy selection rule is a mapping

X×A

→π

with the

property that for any Q ∈R

X×A

, G(Q) is greedy with respect to Q. That is,

G(Q)(a | x) > 0 =⇒ Q(x, a) = max

∈A

Q(x, a

) .

We extend

to return functions by deﬁning, for

η ∈P

(

)

X×A

, the induced

state-action value function

(x, a) = E

Z∼η(x,a)

[Z] ,

and then letting

G(η) = G(Q

) . 4

A greedy selection rule may produce stochastic policies, for example when

assigning equal probability to two equally valued actions. However, it must

select actions that are maximally valued according to

. Given a greedy

selection rule, we may rewrite the Bellman optimality operator as

T Q = T

G(Q)

Q . (7.7)

In the distributional setting, we must make explicit the dependency of the oper-

ator on the greedy selection rule

. Mirroring Equation 7.7, the distributional

55.

As before, this implementation is at its simplest when there are ﬁnitely many possible states,

actions, and rewards, and the projection step can be computed eﬃciently. Alternatives include a

sample-based approach and, as we will see in Chapter 9, function approximation.

Draft version.

Control 203

η η

Q Q

Figure 7.1

When the projection step Π

is mean-preserving, distributional value iteration produces

the same sequence of state-action value functions as standard value iteration.

Bellman optimality operator derived from G is

η = T

G(η)

η .

We will see that, unlike the expected-value setting, diﬀerent choices of the

greedy selection rule result in diﬀerent operators with possibly diﬀerent dynam-

ics – we thus speak of the distributional Bellman optimality operators, in the

plural.

Distributional value iteration algorithms combine a greedy selection rule G

and a projection step (described by the operator Π

and implying a particular

choice of representation

) to compute an approximate optimal return function.

Given some initial return function

∈F

X×A

, distributional value iteration

implements the iterative procedure

k+1

= Π

G(η

)

. (7.8)

In words, distributional value iteration selects a policy that at each state

greedy with respect to the expected values of

(

x, ·

) and computes the return

function resulting from a single step of distributional dynamic programming

with that policy.

The induced value function

plays an important role in distributional

value iteration as it is used to derive the greedy policy

(

). When Π

mean-preserving (Section 5.11),

behaves as if it had been computed from

standard value iteration (Figure 7.1). That is,

k+1

= T Q

By induction, distributional value iteration then produces the same sequence of

state-action value functions as regular value iteration. That is, given the initial

condition

(x, a) = E

Z∼η

(x,a)

[Z] , (x, a) ∈X×A,

Draft version.

204 Chapter 7

and the recursion

k+1

= T Q

we have

= Q

, for all k ≥0 .

As a consequence, these two processes also produce the same greedy policies:

= G(Q

) .

In the following section, we will use this equivalence to argue that distributional

value iteration ﬁnds an approximation to an optimal return function. When Π

is not mean-preserving, however,

may deviate from

. If that is the case,

the greedy policy

is likely to be diﬀerent from

(

), and distributional

value iteration may converge to the return function of a suboptimal policy. We

discuss this point in Remark 7.2.

Before moving on, let us remark on an alternative procedure for approximat-

ing an optimal return function. This procedure ﬁrst performs standard value

iteration to obtain an approximation

∗

. A greedy policy

ˆπ

is then extracted

from

. Finally, distributional dynamic programming is used to approximate

the return function

ˆπ

. If

ˆπ

is an optimal policy, this directly achieves our stated

aim and suggests doing away with the added complexity of distributional value

iteration. In larger problems, however, it is diﬃcult or undesirable to wait until

∗

has been determined before learning or computing the return function, or it

may not be possible to decouple value and return predictions. In these situations,

it is sensible to consider distributional value iteration.

7.4 Dynamics of Distributional Optimality Operators

In this section and the next, we analyze the behavior of distributional optimality

operators; recall from Section 7.3 that there is one such operator for each choice

of greedy selection rule

, in contrast to non-distributional value iteration.

Combined with a mean-preserving projection, our analysis also informs the

behavior of distributional value iteration. As we will see, even in the absence of

approximation due to a ﬁnite-parameter representation, distributional optimality

operators exhibit complex behavior.

Thus far, we have analyzed distributional dynamic programming algorithms

by appealing to contraction mapping theory. Demonstrating that an opera-

tor is a contraction provides a good deal of understanding about algorithms

implementing its repeated application: they converge at a geometric rate to the

operator’s ﬁxed point (when such a ﬁxed point exists). Unfortunately, distri-

butional optimality operators are not contraction mappings, as the following

shows.

Draft version.

Control 205

η T

η η

∗

x, · γε ±γ γε

y, a −ε ε ε

y, b ±1 ±1 ±1

Figure 7.2

Left

: A Markov decision process for which no distributional Bellman optimality operator

is a contraction mapping. Right: The optimal return-distribution function η

∗

and initial

return-distribution function

used in the proof of Proposition 7.7 (expressed in terms of

their support). Given a

-homogeneous probability metric

, the proof chooses

so as to

make d(T

η, η

∗

) > d(η, η

∗

Proposition 7.7.

Consider a probability metric

and let

be its supremum

extension. Suppose that for any z, z

∈R,

d(δ

, δ

) < ∞.

-homogeneous, there exist a Markov decision process and two

return-distribution functions

η, η

such that for any greedy selection rule

and any discount factor γ ∈(0, 1],

d(T

η, T

) > d(η, η

) . (7.9)

Proof.

Consider an MDP with two nonterminal states

x, y

and two actions

a, b

(Figure 7.2). From state

, all actions transition to state

and yield no reward.

In state

, action

results in a reward of

ε >

0, while action

results in a reward

that is

−

1 or 1 with equal probability; both actions are terminal. We will argue

that ε can be chosen to make T

satisfy Equation 7.9.

First note that any optimal policy must choose action

in state

, and all

optimal policies share the same return-distribution function

∗

. Consider another

return-distribution function

that is equal to

∗

at all state-action pairs except

that

(

y, a

) =

−ε

(Figure 7.2, right). This implies that the greedy selection rule

must select b in y, and hence

η)(x, a) = (T

η)(x, b) = (b

0,γ

)

η(y, b) = (b

0,γ

)

∗

(y, b) .

Because both actions are terminal from y, we also have that

η)(y, a) = η

∗

(y, a)

Draft version.

206 Chapter 7

η)(y, b) = η

∗

(y, b).

Let us write

−1

for the reward distribution associated with (

y, b

We have

d(η, η

∗

) = d



η(y, a), η

∗

(y, a)



= d(δ

−ε

, δ

) , (7.10)

and

d(T

η, T

∗

) = d



η)(x, a), (T

∗

)(x, a)



= d



0,γ

)

∗

(y, b), (b

0,γ

)

∗

(y, a)



= γ

d(ν, δ

) , (7.11)

where the last line follows by

-homogeneity (Deﬁnition 4.22). We will show

that for suﬃciently small ε > 0, we have

d(ν, δ

) > γ

−c

d(δ

−ε

, δ

) ,

from which the result follows by Equations 7.10 and 7.11. To this end, note that

d(δ

−ε

, δ

) = ε

d(δ

−1

, δ

). (7.12)

Since

is ﬁnite for any pair of Dirac deltas, we have that

(

−1

, δ

)

< ∞

and so

lim

ε→0

d(δ

−ε

, δ

) = 0 . (7.13)

On the other hand, by the triangle inequality we have

d(ν, δ

) + d(δ

, δ

) ≥d(ν, δ

) > 0 , (7.14)

where the second inequality follows because

and

are diﬀerent distributions.

Again by c-homogeneity of d, we deduce that

lim

ε→0

d(δ

, δ

) = 0 ,

and hence

lim inf

ε→0

d(ν, δ

) ≥d(ν, δ

) > 0 .

Therefore, for ε > 0 suﬃciently small, we have

d(ν, δ

) > γ

−c

d(δ

−ε

, δ

) ,

as required.

Proposition 7.7 establishes that for any metric

that is suﬃciently well

behaved to guarantee that policy evaluation operators are contraction mappings

(ﬁnite on bounded-support distributions,

-homogeneous), optimality operators

are not generally contraction mappings with respect to

. As a consequence,

Draft version.

Control 207

we cannot directly apply the tools of Chapter 4 to characterize distributional

optimality operators. The issue, which is implied in the proof above, is that it is

possible for distributions to have similar expectations yet diﬀer substantially,

for example, in their variance.

With a more careful line of reasoning, we can still identify situations in which

the iterates

k+1

do converge. The most common scenario is when there

is a unique optimal policy. In this case, the analysis is simpliﬁed by the existence

of an action gap in the optimal value function.

Deﬁnition 7.8.

Let

Q ∈R

X×A

. The action gap at a state

is the diﬀerence

between the highest-valued and second highest-valued actions:

gap(Q, x) = min



Q(x, a

∗

) −Q(x, a) : a

∗

, a ∈A, a

∗

, a, Q(x, a

∗

) = max

∈A

Q(x, a

)



The action gap of Q is the smallest gap over all states:

gap(Q) = min

x∈X

gap(Q, x) .

By extension, the action gap for a return function η is

gap(η) = min

x∈X

gap(Q

, x) , Q

(x, a) = E

Z∼η(x,a)

[Z] . 4

Deﬁnition 7.8 is such that if two actions are optimal at a given state, then

gap

(

∗

) = 0. When

gap

(

∗

)

0, however, there is a unique action that is optimal

at each state (and vice versa). In this case, we can identify an iteration

from

which

(

) =

∗

for all

k ≥K

. From that point on, any distributional optimality

operator reduces to the

∗

evaluation operator, which enjoys now-familiar

convergence guarantees.

Theorem 7.9.

Let

be the distributional Bellman optimality operator

instantiated with a greedy selection rule

. Suppose that there is a unique

optimal policy

∗

and let

p ∈

, ∞

]. Under Assumption 4.29(

) (well-

behaved reward distributions), for any initial return-distribution function

∈P (R), the sequence of iterates deﬁned by

k+1

= T

converges to η

∗

with respect to the metric w

. 4

Theorem 7.9 is stated in terms of supremum

-Wasserstein distances for

conciseness. Following the conditions of Theorem 4.25 and under a diﬀerent

set of assumptions, we can of course also establish the convergence in, say, the

supremum Cramér distance.

56. Example 6.1 introduced an update rule that increases this action gap.

Draft version.

208 Chapter 7

Corollary 7.10.

Suppose that Π

is a mean-preserving projection for some

representation

and that there is a unique optimal policy

∗

. If Assump-

tion 5.22(

) holds and Π

is a nonexpansion in

, then under the conditions

of Theorem 7.9, the sequence

k+1

= Π

produced by distributional value iteration converges to the ﬁxed point

ˆη

∗

= Π

∗

ˆη

∗

. 4

Proof of Theorem 7.9.

As there is a unique optimal policy, it must be that the

action gap of

∗

is strictly greater than zero. Fix

gap

(

∗

). Following the

discussion of the previous section, we have that

k+1

= T Q

Because

is a contraction in

∞

norm, we know that there exists a

∈N

after

which

−Q

∗

∞

< ε ∀k ≥K

. (7.15)

For a ﬁxed

, let

∗

be the optimal action in that state. From Equation 7.15, we

deduce that for any a , a

∗

(x, a

∗

) ≥Q

∗

(x, a

∗

) −ε

≥Q

∗

(x, a) + gap(Q

∗

) −ε

> Q

(x, a) + gap(Q

∗

) −2ε

= Q

(x, a) .

Thus, the greedy action in state

after time

is the optimal action for that

state. Thus, G(η

) = π

∗

for k ≥K

and

k+1

= T

∗

k ≥K

We can treat

as a new initial condition

, and apply Proposition 4.30 to

conclude that η

→η

∗

In Section 3.7, we introduced the categorical Q-learning algorithm in terms

of a deterministic greedy policy. Generalized to an arbitrary greedy selection

rule G, categorical Q-learning is deﬁned by the update

η(x, a) ←(1 −α)η(x, a) + α

∈A

G(η)(a

| x

)



r,γ

)



, a





Draft version.

Control 209

Because the categorical projection is mean-preserving, its induced state-value

function follows the update

(x, a) ←(1 −α)η(x, a) + α

∈A

G(η)(a

| x

)



r + γQ

, a

)



| {z }

r+γ max

∈A

)

Using the tools of Chapter 6, one can establish the convergence of categorical

Q-learning under certain conditions, including the assumption that there is a

unique optimal policy

∗

. The proof essentially combines the insight that, under

certain conditions, the sequence of greedy policies tracked by the algorithm

matches that of Q-learning and hence eventually converges to

∗

, at which point

the algorithm is essentially performing categorical policy evaluation of

∗

. The

actual proof is somewhat technical; we omit it here and refer the interested

reader to Rowland et al. (2018).

7.5 Dynamics in the Presence of Multiple Optimal Policies*

In the value-based setting, it does not matter which greedy selection rule is used

to represent the optimality operator: By deﬁnition, any greedy selection rule

must be equivalent to directly maximizing over

(

x, ·

). In the distributional

setting, however, diﬀerent rules usually result in diﬀerent operators. As a con-

crete example, compare the rule “among all actions whose expected value is

maximal, pick the one with smallest variance” to “assign equal probability to

actions whose expected value is maximal.”

Theorem 7.9 relies on the fact that, when there is a unique optimal policy

∗

, we can identify a time after which the distributional optimality operator

behaves like a policy evaluation operator. When there are multiple optimal

policies, however, the action gap of the optimal value function

∗

is zero and

the argument cannot be used. To understand why this is problematic, it is useful

to write the iterates (η

)

k≥0

more explicitly in terms of the policies π

= G(η

k+1

= T

k−1

= T

···T

. (7.16)

When the action gap is zero, the sequence of policies

, π

k+1

, . . .

may con-

tinue to vary over time, depending on the greedy selection rule. Although all

optimal actions have the same optimal value (guaranteeing the convergence

of the expected values to

∗

), they may correspond to diﬀerent distributions.

Thus, distributional value iteration – even with a mean-preserving projection –

may mix together the distribution of diﬀerent optimal policies. Even if (

)

k≥0

converges, the policy it converges to may depend on initial conditions (Exercise

7.5). In the worst case, the iterates (

)

k≥0

might not even converge, as the

following example shows.

Draft version.

210 Chapter 7

Figure 7.3

Left

: A Markov decision process in which the sequence of return function estimates

(

)

k≥0

does not converge (Example 7.11).

Right

: The return-distribution function esti-

mate at (

x, a

) as a function of

and beginning with

= 1. At each

, the pair of dots

indicates the support of the distribution.

Example 7.11

(Failure to converge)

Consider a Markov decision process with

a single nonterminal state

and two actions,

and

(Figure 7.3, left). Action

gives a reward of 0 and leaves the state unchanged, while action

gives a

reward of

−

1 or 1 with equal probability and leads to a terminal state. Note that

the expected return from taking either action is 0.

Let

γ ∈

We will exhibit a sequence of return function estimates (

)

k≥0

that is produced by distributional value iteration and does not have a limit. We

do so by constructing a greedy selection rule that achieves the desired behavior.

Suppose that

(x, b) =

(δ

−1

+ δ

) .

For any initial parameter c

∈R, if

(x, a) =

(δ

−c

+ δ

) ,

then by induction, there is a sequence of scalars (c

)

k≥0

such that

(x, a) =

(δ

−c

+ δ

) ∀k ∈N . (7.17)

This is because

k+1

(x, a) = (b

0,γ

)

(x, a) or η

k+1

(x, a) = (b

0,γ

)

(x, b), k ≥0 .

Deﬁne the following greedy selection rule: at iteration

+ 1, choose

≥

;

otherwise, choose b. With some algebra, this leads to the recursion (k ≥0)

k+1

(

γ if c

γc

otherwise.

Over a period of multiple iterations, the estimate

(

x, a

) exhibits cyclical

behavior (Figure 7.3, right). 4

Draft version.

Control 211

The example illustrates how, without additional constraints on the greedy

selection rule, it is not possible to guarantee that the iterates converge. However,

one can prove a weaker result based on the fact that only optimal actions must

eventually be chosen.

Deﬁnition 7.12.

For a given Markov decision process

, the set of nonstation-

ary Markov optimal return-distribution functions is

nmo

= {η

¯π

: ¯π = (π

)

k≥0

, π

∈π

is optimal for M, ∀k ∈N}. 4

In particular, any history-dependent policy

¯π

satisfying the deﬁnition above is

also optimal for M.

Theorem 7.13.

Let

be a distributional Bellman optimality operator

instantiated with some greedy selection rule, and let

p ∈

, ∞

]. Under

Assumption 4.29(

), for any initial condition

∈P

(

)

the sequence

of iterates η

k+1

= T

converges to the set η

nmo

, in the sense that

lim

k→∞

inf

η∈η

nmo

(η, η

) = 0. 4

Proof.

Along the lines of the proof of Theorem 7.9, there must be a time

K ∈N

after which the greedy policies

k ≥K

are optimal. For

l ∈N

, let us construct

the history-dependent policy

¯π = π

K+l−1

, π

K+l−2

, . . . , π

K+1

, π

∗

, π

∗

, . . . ,

where

∗

is some stationary Markov optimal policy. Denote the return of this

policy from the state-action pair (

x, a

) by

¯π

(

x, a

) and its return-distribution

function by

¯π

. Because

¯π

is an optimal policy, we have that

¯π

∈η

nmo

. Let

be an instantiation of

, for each

k ∈N

. Now for a ﬁxed

x ∈X

a ∈A

let (

, A

, R

)

t=0

be the initial segment of a random trajectory generated by

following

¯π

beginning with

x, A

. More precisely, for

= 1

, . . . , l

, we

have

| (X

0:t

, A

0:t−1

, R

0:t−1

) ∼π

K+l−t

( · | X

) .

From this, we write

¯π

(x, a)

l−1

t=0

+ γ

∗

, A

) .

Draft version.

212 Chapter 7

Because

K+l

K+l−1

···T

, by inductively applying Proposition 4.11, we

also have

K+l

(x, a)

l−1

t=0

+ γ

, A

) .

Hence,



K+l

(x, a), η

¯π

(x, a)



= w



K+l

(x, a), G

¯π

(x, a)



= w



l−1

t=0

+ γ

, A

l−1

t=0

+ γ

∗

, A

)



Consequently,



K+l

(x, a), η

¯π

(x, a)



≤w

(γ

, A

), γ

∗

, A

))

= γ

, A

), G

∗

, A

))

≤γ

, G

∗

) ,

following the arguments in the proof of Proposition 4.15. The result now follows

by noting that

(

, G

∗

)

< ∞

under our assumptions, taking the supremum

over (x, a) on the left-hand side, and taking the limit with respect to l.

Theorem 7.13 shows that, even before the eﬀect of the projection step is

taken into account, the behavior of distributional value iteration is in general

quite complex. When the iterates

are approximated (for example, because

they are estimated from samples), nonstationary Markov return-distribution

functions may also be produced by distributional value iteration – even when

there is a unique optimal policy.

It may appear that the convergence issues highlighted by Example 7.11 and

Theorem 7.13 are consequences of using the wrong greedy selection rule. To

address these issues, one may be tempted to impose an ordering on policies (for

example, always prefer the action with the lowest variance, at equal expected

values). However, it is not clear how to do this in a satisfying way. One hurdle

is that, to avoid the cyclical behavior demonstrated in Example 7.11, we would

like a greedy selection rule that is continuous with respect to its input. This

seems problematic, however, since we also need this rule to return the correct

answer when there is a unique optimal (and thus deterministic) policy (Exercise

7.7). This suggests that when learning to control with a distributional approach,

the learned return distributions may simultaneously reﬂect the random returns

from multiple policies.

Draft version.

Control 213

7.6 Risk and Risk-Sensitive Control

Imagine being invited to interview for a desirable position at a prestigious

research institute abroad. Your plane tickets and the hotel have been booked

weeks in advance. Now the night before an early morning ﬂight, you make

arrangements for a cab to pick you up from your house and take you to the

airport. How long in advance of your plane’s actual departure do you request

the cab for? If someone tells you that, on average, a cab to your local airport

takes an hour – is that suﬃcient information to make the booking? How does

your answer change when the ﬂight is scheduled around rush hour, rather than

early morning?

Fundamentally, it is often desirable that our choices be informed by the

variability in the process that produces outcomes from these choices. In this

context, we call this variability risk. Risk may be inherent to the process or

incomplete knowledge about the state of the world (including any potential

traﬃc jams and the mechanical condition of the hired cab).

In contrast to risk-neutral behavior, decisions that take risk into account are

called risk-sensitive. The language of distributional reinforcement learning is

particularly well suited for this purpose, since it lets us reason about the full

spectrum of outcomes, along with their associated probabilities. The rest of

the chapter gives an overview of how one may account for risk in the decision-

making process and of the computational challenges that arise when doing so.

Rather than be exhaustive, here we take the much more modest aim of exposing

the reader to some of the major themes in risk-sensitive control and their relation

to distributional reinforcement learning; references to more extensive surveys

are provided in the bibliographical remarks.

Recall that the risk-neutral objective is to maximize the expected return from

the (possibly random) initial state X

J(π) = E

∞

t=0

= E

)

Here, we may think of the expectation as mapping the random variable

(

)

to a scalar. Risk-sensitive control is the problem that arises when we replace

this expectation by a risk measure.

Deﬁnition 7.14. A risk measure

is a mapping

ρ : P

(R) →[−∞, ∞) ,

57.

More precisely, this is a static risk measure, in that it is only concerned with the return from

time t = 0. See bibliographical remarks.

Draft version.

214 Chapter 7

deﬁned on a subset

(

)

⊆P

(

) of probability distributions. By extension,

for a random variable

instantiating the distribution

, we write

(

) =

(

Problem 7.15

(Risk-sensitive control)

Given an MDP (

X, A, ξ

, P

), a

discount factor

γ ∈

1), and a risk measure

, ﬁnd a policy

π ∈π

maximizing

(π) = ρ



∞

t=0



. (7.18)

In Problem 7.15, we assume that the distribution of the random return lies

(

), similar to our treatment of probability metrics in Chapter 4. From a

technical perspective, subsequent examples and results should be interpreted

with this assumption in mind.

The risk measure

may take into account higher-order moments of the return

distribution, be sensitive to rare events, and even disregard the expected value

altogether. Note that according to this deﬁnition,

also corresponds to a

risk-sensitive control problem. However, we reserve the term for risk measures

that are sensitive to more than only expected values.

Example 7.16

(Mean-variance criterion)

Let

λ >

0. The variance-penalized

risk measure penalizes high-variance outcomes:

(Z) = E[Z] −λVar(Z) . 4

Example 7.17

(Entropic risk)

Let

λ >

0. Entropic risk puts more weight on

smaller-valued outcomes:

(Z) = −

log E[e

−λZ

] . 4

Example 7.18

(Value-at-risk)

Let

τ ∈

1). The value-at-risk measure (Figure

7.4) corresponds to the τth quantile of the return distribution:

VaR

(Z) = F

−1

(τ) . 4

7.7 Challenges in Risk-Sensitive Control

Many convenient properties of the risk-neutral objective do not carry over to

risk-sensitive control. As a consequence, ﬁnding an optimal policy is usually

signiﬁcantly more involved. This remains true even when the risk-sensitive

58.

Typically, Problem 7.15 is formulated in terms of a risk to be minimized, which linguistically is

a more natural objective. Here, however, we consider the maximization of

(

) so as to keep the

presentation uniﬁed with the rest of the book.

Draft version.

Control 215

Figure 7.4

Illustration of value-at-risk (VaR) and conditional value-at-risk (CVaR). Depicted is the

cumulative distribution function of the mixture of normal distributions

1) +

1). The dashed line corresponds to VaR; CVaR (

= 0

4) can be determined from

a suitable transformation of the shaded area (see Section 7.8 and Exercise 7.10).

objective (Equation 7.18) can be evaluated eﬃciently: for example, by using

distributional dynamic programming to approximate the return-distribution func-

tion

. In this section, we illustrate some of these challenges by characterizing

optimal policies for the variance-constrained control problem.

The variance-constrained problem introduces risk sensitivity by forbidding

policies whose return variance is too high. Given a parameter

C ≥

0, the

objective is to

maximize E

)]

subject to Var



)



≤C .

(7.19)

Equation 7.19 can be shown to satisfy our deﬁnition of a risk-sensitive control

problem if we express it in terms of a Lagrange multiplier:

(π) = min

λ≥0





)



−λ



Var



)



−C





The variance-penalized and variance-constrained problems are related in that

they share the Pareto set

par

⊆π

of possibly optimal solutions. A policy

in the set π

par

if we have that for all π

∈π

(a) Var



)



> Var



)



=⇒ E[G

)] > E[G

)], and

(b) Var



)



= Var



)



=⇒ E[G

)] ≥E[G

)].

In words, between two policies with equal variances, the one with lower expecta-

tion is never a solution to either problem. However, these problems are generally

not equivalent (Exercise 7.8).

Proposition 7.1 establishes the existence of a solution of the risk-neutral

control problem that is (a) deterministic, (b) stationary, and (c) Markov. By

contrast, the solution to the variance-constrained problem may lack any or all

of these properties.

Draft version.

216 Chapter 7

(a) (b) (c)

Figure 7.5

Examples demonstrating how the optimal policy for the variance-constrained control

problem might not be (a) deterministic, (b) Markov, or (c) stationary.

Example 7.19

(The optimal policy may not be deterministic)

Consider the

problem of choosing between two actions,

and

. Action

always yields a

reward of 0, while action

yields a reward of 0 or 1 with equal probability

(Figure 7.5a). If we seek the policy that maximizes the expected return subject

to the variance constraint

/16

, the best deterministic policy respecting the

variance constraint must choose

, for a reward of 0. On the other hand, the

policy that selects

and

with equal probability achieves an expected reward

/4 and a variance of

/16. 4

Example 7.20

(The optimal policy may not be Markov)

Consider the Markov

decision process in Figure 7.5b. Suppose that we seek a policy that maximizes

the expected return from state

, now subject to the variance constraint

= 0.

Let us assume

= 1 for simplicity. Action

has no variance and is therefore

a possible solution, with zero return. Action

gives a greater expected return,

at the cost of some variance. Any policy

that depends on the state alone

and chooses

in state

must incur this variance. On the other hand, the

following history-dependent policy achieves a positive expected return without

violating the variance constraint: in state

, choose

; if the ﬁrst reward

is 0,

select action

; otherwise, select action

. In all cases, the return is 1, an

improvement over the best Markov policy. 4

Example 7.21

(The optimal policy may not be stationary)

In general, the

optimal policy may require keeping track of time. Consider the problem of

maximizing the expected return from the unique state

in Figure 7.5c, subject

Var

(

))

≤C

, for

C ≤

. Exercise 7.9 asks you to show that a simple

time-dependent policy that chooses

for

steps and then selects

achieves an

expected return of up to

√

. This is possible because the variance of the return

decays at a rate of

, while its expected value decays at the slower rate of

By contrast, the best randomized stationary policy performs substantially worse

Draft version.

Control 217

Figure 7.6

Expected return as a function of the discount factor

and variance constraint

Example 7.21. Solid and dashed lines indicate the expected return of the best stationary

and time-dependent policies, respectively. The peculiar zigzag shape of the curve for

the time-varying policy arises because the time

at which action

is taken must be an

integer (see Exercise 7.9).

for a discount factor

close to 1 and small values of

(Figure 7.6). Intuitively,

a randomized policy must choose

with a suﬃciently large probability to avoid

receiving the random reward early, which prevents it from selecting

quickly

beyond the threshold of T

time steps. 4

The last two examples establish that the variance-constrained risk measure

is time-inconsistent: informally, the agent’s preference for one outcome over

another at time

may be reversed at a later time. Compared to the risk-neutral

problem, the variance-constrained problem is more challenging because the

space of policies that one needs to consider is much larger. Among other things,

the lack of an optimal policy that is Markov with respect to the state alone

also implies that the dependency on the initial distribution in Equation 7.19 is

necessary to keep things well deﬁned. The variance-constrained objective must

be optimized for globally, considering the policy at all states at once; this is in

contrast with the risk-neutral setting, where value iteration can make overall

improvements to the policy by acting greedily with respect to the value function

at individual states (see Remark 7.1). In fact, ﬁnding an optimal policy for the

variance-constrained control problem is NP-hard (Mannor and Tsitsiklis 2011).

Draft version.

218 Chapter 7

7.8 Conditional Value-At-Risk*

In the previous section, we saw that solutions to the variance-constrained control

problem can take unintuitive forms, including the need to penalize better-than-

expected outcomes. One issue is that variance only coarsely measures what we

mean by “risk” in the common sense of the word. To reﬁne our meaning, we

may identify two types of risk: downside risk, involving undesirable outcomes

such as greater-than-expected losses, and upside risk, involving what we may

informally call a stroke of luck. In some situations, it is possible and useful to

separately account for these two types of risk.

To illustrate this point, we now present a distributional algorithm for optimiz-

ing conditional value-at-risk (CVaR), based on work by Bäuerle and Ott (2011)

and Chow et al. (2015). One beneﬁt of working with full return distributions

is that the algorithmic template we present here can be reasonably adjusted to

deal with other risk measures, including the entropic risk measure described in

Example 7.17. For conciseness, in what follows, we will state without proof a

few technical facts about conditional value-at-risk that can be found in those

sources and the work of Rockafellar and Uryasev (2002).

Conditional value-at-risk measures downside risk by focusing on the lower

tail behavior of the return distribution, speciﬁcally the expected value of this

tail. This expected value quantiﬁes the magnitude of losses in extreme scenarios.

Let

be a random variable with cumulative and inverse cumulative distribution

functions

and

−1

, respectively. For a parameter

τ ∈

1), the CVaR of

CVaR

(Z) =

−1

(u)du . (7.20)

When the inverse cumulative distribution

−1

is strictly increasing, the right-

hand side of Equation 7.20 is equivalent to

E[Z | Z ≤F

−1

(τ)] . (7.21)

In this case, CVaR quantiﬁes the expected return, conditioned on the event that

this return is no greater than the return’s

th quantile – that is, is within the

fraction of lowest returns.

In a reinforcement learning context, this leads to

the risk-sensitive objective

CVaR

(π) = CVaR



∞

t=0



. (7.22)

59.

In other ﬁelds, CVaR is applied to losses rather than returns, in which case it measures the

expected loss subject that this loss is above the

th percentile. For example, Equation 7.21 becomes

E[Z | Z ≥F

−1

(τ)], and the subsequent derivations need to be adjusted accordingly.

Draft version.

Control 219

In general, there may not be an optimal policy that depends on the state

alone;

however, one can show that optimality can be achieved with a deterministic,

stationary Markov policy on an augmented state that incorporates information

about the return thus far. In the context of this section, we assume that rewards

are bounded in [

min

, R

max

]. At a high level, we optimize the CVaR objective by

(a) deﬁning the requisite augmented state;

(b)

performing a form of risk-sensitive value iteration on this augmented state,

using a suitable selection rule; and

We now explain each of these steps.

Augmented state.

Central to the algorithm and to the state augmentation

procedure is a reformulation of CVaR in terms of a desired minimum return or

target

b ∈R

. Let [

]

denote the function that is 0 if

x <

0 and

otherwise. For

a random variable

and

τ ∈

1), Rockafellar and Uryasev (2002) establish

that

CVaR

(Z) = max

b∈R



b −τ

−1



[b −Z]





. (7.23)

When

−1

is strictly increasing, the maximum-achieving

for Equation 7.23

is the quantile

−1

(

). In fact, taking the derivative of the expression inside

the brackets with respect to

yields the quantile update rule (Equation 6.11;

see Exercise 7.11). The advantage of this formulation is that it is more easily

optimized in the context of a policy-dependent return. To see this, let us write

∞

t=0

to denote the random return from the initial state

, following some history-

dependent policy π ∈π

. We then have that

max

π∈π

CVaR

(π) = max

π∈π

max

b∈R



b −τ

−1



[b −G

]





= max

b∈R



b −τ

−1

min

π∈π



[b −G

]





. (7.24)

In words, the CVaR objective can be optimized by jointly ﬁnding an optimal

target

and a policy that minimizes the expected shortfall



[

b −G

]



. For a

ﬁxed target

, we will see that it is possible to minimize the expected shortfall

by means of dynamic programming. By adjusting

appropriately, one then

obtains an optimal policy.

Based on Equation 7.24, let us now consider an augmented state (

, B

where

is as usual the current state and

takes on values in

= [

min

, V

max

];

we will describe its dynamics in a moment. With this augmented state, we

Draft version.

220 Chapter 7

may consider a class of stationary Markov policies,

CVaR

, which take the form

π : X×B→P(A).

We use the variable

to keep track of the amount of discounted reward

that should be obtained from

onward in order to achieve a desired minimum

return of

∈R

over the entire trajectory. The transition structure of the Markov

decision process over the augmented state is deﬁned by modifying the generative

equations (Section 2.3):

= b

|(X

0:t

, B

0:t

, A

0:t−1

, R

0:t−1

) ∼π(· | X

, B

)

t+1

|(X

0:t+1

, B

0:t

, A

0:t

, R

0:t

) =

−R

;

we similarly extend the sample transition model with the variables

and

This deﬁnition of the variables (

)

t≥0

can be understood by noting, for example,

that a minimum return of b

is achieved over the whole trajectory if

∞

t=1

t−1

≥

−R

If the reward

is small or negative, the new target

t+1

may of course be larger

than

. Note that the value of

is a parameter of the algorithm, rather than

given by the environment.

Risk-sensitive value iteration.

We next construct a method for optimizing

the expected shortfall given a target

∈R

. Let us write

η ∈P

(

)

X×B×A

for a

return-distribution function on the augmented state-action space, instantiated

as a return-variable function

. For ease of exposition, we will mostly work

with this latter form of the return function. As usual, we write

for the return-

variable function associated with a policy

π ∈π

CVaR

. With this notation in mind,

we write

CVaR

(π, b

) = max

b∈R



b −τ

−1



[b −G

, b

, A

)]





(7.25)

to denote the conditional value-at-risk obtained by following policy

from the

initial state (X

, b

), with A

∼π(· | X

, b

Similar to distributional value iteration, the algorithm constructs a series of

approximations to the return-distribution function by repeatedly applying the

distributional Bellman operator with a policy derived from a greedy selection

rule

G. Speciﬁcally, we write

(x, b) = arg min

a∈A



[b −G(x, b, a)]



(7.26)

60. Since B

is a function of the trajectory up to time t, π

CVaR

is a strict subset of π

Draft version.

Control 221

for the greedy action at the augmented state (

x, b

), breaking ties arbitrarily. The

selection rule

G is itself given by

G(η)(a | x, b) =

G(G)(a | x, b) = {a = a

(x, b)}.

The algorithm begins by initializing

(

x, b, a

) =

for all

x ∈X

b ∈B

, and

a ∈A, and iterating

k+1

= T

, (7.27)

as in distributional value iteration. Expressed in terms of return-variable

functions, this is

k+1

(x, b, a)

= R + γG

, B

, a

, B

)), X = x, B = b, A = a.

After

iterations, the policy

(

) can be extracted according to Equation

7.26. As suggested by Equation 7.25, a suitable choice of starting state is

= arg max

b∈B



b −τ

−1



[b −G

, b, a

, b))]





. (7.28)

As given, there are two hurdles to producing a tractable implementation of

Equation 7.27: in addition to the usual concern that return distributions may

need to be projected onto a ﬁnite-parameter representation, we also have to

contend with a real-valued state variable

. Before discussing how this can be

addressed, we ﬁrst establish that Equation 7.27 is a sound approach to ﬁnding

an optimal policy for the CVaR objective.

Theorem 7.22.

Consider the sequence of return-distribution functions

(

)

k≥0

deﬁned by Equation 7.27. Then the greedy policy

(

) is such

that for all x ∈X, b ∈B, and a ∈A,



[b −G

(x, b, a)]



≤ min

π∈π

CVaR



[b −G

, b, a)]





max

−V

min



1 −γ

Consequently, we also have that the conditional value-at-risk of this policy

satisﬁes (with b

given by Equation 7.28)

CVaR

(π

, b

) ≥max

b∈B

max

π∈π

CVaR

(π, b) −



max

−V

min



τ(1 −γ)

. 4

The proof is somewhat technical and is provided in Remark 7.3.

Theorem 7.22 establishes that with suﬃciently many iterations, the policy

is close to optimal. Of course, when distribution approximation is intro-

duced, the resulting policy will in general only approximately optimize the

CVaR objective, with an error term that depends on the expressivity of the

probability distribution representation (i.e., the parameter

in Chapter 5). To

perform dynamic programming with the state variable

, one may use function

Draft version.

222 Chapter 7

approximation, the subject of the next chapter. Another solution is to consider a

discrete number of values for

and to extend the operator

to operate on

this discrete set (Exercise 7.12).

7.9 Technical Remarks

Remark 7.1.

In Section 7.7, we presented some of the challenges involved with

ﬁnding an optimal policy for the variance-constrained objective. In some sense,

these challenges should not be too surprising given that that we are looking to

maximize a function

of an inﬁnite-dimensional object (a history-dependent

policy). Rather, what should be surprising is the relative ease with which one

can obtain an optimal policy in the risk-neutral setting.

From a technical perspective, this ease is a consequence of Lemma 7.3, which

guarantees that

∗

(and hence

∗

) can be eﬃciently approximated. However,

another important property of the risk-neutral setting is that the policy can be

improved locally: that is, at each state simultaneously. To see this, consider a

state-action value function

for a given policy

and denote by

a greedy

policy with respect to Q

. Then,

T Q

= T

≥T

= Q

. (7.29)

That is, a single step of value iteration applied to the value function of a policy

results in a new value function that is at least as good as

at all states

– the Bellman operator is said to be monotone. Because this single step also

corresponds to the value of a nonstationary policy that acts according to

for

one step and then switches to

, we can equivalently interpret it as constructing,

one step at a time, a deterministic history-dependent policy for solving the

risk-neutral problem.

By contrast, it is not possible to use a direct dynamic programming approach

over the objective

to ﬁnd the optimal policy for an arbitrary risk-sensitive

control problem. A practical alternative is to perform the optimization instead

with an ascent procedure (e.g., a policy gradient-type algorithm). Ascent algo-

rithms can often be computed in closed form, and tend to be simpler to

implement. On the other hand, convergence is typically only guaranteed to

local optima, seemingly unavoidable when the optimization problem is known

to be computationally hard. 4

Remark 7.2.

When the projection Π

is not mean-preserving, distributional

value iteration induces a state-action value function

that is diﬀerent from

the value function

determined by standard value iteration under equivalent

initial conditions. Under certain conditions on the distributions of rewards, it is

Draft version.

Control 223

possible to bound this diﬀerence as

k →∞

. To do so, we use a standard error

bound on approximate value iteration (see, e.g., Bertsekas 2012).

Lemma 7.23. Let (Q

)

k≥0

be a sequence of iterates in R

X×A

satisfying

k+1

−T Q

∞

≤ε

for some ε > 0 and where T is the Bellman optimality operator. Then,

lim sup

k→∞

−Q

∗

∞

≤

1 −γ

. 4

In the context of distributional value iteration, we need to bound the diﬀerence

k+1

−T Q

∞

When the rewards are bounded on the interval [

min

, R

max

] and the projection

step is Π

, the

-projection onto the

-quantile representation, a simple bound

follows from an intermediate result used in proving Lemma 5.30 (see Exercise

5.20). In this case, for any ν bounded on [V

min

, V

max

(Π

ν, ν) ≤

max

−V

min

;

Conveniently, the 1-Wasserstein distance bounds the diﬀerence of means

between any distributions ν, ν

∈P

(R):



Z∼ν

[Z] − E

Z∼ν

[Z]



≤w

(ν, ν

) .

This follows from the dual representation of the Wasserstein distance (Villani

2008). Consequently, for any (x, a),



k+1

(x, a) −(T

)(x, a)



≤w



k+1

(x, a), (T

)(x, a)



= w



(Π

)(x, a), (T

)(x, a)



≤

max

−V

min

By taking the maximum over (

x, a

) on the left-hand side of the above and

combining with Lemma 7.23, we obtain

lim sup

k→∞

−Q

∗

∞

≤

max

−V

min

2m(1 −γ)

. 4

Remark 7.3.

Theorem 7.22 is proven from a few facts regarding partial returns,

which we now give. Let us write

(

x, b

) =

(

x, b

). We deﬁne the mapping

: X×B×A→R as

(x, b, a) = E



[b −G

(x, b, a)]



and for a policy π ∈π

CVaR

similarly write

(x, b, a) = E



[b −G

(x, b, a)]



Draft version.

224 Chapter 7

Lemma 7.24. For any x ∈X, a ∈A, b ∈B, and k ∈N, we have

k+1

(x, b, a) = γ E



, B

, a

, B

)



| X = x, B = b, A = a

In addition, if π ∈π

CVaR

is a stationary Markov policy on X×B, we have

(x, b, a) = γ E



, B

, A



| X = x, B = b, A = a

. 4

Proof.

The result follows by time-homogeneity and the Markov property. Con-

sider the sample transition model (

X, B, A, R, X

, B

, A

), with

(

, B

Simultaneously, consider the partial trajectory (

, B

, A

, R

)

t=0

for which

= A

and A

∼π

k−t

(· | X

, B

) for t > 0. As γB

= B −R, we have

γ E



, B

, A



| X = x, B = b, A = a

= γ E



−G

, B

, A

)



| X = x, B = b, A = a

= γ E



−

t=0



| X

= X

, B

= B

, A

= A

| X = x, A = a

= E



b −R −γ

t=0



| X

= X

, B

= B

, A

= A

| X = x, B = b, A = a

= E



b −R −

k+1

t=1



| X

= X

, B

= B

, A

= A

| X = x, B = b, A = a

= E



b −

k+1

t=0



| X

= x, B

= b, A

= a

The second statement follows similarly. 4

Lemma 7.25.

Suppose that

min

≤

0 and

max

≥

0. Let (

)

t≥0

be a sequence of

rewards in [R

min

, R

max

]. For any b ∈R and k ∈N,



[b −

∞

t=0

]

] + γ

k+1

min

≤



[b −

t=0

]

] ≤E



[b −

∞

t=0

]

] + γ

k+1

max

. 4

Proof. First note that, for any b, z, z

∈R,

[b −z]

≤[b −z

]

+ [z

−z]

. (7.30)

Draft version.

Control 225

To obtain the ﬁrst inequality in the statement, we set

[b −

∞

t=0

]

≤[b −

t=0

]

+ [−

∞

t=k+1

]

Since rewards are bounded in [R

min

, R

max

], we have that

−

∞

t=k+1

≤−γ

k+1

min

1 −γ

= −γ

k+1

min

As we have assumed that V

min

≤0, it follows that

[b −

∞

t=0

]

≤[b −

t=0

]

−γ

k+1

min

The second inequality in the statement is obtained analogously. 4

Lemma 7.26.

The sequence (

)

k≥0

deﬁned by Equation 7.27 satisﬁes, for any

x ∈X, b ∈B, and a ∈A,

(x, b, a) = min

π∈π

CVaR



[b −

t=0

]

| X = x, B = b, A = a



. (7.31)

Proof (sketch).

Our choice of

(

x, b, a

) = 0 guarantees that the statement is

true for

= 0. The result then follows by Lemma 7.24, the fact that the policy

(

) chooses the action minimizing the left-hand side of Equation 7.31, and

by induction on k. 4

Proof of Theorem 7.22.

Let us assume that

min

≤

0 and

max

≥

0 so that Lemma

7.25 can be applied. This is without loss of generality, as otherwise we may

ﬁrst construct a new sequence of rewards shifted by an appropriate constant

such that

min

= 0

, R

max

≥

0; by inspection, this transformation does not aﬀect

the statement of the Theorem 7.22.

Let π

∗

∈π

CVaR

be an optimal deterministic policy, in the sense that

∗

(x, b, a) = min

π∈π

CVaR

(x, b, a).

Combining Lemmas 7.25 and 7.26, we have

∗

(x, b, a) + γ

k+1

min

≤U

(x, b, a) ≤U

∗

(x, b, a) + γ

k+1

max

. (7.32)

Write E

xba

[·] = E

[· | X = x, B = b, A = a]. By Lemma 7.24,

(x, b, a) −U

(x, b, a) (7.33)

= γ E

xba

, B

, a

, B

)) −U

k−1

, B

, a

k−1

, B

))

Draft version.

226 Chapter 7

= γ E

xba

, B

, a

, B

)) −U

, B

, a

, B

))

+ U

, B

, a

, B

)) −U

k−1

, B

, a

k−1

, B

))

≤γ E

xba

, B

, a

, B

)) −U

, B

, a

, B

))



+ γ

k+1

max

−γ

min

Now, the quantity

(x, b, a) = U

(x, b, a) −U

(x, b, a)

is bounded above and hence

= sup

x∈X,b∈B,a∈A

(x, b, a)

exists. Taking the supremum over

x, b

and

on both sides of Equation 7.33, we

have

≤γε

+ γ

k+1

max

−γ

min

and consequently for all x, b, a,

(x, b, a) −U

(x, b, a) ≤

(γV

max

−V

min

)

1 −γ

. (7.34)

Because U

(x, b, a) ≤U

∗

(x, b, a) + γ

k+1

max

, then

(x, b, a) ≤U

∗

(x, b, a) + γ

k+1

max

(γV

max

−V

min

)

1 −γ

Now,

k+1

1 −γ

= γ

γ + γ(1 −γ)

1 −γ

≤

1 −γ

Hence, we conclude that also

(x, b, a) ≤ min

π∈π

CVaR

(x, b, a) +

max

−V

min

)

1 −γ

as desired.

For the second statement, following Equation 7.28 we have

= arg max

b∈B



b −τ

−1



[b −G

, b, a

, b))]





The algorithm begins in state (

, b

), selects action

(

, b

), and then

executes policy π

from there on; its return is G

, b

, A

). In particular,

CVaR

(π

, b

) = max

b∈B



b −τ

−1



[b −G

, b

, A

)]





≥b

−τ

−1



−G

, b

, A

)]



. (7.35)

Draft version.

Control 227

Write

∗

(

x, b

) for the action selected by

∗

in (

x, b

). Because Equation 7.32

holds for all x, b, and a, we have

min

a∈A

(x, b, a) ≤min

a∈A

∗

(x, b, a) + γ

k+1

max

=⇒ U

(x, b, a

(x, b)) ≤U

∗

(x, b, a

∗

(x, b)) + γ

k+1

max

since

(

x, b

) is the action

that minimizes

(

x, b, a

). Hence, for any state

we have

max

b∈B



b −τ

−1

min

a∈A

(x, b, a)



≥max

b∈B



b −τ

−1

min

a∈A

∗

(x, b, a)



−

k+1

max

and so

−τ

−1

, b

, a

, b

)) ≥max

b∈B



b −τ

−1

∗

, b, a

∗

(x, b))



−

k+1

max

= max

b∈B

CVaR

(π

∗

, b) −

k+1

max

Combined with Equations 7.34 and 7.35, this yields

CVaR

(π

, b

) ≥max

b∈B

max

π∈π

CVaR

(π, b) −

max

−V

min

)

τ(1 −γ)

. 4

7.10 Bibliographical Remarks

7.0.

The balloon navigation example at the beginning of the chapter is from

Bellemare et al. (2020). Sutton and Barto (2018) separate “control problem”

from “prediction problem”; the latter ﬁgures more predominantly in this book.

In earlier literature, the control problem comes ﬁrst (see, e.g., Bellman 1957b)

and prediction is typically used as a subroutine for control (Howard 1960).

7.1.

Time-dependent policies are common in ﬁnite-horizon scenarios and are

studied at length by Puterman (2014). The technical core of Proposition 7.2

involves demonstrating that any feasible value function can be attained by a

stationary Markov policy; see the results by Puterman (2014, Theorem 5.5.1),

Altman (1999) and the discussion by Szepesvári (2020).

In reinforcement learning, history-dependent policies are also used to deal

with partially observable environments, in which the agent receives an obser-

vation

at each time step rather than the identity of its state. For example,

McCallum (1995) uses a variable-length history to represent state-action values,

while Veness et al. (2011) use a history-based probabilistic model to learn a

model of the environment. History-dependent policies also play a central role

in the study of optimality in the fairly large class of computable environments

(Hutter 2005).

Draft version.

228 Chapter 7

7.2.

The canonical reference for value iteration is the book by Bellman (1957b);

see also Bellman (1957a) for an asymptotic analysis in the undiscounted set-

ting. Lemma 7.3 is standard and can be found in most reinforcement learning

textbooks (Bertsekas and Tsitsiklis 1996; Szepesvári 2010; Puterman 2014).

State-action value functions were introduced along with the Q-learning algo-

rithm (Watkins 1989) and subsequently used in the development of SARSA

(Rummery and Niranjan 1994). Watkins and Dayan (1992) give a restricted

result regarding the convergence of Q-learning, which is more thoroughly estab-

lished by Jaakkola et al. (1994), Tsitsiklis (1994), and Bertsekas and Tsitsiklis

(1996).

7.3–7.5.

The expression of the optimality operator as a ﬁxed-policy operator

whose policy varies with the input is common in the analysis of control algo-

rithms (see, e.g., Munos 2003; Scherrer 2014). The view of value iteration as

constructing a history-dependent policy is taken by Scherrer and Lesner (2012)

to derive more accurate value learning algorithms in the approximate setting.

The extension to distributional value iteration and Theorem 7.13 are from

Bellemare et al. (2017a). The correspondence between standard value iteration

and distributional value iteration with a mean-preserving projection is given by

Lyle et al. (2019).

The notion of action gap plays an important role in understanding the rela-

tionship between value function estimates and policies, in particular when

estimates are approximate. Farahmand (2011) gives a gap-dependent bound on

the expected return obtained by a policy derived from an approximate value

function. Bellemare et al. (2016) derive an algorithm for increasing the action

gap so as to improve performance in the approximate setting.

An example of a selection rule that explicitly incorporates distributional

information is the lexicographical rule of Jaquette (1973), which orders policies

according to the magnitude of their moments.

7.6.

The notion of risk and risk-sensitive decisions can be traced back to

Markowitz (1952), who introduced the concept of trading oﬀ expected gains and

variations in those gains in the context of constructing an investment portfolio;

see also Steinbach (2001) for a retrospective. Artzner et al. (1999) propose a

collection of desirable characteristics that make a risk measure coherent in the

sense that it satisﬁes certain preference axioms. Of the risk measures mentioned

here, CVaR is coherent but the variance-constrained objective is not. Artzner et

al. (2007) discuss coherent risk measures in the context of sequential decisions.

Ruszczy

nski (2010) introduces the notion of dynamic risk measures for Markov

decision processes, which are amenable to optimization via Bellman-style recur-

sions; see also Chow (2017) for a discussion of static and dynamic risk measures

Draft version.

Control 229

as well as time consistency. Jiang and Powell (2018) develop sample-based

optimization methods for dynamic risk measures based on quantiles.

Howard and Matheson (1972) considered the optimization of an exponential

utility function applied to the random return by means of policy iteration. The

same objective is given a distributional treatment by Chung and Sobel (1987).

Heger (1994) considers optimizing for worst-case returns. Haskell and Jain

(2015) study the use of occupancy measures over augmented state spaces as an

approach for ﬁnding optimal policies for risk-sensitive control; similarly, an

occupancy measure-based approach to CVaR optimization is studied by Carpin

et al. (2016). Mihatsch and Neuneier (2002) and Shen et al. (2013) extend

Q-learning to the optimization of recursive risk measures, where a base risk

measure is applied at each time step. Recursive risk measures are more easily

optimized than risk measures directly applied to the random return but are not

as easily interpreted. Martin et al. (2020) consider combining distributional

reinforcement learning with the notion of second-order stochastic dominance as

a means of action selection. Quantile criteria are considered by Filar et al. (1995)

in the case of average-reward MDPs and, more recently, by Gilbert et al. (2017)

and Li et al. (2022). Delage and Mannor (2010) solve a risk-constrained opti-

mization problem to handle uncertainty in a learned model’s parameters. See

Prashanth and Fu (2021) for a survey on risk-sensitive reinforcement learning.

7.7.

Sobel (1982) establishes that an operator constructed directly from the

variance-penalized objective does not have the monotone improvement prop-

erty, making its optimization more challenging. The examples demonstrating

the need for randomization and a history-dependent policy are adapted from

Mannor and Tsitsiklis (2011), who also prove the NP-hardness of the problem

of optimizing the variance-constrained objective. Tamar et al. (2012) propose

a policy gradient algorithm for optimizing a mean-variance objective and for

the CVaR objective (Tamar et al. 2015); see also Prashanth and Ghavamzadeh

(2013) and Chow and Ghavamzadeh (2014) for actor-critic algorithms for these

criteria. Chow et al. (2018) augment the state with the return-so-far in order to

extend gradient-based algorithms to a broader class of risk measures.

7.8.

The reformulation of the conditional value-at-risk (CVaR) of a random

variable in terms of the (convex) optimization of a function of a variable b ∈R

is due to Rockafellar and Uryasev (2000); see also Rockafellar and Uryasev

(2002) and Shapiro et al. (2009). Bäuerle and Ott (2011) provide an algorithm

for optimizing the CVaR of the random return in Markov decision processes.

Their work forms the basis for the algorithm presented in this section, although

the treatment in terms of return-distribution functions is new here. Another

closely related algorithm is due to Chow et al. (2015), who additionally provide

an approximation error bound on the computed CVaR. Brown et al. (2020) apply

Draft version.

230 Chapter 7

Rockafellar and Uryasev’s approach to design an agent that is risk-sensitive

with respect to a prior distribution over possible reward functions. Keramati et

al. (2020) combine categorical temporal-diﬀerence learning with an exploration

bonus derived from the Dvoretzky–Kiefer–Wolfowitz inequality to develop an

algorithm to optimize for conditional value-at-risk.

7.11 Exercises

Exercise 7.1.

Find a counterexample that shows that the Bellman optimality

operator is not an aﬃne operator. 4

Exercise 7.2.

Consider the Markov decision process depicted in Figure 2.4a.

For which values of the discount factor

γ ∈

1) is there more than one optimal

action from state

? Use this result to argue that the optimal policy depends on

the discount factor. 4

Exercise 7.3.

Proposition 7.7 establishes that distributional Bellman optimality

operators are not contraction mappings.

(i)

Instantiate the result with the 1-Wasserstein distance. Provide a visual

explanation for the result by drawing the relevant cumulative distribution

functions before and after the application of the operator.

(ii)

Discuss why it was necessary, in the proof of Proposition 7.7, to assume

that the probability metric d is c-homogeneous. 4

Exercise 7.4.

Suppose that there is a unique optimal policy

∗

, as per Section

7.4. Consider the use of a projection Π

for a probability representation

and

the iterates

k+1

= Π

Tη

. (7.36)

Discuss under what conditions the sequence of greedy policies (

(

))

k≥0

converges to π

∗

when Π

(i) the m-categorical projection Π

;

(ii) the m-quantile projection Π

Where necessary, provide proofs of your statements. Does your answer depend

on m or on θ

, . . . , θ

for the case of the categorical representation? 4

Exercise 7.5.

Give a Markov decision process for which the limit of the

sequence of iterates deﬁned by

k+1

depends on the initial condition

irrespective of the greedy selection rule

. Hint. Construct a scenario where the

implied policy π

is the same for all k but depends on η

. 4

Exercise 7.6.

Consider the greedy selection rule that selects an action with

minimal variance among those with maximal expected value, breaking ties

Draft version.

Control 231

uniformly at random. Provide an example Markov decision process in which

this rule results in a sequence of return-distribution functions that does not

converge, as per Example 7.11. Hint. Consider reward distributions of the form

i=1

. 4

Exercise 7.7

(*)

Consider the 1-Wasserstein distance

and its supremum

extension

. In addition, let

be a metric on

(

). Suppose that we are given

a mapping

G: P(R)

X×A

→π

which is continuous at every state, in the sense that for any

ε >

0, there exists a

δ > 0 such that for any return functions η, η

(η, η

) < δ =⇒ max

x∈X



G(η)(· | x),

G(η

)(· | x)



< ε.

Show that this mapping cannot be a greedy selection rule in the sense of

Deﬁnition 7.6. 4

Exercise 7.8.

Consider the Markov decision process depicted in Figure 7.5a.

Show that there is no λ ≥0 such that the policy maximizing

(π) = E



)



−λVar



)



is stochastic. This illustrates how the variance-constrained and variance-

penalized control problems are not equivalent. 4

Exercise 7.9. Consider the Markov decision process depicted in Figure 7.5c.

(i)

Solve for the optimal stopping time

maximizing the return of a time-

dependent policy that selects action

for

time steps, then selects action

b (under the constraint that the variance should be no greater than C).

(ii) Prove that this policy can achieve an expected return of up to

√

(iii) Based on your conclusions, design a policy that improves on this strategy.

(iv)

Show that the expectation and variance of the return of a randomized

stationary policy that selects action b with probability p are given by

(x)] =



1 −γ(1 − p)



Var



(x)



1 −γ

(1 − p)

−

(1 −γ(1 − p))

(v)

Using your favorite visualization program, chart the returns achieved by the

optimal randomized policy and the optimal time-dependent policy, for values

and

diﬀerent from those shown in Figure 7.5c. What do you observe?

Hint. Use a root-ﬁnding algorithm to determine the maximum expected

return of a randomized policy under the constraint Var



(x)



≤C. 4

Draft version.

232 Chapter 7

Exercise 7.10.

Explain the relationship between the shaded area in Figure 7.4

and conditional value-at-risk for the depicted distribution. 4

Exercise 7.11.

Following Equation 7.23, for a distribution

ν ∈P

(

), consider

the function

f (θ) = θ −τ

−1

Z∼ν



[θ −Z]



Show that for τ ∈(0, 1),

θ ←θ −ατ

dθ

f (θ)

is equivalent to the quantile update rule (Equation 6.11), in expectation. 4

Exercise 7.12

(*)

Consider a uniform discretization

of the interval

[

min

, V

max

] into intervals of width

(endpoints included). For a return-

distribution function

on the discrete space

X×B

×A

, deﬁne its extension to

X×[V

min

, V

max

] ×A

˜η(x, b, a) = η(x, ε

, a).

Suppose that probability distributions can be represented exactly (i.e., without

needing to resort to a ﬁnite-parameter representation). For the CVaR objective

(Equation 7.22), derive an upper bound for the suboptimality

max

π∈π

CVaR

J(π) − J( ˜π),

where

˜π

is found by the procedure of Section 7.8 applied to the discrete space

X×B

×A and using the extension η 7→ ˜η to implement the operator T

. 4

Draft version.