7
Control
A superpressure balloon is a kind of aircraft whose altitude is determined by the
relative pressure between its envelope and the ambient atmosphere, and which
can be ﬂown high in the stratosphere. Like a submarine, a superpressure balloon
ascends when it becomes lighter and descends when it becomes heavier. Once
in ﬂight, superpressure balloons are passively propelled by the winds around
them, so that their direction of travel can be inﬂuenced simply by changing their
altitude. This makes it possible to steer such a balloon in an energy-efﬁcient
manner and have it operate autonomously for months at a time. Determining
the most efﬁcient way to control the ﬂight of a superpressure balloon by means
of altitude changes is an example of a control problem, the topic of this chapter.
In reinforcement learning, control problems are concerned with ﬁnding policies
that achieve or maximise speciﬁed objectives. This is in contrast with pre-
diction problems, which are concerned with characterising or quantifying the
consequences of following a particular policy. The study of control problems
involves not only the design of algorithms for learning optimal policies, but
also the study of the behaviour of these algorithms under different conditions,
such as when learning occurs one sample at a time (as per Chapter 6), when
noise is injected into the process, or when only a ﬁnite amount of data is made
available for learning. Under the distributional perspective, the dynamics of
control algorithms exhibit a surprising complexity. This chapter gives a brief
and necessarily incomplete overview of the control problem. In particular, our
treatment of control differs from most textbooks in that we focus on the distri-
butional component, and for conciseness omit some traditional material such as
policy iteration and -return algorithms.
205
206 Chapter 7
7.1 Risk-Neutral Control
The problem of ﬁnding a policy that maximises the agent’s expected return is
called the risk-neutral control problem, as it is insensitive to the deviations of
returns from their mean. We have already encountered risk-neutral control when
we introduced the Q-learning algorithm in Section 3.7. We begin this chapter
by providing a theoretical justiﬁcation for this algorithm.
Problem 7.1
(Risk-neutral control)
.
Given an MDP (
X
,
A
,
0
, P
X
, P
R
) and
discount factor 2[0, 1), ﬁnd a policy maximising the objective function
J()=E
1
t=0
t
R
t
. 4
A solution
that maximises J is called an optimal policy.
Implicit in the deﬁnition of risk-neutral control and our deﬁnition of a policy
in Chapter 2 is the fact that the objective J is maximised by a policy that only
depends on the current state, that is one that takes the form
: X!P(A).
As noted in Section 2.2, policies of this type are more properly called sta-
tionary Markov policies, and are but a subset of possible decision rules. With
stationary Markov policies, the action A
t
is independent of the random vari-
ables X
0
, A
0
, R
0
,
...
, X
t–1
, A
t–1
, R
t–1
given X
t
. In addition, the distribution of A
t
,
conditional on X
t
, is the same for all time indices t.
By contrast, history-dependent policies select actions on the basis on the entire
trajectory up to and including X
t
(the history). Formally, a history-dependent
policy is a time-indexed collection of mappings
t
:(XAR)
t–1
X !P(A).
In this case, we have that
A
t
|(X
0:t
, A
0:t–1
, R
0:t–1
)
t
(· | X
0
, A
0
, R
0
, ..., A
t–1
, R
t–1
, X
t
) . (7.1)
When clear from context, we omit the time subscript to
t
and write
P
,
E
, G
,
and
to denote the joint distribution, expectation, random-variable function,
and return-distribution function implied by the generative equations but with
Equation 7.1 substituting the earlier deﬁnition from Section 2.2. We write
MS
for the space of stationary Markov policies, and
H
for the space of
history-dependent policies. Control 207
It is clear that every stationary Markov policy is a history-dependent policy,
but the converse is not true. However, in risk-neutral control the added degree
of freedom provided by history-dependent policies is not needed to achieve
optimality; this is made formal by the following proposition (recall that a policy
is deterministic if it always selects the same action for a given state or history).
Proposition 7.2.
Let J(
) be as in Problem 7.1. There exists a determinis-
tic stationary Markov policy
2
MS
such that
J(
) J() 8 2
H
. 4
Proposition 7.2 is a central result in reinforcement learning from a computa-
tional point of view, for example, it is easier to deal with deterministic policies
(there are ﬁnitely many of them) than stochastic policies. Remark 7.1 discusses
some other beneﬁcial consequences of Proposition 7.2. Its proof involves a
surprising amount of detail; we refer the interested reader to Puterman [2014,
Section 6.2.4].
7.2 Value Iteration and Q-Learning
The main consequence of Proposition 7.2 is that when optimising the risk-
neutral objective we can restrict our attention to deterministic stationary Markov
policies. In turn, this makes it possible to ﬁnd an optimal policy
by computing
the optimal state-action value function Q
, deﬁned as
Q
(x, a) = sup
2
MS
E
1
t=0
t
R
t
| X = x, A = a
. (7.2)
Just as the value function V
for a given policy
satisﬁes the Bellman equation,
Q
satisﬁes the Bellman optimality equation:
Q
(x, a)=E
R + max
a
0
2A
Q
(X
0
, a
0
)|X = x, A = a
. (7.3)
The optimal state-action value function describes the expected return obtained
by acting so as to maximise the risk-neutral objective when beginning from
the state-action pair (x, a). Intuitively, we may understand Equation 7.3 as
describing this maximising behaviour recursively. While there might be multiple
optimal policies, they must (by deﬁnition) achieve the same objective value in
Problem 7.1. This value is
E
[V
(X
0
)] ,
where V
is the optimal value function:
V
(x) = max
a2A
Q
(x, a).
208 Chapter 7
Given Q
, an optimal policy is obtained by acting greedily with respect to Q
that is, choosing in state x any action a that maximises Q
(x, a).
Value iteration is a procedure for ﬁnding the optimal state-action value function
Q
iteratively, from which
can then be recovered by choosing actions that
have maximal state-action values. In Chapter 5 we discussed a procedure for
computing V
based on repeated applications of the Bellman operator T
. Value
iteration replaces the Bellman operator T
in this procedure with the Bellman
optimality operator
(TQ)(x, a)=E
R + max
a
0
2A
Q(X
0
, a
0
)|X = x, A = a
. (7.4)
Let us deﬁne the L
1
norm of a state-action value function Q 2R
XA
as
kQk
1
= sup
x2X,a2A
|Q(x, a)|.
As the following establishes, the Bellman optimality operator is a contraction
mapping in this norm.
Lemma 7.3.
The Bellman optimality operator T is a contraction in L
1
norm
with modulus . That is, for any Q, Q
0
2R
XA
,
kTQ TQ
0
k
1
kQ Q
0
k
1
. 4
Corollary 7.4.
The optimal state-action value function Q
is the only value
function that satisﬁes the Bellman optimality equation. Further, for any Q
0
2
R
XA
, the sequence (Q
k
)
k0
deﬁned by Q
k+1
= TQ
k
(for k
0) converges to
Q
. 4
Corollary 7.4 is an immediate consequence of Lemma 7.3 and Proposition 4.7.
Before we give the proof of Lemma 7.3, it is instructive to note that despite
a visual similarity and the same contractive property in supremum norm, the
optimality operator behaves somewhat differently from the ﬁxed-policy operator
T
, deﬁned for state-action value functions as
(T
Q)(x, a)=E
R + Q(X
0
, A
0
)|X = x, A = a] , (7.5)
where conditional on the random variables (X = x, A = a, R, X
0
), we have A
0
(
·
| X
0
). In the context of this chapter, we call T
a policy evaluation operator.
Such an operator is said to be afﬁne: For any two Q-functions Q, Q
0
and
2
[0, 1], it satisﬁes
T
(Q + (1 )Q
0
)=T
Q + (1 )T
Q
0
. (7.6) Control 209
Equivalently, the difference between T
Q and T
Q
0
can be expressed as
T
Q T
Q
0
= P
(Q Q
0
).
The optimality operator, on the other hand, is not afﬁne. While afﬁne operators
can be analysed almost as if they were linear,
55
the optimality operator is
generally a nonlinear operator. As such, its analysis requires a slightly different
approach.
Proof of Lemma 7.3.
The proof relies on a special property of the maximum
function. For f
1
, f
2
: A!R, it can be shown that
max
a2A
f
1
(a) max
a2A
f
2
(a)
max
a2A
f
1
(a)–f
2
(a)
.
Now let Q, Q
0
2R
XA
, and ﬁx x
2X
, a
2A
. Let us write
E
xa,
[
·
]=
E
[
·
| X =
x, A = a]. By linearity of expectations, we have
|(TQ)(x, a)–(TQ
0
)(x, a)| =
E
xa,
[R + max
a
0
2A
Q(X
0
, a
0
)–R max
a
0
2A
Q
0
(X
0
, a
0
)]
=
E
xa,
[ max
a
0
2A
Q(X
0
, a
0
)– max
a
0
2A
Q
0
(X
0
, a
0
)]
=
E
xa,
[max
a
0
2A
Q(X
0
, a
0
) max
a
0
2A
Q
0
(X
0
, a
0
)]
E
xa,
| max
a
0
2A
Q(X
0
, a
0
) max
a
0
2A
Q
0
(X
0
, a
0
)|
E
xa,
max
a
0
2A
|Q(X
0
, a
0
)–Q
0
(X
0
, a
0
)|
max
(x
0
,a
0
)2XA
|Q(x
0
, a
0
)–Q
0
(x
0
, a
0
)|
= kQ Q
0
k
1
.
Since the bound holds for any (x, a) pair, it follows that
kTQ TQ
0
k
1
kQ Q
0
k
1
.
Corollary 7.5.
For any initial state-action value function Q
0
2R
XA
, the
sequence of iterates Q
k+1
= TQ
k
converges to Q
in the L
1
norm. 4
We can use the unbiased estimation method of Section 6.2 to derive an incre-
mental algorithm for learning Q
, since the contractive operator T is expressible
as an expectation over a sample transition. Given a realisation (x, a, r, x
0
), we
55. Consider the proof of Proposition 4.4.
210 Chapter 7
construct the sample target
r + max
a
0
2A
Q(x
0
, a
0
).
We then incorporate this target to an update rule to obtain the Q-learning
algorithm ﬁrst encountered in Chapter 3:
Q(x, a) (1 )Q(x, a)+(r + max
a
0
2X
Q(x
0
, a
0
)) .
Under appropriate conditions, the convergence of Q-learning to Q
can be
established with Lemma 7.3 and a suitable extension of the analysis of Chapter
6 to the space of action-value functions.
7.3 Distributional Value Iteration
Analogous to value iteration, we can devise a distributional dynamic pro-
gramming procedure for the risk-neutral control problem; such a procedure
determines an approximation to the return function of an optimal policy. As we
will see, in some circumstances this can be accomplished without complications,
giving some theoretical justiﬁcation for the distributional algorithm presented
in Section 3.7.
As in Chapter 5, we perform distributional dynamic programming by imple-
menting the combination of a projection step with a distributional optimality
operator.
56
Because it is not possible to “directly maximise” a probability
distribution, we instead deﬁne the operator via a greedy selection rule G.
We can view the expected-value optimality operator T as substituting the expec-
tation over the next-state action A
0
in Equation 7.5 by a maximisation over a
0
.
As such, it can be rewritten as a particular policy evaluation operator T
whose
policy
depends on the operand Q; the mapping from Q to
is what we call a
greedy selection rule.
Deﬁnition 7.6.
A greedy selection rule is a mapping
G
:
R
XA
!
MS
with
the property that for any Q
2R
XA
,
G
(Q) is greedy with respect to Q. That is,
G(Q)(a | x)>0 =) Q(x, a) = max
a
0
2A
Q(x, a
0
).
56.
As before, this implementation is at its simplest when there are ﬁnitely many possible states,
actions, and rewards, and the projection step can be computed efﬁciently. Alternatives include a
sample-based approach and, as we will see in Chapter 9, function approximation.
Control 211
We extend
G
to return functions by deﬁning, for
2P
1
(
R
)
XA
, the induced
state-action value function
Q
(x, a)= E
Z(x,a)
[Z],
and then letting
G()=G(Q
). 4
A greedy selection rule may produce stochastic policies, for example when
assigning equal probability to two equally-valued actions. However, it must
select actions which are maximally valued according to Q. Given a greedy
selection rule, we may rewrite the Bellman optimality operator as
TQ = T
G(Q)
Q . (7.7)
In the distributional setting, we must make explicit the dependency of the oper-
ator on the greedy selection rule
G
. Mirroring Equation 7.7, the distributional
Bellman optimality operator derived from G is
T
G
= T
G()
,
We will see that, unlike the expected-value setting, different choices of the
greedy selection rule result in different operators with possibly different dynam-
ics we thus speak of the distributional Bellman optimality operators, in the
plural.
Distributional value iteration algorithms combine a greedy selection rule
G
and a projection step (described by the operator
F
and implying a particular
choice of representation
F
) to compute an approximate optimal return function.
Given some initial return function
0
2F
XA
, distributional value iteration
implements the iterative procedure
k+1
=
F
T
G(
k
)
k
. (7.8)
In words, distributional value iteration selects a policy which at each state x is
greedy with respect to the expected values of
k
(x,
·
), and computes the return
function resulting from a single step of distributional dynamic programming
with that policy.
The induced value function Q
k
plays an important role in distributional value
iteration as it is used to derive the greedy policy
k
=
G
(
k
). When
F
is mean-
preserving (Section 5.11), Q
k
behaves as if it had been computed from standard
value iteration (Figure 7.1). That is,
Q
k+1
= TQ
k
. 212 Chapter 7
0
Q Q
0
F
T
G
T
Figure 7.1
When the projection step
F
is mean-preserving, distributional value iteration produces
the same sequence of state-action value functions as standard value iteration.
By induction, distributional value iteration then produces the same sequence of
state-action value functions as regular value iteration. That is, given the initial
condition
Q
0
(x, a)= E
Z
0
(x,a)
[Z], (x, a) 2XA,
and the recursion
Q
k+1
= TQ
k
,
we have
Q
k
= Q
k
, for all k 0.
As a consequence, these two processes also produce the same greedy policies:
k
= G(Q
k
).
In the following section we will use this equivalence to argue that distributional
value iteration ﬁnds an approximation to an optimal return function. When
F
is not mean-preserving, however, Q
k
may deviate from Q
k
. If that is the case,
the greedy policy
k
is likely to be different from
G
(Q
k
), and distributional
value iteration may converge to the return function of a suboptimal policy. We
discuss this point in Remark 7.2.
Before moving on, let us remark on an alternative procedure for approximating
an optimal return function. This procedure ﬁrst performs standard value iteration
to obtain an approximation
ˆ
Q
to Q
. A greedy policy
ˆ
is then extracted from
ˆ
Q
.
Finally, distributional dynamic programming is used to approximate the return
function
ˆ
. If
ˆ
is an optimal policy, this directly achieves our stated aim, and
suggests doing away with the added complexity of distributional value iteration.
In larger problems, however, it is difﬁcult or undesirable to wait until Q
has
been determined before learning or computing the return function, or it may
not be possible to decouple value and return predictions. In these situations it is
sensible to consider distributional value iteration. Control 213
7.4 Dynamics of Distributional Optimality Operators
In this section and the next, we analyse the behaviour of distributional optimality
operators; recall from Section 7.3 that there is one such operator for each choice
of greedy selection rule
G
, in contrast to non-distributional value iteration.
Combined with a mean-preserving projection, our analysis also informs the
behaviour of distributional value iteration. As we will see, even in the absence of
approximation due to a ﬁnite-parameter representation, distributional optimality
operators exhibit complex behaviour.
Thus far, we have analysed distributional dynamic programming algorithms
by appealing to contraction mapping theory. Demonstrating that an opera-
tor is a contraction provides a good deal of understanding about algorithms
implementing its repeated application: they converge at a geometric rate to the
operator’s ﬁxed point (when such a ﬁxed point exists). Unfortunately, distri-
butional optimality operators are not contraction mappings, as the following
shows.
Proposition 7.7.
Consider a probability metric d and let
d
be its supremum
extension. Suppose that for any z, z
0
2R,
d(
z
,
z
0
)<1.
If d is c-homogeneous, there exist a Markov decision process and two
return-distribution functions
,
0
such that for any greedy selection rule
G
and any discount factor 2(0, 1],
d(T
G
, T
G
0
)>d(,
0
) . (7.9)
4
Proof. Consider a MDP with two non-terminal states x, y and two actions a, b
(Figure 7.2). From state x all actions transition to state y and yield no reward. In
state y, action a results in a reward of
"
> 0, while action b results in a reward
that is –1 or 1 with equal probability; both actions are terminal. We will argue
that " can be chosen to make T
G
satisfy Equation 7.9.
First note that any optimal policy must choose action a in state y, and all optimal
policies share the same return-distribution function
. Consider another return-
distribution function
that is equal to
at all state-action pairs except that
(y, a)=
"
(Figure 7.2, right). This implies that the greedy selection rule must
select b in y, and hence
(T
G
)(x, a)=(T
G
)(x, b) = (b
0,
)
#
(y, b) = (b
0,
)
#
(y, b). 214 Chapter 7
T
G
⌘⌘
x, · " ±"
y, a "" "
y, b ±1 ±1 ±1
Figure 7.2
Left.
A Markov decision process for which no distributional Bellman optimality operator
is a contraction mapping.
Right.
The optimal return-distribution function
and initial
return-distribution function
used in the proof of Proposition 7.7 (expressed in terms of
their support). Given a c-homogeneous probability metric d, the proof chooses
"
so as to
make d(T
G
,
)>d(,
).
Because both actions are terminal from y, we also have that
(T
G
)(y, a)=
(y, a)
(T
G
)(y, b)=
(y, b).
Let us write
=
1
2
–1
+
1
2
1
for the reward distribution associated with (y, b). We
have
d(,
)=d
(y, a),
(y, a)
= d(
"
,
"
) , (7.10)
and
d(T
G
, T
G
)=d
(T
G
)(x, a), (T
G
)(x, a)
= d
(b
0,
)
#
(y, b), (b
0,
)
#
(y, a)
=
c
d(,
"
) , (7.11)
where the last line follows by c-homogeneity (Deﬁnition 4.22). We will show
that for sufﬁciently small " > 0 we have
d(,
"
)>
c
d(
"
,
"
),
from which the result follows by Equations 7.10 and 7.11. To this end, note that
d(
"
,
"
)="
c
d(
–1
,
1
). (7.12)
Since d is ﬁnite for any pair of Dirac deltas, we have that d(
–1
,
1
)<
1
and so
lim
"!0
d(
"
,
"
) = 0 . (7.13) Control 215
On the other hand, by the triangle inequality we have
d(,
"
)+d(
0
,
"
) d(,
0
) > 0 , (7.14)
where the second inequality follows because
and
0
are different distributions.
Again by c-homogeneity of d, we deduce that
lim
"!0
d(
0
,
"
)=0,
and hence
lim inf
"!0
d(,
"
) d(,
0
)>0.
Therefore, for " > 0 sufﬁciently small, we have
d(,
"
)>
c
d(
"
,
"
),
as required.
Proposition 7.7 establishes that for any metric d that is sufﬁciently well-behaved
to guarantee that policy evaluation operators are contraction mappings (ﬁnite on
bounded-support distributions, c-homogeneous), optimality operators are not
generally contraction mappings with respect to
d
. As a consequence we cannot
directly apply the tools of Chapter 4 to characterise distributional optimality
operators. The issue, which is implied in the proof above, is that it is possible
for distributions to have similar expectations yet differ substantially e.g. in their
variance.
With a more careful line of reasoning, we can still identify situations in which
the iterates
k+1
=
T
G
k
do converge. The most common scenario is when there
is a unique optimal policy. In this case, the analysis is simpliﬁed by the existence
of an action gap in the optimal value function.
57
Deﬁnition 7.8.
Let Q
2R
XA
. The action gap at a state x is the difference
between the highest-valued and second highest-valued actions:
GAP(Q, x) = min
Q(x, a
)–Q(x, a):a
, a 2A, a
6= a, Q(x, a
) = max
a
0
2A
Q(x, a
0
)
.
The action gap of Q is the smallest gap over all states:
GAP(Q) = min
x2X
GAP(Q, x).
By extension, the action gap for a return function is
GAP() = min
x2X
GAP(Q
, x), Q
(x, a)= E
Z(x,a)
[Z]. 4
57. Example 6.1 introduced an update rule that increases this action gap. 216 Chapter 7
Deﬁnition 7.8 is such that if two actions are optimal at a given state, then
GAP
(Q
) = 0. When
GAP
(Q
) > 0, however, there is a unique action that is
optimal at each state (and vice-versa). In this case, we can identify an iteration
K from which
G
(
k
)=
for all k
K. From that point on, any distributional
optimality operator reduces to the
evaluation operator which enjoys now-
familiar convergence guarantees.
Theorem 7.9.
Let
T
G
be the distributional Bellman optimality operator
instantiated with a greedy selection rule
G
. Suppose that there is a unique
optimal policy
and let p
2
[1,
1
]. Under Assumption 4.29(p) (well-
behaved reward distributions), for any initial return-distribution function
0
2P(R) the sequence of iterates deﬁned by
k+1
= T
G
k
.
converges to
with respect to the metric w
p
. 4
Theorem 7.9 is stated in terms of supremum p-Wasserstein distances for con-
ciseness. Following the conditions of Theorem 4.25 and under a different set
of assumptions, we can of course also establish the convergence in say, the
supremum Cramér distance.
Corollary 7.10.
Suppose that
F
is a mean-preserving projection for some
representation
F
, and that there is a unique optimal policy
. If Assump-
tion 5.22(w
p
) holds and
F
is a non-expansion in
w
p
, then under the conditions
of Theorem 7.9 the sequence
k+1
=
F
T
G
k
produced by distributional value iteration converges to the ﬁxed point
ˆ
=
F
T
ˆ
. 4
Proof of Theorem 7.9.
As there is a unique optimal policy, it must be that the
action gap of Q
is strictly greater than zero. Fix
"
=
1
2
GAP
(Q
). Following the
discussion of the previous section, we have that
Q
k+1
= TQ
k
.
Because T is a contraction in L
1
norm, we know that there exists a K
"
2N
after which
kQ
k
Q
k
1
< " 8k K
"
. (7.15) Control 217
For a ﬁxed x, let a
be the optimal action in that state. From Equation 7.15 we
deduce that for any a 6= a
,
Q
k
(x, a
) Q
(x, a
)–"
Q
(x, a)+GAP(Q
)–"
> Q
k
(x, a)+GAP(Q
)–2"
= Q
k
(x, a).
Thus the greedy action in state x after time K
"
is the optimal action for that state.
Thus G(
k
)=
for k K
"
and
k+1
= T
k
k K
"
.
We can treat
K
"
as a new initial condition
0
0
, and apply Proposition 4.30 to
conclude that
k
!
.
In Section 3.7 we introduced the categorical Q-learning algorithm in terms of a
deterministic greedy policy. Generalised to an arbitrary greedy selection rule
G
,
categorical Q-learning is deﬁned by the update
(x, a) (1 )(x, a)+
a
0
2A
G()(a
0
| x
0
)
C
(b
r,
)
#
x
0
, a
0
.
Because the categorical projection is mean-preserving, its induced state-value
function follows the update
Q
(x, a) (1 )(x, a)+
a
0
2A
G()(a
0
| x
0
)
r + Q
(x
0
, a
0
)

r+ max
a
0
2A
Q
(x
0
,a
0
)
.
Using the tools of Chapter 6, one can establish the convergence of categorical Q-
learning under certain conditions, including the assumption that there is a unique
optimal policy
. The proof essentially combines the insight that under certain
conditions the sequence of greedy policies tracked by the algorithm matches that
of a non-distributional Q-learning algorithm, and hence eventually converges
to
, at which point the algorithm is essentially performing categorical policy
evaluation of
. The actual proof is somewhat technical; we omit it here and
refer the interested reader to Rowland et al. .
7.5 Dynamics In the Presence of Multiple Optimal Policies*
In the value-based setting, it does not matter which greedy selection rule is
used to represent the optimality operator: By deﬁnition, any greedy selection
rule must be equivalent to directly maximising over Q(x,
·
). In the distributional 218 Chapter 7
setting, however, different rules usually result in different operators. As a
concrete example, compare the rule “among all actions whose expected value is
maximal, pick the one with smallest variance” to “assign equal probability to
actions whose expected value is maximal”.
Theorem 7.9 relies on the fact that, when there is a unique optimal policy
, we can identify a time after which the distributional optimality operator
behaves like a policy evaluation operator. When there are multiple optimal
policies, however, the action gap of the optimal value function Q
is zero and
the argument cannot be used. To understand why this is problematic, it is useful
to write the iterates (
k
)
k0
more explicitly in terms of the policies
k
= G(
k
):
k+1
= T
k
k
= T
k
T
k–1
k–1
= T
k
···T
0
0
. (7.16)
When the action gap is zero, the sequence of policies
k
,
k+1
,
...
may continue
to vary over time, depending on the greedy selection rule. Although all optimal
actions have the same optimal value (guaranteeing the convergence of the
expected values to Q
), they may correspond to different distributions. Thus
distributional value iteration even with a mean-preserving projection may
mix together the distribution of different optimal policies. Even if (
k
)
k0
converges, the policy it converges to may depend on initial conditions (Exercise
7.5). In the worst case, the iterates (
k
)
k0
might not even converge, as the
following example shows.
Example 7.11
(Failure to converge)
.
Consider a Markov decision process with
a single non-terminal state x and two actions, a and b (Figure 7.3, left). Action
a gives a reward of 0 and leaves the state unchanged, while action b gives a
reward of –1 or 1 with equal probability and leads to a terminal state. Note that
the expected return from taking either action is 0.
Let
2
(0, 1). We will exhibit a sequence of return function estimates (
k
)
k0
that is produced by distributional value iteration and does not have a limit. We
do so by constructing a greedy selection rule that achieves the desired behaviour.
Suppose that
0
(x, b)=
1
2
(
–1
+
1
).
For any initial parameter c
0
2R, if
0
(x, a)=
1
2
(
c
0
+
c
0
),
then by induction there is a sequence of scalars (c
k
)
k0
such that
k
(x, a)=
1
2
(
c
k
+
c
k
) 8k 2N . (7.17) Control 219
Figure 7.3
Left.
A Markov decision process in which the sequence of return function estimates
(
k
)
k0
do not converge (Example 7.11).
Right.
The return-distribution function estimate
at (x, a) as a function of k and beginning with c
0
= 1. At each k, the pair of dots indicates
the support of the distribution.
This is because
k+1
(x, a) = (b
0,
)
#
k
(x, a) or
k+1
(x, a) = (b
0,
)
#
k
(x, b), k 0.
Deﬁne the following greedy selection rule: at iteration k + 1, choose a if c
k
1
10
,
otherwise choose b. With some algebra, this leads to the recursion (k 0)
c
k+1
=
if c
k
<
1
10
,
c
k
otherwise.
Over a period of multiple iterations, the estimate
k
(x, a) exhibits cyclical
behaviour (Figure 7.3, right). 4
The example illustrates how, without additional constraints on the greedy selec-
tion rule, it is not possible to guarantee that the iterates converge. However,
one can prove a weaker result based on the fact that only optimal actions must
eventually be chosen.
Deﬁnition 7.12.
For a given Markov decision process
M
, the set of non-
stationary Markov optimal return-distribution functions is
NMO
={
¯
: ¯ =(
k
)
k0
,
k
2
MS
is optimal for M, 8k 2N}. 4
In particular, any history-dependent policy
¯
satisfying the deﬁnition above is
also optimal for M. 220 Chapter 7
Theorem 7.13.
Let
T
G
be a distributional Bellman optimality operator
instantiated with some greedy selection rule, and let p
2
[1,
1
]. Under
Assumption 4.29(p), for any initial condition
0
2P
p
(
R
)
X
the sequence
of iterates
k+1
= T
G
k
converges to the set
NMO
, in the sense that
lim
k!1
inf
2
NMO
w
p
(,
k
) = 0. 4
Proof.
Along the lines of the proof of Theorem 7.9, there must be a time K
2N
after which the greedy policies
k
, k
K are optimal. For l
2N
+
, let us construct
the history-dependent policy
¯ =
K+l–1
,
K+l–2
, ...,
K+1
,
K
,
,
, ... ,
where
is some stationary Markov optimal policy. Denote the return of this
policy from the state-action pair (x, a) by G
¯
(x, a), and its return-distribution
function by
¯
. Because
¯
is an optimal policy, we have that
¯
2
NMO
. Let
G
k
be an instantiation of
k
, for each k
2N
. Now for a ﬁxed x
2X
, a
2A
,
let (X
t
, A
t
, R
t
)
l
t=0
be the initial segment of a random trajectory generated by
following
¯
beginning with X
0
= x, A
0
= a. More precisely, for t = 1,
...
, l we
have
A
t
|(X
0:t
, A
0:t–1
, R
0:t–1
)
K+lt
( · | X
t
).
From this, we write
G
¯
(x, a)
D
=
l–1
t=0
t
R
t
+
l
G
(X
l
, A
l
).
Because
K+l
=
T
K+l–1
···T
K
K
, by inductively applying Proposition 4.11 we
also have
G
K+l
(x, a)
D
=
l–1
t=0
t
R
t
+
l
G
K
(X
l
, A
l
).
Hence
w
p
K+l
(x, a),
¯
(x, a)
= w
p
G
K+l
(x, a), G
¯
(x, a)
= w
p
l–1
t=0
t
R
t
+
l
G
K
(X
l
, A
l
),
l–1
t=0
t
R
t
+
l
G
(X
l
, A
l
)
.
Consequently,
w
p
K+l
(x, a),
¯
(x, a)
w
p
(
l
G
K
(X
l
, A
l
),
l
G
(X
l
, A
l
))
=
l
w
p
(G
K
(X
l
, A
l
), G
(X
l
, A
l
)) Control 221
l
w
p
(G
K
, G
),
following the arguments in the proof of Proposition 4.15. The result now follows
by noting that
w
p
(G
K
, G
)<
1
under our assumptions, taking the supremum
over (x, a) on the left-hand side, and taking the limit with respect to l.
Theorem 7.13 shows that, even before the effect of the projection step is taken
into account, the behaviour of distributional value iteration is in general quite
complex. When the iterates
k
are approximated (for example, because they are
estimated from samples), non-stationary Markov return functions may also be
produced by distributional value iteration even when there is a unique optimal
policy.
It may appear that the convergence issues highlighted by Example 7.11 and
Theorem 7.13 are consequences of using the wrong greedy selection rule. To
address these issues, one may be tempted to impose an ordering on policies (for
example, always prefer the action with the lowest variance, at equal expected
values). However, it is not clear how to do this in a satisfying way. One hurdle is
that, to avoid the cyclical behaviour demonstrated in Example 7.11, we would
like a greedy selection rule which is continuous with respect to its input. This
seems problematic, however, since we also need this rule to return the correct
answer when there is a unique optimal (and thus deterministic) policy (Exercise
7.7). This suggests that when learning to control with a distributional approach,
the learned return distributions may simultaneously reﬂect the random returns
from multiple policies.
7.6 Risk and Risk-Sensitive Control
Imagine being invited to interview for a desirable position at a prestigious
research institute abroad. Your plane tickets and the hotel have been booked
weeks in advance. Now the night before an early morning ﬂight, you make
arrangements for a cab to pick you up from your house and take you to the
airport. How long in advance of your plane’s actual departure do you request
the cab for? If someone tells you that, on average, a cab to your local airport
takes an hour is that sufﬁcient information to make the booking? How does
your answer change when the ﬂight is scheduled around rush hour, rather than
early morning?
Fundamentally, it is often desirable that our choices be informed by the variabil-
ity in the process that produces outcomes from these choices. In this context,
we call this variability risk. Risk may be inherent to the process, or incomplete
222 Chapter 7
knowledge about the state of the world (including any potential trafﬁc jams,
and the mechanical condition of the hired cab).
In contrast to risk-neutral behaviour, decisions that take risk into account are
called risk-sensitive. The language of distributional reinforcement learning is
particularly well-suited for this purpose, since it lets us reason about the full
spectrum of outcomes, along with their associated probabilities. The rest of
the chapter gives an overview of how one may account for risk in the decision-
making process and of the computational challenges that arise when doing so.
Rather than be exhaustive, here we take the much more modest aim of exposing
the reader to some of the major themes in risk-sensitive control and their relation
to distributional reinforcement learning; references to more extensive surveys
are provided in the bibliographical remarks.
Recall that the risk-neutral objective is to maximise the expected return from
the (possibly random) initial state X
0
:
J()=E
1
t=0
t
R
t
= E
G
(X
0
)
.
Here, we may think of the expectation as mapping the random variable G
(X
0
)
to a scalar. Risk-sensitive control is the problem that arises when we replace
this expectation by a risk measure.
Deﬁnition 7.14. A risk measure
58
is a mapping
: P
(R) ![–1, 1),
deﬁned on a subset
P
(
R
)
P
(
R
) of probability distributions. By extension,
for a random variable Z instantiating the distribution
we write
(Z)=
(
).
4
Problem 7.15
(Risk-sensitive control)
.
Given an MDP (
X
,
A
,
0
, P
X
, P
R
),
a discount factor
2
[0, 1), and a risk measure
, ﬁnd a policy
2
H
maximising
59
J
()=
1
t=0
t
R
t
. (7.18)
4
58.
More precisely, this is a static risk measure, in that it is only concerned with the return from
time t = 0. See bibliographical remarks.
59.
Typically, Problem 7.15 is formulated in terms of a risk to be minimised, which linguistically is
a more natural objective. Here, however, we consider the maximisation of J
(
) so as to keep the
presentation uniﬁed with the rest of the book. Control 223
Figure 7.4
Illustration of value-at-risk (VaR) and conditional value-at-risk (CVaR). Depicted is the
cumulative distribution function of the mixture of normal distributions
=
1
2
N
(0, 1) +
1
2
N
(4, 1). The dashed line corresponds to VaR; CVaR (
= 0.4) can be determined from
a suitable transformation of the shaded area (see Section 7.8 and Exercise 7.10).
In Problem 7.15, we assume that the distribution of the random return lies in
P
(
R
), similar to our treatment of probability metrics in Chapter 4. From a
technical perspective, subsequent examples and results should be interpreted
with this assumption in mind.
The risk measure
may take into account higher-order moments of the return
distribution, be sensitive to rare events, and even disregard the expected value
altogether. Note that according to this deﬁnition,
=
E
also corresponds to a
risk-sensitive control problem. However, we reserve the term for risk measures
that are sensitive to more than only expected values.
Example 7.16
(Mean-variance criterion)
.
Let
> 0. The variance-penalised
risk measure penalises high-variance outcomes:
MV
(Z)=E[Z]–Var(Z). 4
Example 7.17
(Entropic risk)
.
Let
> 0. Entropic risk puts more weight on
smaller-valued outcomes:
ER
(Z)=–
1
log E[e
Z
]. 4
Example 7.18
(Value-at-risk)
.
Let
2
(0, 1). The value-at-risk measure (Figure
7.4) corresponds to the
th
quantile of the return distribution:
VAR
(Z)=F
–1
Z
(). 4
224 Chapter 7
7.7 Challenges In Risk-Sensitive Control
Many convenient properties of the risk-neutral objective do not carry over to
risk-sensitive control. As a consequence, ﬁnding an optimal policy is usually
signiﬁcantly more involved. This remains true even when the risk-sensitive
objective (Equation 7.18) can be evaluated efﬁciently, for example by using
distributional dynamic programming to approximate the return-distribution func-
tion
. In this section we illustrate some of these challenges by characterising
optimal policies for the variance-constrained control problem.
The variance-constrained problem introduces risk sensitivity by forbidding poli-
cies whose return variance is too high. Given a parameter C
0, the objective
is to
maximise E
[G
(X
0
)]
subject to Var
G
(X
0
)
C .
(7.19)
Equation 7.19 can be shown to satisfy our deﬁnition of a risk-sensitive control
problem if we express it in terms of a Lagrange multiplier:
J
VC
() = min
0
E
G
(X
0
)
Var
G
(X
0
)
C
.
The variance-penalised and variance-constrained problems are related in that
they share the Pareto set
PAR
H
of possibly optimal solutions. A policy
is in the set
PAR
if we have that for all
0
2
H
,
(a) Var
G
(X
0
)
> Var
G
0
(X
0
)
=) E[G
(X
0
)] > E[G
0
(X
0
)], and
(b) Var
G
(X
0
)
= Var
G
0
(X
0
)
=) E[G
(X
0
)] E[G
0
(X
0
)].
In words, between two policies with equal variances, the one with lower expecta-
tion is never a solution to either problem. However, these problems are generally
not equivalent (Exercise 7.8).
Proposition 7.1 establishes the existence of a solution of the risk-neutral control
problem that is a) deterministic, b) stationary, and c) Markov. By contrast,
the solution to the variance-constrained problem may lack any or all of these
properties.
Example 7.19
(The optimal policy may not be deterministic)
.
Consider the
problem of choosing between two actions, a and b. Action a always yields a
reward of 0, while action b yields a reward of 0 or 1 with equal probability
(Figure 7.5a). If we seek the policy that maximises the expected return subject
to the variance constraint C =
3
/
16
, the best deterministic policy respecting the
variance constraint must choose a, for a reward of 0. On the other hand, the Control 225
(a) (b) (c)
Figure 7.5
Examples demonstrating how the optimal policy for the variance-constrained control
problem might not be (a) deterministic, (b) Markov, or (c) stationary.
policy that selects a and b with equal probability achieves an expected reward
of
1
/4 and a variance of
3
/16. 4
Example 7.20
(The optimal policy may not be Markov)
.
Consider the Markov
decision process in Figure 7.5b. Suppose that we seek a policy that maximises
the expected return from state x, now subject to the variance constraint C = 0.
Let us assume
= 1 for simplicity. Action a has no variance and is therefore
a possible solution, with zero return. Action b gives a greater expected return,
at the cost of some variance. Any policy
that depends on the state alone
and chooses b in state x must incur this variance. On the other hand, the
following history-dependent policy achieves a positive expected return without
violating the variance constraint: In state x, choose b; if the ﬁrst reward R
0
is
0 select action b in x, otherwise select action a. In all cases, the return is 1, an
improvement over the best Markov policy. 4
Example 7.21
(The optimal policy may not be stationary)
.
In general, the
optimal policy may require keeping track of time. Consider the problem of
maximising the expected return from the unique state x in Figure 7.5c, subject
to
Var
(G
(X
0
))
C, for C
1
/
4
. Exercise 7.9 asks you to show that a simple
time-dependent policy that chooses a for T
C
steps then selects b achieves an
expected return of up to
p
C
. This is possible because the variance of the return
decays at a rate of
2
, while its expected value decays at the slower rate of
.
By contrast, the best randomised stationary policy performs substantially worse
for a discount factor
close to 1 and small values of C (Figure 7.6). Intuitively,
a randomised policy must choose a with a sufﬁciently large probability to avoid
receiving the random reward early, which prevents it from selecting b quickly
beyond the threshold of T
C
time steps. 4 226 Chapter 7
Figure 7.6
Expected return as a function of the discount factor
and variance constraint C in
Example 7.21. Solid and dashed lines indicate the expected return of the best stationary
and time-dependent policies, respectively. The peculiar ziz-zag shape of the curve for
the time-varying policy arises because the time T
C
at which action b is taken must be an
integer (see Exercise 7.9).
The last two examples establish that the variance-constrained risk measure
is time-inconsistent: informally, the agent’s preference for one outcome over
another at time t may be reversed at a later time. Compared to the risk-neutral
problem, the variance-constrained problem is more challenging because the
space of policies that one needs to consider is much larger. Among other things,
the lack of an optimal policy that is Markov with respect to the state alone
also implies that the dependency on the initial distribution in Equation 7.19 is
necessary to keep things well-deﬁned. The variance-constrained objective must
be optimised for globally, considering the policy at all states at once; this is in
contrast with the risk-neutral setting, where value iteration can make overall
improvements to the policy by acting greedily with respect to the value function
at individual states (see Remark 7.1). In fact, ﬁnding an optimal policy for the
variance-constrained control problem is NP-hard [Mannor and Tsitsiklis, 2011].
7.8 Conditional Value-At-Risk*
In the previous section, we saw that solutions to the variance-constrained control
problem can take unintuitive forms, including the need to penalise better-than-
expected outcomes. One issue is that variance only coarsely measures what we
mean by “risk” in the common sense of the word. To reﬁne our meaning, we
may identify two types of risk: downside risk, involving undesirable outcomes Control 227
such as greater-than-expected losses, and upside risk, involving what we may
informally call a stroke of luck. In some situations, it is possible and useful to
separately account for these two types of risk.
To illustrate this point, we now present a distributional algorithm for optimising
conditional value-at-risk (CVaR), based on work by Bäuerle and Ott 
and Chow et al. . One beneﬁt of working with full return distributions
is that the algorithmic template we present here can be reasonably adjusted to
deal with other risk measures, including the entropic risk measure described in
Example 7.17. For conciseness, in what follows we will state without proof a
few technical facts about conditional value-at-risk which can be found in those
sources and the work of Rockafellar and Uryasev .
Conditional value-at-risk measures downside risk by focusing on the lower tail
behaviour of the return distribution, speciﬁcally the expected value of this tail.
This expected value quantiﬁes the magnitude of losses in extreme scenarios.
Let Z be a random variable with cumulative and inverse cumulative distribution
functions F
Z
and F
–1
Z
, respectively. For a parameter
2
(0, 1), the CVaR of Z is
CVAR
(Z)=
1
0
F
–1
Z
(u)du . (7.20)
When the inverse cumulative distribution F
–1
Z
is strictly increasing, the right-
hand side of Equation 7.20 is equivalent to
E[Z | Z F
–1
Z
()] . (7.21)
In this case, CVaR quantiﬁes the expected return, conditioned on the event that
this return is no greater than the return’s
th
quantile i.e., is within the
th
fraction of lowest returns.
60
In a reinforcement learning context, this leads to
the risk-sensitive objective
J
CVAR
()=CVAR
1
t=0
t
R
t
. (7.22)
In general, there may not be an optimal policy that depends on the state x alone;
however, one can show that optimality can be achieved with a deterministic,
stationary Markov policy on an augmented state that incorporates information
about the return thus far. In the context of this section, we assume that rewards
are bounded in [R
MIN
, R
MAX
]. At a high level, we optimise the CVaR objective
by
60.
In other ﬁelds, CVaR is applied to losses rather than returns, in which case it measures the
expected loss subject that this loss is above the
th percentile. For example, Equation 7.21 becomes
E[Z | Z F
–1
Z
( )], and the subsequent derivations need to be adjusted accordingly.
228 Chapter 7
(a) Deﬁning the requisite augmented state;
(b)
Performing a form of risk-sensitive value iteration on this augmented state,
using a suitable selection rule;
(c) Extracting the resulting policy.
We now explain each of these steps.
Augmented state.
Central to the algorithm and to the state augmentation pro-
cedure is a reformulation of CVaR in terms of a desired minimum return or
target b
2R
. Let [x]
+
denote the function that is 0 if x < 0, and x otherwise. For
a random variable Z and
2
(0, 1), Rockafellar and Uryasev  establish
that
CVAR
(Z) = max
b2R
b
–1
E
[b Z]
+
. (7.23)
When F
–1
Z
is strictly increasing, the maximum-achieving b for Equation 7.23
is the quantile F
–1
Z
(
). In fact, taking the derivative of the expression inside
the brackets with respect to
yields the quantile update rule (Equation 6.11;
see Exercise 7.11). The advantage of this formulation is that it is more easily
optimised in the context of a policy-dependent return. To see this, let us write
G
=
1
t=0
t
R
t
to denote the random return from the initial state X
0
, following some history-
dependent policy 2
H
. We then have that
max
2
H
J
CVAR
() = max
2
H
max
b2R
b
–1
E
[b G
]
+
= max
b2R
b
–1
min
2
H
E
[b G
]
+
. (7.24)
In words, the CVaR objective can be optimised by jointly ﬁnding an optimal
target b and a policy that minimises the expected shortfall
E
[b G
]
+
. For a
ﬁxed target b, we will see that it is possible to minimise the expected shortfall
by means of dynamic programming. By adjusting b appropriately, one then
obtains an optimal policy.
Based on Equation 7.24, let us now consider an augmented state (X
t
, B
t
), where
X
t
is as usual the current state and B
t
takes on values in
B
=[V
MIN
, V
MAX
]; we
will describe its dynamics in a moment. With this augmented state we may
consider a class of stationary Markov policies,
CVAR
, which take the form
: XB!P(A).
61
61. Since B
t
is a function of the trajectory up to time t,
CVAR
is a strict subset of
H
. Control 229
We use the variable B
t
to keep track of the amount of discounted reward that
should be obtained from X
t
onwards in order to achieve a desired minimum
return of b
0
2R
over the entire trajectory. The transition structure of the Markov
decision process over the augmented state is deﬁned by modifying the generative
equations (Section 2.3):
B
0
= b
0
A
t
|(X
0:t
, B
0:t
, A
0:t–1
, R
0:t–1
) (· | X
t
, B
t
)
B
t+1
|(X
0:t+1
, B
0:t
, A
0:t
, R
0:t
)=
B
t
R
t
;
we similarly extend the sample transition model with the variables B and B
0
.
This deﬁnition of the variables (B
t
)
t0
can be understood by noting, for example,
that a minimum return of b
0
is achieved over the whole trajectory if
1
t=1
t–1
R
t
b
0
R
0
.
If the reward R
t
is small or negative, the new target B
t+1
may of course be larger
than B
t
. Note that the value of b
0
is a parameter of the algorithm, rather than
given by the environment.
Risk-sensitive value iteration.
We next construct a method for optimising the
expected shortfall given a target b
0
2R
. Let us write
2P
(
R
)
XBA
for a
return-distribution function on the augmented state-action space, instantiated
as a return-variable function G. For ease of exposition, we will mostly work
with this latter form of the return function. As usual, we write G
for the
return-variable function associated with a policy
2
CVAR
. With this notation
in mind, we write
J
CVAR
(, b
0
) = max
b2R
b
–1
E
[b G
(X
0
, b
0
, A
0
)]
+
(7.25)
to denote the conditional value-at-risk obtained by following policy
from the
initial state (X
0
, b
0
), with A
0
(· | X
0
, b
0
).
Similar to distributional value iteration, the algorithm constructs a series of
approximations to the return-distribution function by repeatedly applying the
distributional Bellman operator with a policy derived from a greedy selection
rule
˜
G. Speciﬁcally, we write
a
G
(x, b) = arg min
a2A
E
[b G(x, b, a)]
+
(7.26) 230 Chapter 7
for the greedy action at the augmented state (x, b), breaking ties arbitrarily. The
selection rule
˜
G is itself given by
˜
G()(a | x, b)=
˜
G(G)(a | x, b)= {a =a
G
(x, b)}.
The algorithm begins by initialising
0
(x, b, a)=
0
for all x
2X
, b
2B
, and
a 2A, and iterating
k+1
= T
˜
G
k
, (7.27)
as in distributional value iteration. Expressed in terms of random-variable
functions, this is
G
k+1
(x, b, a)
D
= R + G
k
(X
0
, B
0
,a
G
k
(X
0
, B
0
)), X = x, B = b, A = a.
After k iterations, the policy
k
=
˜
G
(
k
) can be extracted according to Equation
7.26. As suggested by Equation 7.25, a suitable choice of starting state is
b
0
= arg max
b2B
b
–1
E
[b G
k
(X
0
, b,a
G
k
(X
0
, b))]
+
, (7.28)
As given, there are two hurdles to producing a tractable implementation of
Equation 7.27: in addition to the usual concern that return distributions may
need to be projected onto a ﬁnite-parameter representation, we also have to
contend with a real-valued state variable B
t
. Before discussing how this can be
addressed, we ﬁrst establish that Equation 7.27 is a sound approach to ﬁnding
an optimal policy for the CVaR objective.
Theorem 7.22.
Consider the sequence of return-distribution functions
(
k
)
k0
deﬁned by Equation 7.27. Then the greedy policy
k
=
˜
G
(
k
) is
such that for all x 2X, b 2B, and a 2A,
E
[b G
k
(x, b, a)]
+
min
2
CVAR
E
[b G
(X
0
, b, a)]
+
+
k
V
MAX
V
MIN
1–
.
Consequently, we also have that the conditional value-at-risk of this policy
satisﬁes (with b
0
given by Equation 7.28)
J
CVAR
(
k
, b
0
) max
b2B
max
2
CVAR
J
CVAR
(, b)–
k
V
MAX
V
MIN
(1 )
. 4
The proof is somewhat technical, and is provided in Remark 7.3.
Theorem 7.22 establishes that with sufﬁciently many iterations, the policy
k
is
close to optimal. Of course, when distribution approximation is introduced the
resulting policy will in general only approximately optimise the CVaR objective,
with an error term that depends on the expressivity of the probability distribu-
tion representation (i.e., the parameter m in Chapter 5). To perform dynamic
Control 231
programming with the state variable B, one may use function approximation,
the subject of the next chapter. Another solution is to consider a discrete number
of values for B and to extend the operator
T
˜
G
to operate on this discrete set
(Exercise 7.12).
7.9 Technical Remarks
Remark 7.1.
In Section 7.7 we presented some of the challenges involved with
ﬁnding an optimal policy for the variance-constrained objective. In some sense,
these challenges should not be too surprising given that that we are looking to
maximise a function J of an inﬁnite-dimensional object (a history-dependent
policy). Rather, what should be surprising is the relative ease with which one
can obtain an optimal policy in the risk-neutral setting.
From a technical perspective, this ease is a consequence of Lemma 7.3, which
guarantees that Q
(and hence
) can be efﬁciently approximated. However,
another important property of the risk-neutral setting is that the policy can be
improved locally, that is at each state simultaneously. To see this, consider a
state-action value function Q
for a given policy
, and denote by
0
a greedy
policy with regards to Q
. Then
TQ
= T
0
Q
T
Q
= Q
. (7.29)
That is, a single step of value iteration applied to the value function of a policy
results in a new value function that is at least as good as Q
at all states
the Bellman operator is said to be monotone. Because this single step also
corresponds to the value of a non-stationary policy that acts according to
0
for
one step then switches to
, we can equivalently interpret it as constructing,
one step at a time, a deterministic history-dependent policy for solving the
risk-neutral problem.
By contrast, it is not possible to use a direct dynamic programming approach
over the objective J to ﬁnd the optimal policy for an arbitrary risk-sensitive con-
trol problem. A practical alternative is to perform the optimisation instead
with an ascent procedure (e.g., a policy gradient-type algorithm). Ascent
algorithms can often be computed in closed-form, and tend to be simpler
to implement. On the other hand, convergence is typically only guaranteed to
local optima, seemingly unavoidable when the optimisation problem is known
to be computationally hard. 4
Remark 7.2.
When the projection
F
is not mean-preserving, distributional
value iteration induces a state-action value function Q
k
that is different from
the value function Q
k
determined by standard value iteration under equivalent 232 Chapter 7
initial conditions. Under certain conditions on the distributions of rewards, it is
possible to bound this difference as k
!1
. To do so, we use a standard error
bound on approximate value iteration [see e.g. Bertsekas, 2012].
Lemma 7.23. Let (Q
k
)
k0
be a sequence of iterates in R
XA
satisfying
kQ
k+1
TQ
k
k
1
"
for some " > 0, and where T is the Bellman optimality operator. Then
lim sup
k!1
kQ
k
Q
k
1
"
1–
. 4
In the context of distributional value iteration, we need to bound the difference
kQ
k+1
TQ
k
k
1
.
When the rewards are bounded on the interval [R
MIN
, R
MAX
] and the projection
step is
Q
, the w
1
-projection onto the m-quantile representation, a simple bound
follows from an intermediate result used in proving Lemma 5.30 (see Exercise
5.20). In this case, for any bounded on [V
MIN
, V
MAX
],
w
1
(
Q
, )
V
MAX
V
MIN
2m
;
Conveniently, the 1-Wasserstein distance bounds the difference of means
between any distributions ,
0
2P
1
(R):
E
Z
[Z]– E
Z
0
[Z]
w
1
(,
0
).
This follows from the dual representation of the Wasserstein distance [Villani,
2008]. Consequently, for any (x, a),
Q
k+1
(x, a)–(T
k
Q
k
)(x, a)
w
1
k+1
(x, a), (T
k
k
)(x, a)
= w
1
(
Q
T
k
k
)(x, a), (T
k
k
)(x, a)
V
MAX
V
MIN
2m
.
By taking the maximum over (x, a) on the left-hand side of the above and
combining with Lemma 7.23, we obtain
lim
k!1
kQ
k
Q
k
1
V
MAX
V
MIN
2m(1 )
. 4
Remark 7.3.
Theorem 7.22 is proven from a few facts regarding partial returns,
which we now give. Let us write
a
k
(x, b)=
a
G
k
(x, b). We deﬁne the mapping
U
k
: XBA!R as
U
k
(x, b, a)=E
[b G
k
(x, b, a)]
+
, Control 233
and for a policy 2
CVAR
similarly write
U
(x, b, a)=E
[b G
(x, b, a)]
+
.
Lemma 7.24. For any x 2X, a 2A, b 2B, and k 2N, we have
U
k+1
(x, b, a)= E
U
k
X
0
, B
0
,a
k
(X
0
, B
0
)
| X = x, B = b, A = a
.
CVAR
is a stationary Markov policy on XB, we have
U
(x, b, a)= E
U
X
0
, B
0
, A
0
| X = x, B = b, A = a
. 4
Proof.
The result follows by time-homogeneity and the Markov property. Con-
sider the sample transition model (X, B, A, R, X
0
, B
0
, A
0
), with A
0
=
a
k
(X
0
, B
0
).
Simultaneously, consider the partial trajectory (X
t
, B
t
, A
t
, R
t
)
k
t=0
for which A
0
= A
0
and A
t
kt
(· | X
t
, B
t
) for t > 0. As B
0
= B R, we have
E
U
k
X
0
, B
0
, A
0
| X = x, B = b, A = a
= E
E
B
0
G
k
(X
0
, B
0
, A
0
)
+
| X = x, B = b, A = a
= E
E
B
0
k
t=0
t
R
t
+
| X
0
= X
0
, B
0
= B
0
, A
0
= A
0
| X = x, A = a
= E
E
b R
k
t=0
t
R
t
+
| X
0
= X
0
, B
0
= B
0
, A
0
= A
0
| X = x, B = b, A = a
= E
E
b R
k+1
t=1
t
R
t
+
| X
1
= X
0
, B
1
= B
0
, A
1
= A
0
| X = x, B = b, A = a
= E
b
k+1
t=0
t
R
t
+
| X
0
= x, B
0
= b, A
0
= a
.
The second statement follows similarly.
Lemma 7.25.
Suppose that V
MIN
0 and V
MAX
0. Let (R
t
)
t0
be a sequence
of rewards in [R
MIN
, R
MAX
]. For any b 2R and k 2N,
E
[b
1
t=0
t
R
t
]
+
]+
k+1
V
MIN
E
[b
k
t=0
t
R
t
]
+
] E
[b
1
t=0
t
R
t
]
+
]+
k+1
V
MAX
.
4 234 Chapter 7
Proof. First note that, for any b, z, z
0
2R,
[b z]
+
[b z
0
]
+
+[z
0
z]
+
. (7.30)
To obtain the ﬁrst inequality in the statement, we set
[b
1
t=0
t
R
t
]
+
[b
k
t=0
t
R
t
]
+
+ [–
1
t=k+1
t
R
t
]
+
Since rewards are bounded in [R
MIN
, R
MAX
], we have that
1
t=k+1
t
R
t
k+1
R
MIN
1–
=–
k+1
V
MIN
.
As we have assumed that V
MIN
0, it follows that
[b
1
t=0
t
R
t
]
+
[b
k
t=0
t
R
t
]
+
k+1
V
MIN
.
The second inequality in the statement is obtained analogously.
Lemma 7.26.
The sequence (
k
)
k0
deﬁned by Equation 7.27 satisﬁes, for any
x 2X, b 2B, and a 2A,
U
k
(x, b, a) = min
2
CVAR
E
[b
k
t=0
t
R
t
]
+
| X = x, B = b, A = a
. (7.31)
4
Proof (sketch).
Our choice of G
0
(x, b, a) = 0 guarantees that the statement is
true for k = 0. The result then follows by Lemma 7.24, the fact that the policy
˜
G
(
k
) chooses the action minimising the left-hand side of Equation 7.31, and
by induction on k.
Proof of Theorem 7.22.
Let us assume that V
MIN
0 and V
MAX
0 so that
Lemma 7.25 can be applied. This is without loss of generality, as otherwise
we may ﬁrst construct a new sequence of rewards shifted by an appropriate
constant C, such that R
MIN
= 0, R
MAX
0; by inspection, this transformation does
not affect the statement of the Theorem 7.22.
Let
2
CVAR
be an optimal deterministic policy, in the sense that
U
(x, b, a) = min
2
CVAR
U
(x, b, a).
Combining Lemmas 7.25 and 7.26, we have
U
(x, b, a)+
k+1
V
MIN
U
k
(x, b, a) U
(x, b, a)+
k+1
V
MAX
. (7.32) Control 235
Write E
xba
[·]=E
[· | X = x, B = b, A = a]. By Lemma 7.24,
U
k
(x, b, a)–U
k
(x, b, a)
= E
xba
U
k
(X
0
, B
0
,a
k
(X
0
, B
0
)) U
k–1
(X
0
, B
0
,a
k–1
(X
0
, B
0
))
= E
xba
U
k
(X
0
, B
0
,a
k
(X
0
, B
0
)) U
k
(X
0
, B
0
,a
k
(X
0
, B
0
))
+ U
k
(X
0
, B
0
,a
k
(X
0
, B
0
)) U
k–1
(X
0
, B
0
,a
k–1
(X
0
, B
0
))
E
xba
U
k
(X
0
, B
0
,a
k
(X
0
, B
0
)) U
k
(X
0
, B
0
,a
k
(X
0
, B
0
))
+
k+1
V
MAX
k
V
MIN
.
(7.33)
Now, the quantity
"
k
(x, b, a)=U
k
(x, b, a)–U
k
(x, b, a)
is bounded above and hence
"
k
= sup
x2X,b2B,a2A
"
k
(x, b, a)
exists. Taking the supremum over x, b and a on both sides of Equation 7.33, we
have
"
k
"
k
+
k+1
V
MAX
k
V
MIN
and consequently for all x, b, a,
U
k
(x, b, a)–U
k
(x, b, a)
k
(V
MAX
V
MIN
)
1–
. (7.34)
Because U
k
(x, b, a) U
(x, b, a)+
k+1
V
MAX
, then
U
k
(x, b, a) U
(x, b, a)+
k+1
V
MAX
+
k
(V
MAX
V
MIN
)
1–
.
Now,
k+1
+
k+1
1–
=
k
+ (1 )
1–
k
1–
.
Hence we conclude that also
U
k
(x, b, a) min
2
CVAR
U
(x, b, a)+
k
(V
MAX
V
MIN
)
1–
,
as desired.
For the second statement, following Equation 7.28 we have
b
0
= arg max
b2B
b
–1
E
[b G
k
(X
0
, b,a
k
(X
0
, b))]
+
. 236 Chapter 7
The algorithm begins in state (X
0
, b
0
), selects action A
0
=
a
k
(X
0
, b
0
), and then
executes policy
k
from there on; its return is G
k
(X
0
, b
0
, A
0
). In particular,
J
CVAR
(
k
, b
0
) = max
b2B
b
–1
E
[b G
k
(X
0
, b
0
, A
0
)]
+
b
0
–1
E
[b
0
G
k
(X
0
, b
0
, A
0
)]
+
. (7.35)
Write
a
(x, b) for the action selected by
in (x, b). Because Equation 7.32
holds for all x, b, and a, we have
min
a2A
U
k
(x, b, a) min
a2A
U
(x, b, a)+
k+1
V
MAX
=) U
k
(x, b,a
k
(x, b)) U
(x, b,a
(x, b)) +
k+1
V
MAX
,
since
a
k
(x, b) is the action a that minimises U
k
(x, b, a). Hence, for any state x
we have
max
b2B
b
–1
min
a2A
U
k
(x, b, a)
max
b2B
b
–1
min
a2A
U
(x, b, a)
k+1
V
MAX
=) b
0
–1
U
k
(X
0
, b
0
,a
k
(X
0
, b
0
)) max
b2B
b
–1
U
(X
0
, b,a
(x, b))
k+1
V
MAX
= max
b2B
J
CVAR
(
, b)–
k+1
V
MAX
.
Combined with Equations 7.34 and 7.35, this yields
J
CVAR
(
k
, b
0
) max
b2B
max
2
CVAR
J
CVAR
(, b)–
k
(V
MAX
V
MIN
)
(1 )
.
4
7.10 Bibliographical Remarks
7.0.
The balloon navigation example at the beginning of the chapter is from
Bellemare et al. . Sutton and Barto  separates “control problem”
from “prediction problem”; the latter ﬁgures more predominantly in this book.
In earlier literature, the control problem comes ﬁrst [see e.g. Bellman, 1957a]
and prediction is typically used as an subroutine for control [Howard, 1960].
7.1.
Time-dependent policies are common in ﬁnite-horizon scenarios, and are
studied at length by Puterman . The technical core of Proposition 7.2
involves demonstrating that any feasible value function can be attained by a
stationary Markov policy; see the results by Puterman [2014, Theorem 5.5.1],
Altman  and the discussion by Szepesvári .
Control 237
In reinforcement learning, history-dependent policies are also used to deal with
partially observable environments, in which the agent receives an observation o
at each time step rather than the identity of its state. For example, McCallum
 uses a variable-length history to represent state-action values, while
Veness et al.  uses a history-based probabilistic model to learn a model
of the environment. History-dependent policies also play a central role in the
study of optimality in the fairly large class of computable environments [Hutter,
2005].
7.2.
The canonical reference for value iteration is the book by Bellman [1957a];
see also Bellman [1957b] for an asymptotic analysis in the undiscounted set-
ting. Lemma 7.3 is standard and can be found in most reinforcement learning
textbooks [Bertsekas and Tsitsiklis, 1996, Szepesvári, 2010, Puterman, 2014].
State-action value functions were introduced along with the Q-learning algo-
rithm [Watkins, 1989] and subsequently used in the development of SARSA
[Rummery and Niranjan, 1994]. Watkins and Dayan  gives a restricted
result regarding the convergence of Q-learning, which is more thoroughly estab-
lished by Jaakkola et al. , Tsitsiklis , and Bertsekas and Tsitsiklis
.
7.3–7.5.
The expression of the optimality operator as a ﬁxed-policy operator
whose policy varies with the input is common in the analysis of control algo-
rithms [see e.g. Munos, 2003, Scherrer, 2014]. The view of value iteration as
constructing a history-dependent policy is taken by Scherrer and Lesner 
to derive more accurate value learning algorithms in the approximate setting.
The extension to distributional value iteration and Theorem 7.13 are from
[Bellemare et al., 2017a]. The correspondence between standard value iteration
and distributional value iteration with a mean-preserving projection is given by
Lyle et al. .
The notion of action gap plays an important role in understanding the rela-
tionship between value function estimates and policies, in particular when
estimates are approximate. Farahmand  gives a gap-dependent bound on
the expected return obtained by a policy derived from an approximate value
function. Bellemare et al.  derives an algorithm for increasing the action
gap so as to improve performance in the approximate setting.
An example of a selection rule that explicitly incorporates distributional infor-
mation is the lexicographical rule of Jaquette , which orders policies
according to the magnitude of their moments.
238 Chapter 7
7.6.
The notion of risk and risk-sensitive decisions can be traced back to
Markowitz , who introduced the concept of trading off expected gains and
variations in those gains in the context of constructing an investment portfolio;
see also Steinbach  for a retrospective. Artzner et al.  proposes
a collection of desirable characteristics that make a risk measure coherent
in the sense that it satisﬁes certain preference axioms. Of the risk measures
mentioned here, CVaR is coherent but the variance-constrained objective is
not. Artzner et al.  discusses coherent risk measures in the context of
sequential decisions. Ruszczy
´
nski  introduces the notion of dynamic risk
measures for Markov decision processes, which are amenable to optimisation
via Bellman-style recursions; see also Chow  for a discussion of static
and dynamic risk measures as well as time consistency. Jiang and Powell 
develop sample-based optimisation methods for dynamic risk measures based
on quantiles.
Howard and Matheson  considered the optimisation of an exponential
utility function applied to the random return by means of policy iteration. The
same objective is given a distributional treatment by Chung and Sobel .
Heger  considers optimising for worst-case returns. Haskell and Jain
 study the use of occupancy measures over augmented state spaces as an
approach for ﬁnding optimal policies for risk-sensitive control; similarly, an
occupancy measure-based approach to CVaR optimisation is studied by Carpin
et al. . Mihatsch and Neuneier  and Shen et al.  extend
Q-learning to the optimisation of recursive risk measures, where a base risk
measure is applied at each time step. Recursive risk measures are more easily
optimised than risk measures directly applied to the random return, but are not
as easily interpreted. Martin et al.  consider combining distributional
reinforcement learning with the notion of second-order stochastic dominance
as a means of action selection. Quantile criteria are considered by Filar et al.
 in the case of average-reward MDPs, and more recently by Gilbert et al.
 and Li et al. . Delage and Mannor  solves a risk-constrained
optimisation problem to handle uncertainty in a learned model’s parameters. See
Prashanth and Fu  for a survey on risk-sensitive reinforcement learning.
7.7.
Sobel  establishes that an operator constructed directly from the
variance-penalised objective does not have the monotone improvement property,
making it optimisation more challenging. The examples demonstrating the need
for randomisation and a history-dependent policy are adapted from Mannor and
Tsitsiklis , who also prove the NP-hardness of the problem of optimising
the variance-constrained objective. Tamar et al.  propose a policy gradient
algorithm for optimising a mean-variance objective and for the CVaR objective
Control 239
Ghavamzadeh  for actor-critic algorithms for these criteria. Chow et al.
 augment the state with the return-so-far in order to extend gradient-based
algorithms to a broader class of risk measures.
7.8.
The reformulation of the conditional value-at-risk (CVaR) of a random
variable in terms of the (convex) optimisation of a function of a variable b
2R
is due to Rockafellar and Uryasev ; see also Rockafellar and Uryasev
, Shapiro et al. . Bäuerle and Ott  provide an algorithm for
optimising the CVaR of the random return in Markov decision processes. Their
work forms the basis for the algorithm presented in this section, although the
treatment in terms of return-distribution functions is new here. Another closely
related algorithm is due to Chow et al. , who additionally provide an
approximation error bound on the computed CVaR. Brown et al.  apply
Rockafellar and Uryasev’s approach to design an agent that is risk-sensitive with
respect to a prior distribution over possible reward functions. Keramati et al.
 combine categorical temporal-difference learning with an exploration
bonus derived from the DKW inequality to develop an algorithm to optimise
for conditional value-at-risk.
7.11 Exercises
Exercise 7.1.
Find a counterexample that shows that the Bellman optimality
operator is not an afﬁne operator. 4
Exercise 7.2.
Consider the Markov decision process depicted in Figure 2.4a.
For which values of the discount factor
2
[0, 1) is there more than one optimal
action from state x ? Use this result to argue that the optimal policy depends on
the discount factor. 4
Exercise 7.3.
Proposition 7.7 establishes that distributional Bellman optimality
operators are not contraction mappings.
(i)
Instantiate the result with the 1-Wasserstein distance. Provide a visual
explanation for the result by drawing the relevant cumulative distribution
functions before and after the application of the operator.
(ii)
Discuss why it was necessary, in the proof of Proposition 7.7, to assume
that the probability metric d is c-homogeneous. 4
Exercise 7.4.
Suppose that there is a unique optimal policy
, as per Section
7.4. Consider the use of a projection
F
for a probability representation
F 240 Chapter 7
and the iterates
k+1
=
F
T
k
. (7.36)
Discuss under what conditions the sequence of greedy policies (
G
(
k
))
k0
converges to
when
F
is
(i) The m-categorical projection
C
;
(ii) The m-quantile projection
Q
.
on m, or on
1
, ...,
m
for the case of the categorical representation? 4
Exercise 7.5.
Give a Markov decision process for which the limit of the
sequence of iterates deﬁned by
k+1
=
T
G
k
depends on the initial condition
0
,
irrespective of the greedy selection rule
G
. Hint. Construct a scenario where the
implied policy
k
is the same for all k, but depends on
0
. 4
Exercise 7.6.
Consider the greedy selection rule that selects an action with
minimal variance amongst those with maximal expected value, breaking ties
uniformly at random. Provide an example Markov decision process in which
this rule results in a sequence of return-distribution functions that does not
converge, as per Example 7.11. Hint. Consider reward distributions of the form
1
3
3
i=1
i
. 4
Exercise 7.7
(*)
.
Consider the 1-Wasserstein distance w
1
and its supremum
extension
w
1
. In addition, let d be a metric on
P
(
A
). Suppose that we are given
a mapping
˜
G : P(R)
XA
!
MS
which is continuous at every state, in the sense that for any
"
> 0, there exists a
> 0 such that for any return functions ,
0
,
w
1
(,
0
)< =) max
x2X
d
˜
G()(· | x),
˜
G(
0
)(· | x)
< ".
Show that this mapping cannot be a greedy selection rule in the sense of
Deﬁnition 7.6. 4
Exercise 7.8.
Consider the Markov decision process depicted in Figure 7.5a.
Show that there is no 0 such that the policy maximising
J
MV
()=E
G
(X
0
)
Var
G
(X
0
)
is stochastic. This illustrates how the variance-constrained and variance-
penalised control problems are not equivalent. 4 Control 241
Exercise 7.9. Consider the Markov decision process depicted in Figure 7.5c.
(i)
Solve for the optimal stopping time T
C
maximising the return of a time-
dependent policy that selects action a for T
C
time steps, then selects action
b (under the constraint that the variance should be no greater than C).
(ii) Prove that this policy can achieve an expected return of up to
p
C.
(iii) Based on your conclusions, design a policy that improves on this strategy.
(iv)
Show that the expectation and variance of the return of a randomised
stationary policy that selects action b with probability p are given by
E
[G
(x)] =
p
2
1–(1 p)
Var
G
(x)
=
p
4
2
1–
2
(1 p)
p
(1 (1 p))
2
.
(v)
Using your favourite visualisation program, chart the returns achieved by
the optimal randomised policy and the optimal time-dependent policy, for
values of C and
different from those shown in Figure 7.5c. What do
you observe?Hint. Use a root-ﬁnding algorithm to determine the maximum
expected return of a randomised policy under the constraint
Var
G
(x)
C. 4
Exercise 7.10.
Explain the relationship between the shaded area in Figure 7.4
and conditional value-at-risk for the depicted distribution. 4
Exercise 7.11.
Following Equation 7.23, for a distribution
2P
(
R
) consider
the function
f ()=
–1
E
Z
[ Z]
+
.
Show that for 2(0, 1),
↵⌧
d
d
f ()
is equivalent to the quantile update rule (Equation 6.11), in expectation. 4
Exercise 7.12
(*)
.
Consider a uniform discretisation
B
"
of the interval
[V
MIN
, V
MAX
] into intervals of width
"
(endpoints included). For a return-
distribution function
on the discrete space
XB
"
A
, deﬁne its extension
to X[V
MIN
, V
MAX
] A
˜(x, b, a)=(x, "
b
"
, a).
242 Chapter 7
Suppose that probability distributions can be represented exactly (i.e., without
needing to resort to a ﬁnite-parameter representation). For the CVaR objective
(Equation 7.22), derive an upper bound for the suboptimality
max
2
CVAR
J()–J(˜),
where
˜
is found by the procedure of Section 7.8 applied to the discrete space
XB
"
A
and using the extension
7! ˜
to implement the operator
T
˜
G
.
4