8
Statistical Functionals
The development of distributional reinforcement learning in previous chapters
has focused on approximating the full return function with parametrised families
of distributions. In our analysis, we quantified the accuracy of an algorithm’s
estimate according to its distance from the true return-distribution function,
measured using a suitable probability metric.
Rather than try to approximate the full distribution of the return, we may instead
select specific properties of this distribution and directly estimate these proper-
ties. Implicitly, this is the approach taken when estimating the expected return.
Other common properties of interest include quantiles of the distributions, high-
probability tail bounds, and the risk-sensitive objectives described in Chapter 7.
In this chapter, we introduce the language of statistical functionals to describe
such properties.
In some cases, the statistical functional approach allows us to obtain accurate
estimates of quantities of interest, in a more straightforward manner. As a
concrete example, there is a low-cost dynamic programming procedure to
determine the variance of the return distribution.
62
By contrast, categorical and
quantile dynamic programming usually under- or overestimate this variance.
This chapter develops the framework of statistical functional dynamic program-
ming as a general method for approximately determining the values of statistical
functionals. As we demonstrate in Section 8.4, it is in fact possible to interpret
both categorical and quantile dynamic programming as operating over statistical
functionals. We will see that while some characteristics of the return (including
its variance) can be accurately estimated by an iterative procedure, in general
some care must be taken when estimating arbitrary statistical functionals.
62.
In fact, the return variance can be determined to machine precision by solving a linear system
of equations, similar to what was done in Section 5.1 for the value function.
243
244 Chapter 8
8.1 Statistical Functionals
A functional maps functions to real values. By extension, a statistical functional
maps probability distributions to the reals. In this book, we view statistical
functionals as measuring a particular property or characteristic of a probability
distribution. For example, the mapping
7!P
Z
Z 0
, 2P(R)
is a statistical functional that measures how much probability mass its argument
puts on the non-negative reals. Statistical functionals express quantifiable
properties of probability distributions such as their mean and variance. The
following formalises this point.
Definition 8.1.
A statistical functional
is a mapping from a subset of
probability distributions P
(R) P(R) to the reals, written
: P
(R) !R .
We call the particular scalar
(
) associated with a probability distribution
a
functional value, and the set P
(R) the domain of the functional. 4
Example 8.2.
The mean functional maps probability distributions to their
expected values. As before, let
P
1
(R)={ 2P(R): E
Z
|Z|
< 1}
be the set of distributions with finite first moment. For
2P
1
(
R
), the mean
functional is
µ
1
()= E
Z
[Z].
The restriction to
P
1
(
R
) is necessary to exclude from the definition distributions
without a well-defined mean. 4
The purpose of this chapter is to study how functional values of the return
distribution can be approximated using dynamic programming procedures and
incremental algorithms. In general, we will be interested in a collection of
such functionals that exhibit desirable properties, for example because they
can be jointly determined by dynamic programming or because they provide
complementary information about the return function. We call such a collection
a distribution sketch.
Definition 8.3.
A distribution sketch (or simply sketch)
:
P
(
R
)
!R
m
is a
vector-valued function specified by a tuple (
1
,
,
m
) of statistical functionals.
Statistical Functionals 245
Its domain is
P
(R)=
m
i=1
P
i
(R),
and it is defined as
()=(
1
(), ,
m
()), 2P
(R).
Its image is
I
=
(): 2P
(R)
R
m
.
We also extend this notation to return-distribution functions:
()=
((x)) : x 2X), 2P
(R)
X
. 4
Example 8.4.
The quantile functionals are a family of statistical functionals
indexed by
2
(0, 1) and defined over
P
(
R
). The
-quantile functional is
defined in terms of the inverse cumulative distribution function of its argument
(Definition 4.12):
Q
()=F
–1
().
A finite collection of quantile functionals (say, for
1
,
...
,
m
2
(0, 1)) constitutes
a sketch. 4
Example 8.5.
To prove the convergence of categorical temporal-difference
learning (Section 6.10) we introduced the isometry I : F
C,m
!R
I
defined as
I()=
F
(
i
):i 2{1, ..., m}
, (8.1)
where (
i
)
m
i=1
is the set of locations for the categorical representation. This
isometry is also a sketch in the sense of Definition 8.3. If we extend its domain
to be
P
(
R
), Equation 8.1 still defines a valid sketch but it is no longer an
isometry: it is not possible to recover the distribution
from its functional
values I(). 4
8.2 Moments
Moments are an especially important class of statistical functionals. For an
integer p 2N
+
, the p
th
moment of a distribution 2P
p
(R) is given by
µ
p
()= E
Z
Z
p
.
In particular, the first moment of
is its mean, while the variance of
is the
difference between its second moment and squared mean:
µ
2
()–
µ
1
()
2
. (8.2)
246 Chapter 8
Moments are ubiquitous in mathematics. They form a natural way of capturing
important aspects of a probability distribution, and the infinite sequence of
moments
µ
p
(
)
1
p=1
uniquely characterises many probability distributions of
interest; see Remark 8.3.
Our goal in this section is to describe a dynamic programming approach to
determining the moments of the return distribution. Fix a policy
, and consider
a state x
2X
and action a
2A
. The p
th
moment of the return distribution
(x, a)
is given by
E
G
(x, a)
p
,
where as before G
(x, a) is an instantiation of
(x, a). Although we can also
study dynamic programming approaches to learning the p
th
moment of state-
indexed return distributions,
E
G
(x)
p
,
this is complicated by a potential conditional dependency between the reward R
and next state X
0
due to the action A. One solution is to assume independence
of R and X
0
, as we did in Section 5.4. Here, however, to avoid making this
assumption we work with functions indexed by state-action pairs.
To begin, let us fix m 2N
+
. The m-moment function M
is
M
(x, a, i)=E
[(G
(x, a))
i
]=µ
i
(x, a)
, for i = 1, ..., m . (8.3)
As with value functions, we view M
as the function (or vector) in
R
XAm
describing the collection of the first m moments of the random return. In
particular, M
(
·
,
·
, 1) is the usual state-action value function. As elsewhere in
the book, to ensure that the expectation in Equation 8.3 is well-defined we
assume that all reward distributions have finite p
th
moments, for p = 1,
...
, m.
In fact, it is sufficient to assume that this holds for p = m (Assumption 4.29(m)).
As with the standard Bellman equation, from the state-action random-variable
Bellman equation
G
(x, a)=R + G
(X
0
, A
0
), X = x, A = a
we can derive Bellman equations for the moments of the return distribution.
To do so, we raise both sides to the i
th
power and take expectations with
respect to both the random return variables G
and the random transition
(X = x, A = a, R, X
0
, A
0
):
E
[(G
(x, a))
i
]=E
[(R + G
(X
0
, A
0
))
i
| X = x, A = a].
Statistical Functionals 247
From the binomial expansion of the term inside the expectation, we obtain
E
[(G
(x, a))
i
]=E
i
j=0
ij
i
j
R
j
G
(X
0
, A
0
)
ij
X = x, A = a
.
Since R and G
(X
0
, A
0
) are independent given X and A, we can rewrite the above
as
M
(x, a, i)=
i
j=0
ij
i
j
E
[R
j
| X = x, A = a] E
M
(X
0
, A
0
, i j)|X = x, A = a
.
where by convention we take M
(x
0
, a
0
, 0) = 1 for all x
0
2X
and a
0
2A
. This is
a recursive characterisation of the i
th
moment of a return distribution, analogous
to the familiar Bellman equation for the mean. The recursion is cast into the
familiar framework of operators with the following definition.
Definition 8.6.
Let m
2N
+
. The m-moment Bellman operator T
(m)
:
R
XAm
!R
XAm
is given by
(T
(m)
M)(x, a, i)=
i
j=0
ij
i
j
E
[R
j
| X = x, A = a] E
M(X
0
, A
0
, i j)|X = x, A = a
.
(8.4)
4
The collection of moments (M
(x, a, i):(x, a)
2X A
, i = 1,
, m) is a fixed
point of the operator T
(m)
. In general, the m-moment Bellman operator is not a
contraction mapping with respect to the L
1
metric (except, of course, for m = 1;
see Exercise 8.1). However, with a more nuanced analysis, we can still show
that T
(m)
has a unique fixed point to which the iterates
M
k+1
= T
(m)
M
k
(8.5)
converge.
Proposition 8.7.
Let m
2N
+
. Under Assumption 4.29(m), M
is the
unique fixed point of T
(m)
. In addition, for any initial condition M
0
2
R
XAm
, the iterates of Equation 8.5 converge to M
. 4
Proof.
We begin by constructing a suitable notion of distance between m-
moment functions R
XAm
. For M 2R
XAm
, let
kMk
1,i
= sup
(x,a)2XA
|M(x, a, i)| , for i = 1, ..., m
248 Chapter 8
kMk
1,<i
= sup
j=1,,i–1
kMk
1,j
, for i = 2, ..., m .
Each of
k·k
1,i
(for i = 1,
, m) and
k·k
1,<i
(for i = 2,
, m) is a semi-norm;
they fulfil the requirements of a norm, except that neither
k
M
k
1,i
= 0 nor
k
M
k
1,<i
= 0 implies that M = 0. From these semi-norms, we construct the
pseudo-metrics
(M, M
0
) 7!kM M
0
k
1,i
,
noting that it is possible for the distance between M and M
0
to be zero even
when M is different from M
0
.
The structure of the proof is to argue that T
(m)
is a contraction with modulus
with respect to
k·k
1,1
, and then to show inductively that it satisfies an
inequality of the form
kT
(m)
M T
(m)
M
0
k
1,i
C
i
kM M
0
k
1,<i
+
i
kM M
0
k
1,i
, (8.6)
for each i = 2,
, m, and some constant C
i
that depends on P
R
. Chaining these
results together then leads to the convergence statement, and uniqueness follows
as an immediate corollary.
To see that T
(m)
is a contraction with respect to
k·k
1,1
, let M
2R
XAm
, and
write M
(i)
=(M(x, a, i):(x, a)
2X A
) for the function in
R
XA
corresponding
to the i
th
moment function estimates given by M. By inspecting Equation 8.4
with i = 1, it follows that
T
(m)
M
(1)
= T
M
(1)
,
where T
is the usual Bellman operator. Furthermore,
k
M
k
1,1
=
k
M
(1)
k
1
, and
so the statement that T
(m)
is a contraction with respect to the pseudo-metric
implied by
k·k
1,1
is equivalent to the contractivity of T
on
R
XA
with the
respect to the L
1
norm, which was shown in Proposition 4.4.
To see that T
(m)
satisfies the bound of Equation 8.6 for i > 1, let L
2R
be such
that
| E[R
i
| X = x, A = a]| L, for all x, a 2X Aand i = 1, , m.
Observe that
(T
(m)
M)(x, a, i)–(T
(m)
M
0
)(x, a, i)
=
i–1
j=0
ij
i
j
E
R
j
| X = x, A = a
x
0
2X
a
0
2A
P
X
(x
0
| x, a)(a
0
| x
0
)(M M
0
)(x
0
, a
0
, i j)
Statistical Functionals 249
i–1
j=1
ij
i
j
E
R
j
| X = x, A = a
⇥kM M
0
k
1,<i
+
i
kM M
0
k
1,i
L
i–1
j=1
ij
i
j
kM M
0
k
1,<i
+
i
kM M
0
k
1,i
(2
i
2)LkM M
0
k
1,<i
+
i
kM M
0
k
1,i
.
Taking C
i
= (2
i
2)L, we have
kT
(m)
M T
(m)
M
0
k
1,i
C
i
kM M
0
k
1,<i
+
i
kM M
0
k
1,i
, for i = 2, ..., m.
To chain these results together, first observe that
kM
k
M
k
1,1
!0.
We next argue inductively that if, for a given i < m,(M
k
)
k0
converges to M
in
the pseudo-metric induced by k·k
1,<i
, then also
kM
k
M
k
1,i
!0, and hence
kM
k
M
k
1,<(i+1)
!0.
Let y
k
=
k
M
k
M
k
1,<i
and z
k
=
k
M
k
M
k
1,i
. Then the generalised contrac-
tion result states that z
k+1
C
i
y
k
+
i
z
k
. Taking the limit superior on both sides
yields
lim sup
k!1
z
k
lim sup
k!1
C
i
y
k
+
i
z
k
=
i
lim sup
k!1
z
k
,
where we have used the result y
k
!
0. From this, we deduce
lim sup
k!1
z
k
0,
but since (z
k
)
k0
is a non-negative sequence, we therefore have z
k
!
0. This
completes the inductive step, and we therefore obtain
k
M
k
M
k
1,i
!
0, as
required.
In essence, Proposition 8.7 establishes that the m-moment Bellman operator
behaves in a similar fashion to the usual Bellman operator, in the sense that its
iterates converge to the fixed point M
. From here, we may follow the deriva-
tions of Chapter 5 to construct a dynamic programming algorithm for learning
these moments;
63
or those of Chapter 6 to construct the corresponding incre-
mental algorithm (Section 8.8). Although the proof above does not demonstrate
the contractive nature of the moment Bellman operator, for m = 2 this can be
achieved using a different norm and analysis technique (Exercise 8.4).
63.
When the reward distributions take on a finite number of values, in particular, the expectations
of Definition 8.6 can be implemented as sums.
250 Chapter 8
0
s s
0
T
T
Figure 8.1
A sketch is Bellman closed if there is an operator
T
such that in the diagram above,
the composite functions T
and T
coincide.
8.3 Bellman Closedness
In preceding chapters, our approach to distributional reinforcement learning
considered approximations of the return distributions that could be tractably
manipulated by algorithms. The m-moment Bellman operator, on the other
hand, is not directly applied to probability distributions compared to say,
a m-categorical distribution, there is no immediate procedure for drawing a
sample from a collection of m moments. Compared to the categorical and
quantile projected operators, however, the m-moment operator yields an error-
free dynamic programming procedure with sufficiently many iterations and
under some finiteness assumptions, we can determine the moments of the
return function to any degree of accuracy. The concept of Bellman closedness
formalises this idea.
Definition 8.8.
A sketch
=(
1
,
,
m
) is Bellman closed if, whenever its
domain P
(R)
X
is closed under the distributional Bellman operator:
2P
(R)
X
=)T
2P
(R)
X
,
there is an operator T
: I
X
!I
X
such that
T
= T
for all 2P
(R)
X
.
The operator T
is said to be the Bellman operator for the sketch . 4
As was demonstrated in the preceding section, the collection of the m first
moments
µ
1
,
...
,
µ
m
is a Bellman-closed sketch. Its associated operator is
the m-moment operator T
(m)
.
When a sketch
is Bellman closed, the operator T
mirrors the application
of the distributional Bellman operator to the return-distribution function
; see
Figure 8.1. The concept of Bellman closedness is related to that of a diffusion-
free projection (Chapter 5), and we will in fact establish an equivalence between
Statistical Functionals 251
the two in Section 8.4. In addition, Bellman-closed sketches are particularly
interesting from a computational perspective because they support an exact
dynamic programming procedure, as the following establishes.
Proposition 8.9.
Let
=(
1
,
,
m
) be a Bellman-closed sketch and
suppose that
P
(
R
)
X
is closed under
T
. Then for any initial condition
0
2P
(R)
X
, and sequences (
k
)
k0
,(s
k
)
k0
defined by
k+1
= T
k
, s
0
= (
0
), s
k+1
= T
s
k
,
we have, for k 0,
s
k
= (
k
).
In addition, the functional values s
=
(
) of the return-distribution
function are a fixed point of the operator T
. 4
Proof.
Both parts of the result follow immediately from the definition of the
operator T
. First suppose that s
k
= (
k
), for some k 0. Then note that
s
k+1
= T
s
k
= T
(
k
)=
T
k
= (
k+1
).
Thus, by induction the first statement is proven. For the second statement, we
have
s
= (
)=
T
= T
(
)=T
s
.
Of course, dynamic programming is only feasible if the operator
T
can itself
be implemented in a computationally tractable manner. In the case of the m-
moment operator, we know this is possible under similar assumptions as were
made in Chapter 5.
Proposition 8.9 illustrates how, when the sketch
is Bellman closed, we can do
away with probability distributions and work exclusively with functional values.
However, many sketches of interest fail to be Bellman closed, as the following
examples demonstrate.
Example 8.10
(
The median functional
)
.
A median of a distribution
is its
0.5-quantile F
–1
(0.5).
64
Perhaps surprisingly, there is in general no way to
determine the median of a return distribution based solely on the medians
at the successor states. To see this, consider a state x that leads to state y
1
with probability
1
/
3
and to state y
2
with probability
2
/
3
, with zero reward. The
64.
As usual, there might be multiple values of z for which
P
Z
(Z
z) = 0.5; recall that F
–1
takes
the smallest such value.
252 Chapter 8
0.5 0.0 0.5 1.0 1.5 2.0
Return
0.00
0.25
0.50
0.75
1.00
Cumulative Probability
(x)
(y
1
)
(y
2
)
0.5 0.0 0.5 1.0 1.5
Return
0.00
0.25
0.50
0.75
1.00
Cumulative Probability
(x)
(y
1
)
(y
2
)
2.0
(c)(b)
(a)
Figure 8.2
Illustration of Example 8.10.
(a)
A Markov decision process in which state x leads to
states y
1
and y
2
with probability
1
/
3
and
2
/
3
respectively.
(b)
Case 1, in which the median
of
(x) matches the median of
(y
2
).
(c)
Case 2, in which the median of
(x) differs from
the median of (y
2
).
following are two scenarios in which the median returns at y
1
and y
2
are the
same, but the median at x is different (see Figure 8.2):
Case 1. The return distributions at y
1
and y
2
are Dirac deltas at 0 and 1, respec-
tively, and these are also the medians of these distributions. The median at x is
also 1.
Case 2. The return distributions at y
1
and y
2
are a Dirac delta at 0 and the
uniform distribution on [0.5, 1.5], respectively, and have the same medians as in
Case 1. However, the median at x is now 0.75. 4
Example 8.11
(
At-least functionals
)
.
For
2P
(
R
) and z
2R
, let us define
the at-least functional
z
()= {P
Z
Z z
> 0} ,
measuring whether
assigns positive probability to values in [z,
1
). Now
consider a state x which deterministically leads to y, with no reward, and
suppose that there is a single action a available. The statement “it is possible to
Statistical Functionals 253
obtain a return of at least 10 at state y corresponds to
10
(
(y)) = 1 . (8.7)
If Equation 8.7 holds, can we deduce whether or not a return of at least 10 is
possible at state x? The answer is no. Suppose that
= 0.9, and consider the
following two situations:
Case 1.
(y)=
10
. Then
10
(
(y)) = 1,
(x)=
9
and
10
(
(x)) = 0.
Case 2.
(y)=
20
. Then
10
(
(y)) = 1 still, but
(x)=
18
and
10
(
(x)) = 1. 4
What goes wrong in the examples above is that we do not have sufficient
information about the return distribution at the successor states to compute the
functional values for the return distribution of state x. Consequently, we cannot
use an iterative procedure to determine the functional values of
, at least not
without error.
As it turns out, m-moment sketches are somewhat special in being Bellman
closed. As the following theorem establishes, any sketch whose functionals
are expectations of functions must encode the same information as a moment
sketch.
Theorem 8.12.
Let
=(
1
,
...
,
m
) be a sketch. Suppose that
is Bell-
man closed and that for each i = 1,
...
, m, there is a function f
i
:
R !R
for
which
i
()= E
Z
[f
i
(Z)] .
Then
is equivalent to the first n moment functionals, for some n
m, in
the sense that there are real-valued coefficients (b
ij
) and (c
ij
) such that for
any 2P
(R) \P
m
(R),
i
()=
n
j=1
b
ij
µ
j
()+b
i0
, i = 1, ..., m ;
µ
j
()=
m
i=1
c
ij
i
()+c
0j
, j = 1, ..., n . 4
The proof is somewhat lengthy and is given in Remark 8.2 at the end of the
chapter.
254 Chapter 8
As a corollary, we may deduce that any sketch that can be expressed as an
invertible function of the first m moments is also Bellman closed. More pre-
cisely, if
0
is a sketch that is an invertible transformation of the sketch
corresponding to the first m moments, say
0
= h
, then
0
is Bellman closed
with corresponding Bellman operator h
T
h
–1
. Thus, for example, we may
deduce that the sketch corresponding to the mean and variance functionals is
Bellman closed, since the mean and variance are expressible as an invertible
function of the mean and uncentered second moment. On the other hand, many
other statistical functionals (including quantile functionals) are not covered by
Theorem 8.12. In the latter case, this is because there is no function f :
R !R
whose expectation for an arbitrary distribution
recovers the
th
quantile of
(Exercise 8.5). Still, as established in Example 8.10, quantile sketches are not
Bellman closed.
8.4 Statistical Functional Dynamic Programming
When a sketch
is not Bellman closed, we lack an operator
T
that emu-
lates the combination of the distributional Bellman operator and this sketch.
This precludes a dynamic programming approach that bootstraps its functional
value estimates directly from the previous estimates. However, approximate
dynamic programming with arbitrary statistical functionals is still possible if we
introduce an additional imputation step
that reconstructs plausible probability
distributions from functional values. As we will now see, this allows us to apply
the distributional Bellman operator to the reconstructed distributions and then
extract the functional values of the resulting return function estimate.
Definition 8.13.
An imputation strategy for the sketch
:
P
(
R
)
!R
m
is a
function
: I
!P
(
R
). We say that it is exact if for any valid functional
values (s
1
, , s
m
) 2I
, we have
i
((s
1
, , s
m
)) = s
i
, i = 1, ..., m .
Otherwise, we say that it is approximate.
By extension, we write
(s)
2P
(
R
)
X
for the return-distribution function
corresponding to the collection of functional values s 2I
X
. 4
In other words, if
is an exact imputation strategy for the sketch
=(
1
,
,
m
),
then for any valid values s
1
,
, s
m
of the functionals
1
,
,
m
, we have that
(s
1
,
, s
m
) is a probability distribution with the required values under each
functional. In a certain sense,
is a pseudo-inverse to the vector-valued map
Statistical Functionals 255
:
7!
(
1
(
),
,
m
(
)). Note that a true inverse to
does not exist, as
generally does not capture all aspects of the distribution .
Once an imputation strategy has been selected, it is possible to write down an
approximate dynamic programming algorithm for the functional values under
consideration. An abstract framework is given in Algorithm 8.1. In effect, such
an algorithm recursively computes the iterates
s
k+1
=
T
(s
k
)
(8.8)
from an initial s
0
2
I
X
. Procedures that implement the iterative process
described by Equation 8.8 are referred to as statistical functional dynamic
programming (SFDP) algorithms. When the sketch
is Bellman closed and
its imputation strategy
is exact, the sequence of iterates (s
k
)
k0
converges to
(
).
Algorithm 8.1: Statistical functional dynamic programming
Algorithm parameters: statistical functionals
1
, ,
m
,
imputation strategy ,
initial functional values
(s
i
(x))
m
i=1
: x 2X
,
desired number of iterations K
for k = 1, ..., K do
. Impute distributions
(s
1
(x), , s
m
(x)) : x 2X
. Apply distributional Bellman operator
˜ T
foreach state x 2X do
for i = 1, ..., m do
. Update statistical functional
values
s
i
(x)
i
(˜(x))
end for
end foreach
end for
return
(s
i
(x))
m
i=1
: x 2X
256 Chapter 8
Example 8.14.
For the quantile functionals (
Q
i
)
m
i=1
with
i
=
2i–1
2m
for i = 1,
, m,
an exact imputation strategy is
(q
1
, , q
m
) 7!
1
m
m
i=1
q
i
. (8.9)
This follows because the
2i–1
2m
-quantile of
1
m
m
i=1
q
i
is precisely q
i
.
Note that when
1
,
...
,
m
2
(0, 1) are arbitrary levels with quantile values
(q
1
,
...
, q
m
), however, it is generally not true that Equation 8.9 is an exact
imputation strategy for the corresponding quantile functionals. 4
Example 8.15.
Categorical dynamic programming can be interpreted as an
SFDP algorithm. Indeed, the parameters p
1
,
, p
m
found by the categorical
projection correspond to the values of the following statistical functionals:
C
i
()= E
Z
h
i
(&
–1
m
(Z
i
)
, i = 1, ..., m (8.10)
where (h
i
)
m
i=1
are the triangular and half-triangular kernels defining the cate-
gorical projection on (
i
)
m
i=1
(Section 5.6). An exact imputation strategy in this
case is the function that returns the unique distribution supported on (
i
)
m
i=1
that
matches the estimated functional values p
i
=
C
i
(), i = 1, ..., m:
(p
1
, ..., p
m
) 7!
m
i=1
p
i
i
.
4
Mathematically, an exact imputation strategy always exists, because we defined
imputation strategies in terms of valid functional values. However, there is no
guarantee that an efficient algorithm exists to compute the application of this
strategy to arbitrary functional values. In practice we may favour approximate
strategies with efficient implementations. For example, we may map functional
values to probability distributions from a representation
F
by optimising some
notion of distance between functional values. The optimisation process may not
yield an exact match in
F
(one may not even exist), but can often be performed
efficiently.
Example 8.16.
Let
C
1
,
...
,
C
m
be the categorical functionals from Equation
8.10. Suppose we are given the corresponding functional values p
1
,
...
, p
m
of a
probability distribution :
p
i
=
C
i
(), i = 1, ..., m .
Statistical Functionals 257
An approximate imputation strategy for these functionals is to find the n-quantile
distribution (n possibly different from m)
=
1
n
n
i=1
i
that best fits p
i
according to the loss
L()=
m
i=1
p
i
C
i
(
)
. (8.11)
Exercise 8.7 asks you to demonstrate that this strategy is approximate for m > 2.
Although in this context we know of an exact imputation strategy based on
categorical distributions, this illustrates that it is possible to impute distributions
from a different representation. 4
8.5 Relationship With Distributional Dynamic Programming
In Chapter 5 we introduced distributional dynamic programming (DDP) as a
class of methods that operates over return-distribution functions. In fact, every
statistical functional dynamic programming is also a DDP algorithm (but not
the other way around; see Exercise 8.8). This relationship is established by
considering the implied representation
F ={(s):s 2I
} P(R)
and the projection
F
= (see Figure 8.3).
˜
0
s s
0
T
F
Figure 8.3
The interpretation of SFDP algorithms as distributional dynamic programming algo-
rithms. Traversing along the diagram from
to
0
corresponds to dynamic programming
implementing a projected Bellman operator, while the path from s to s
0
corresponds to
statistical functional dynamic programming (SFDP).
From this correspondence, we may establish the relationship between Bellman
closedness and the notion of a diffusion-free projection developed in Chapter 5.
258 Chapter 8
Proposition 8.17.
Let
be a Bellman-closed sketch. Then for any choice
of exact imputation strategy
: I
!P
(
R
), the projection operator
F
=
is diffusion-free. 4
Proof.
We may directly check the diffusion-free property (omitting parentheses
for conciseness):
F
T
F
= T
(a)
= T
(b)
= T
(a)
= T
=
F
T
.
where steps marked (a) follow from the identity
T
=
T
, and (b) follows
from the identity = for any exact imputation strategy for .
Imputation strategies formalise how one might interpret functional values
as parameters of a probability distribution. Naturally, the chosen imputation
strategy affects the approximation artefacts from distributional dynamic pro-
gramming, the rate of convergence, and whether the algorithm converges at
all.
Compared with representation-based algorithms of the style introduced in
Chapter 5, working with statistical functionals allows us to design the projection
F
in two separate pieces: a sketch
and an imputation strategy
. In particular,
this makes possible to learn statistical functionals that would be difficult to
directly capture in a probability distribution representation. As the next section
demonstrates, this allows us to create new kinds of distributional reinforcement
learning algorithms.
8.6 Expectile Dynamic Programming
Expectiles form a family of statistical functionals parametrised by a level
2
(0, 1). They extend the notion of the mean of a distribution (
= 0.5) similar
to how quantiles extend the notion of a median. Expectiles have classically
found application in econometrics and finance as a form of risk measure (see the
bibliographical remarks for further details). Based on the principles of statistical
functional dynamic programming, expectile dynamic programming
65
uses an
approximate imputation strategy in order to iteratively estimate the expectiles
of the return function.
65.
The incremental analogue is called expectile temporal-difference learning [Rowland et al.,
2019].
Statistical Functionals 259
Definition 8.18.
For a given
2
(0, 1), the
-expectile of a distribution
2
P
2
(R) is
E
() = arg min
z2R
ER
(z; ) , (8.12)
where
ER
(z; )= E
Z
|
{Z<z}
| (Z z)
2
(8.13)
is the expectile loss. 4
The loss appearing in Definition 8.18 is strongly convex [Boyd and Vanden-
berghe, 2004] and bounded below by 0. As a consequence, Equation 8.12 has a
unique minimiser for a given
; this verifies that the corresponding expectile is
uniquely defined.
To understand the relationship to the mean functional and develop some intuition
for the statistical property than an expectile encodes, observe that the mean of a
distribution 2P
2
(R) can be expressed as
µ
1
() = arg min
z2R
E
Z
[(Z z)
2
].
Similar to how a quantile is derived from a loss that weights errors asymmetri-
cally (depending on whether the realisation from Z is smaller or greater than
z), the expectile loss for
2
(0, 1) is the asymmetric version of the above. For
greater than
1
/
2
, one can think of the expectile as an “optimistic” summary
of the distribution a value that emphasises outcomes that are greater than
the mean. Conversely, for smaller than
1
/2 the corresponding expectile is in a
sense “pessimistic”.
Expectile dynamic programming (EDP) estimates the values of a finite set of
expectile functionals with values 0 <
1
<
···
<
m
< 1. For a distribution
2
P
2
(R), let us write
e
i
=
E
i
().
Given the collection of expectile values e
1
,
, e
m
, EDP uses an imputation
strategy that outputs an n-quantile probability distribution that approximately
has these expectile values.
66
The imputation strategy finds a suitable reconstruction by finding a solution to
a root-finding problem. To begin, this strategy outputs a n-quantile distribution
66.
Of course, this particular form for the imputation strategy is a design choice; the reader is
invited to consider what other imputation strategies might be sensible here.
260 Chapter 8
ˆ, with n possibly different from m:
ˆ =
1
n
n
j=1
j
.
Following Definition 8.13, for this imputation to be exact the expectiles of
ˆ
at
1
, ...,
m
should be equal to e
1
, ..., e
m
:
E
i
(ˆ)=e
i
, i = 1, ..., m .
This constraint implies that the derivatives of the expectile loss, instantiated
with
1
, ...,
m
and evaluated with ˆ, should all be 0:
@
z
ER
i
z; ˆ
z=e
i
= 0, i = 1, ..., m . (8.14)
Written out in full for the choice of ˆ above, these derivatives take the form
@
z
ER
i
z; ˆ
z=e
i
=
1
n
n
j=1
1
2
(e
i
j
)|
{
j
< e
i
}
i
|, i = 1, ..., m .
As an alternative to the root-finding problem expressed in Equation 8.14 is the
following optimisation problem:
minimise
m
i=1
@
z
ER
i
z; ˆ
z=e
i
2
. (8.15)
A practical implementation of this imputation strategy therefore applies an opti-
misation algorithm to the objective in Equation 8.15, or a root-finding method
to Equation 8.14, viewed as functions of
1
,
,
n
. Because the optimisation
algorithm may return a solution that does not exactly satisfy Equation 8.14,
this method is an approximate (rather than exact) imputation strategy. It can
be used in the impute distributions step of Algorithm 8.1, yielding a dynamic
programming algorithm that aims to approximately learn return distribution
expectiles. If the root-finding algorithm is always able to find
ˆ
exactly sat-
isfying Equation 8.14, then the imputation strategy is exact in this instance;
otherwise it is approximate. A specific implementation is explored in detail in
Exercise 8.10.
8.7 Infinite Collections of Statistical Functionals
Thus far, our treatment of statistical functionals has focused on finite collec-
tions of statistical functionals what we call a sketch. From a computational
standpoint this is sensible since, to implement an SFDP algorithm, one needs to
be able to operate on individual functional values. On the other hand, in Section
8.3 we saw that many sketches are not Bellman closed and must be combined
Statistical Functionals 261
with an imputation strategy in order to perform dynamic programming. An
alternative, which we will study in greater detail in Chapter 10, is to implicitly
parametrise an infinite family of statistical functionals.
Many (though not all) infinite families of functionals provide a lossless encoding
of probability distributions, and are consequently Bellman closed. That is,
knowing the values taken on by these functionals is equivalent to knowing the
distribution itself. We encode this property with the following definition.
Definition 8.19.
Let
be a set of statistical functionals. We say that
char-
acterises probability distributions over the real numbers if for each
2P
(
R
)
there is a unique collection of functional values ( (): 2 ). 4
The following families of statistical functionals all characterise probability
distributions over R.
The cumulative distribution function.
The functionals mapping distributions
to the probabilities
P
Z
(Z
z), indexed by z
2R
. Closely related are upper-
tail probabilities,
7!P
Z
(Z z),
and the quantile functionals
7!F
–1
(),
indexed by 2(0, 1).
The characteristic function. Functionals of the form
7! E
Z
[e
iuZ
] 2C ,
indexed by u
2R
(and where i
2
= –1). The corresponding collection of statistical
values is the characteristic function of , denoted
.
Moments and cumulants.
The infinite collection of moment functionals
(
µ
p
)
1
p=1
does not unconditionally characterise the distribution
: there are dis-
tinct distributions that have the same sequence of moments. However, if the
sequence of moments does not grow too quickly, uniqueness is restored. In par-
ticular, a sufficient condition for uniqueness is that the underlying distribution
has a moment-generating function
u 7! E
Z
[e
uZ
]
which is finite in an open neighbourhood of u = 0; see Remark 8.3 for further
details. Under this condition, the moment-generating function itself also charac-
terises the distribution, as does the cumulant-generating function, defined as
262 Chapter 8
the logarithm of the moment-generating function,
u 7!log
E
Z
[e
uZ
]
.
The cumulants (
p
)
1
p=1
are defined through a power series expansion of the
cumulant-generating function
log
E
Z
[e
uZ
]
=
1
p=1
p
u
p
p!
.
Under the condition that the moment-generating function is finite in an open
neighbourhood of the origin, the sequence of cumulants and moments are
determined by one another, and so the sequence of cumulants is another
characterisation of the distribution under this condition.
Example 8.20. Consider the return-variable Bellman equation
G
(x, a)
D
= R + G
(X
0
, A
0
), X = x , A = a ,
If for each u
2R
we apply the functional
7! E
Z
[e
iuZ
] to the distribution of the
random variables on each side, we obtain the characteristic function Bellman
equation:
(x,a)
(u)=E
[e
iu(R+G
(X
0
,A
0
))
| X = x, A = a]
= E
e
iuR
| X = x, A = a
E
e
iuG
(X
0
,A
0
)
| X = x, A = a
=
P
R
(·|x,a)
(u) E
(X
0
,A
0
)
(u)|X = x, A = a
.
This is a different kind of distributional Bellman equation in which the addi-
tion of independent random variables corresponds to a multiplication of their
characteristic functions. The equation highlights that the characteristic function
of
evaluated at u depends on the next-state characteristic functions evaluated
at u. This shows that for a set S R, the sketch ( 7!
(u):u 2S) cannot be
Bellman closed unless S is infinite or S = {0}. Exercise 8.12 asks you to give a
theoretical analysis of a dynamic programming approach based on characteristic
functions. 4
Another way to understand collections of statistical functionals that are char-
acterising (in the sense of Definition 8.19) is to interpret them in light of our
definition of a probability distribution representation (Definition 5.2). Recall
that a representation
F
is a collection of distributions indexed by a parameter
:
F =
2P(R): 2
.
Statistical Functionals 263
Here, the functional values associated with the set of statistical functionals
correspond to the (infinite-dimensional) parameter , so that
F
= P(R).
This clearly implies that
F
is closed under the distributional Bellman operator
T
(Section 5.3) and hence that approximation-free distributional dynamic
programming is (mathematically) possible with F
.
8.8 Moment Temporal-Difference Learning*
In Section 8.2 we introduced the m-moment Bellman operator, from which an
exact dynamic programming algorithm can be derived. A natural follow-up is to
apply the tools of Chapter 6 to derive an incremental algorithm for learning the
moments of the return-distribution function from samples. Here, an algorithm
that incrementally updates an estimate M
2R
XAm
of the m first moments
of the return function can be directly obtained through the unbiased estimation
approach, as the corresponding operator can be written as an expectation. Given
a sample transition (x, a, r, x
0
, a
0
), the unbiased estimation approach yields the
update rule (for i = 1, ..., m)
M(x, a, i) (1 )M(x, a, i)+
i
j=0
ij
i
j
r
j
M(x
0
, a
0
, i j)
, (8.16)
where again we take M(·, ·, 0) = 1 by convention.
Unlike the TD and CTD algorithms analysed in Chapter 6, this algorithm is
derived from an operator, T
(m)
, which is not a contraction in a supremum-norm
over states. As a result, the theory developed in Chapter 6 cannot immediately
be applied to demonstrate convergence of this algorithm under appropriate
conditions. With some care, however, a proof is possible; we now give an
overview of what is needed.
The proof of Proposition 8.7 demonstrates that the behaviour of T
(m)
is closely
related to that of a contraction mapping. Specifically, the behaviour of T
(m)
in updating the estimates of i
th
moments of returns is contractive if the lower
moment estimates are sufficiently close to their correct values. To turn these
observations into a proof of convergence, an inductive argument on the moments
being learnt must be made, as in the proof of Proposition 8.7. Further, the
approach of Chapter 6 needs to be extended to deal with a vanishing bias term
in the update to account for this ‘near-contractivity’ of T
(m)
; to this end, one
may for example begin from the analysis of Bertsekas and Tsitsiklis [1996,
Proposition 4.5].
264 Chapter 8
Before moving on, let us remark that in practice, we are likely to be interested
in centered moments such as the variance (m = 2); these take the form
E

1
t=0
t
R
t
Q
(x, a)
m
X
0
= x, A
0
= a
,
These can be derived from their uncentered counterparts; for example, the vari-
ance of the return distribution
(x, a) is obtained from the first two uncentered
moments via Equation 8.2.
It is also possible to perform dynamic programming on centered moments
directly, as was shown in the context of the mean and variance in Section 5.4
(Exercise 8.14 asks you to derive the Bellman operators for the more general
case of the first m centered moments). Given in terms of state-action pairs, the
Bellman equation for the return variances
¯
M
(·, ·, 2) 2R
XA
is
¯
M
(x, a, 2) = Var
(R | X = x, A = a) + (8.17)
2
Var
(Q
(X
0
, A
0
)|X = x, A = a)+E
[
¯
M
(X
0
, A
0
, 2) | X = x, A = a]
;
contrast with Equation 5.20.
One challenge with deriving an incremental algorithm for learning the variance
directly is that unbiasedly estimating some of the variance terms on the right-
hand side requires multiple samples. For example, an unbiased estimator of
Var
(Q
(X
0
, A
0
)|X = x, A = a)
in general requires two independent realisations of X
0
, A
0
for a given source
state-action pair x, a. Consequently, unbiased estimation of the corresponding
operator application with a single transition is not feasible in this case. Despite
the fact that the first m centered and uncentered moments of a probability
distribution can be recovered from one another, there is a distinct advantage
associated with working with uncentered moments when learning from samples.
8.9 Technical Remarks
Remark 8.1.
Theorem 8.12 illustrates how dynamic programming over func-
tional values must incur some approximation error, unless the underlying sketch
is Bellman closed. One way to avoid this error is to augment the state space
with additional information, for example the return accumulated so far. We in
fact took this approach when optimising the conditional value-at-risk (CVaR)
of the return in Chapter 7; in fact, risk measures are statistical functionals that
may also take on the value 1 (see Definition 7.14). 4
Statistical Functionals 265
Remark 8.2
(
Proof of Theorem 8.12
)
.
It is sufficient to consider a pair of
states, x and y, such that x deterministically transitions to y with reward r.
Because
is Bellman closed, we can identify an associated Bellman operator
T
. For a given return function
whose state-indexed collection of functional
values is s =
(
), let us write (T
s)
i
(x) for the i
th
functional value at state x, for
i = 1,
...
, m. By construction and definition of the operator T
,(T
s)
i
(x) is a
function of the functional values at y as well as the reward r and discount factor
, and so we may write
(T
s)
i
(x)=g
i
r, ,
1
((y)), ,
m
((y))
for some function g
i
. We next argue that g
i
is affine
67
in the inputs
1
(
(y)),
,
m
(
(y)). This is readily observed as each functional
1
,
,
m
is
affine in its input distribution,
i
(↵⌫ + (1 )
0
)= E
Z↵⌫+(1–)
0
[f
i
(Z)]
= E
Z
[f
i
(Z)] + (1 ) E
Z
0
[f
i
(Z)]
=
i
() + (1 )
i
(
0
),
and
(T
s)
i
(x)= E
Z(y)
[f
i
(r + Z)]
is also affine as a function of
. This affineness would be contradicted if g
i
were
not also affine. Hence, there exist functions
i
:
R
[0, 1)
!R
for i = 1,
, m
such that
g
i
r, ,
1
((y)), ,
m
((y))
=
0
(r, )+
m
i=1
i
(r, )
i
((y)) ,
and therefore
E
Z(y)
[f
i
(r + Z)] = E
Z(y)
m
j=0
i
(r, )f
j
(Z)
,
where f
0
(z) = 1. Taking
(y) to be a Dirac delta
z
then gives the following
identity:
f
i
(r + z)=
m
j=0
i
(r, )f
j
(z).
We therefore have that the finite-dimensional function space spanned by
f
0
, f
1
,
, f
m
(where f
0
is the constant function equal to 1) is closed under
67.
Recall that a function h : M
!
M
0
between vector spaces M and M
0
is affine if for u
1
, u
2
2
M,
2 (0, 1), we have h(u
1
+ (1 )u
2
)=h(u
1
) + (1 )h(u
2
).
266 Chapter 8
translation (by r
2R
) and scaling (by
2
[0, 1)). Engert [1970] shows that
the only finite-dimensional subspaces of measurable functions closed under
translation are contained in the span of finitely-many functions of the form
z
7!
z
`
exp
(
z), with
` 2N
and
2C
. Since we further require closure under
scaling by
2
[0, 1), we deduce that we must have
= 0 in any such function,
and the subspace must be equal to the space spanned by the first n monomials
(and the constant function).
To conclude, since each monomial z
7!
z
i
for i = 1,
, n is expressible as a
linear combination of f
0
,
, f
m
, the corresponding expectations
E
Z
[Z
i
] are
expressible as linear combinations of the expectations
E
Z
[f
j
(Z)], for any
distribution
. The converse also holds, and so we conclude that the sketch
encodes the same distributional information as the first n moments. 4
Remark 8.3.
The question of whether a distribution is characterised by its
sequence of moments has been a subject of study in probability theory for over
a century. The sufficient condition on the moment generating function described
in Section 8.8 means that the characteristic function of such a distribution can be
written as a power series with scaled moments as coefficients, ensuring unique-
ness of the distribution; see e.g. Billingsley [2012] for a detailed discussion.
Lin [2017] gives a survey of known sufficient conditions for characterisation,
as well as examples where characterisation does not hold. 4
8.10 Bibliographical Remarks
8.1.
Statistical functionals are a core notion in statistics; see for example the clas-
sic text by Van der Vaart [2000]. In reinforcement learning, specific functionals
such as moments, quantiles, and CVaR have been of interest for risk-sensitive
control (more on this in the bibliographical remarks of Chapter 7). Chandak
et al. [2021] consider the problem of off-policy Monte Carlo policy evaluation
of arbitrary statistical functionals of the return distribution.
8.2, 8.8.
Sobel [1982] gives a Bellman equation for return distribution moments
for state-indexed value functions with deterministic policies. More recent work
in this direction includes that of Lattimore and Hutter [2012] and Azar et al.
[2013, 2017], who make use of variance estimates in combination with Bern-
stein’s inequality to improve the efficiency of exploration algorithms, as well
as the work of White and White [2016], who use estimated return variance
to set trace coefficients in multi-step TD learning methods. Sato et al. [2001],
Tamar et al. [2012], Tamar et al. [2013], and Prashanth and Ghavamzadeh
[2013] further develop methods for learning the variance of the return. Tamar
et al. [2016] shows that the operator T
(2)
is a contraction under a weighted
Statistical Functionals 267
norm (see Exercise 8.4), develops an incremental algorithm with a proof of
convergence using the ODE method, and studies both dynamic programming
and incremental algorithms under linear function approximation (the topic of
Chapter 9).
8.3-8.5.
The notion of Bellman closedness is due to Rowland et al. [2019],
although our presentation here is a revised take on the idea. The noted connec-
tion between Bellman closedness and diffusion-free representations and the
term “statistical functional dynamic programming” are new to this book.
8.6.
The expectile dynamic programming algorithm is new to this book, but is
directly derived from expectile temporal-difference learning [Rowland et al.,
2019]. Expectiles themselves were introduced by Newey and Powell [1987] in
the context of testing in econometric regression models, with the asymmetric
squared loss defining expectiles already appearing in Aigner et al. [1976].
Expectiles have since found further application as risk measures, particularly
within finance [Taylor, 2008, Kuan et al., 2009, Bellini et al., 2014, Ziegel,
2016, Bellini and Di Bernardino, 2017]. Our presentation here focuses on the
asymmetric squared loss, requiring a finite second moment assumption, but an
equivalent definition allows expectiles to be defined for all distributions with a
finite first moment [Newey and Powell, 1987].
8.7.
The study of characteristic functions in distributional reinforcement learn-
ing is due to Farahmand [2019], who additionally provides error propagation
analysis for the characteristic value iteration algorithm, in which value iteration
is carried out with characteristic function representations of return distributions.
Earlier, Mandl [1971] studied the characteristic function of the return in Markov
decision processes with deterministic immediate rewards and policies. Chow
et al. [2015] combine a state augmentation method (see Chapter 7) with an
infinite-dimensional Bellman equation for CVaR values to learn a CVaR-optimal
policy. They develop an implementable version of the algorithm by tracking
finitely-many CVaR values, and using linear interpolation for the remainder, an
approach related to the imputation strategies described earlier in the chapter.
Characterisation via the quantile function has driven the success of several large-
scale distributional reinforcement learning algorithms [Dabney et al., 2018a,
Yang et al., 2019], and is the subject of further study in Chapter 10.
268 Chapter 8
8.11 Exercises
Exercise 8.1.
Consider the m-moment Bellman operator T
(m)
(Definition 8.6).
For M 2R
XAm
, define the norm
kMk
1,MAX
= max
i2{1,...,m}
sup
x2X
a2A
M(x, a, i).
By means of a counterexample, show that T
(m)
is not a contraction mapping in
the metric induced by k·k
1,MAX
. 4
Exercise 8.2.
Let
"
> 0. Determine a bound on the computational cost (in O(
·
)
notation) of performing iterative policy evaluation with the m-moment Bellman
operator to obtain an approximation
ˆ
M
such that
max
i2{1,...,m}
sup
x2X
a2A
|
ˆ
M
(x, a, i)–M
(x, a, i)| < ".
You may find it convenient to refer to the proof of Proposition 8.7. 4
Exercise 8.3.
Equation 5.2 gives the value function V
as the solution of the
linear system of equations
V = r
+ P
V.
Provide the analogous linear system for the moment function M
. 4
Exercise 8.4.
The purpose of this exercise is to show that T
(2)
is a contraction
mapping on
R
XA2
in a weighted L
1
norm, as shown by Tamar et al. [2016].
Let M
2R
XA2
be a moment function estimate (specifically, for the first two
moments). For each 2(0, 1), define the -weighted norm on R
XA2
by
kMk
= kM
(1)
k
1
+ (1 )kM
(2)
k
1
,
where M
(i)
= M(· , ·, i) 2R
XA
. For any M, M
0
2R
XA2
, show that
kT
(2)
(M M
0
)k
kP
(M
(1)
M
0
(1)
)k
1
+ (1 )k2C
r
P
(M
(1)
M
0
(1)
)+
2
P
(M
(2)
M
0
(2)
)k
1
,
where P
is the state-action transition operator, defined by
(P
Q)(x, a)=
(x
0
,a
0
)2XA
P
(x
0
|x, a)(a
0
|x
0
)Q(x
0
, a
0
),
and C
r
is the diagonal reward operator
(C
r
Q)(x, a)=E[R | X = x, A = a]Q(x , a).
Statistical Functionals 269
Writing
0 for the Lipschitz constant of C
r
P
with respect to the L
1
metric,
deduce that
kT
(2)
(M M
0
)k
(↵ + 2(1 ))kM
(1)
M
0
(1)
k
1
+ (1 )
2
kM
(2)
M
0
(2)
k
1
.
Hence deduce that there exist parameters 2(0, 1), 2[0, 1) such that
kT
(2)
(M M
0
)k
kM M
0
k
,
as required. 4
Exercise 8.5. Consider the median functional
7!F
–1
(0.5) .
Show that there does not exist a function f : R !R such that, for any 2R,
E
Z
[f (Z)] = F
–1
(0.5) . 4
Exercise 8.6.
Consider the subset of probability distributions endowed with a
probability density f
. Repeat the preceding exercise for the differential entropy
functional
7!
z2R
f
(z) log
f
(z)
dz . 4
Exercise 8.7. For the imputation strategy of Example 8.16,
(i) Show that for m = 2, the imputation strategy is exact, for any n 2N
+
.
(ii)
Show that for m > 2, this imputation strategy is inexact. Hint. Find a
distribution for which
C
i
((p
1
, ..., p
m
)) 6= p
i
, for some i = 1, ..., m. 4
Exercise 8.8.
In Section 8.4 we argued that every statistical functional dynamic
programming algorithm is a distributional dynamic programming algorithm.
Explain why the converse is false. Under what circumstances may we favour
either an algorithm that operates on statistical functionals or one that operates
on probability distribution representations? 4
Exercise 8.9.
Consider an imputation strategy
for a sketch
. We say the
( , ) pair is mean-preserving if, for any probability distribution 2P
(R)
0
= ()
satisfies
E
Z
0
[Z]= E
Z
[Z].
270 Chapter 8
Show that in this case, the operator
T
is also mean-preserving. 4
Exercise 8.10.
Using your favourite numerical computation software, imple-
ment the expectile imputation strategy described in Section 8.6. Specifically,
(i)
Implement a procedure for approximately determining the expectile values
e
1
,
...
, e
m
of a given distribution. Hint. An incremental approach in the style
of quantile regression, or a binary search approach, will allow you to deal
with continuous distributions.
(ii)
Given a set of expectile values, e
1
,
...
, e
m
, implement a procedure that
imputes an n-quantile distribution
1
n
n
i=1
i
by minimising the objective given in Equation 8.15.
Test your implementation on discrete and continuous distributions, and compare
it with the best m-quantile approximation of those distributions. Is one method
better suited to discrete distributions than the other? More generally when might
one method be preferred over the other? 4
Exercise 8.11.
Formulate a variant of expectile dynamic programming that
imputes n-quantile distributions and whose (possibly approximate) imputation
strategy is mean-preserving in the sense of Exercise 8.9. 4
Exercise 8.12
(*)
.
This exercise applies the line of reasoning from Chapter 4
to characteristic functions, and is based on Farahmand [2019]. For a probability
distribution recall that its characteristic function
is:
(u)= E
Z
e
iuZ
.
Now for p 2[1, 1) define the probability metric
d
1,p
(,
0
)=
u2R
|
(u)–
0
(u)|
|u|
p
du ,
and its supremum extension to return functions
d
1,p
(,
0
) = sup
(x,a)2XA
d
1,p
((x, a),
0
(x, a)) .
(i) Determine a subset P
,p
(R) P(R) on which d
1,p
is a proper metric.
Statistical Functionals 271
(ii) Provide assumption(s) under which the return function
lies in P
,p
(R).
(iii)
Prove that for p
2, the distributional Bellman operator is a contraction
mapping in d
1,p
, with modulus
p–1
. 4
Exercise 8.13 (*). Consider the probability metric
d
2,2
(,
0
)=
u2R
(u)–
0
(u)
2
u
2
du
1
/2
.
Show that d
2,2
is the Cramér distance
`
2
. Hint. Use the Parseval-Plancherel
identity. 4
Exercise 8.14.
Let m
2N
+
. Derive a Bellman operator on
R
XAm
whose
unique fixed point
˜
M
is the collection of centered moments:
˜
M
(x, a, i)=E
G
(x, a)–Q
(x, a)
i
, i = 1, ..., m . 4