Statistical Functionals

The development of distributional reinforcement learning in previous chapters

has focused on approximating the full return function with parameterized fami-

lies of distributions. In our analysis, we quantiﬁed the accuracy of an algorithm’s

estimate according to its distance from the true return-distribution function,

measured using a suitable probability metric.

Rather than try to approximate the full distribution of the return, we may

instead select speciﬁc properties of this distribution and directly estimate these

properties. Implicitly, this is the approach taken when estimating the expected

return. Other common properties of interest include quantiles of the distribu-

tions, high-probability tail bounds, and the risk-sensitive objectives described in

Chapter 7. In this chapter, we introduce the language of statistical functionals

to describe such properties.

In some cases, the statistical functional approach allows us to obtain accurate

estimates of quantities of interest, in a more straightforward manner. As a

concrete example, there is a low-cost dynamic programming procedure to

determine the variance of the return distribution.

By contrast, categorical and

quantile dynamic programming usually under- or overestimate this variance.

This chapter develops the framework of statistical functional dynamic pro-

gramming as a general method for approximately determining the values of

statistical functionals. As we demonstrate in Section 8.4, it is in fact possible

to interpret both categorical and quantile dynamic programming as operating

over statistical functionals. We will see that while some characteristics of the

return (including its variance) can be accurately estimated by an iterative proce-

dure, in general, some care must be taken when estimating arbitrary statistical

functionals.

61.

In fact, the return variance can be determined to machine precision by solving a linear system

of equations, similar to what was done in Section 5.1 for the value function.

Draft version. 233

234 Chapter 8

8.1 Statistical Functionals

A functional maps functions to real values. By extension, a statistical functional

maps probability distributions to the reals. In this book, we view statistical

functionals as measuring a particular property or characteristic of a probability

distribution. For example, the mapping

ν 7→P

Z∼ν



Z ≥0



, ν ∈P(R)

is a statistical functional that measures how much probability mass its argument

puts on the nonnegative reals. Statistical functionals express quantiﬁable

properties of probability distributions such as their mean and variance. The

following formalizes this point.

Deﬁnition 8.1.

A statistical functional

is a mapping from a subset of

probability distributions P

(R) ⊆P(R) to the reals, written

ψ : P

(R) →R .

We call the particular scalar

(

) associated with a probability distribution

functional value and the set P

(R) the domain of the functional. 4

Example 8.2.

The mean functional maps probability distributions to their

expected values. As before, let

(R) = {ν ∈P(R) : E

Z∼ν



|Z|



< ∞}

be the set of distributions with ﬁnite ﬁrst moment. For

ν ∈P

(

), the mean

functional is

(ν) = E

Z∼ν

[Z] .

The restriction to

(

) is necessary to exclude from the deﬁnition distributions

without a well-deﬁned mean. 4

The purpose of this chapter is to study how functional values of the return

distribution can be approximated using dynamic programming procedures and

incremental algorithms. In general, we will be interested in a collection of

such functionals that exhibit desirable properties: for example, because they

can be jointly determined by dynamic programming or because they provide

complementary information about the return function. We call such a collection

a distribution sketch.

Deﬁnition 8.3.

A distribution sketch (or simply sketch)

(

)

→R

is a

vector-valued function speciﬁed by a tuple (

, …, ψ

) of statistical functionals.

Its domain is

(R) =

i=1

(R) ,

Draft version.

Statistical Functionals 235

and it is deﬁned as

ψ(ν) = (ψ

(ν), …, ψ

(ν)), ν ∈P

(R) .

Its image is



ψ(ν) : ν ∈P

(R)



⊆R

We also extend this notation to return-distribution functions:

ψ(η) =



ψ(η(x)) : x ∈X), η ∈P

(R)

. 4

Example 8.4.

The quantile functionals are a family of statistical function-

als indexed by

τ ∈

1) and deﬁned over

(

). The

-quantile functional is

deﬁned in terms of the inverse cumulative distribution function of its argument

(Deﬁnition 4.12):

(ν) = F

−1

(τ) .

A ﬁnite collection of quantile functionals (say, for

, . . . , τ

∈

1)) constitutes

a sketch. 4

Example 8.5.

To prove the convergence of categorical temporal-diﬀerence

learning (Section 6.10), we introduced the isometry I: F

C,m

→R

deﬁned as

I(ν) =



(θ

) : i ∈{1, . . . , m}



, (8.1)

where (

)

i=1

is the set of locations for the categorical representation. This

isometry is also a sketch in the sense of Deﬁnition 8.3. If we extend its domain

to be

(

), Equation 8.1 still deﬁnes a valid sketch but it is no longer an

isometry: it is not possible to recover the distribution

from its functional

values I(ν). 4

8.2 Moments

Moments are an especially important class of statistical functionals. For an

integer p ∈N

, the pth moment of a distribution ν ∈P

(R) is given by

(ν) = E

Z∼ν





In particular, the ﬁrst moment of

is its mean, while the variance of

is the

diﬀerence between its second moment and squared mean:

(ν) −



(ν)



. (8.2)

Moments are ubiquitous in mathematics. They form a natural way of capturing

important aspects of a probability distribution, and the inﬁnite sequence of

moments



(

)



∞

p=1

uniquely characterizes many probability distributions of

interest; see Remark 8.3.

Draft version.

236 Chapter 8

Our goal in this section is to describe a dynamic programming approach to

determining the moments of the return distribution. Fix a policy

, and consider

a state

x ∈X

and action

a ∈A

. The

th moment of the return distribution

(

x, a

)

is given by



(x, a)



where as before,

(

x, a

) is an instantiation of

(

x, a

). Although we can also

study dynamic programming approaches to learning the

th moment of state-

indexed return distributions,



(x)



this is complicated by a potential conditional dependency between the reward

and next state

due to the action

. One solution is to assume independence

and

, as we did in Section 5.4. Here, however, to avoid making this

assumption, we work with functions indexed by state-action pairs.

To begin, let us ﬁx m ∈N

. The m-moment function M

(x, a, i) = E

[(G

(x, a))

] = µ



(x, a)



, for i = 1, . . . , m . (8.3)

As with value functions, we view

as the function (or vector) in

X×A×m

describing the collection of the ﬁrst

moments of the random return. In

particular,

(

·, ·,

1) is the usual state-action value function. As elsewhere in

the book, to ensure that the expectation in Equation 8.3 is well deﬁned, we

assume that all reward distributions have ﬁnite

th moments, for

= 1

, . . . , m

In fact, it is suﬃcient to assume that this holds for

(Assumption 4.29(

)).

As with the standard Bellman equation, from the state-action random-

variable Bellman equation

(x, a) = R + γG

, A

), X = x, A = a

we can derive Bellman equations for the moments of the return distribution. To

do so, we raise both sides to the

th power and take expectations with respect

to both the random return variables

and the random transition (

x, A

a, R, X

, A

[(G

(x, a))

] = E

[(R + γG

, A

))

| X = x, A = a] .

From the binomial expansion of the term inside the expectation, we obtain

[(G

(x, a))

] = E



j=0

i−j

, A

)

i−j



X = x, A = a



Draft version.

Statistical Functionals 237

Since

and

(

, A

) are independent given

and

, we can rewrite the above

(x, a, i) =

j=0

i−j

| X = x, A = a] E



, A

, i − j) | X = x, A = a



where by convention we take

(

, a

0) = 1 for all

∈X

and

∈A

. This is a

recursive characterization of the

th moment of a return distribution, analogous

to the familiar Bellman equation for the mean. The recursion is cast into the

familiar framework of operators with the following deﬁnition.

Deﬁnition 8.6.

Let

m ∈N

. The

-moment Bellman operator

(m)

X×A×m

→

X×A×m

is given by

(m)

M)(x, a, i) = (8.4)

j=0

i−j

| X = x, A = a] E



M(X

, A

, i − j) | X = x, A = a



. 4

The collection of moments (

(

x, a, i

) : (

x, a

)

∈X×A, i

= 1

, …, m

) is a ﬁxed

point of the operator

(m)

. In general, the

-moment Bellman operator is not a

contraction mapping with respect to the

∞

metric (except, of course, for

= 1;

see Exercise 8.1). However, with a more nuanced analysis, we can still show

that T

(m)

has a unique ﬁxed point to which the iterates

k+1

= T

(m)

(8.5)

converge.

Proposition 8.7.

Let

m ∈N

. Under Assumption 4.29(

is the

unique ﬁxed point of

(m)

. In addition, for any initial condition

∈

X×A×m

, the iterates of Equation 8.5 converge to M

. 4

Proof.

We begin by constructing a suitable notion of distance between

moment functions R

X×A×m

. For M ∈R

X×A×m

, let

kMk

∞,i

= sup

(x,a)∈X×A

|M(x, a, i)|, for i = 1, . . . , m

kMk

∞,<i

= sup

j=1,…,i−1

kMk

∞, j

, for i = 2, . . . , m .

Each of

k·k

∞,i

(for

= 1

, …, m

) and

k·k

∞,<i

(for

= 2

, …, m

) is a semi-norm; they

fulﬁll the requirements of a norm, except that neither

kMk

∞,i

= 0 nor

kMk

∞,<i

= 0

implies that M = 0. From these semi-norms, we construct the pseudo-metrics

(M, M

) 7→kM − M

∞,i

Draft version.

238 Chapter 8

noting that it is possible for the distance between

and

to be zero even

when M is diﬀerent from M

The structure of the proof is to argue that

(m)

is a contraction with modulus

with respect to

k·k

∞,1

and then to show inductively that it satisﬁes an inequality

of the form

(m)

M −T

(m)

∞,i

≤C

kM −M

∞,<i

+ γ

kM −M

∞,i

, (8.6)

for each

= 2

, …, m

, and some constant

that depends on

. Chaining these

results together then leads to the convergence statement, and uniqueness follows

as an immediate corollary.

To see that

(m)

is a contraction with respect to

k·k

∞,1

, let

M ∈R

X×A×m

, and

write

(i)

= (

(

x, a, i

) : (

x, a

)

∈X×A

) for the function in

X×A

corresponding

to the

th moment function estimates given by

. By inspecting Equation 8.4

with i = 1, it follows that



(m)



(1)

= T

(1)

where

is the usual Bellman operator. Furthermore,

kMk

∞,1

(1)

∞

, and

so the statement that

(m)

is a contraction with respect to the pseudo-metric

implied by

k·k

∞,1

is equivalent to the contractivity of

X×A

with the

respect to the L

∞

norm, which was shown in Proposition 4.4.

To see that

(m)

satisﬁes the bound of Equation 8.6 for

i >

1, let

L ∈R

be such

that



E[R

| X = x, A = a]



≤L, for all x, a ∈X×A and i = 1, …, m.

Observe that



(m)

M)(x, a, i) −(T

(m)

)(x, a, i)



i−1

j=0

i−j



|X = x, A = a



∈X

∈A

| x, a)π(a

| x

)(M −M

)(x

, a

, i − j)



≤

i−1

j=1

i−j





|X = x, A = a





×kM − M

∞,<i

+ γ

kM −M

∞,i

≤L

i−1

j=1

i−j

kM −M

∞,<i

+ γ

kM −M

∞,i

≤(2

−2)LkM − M

∞,<i

+ γ

kM −M

∞,i

Draft version.

Statistical Functionals 239

Taking C

= (2

−2)L, we have

(m)

M −T

(m)

∞,i

≤C

kM −M

∞,<i

+ γ

kM −M

∞,i

, for i = 2, . . . , m.

To chain these results together, ﬁrst observe that

− M

∞,1

→0.

We next argue inductively that if, for a given

i < m

, (

)

k≥0

converges to

the pseudo-metric induced by k·k

∞,<i

, then also

− M

∞,i

→0, and hence

− M

∞,<(i+1)

→0.

Let

− M

∞,<i

and

− M

∞,i

. Then the generalized contraction

result states that

k+1

≤C

. Taking the limit superior on both sides yields

lim sup

k→∞

≤lim sup

k→∞



+ γ



= γ

lim sup

k→∞

where we have used the result

→

0. From this, we deduce

lim sup

k→∞

≤

but since (

)

k≥0

is a nonnegative sequence, we therefore have

→

0. This

completes the inductive step, and we therefore obtain

− M

∞,i

→

0, as

required.

In essence, Proposition 8.7 establishes that the

-moment Bellman operator

behaves in a similar fashion to the usual Bellman operator, in the sense that its

iterates converge to the ﬁxed point

. From here, we may follow the deriva-

tions of Chapter 5 to construct a dynamic programming algorithm for learning

these moments

or those of Chapter 6 to construct the corresponding incre-

mental algorithm (Section 8.8). Although the proof above does not demonstrate

the contractive nature of the moment Bellman operator, for

= 2, this can be

achieved using a diﬀerent norm and analysis technique (Exercise 8.4).

8.3 Bellman Closedness

In preceding chapters, our approach to distributional reinforcement learning

considered approximations of the return distributions that could be tractably

manipulated by algorithms. The

-moment Bellman operator, on the other

hand, is not directly applied to probability distributions – compared to say,

-categorical distribution, there is no immediate procedure for drawing a

sample from a collection of

moments. Compared to the categorical and

62.

When the reward distributions take on a ﬁnite number of values, in particular, the expectations

of Deﬁnition 8.6 can be implemented as sums.

Draft version.

240 Chapter 8

η η

s s

Figure 8.1

A sketch is Bellman closed if there is an operator

such that in the diagram above, the

composite functions ψ ◦T

and T

◦ψ coincide.

quantile projected operators, however, the

-moment operator yields an error-

free dynamic programming procedure – with suﬃciently many iterations and

under some ﬁniteness assumptions, we can determine the moments of the

return function to any degree of accuracy. The concept of Bellman closedness

formalizes this idea.

Deﬁnition 8.8.

A sketch

= (

, …, ψ

) is Bellman closed if, whenever its

domain P

(R)

is closed under the distributional Bellman operator:

η ∈P

(R)

=⇒ T

η ∈P

(R)

there is an operator T

: I

→I

such that





= T





for all η ∈P

(R)

The operator T

is said to be the Bellman operator for the sketch ψ. 4

As was demonstrated in the preceding section, the collection of the

ﬁrst

moments



, . . . , µ



is a Bellman-closed sketch. Its associated operator is the

m-moment operator T

(m)

When a sketch

is Bellman closed, the operator

mirrors the application

of the distributional Bellman operator to the return-distribution function

; see

Figure 8.1. The concept of Bellman closedness is related to that of a diﬀusion-

free projection (Chapter 5), and we will in fact establish an equivalence between

the two in Section 8.4. In addition, Bellman-closed sketches are particularly

interesting from a computational perspective because they support an exact

dynamic programming procedure, as the following establishes.

Draft version.

Statistical Functionals 241

Proposition 8.9.

Let

= (

, …, ψ

) be a Bellman-closed sketch and sup-

pose that

(

)

is closed under

. Then for any initial condition

∈P

(R)

, and sequences (η

)

k≥0

, (s

)

k≥0

deﬁned by

k+1

= T

, s

= ψ(η

) , s

k+1

= T

we have, for k ≥0,

= ψ(η

) .

In addition, the functional values

(

) of the return-distribution

function are a ﬁxed point of the operator T

. 4

Proof.

Both parts of the result follow immediately from the deﬁnition of the

operator T

. First suppose that s

= ψ(η

), for some k ≥0. Then note that

k+1

= T

ψ(η

) = ψ





= ψ(η

k+1

) .

Thus, by induction, the ﬁrst statement is proven. For the second statement, we

have

= ψ(η

) = ψ





= T

ψ(η

) = T

Of course, dynamic programming is only feasible if the operator

can

itself be implemented in a computationally tractable manner. In the case of the

-moment operator, we know this is possible under similar assumptions as

were made in Chapter 5.

Proposition 8.9 illustrates how, when the sketch

is Bellman closed, we can

do away with probability distributions and work exclusively with functional

values. However, many sketches of interest fail to be Bellman closed, as the

following examples demonstrate.

Example 8.10

(

The median functional

)

A median of a distribution

is its

0.5-quantile

−1

5).

Perhaps surprisingly, there is in general no way to

determine the median of a return distribution based solely on the medians

at the successor states. To see this, consider a state

that leads to state

with probability

and to state

with probability

, with zero reward. The

following are two scenarios in which the median returns at

and

are the

same, but the median at x is diﬀerent (see Figure 8.2):

63.

As usual, there might be multiple values of

for which

Z∼ν

(

Z ≤z

) = 0

5; recall that

−1

takes

the smallest such value.

Draft version.

242 Chapter 8

0.5 0.0 0.5 1.0 1.5 2.0

Return

0.00

0.25

0.50

0.75

1.00

Cumulative Probability

(x)

)

0.5 0.0 0.5 1.0 1.5

Return

0.00

0.25

0.50

0.75

1.00

Cumulative Probability

(x)

)

2.0

(c)(b)

(a)

Figure 8.2

Illustration of Example 8.10.

(a)

A Markov decision process in which state

leads to

states

and

with probability

and

, respectively.

(b)

Case 1, in which the median

(

) matches the median of

(

(c)

Case 2, in which the median of

(

) diﬀers from

the median of η(y

Case 1. The return distributions at

and

are Dirac deltas at 0 and 1,

respectively, and these are also the medians of these distributions. The median

at x is also 1.

Case 2. The return distributions at

and

are a Dirac delta at 0 and the

uniform distribution on [0

5], respectively, and have the same medians as

in Case 1. However, the median at x is now 0.75. 4

Example 8.11

(

At-least functionals

)

For

ν ∈P

(

) and

z ∈R

, let us deﬁne

the at-least functional

≥z

(ν) = {P

Z∼ν



Z ≥z



> 0},

measuring whether

assigns positive probability to values in [

z, ∞

). Now

consider a state

that deterministically leads to

, with no reward, and suppose

that there is a single action

available. The statement “it is possible to obtain a

return of at least 10 at state y” corresponds to

≥10

(η

(y)) = 1 . (8.7)

Draft version.

Statistical Functionals 243

If Equation 8.7 holds, can we deduce whether or not a return of at least 10 is

possible at state

? The answer is no. Suppose that

= 0

9, and consider the

following two situations:

Case 1. η

(y) = δ

. Then ψ

≥10

(η

(y)) = 1, η

(x) = δ

and ψ

≥10

(η

(x)) = 0.

Case 2.

(

) =

. Then

≥10

(

)) = 1 still. However,

(

) =

and

≥10

(η

(x)) = 1. 4

What goes wrong in the examples above is that we do not have suﬃcient

information about the return distribution at the successor states to compute the

functional values for the return distribution of state

. Consequently, we cannot

use an iterative procedure to determine the functional values of

, at least not

without error.

As it turns out,

-moment sketches are somewhat special in being Bellman

closed. As the following theorem establishes, any sketch whose functionals

are expectations of functions must encode the same information as a moment

sketch.

Theorem 8.12.

Let

= (

, . . . , ψ

) be a sketch. Suppose that

is Bell-

man closed and that for each

= 1

, . . . , m

, there is a function

R →R

for

which

(ν) = E

Z∼ν

[ f

(Z)] .

Then,

is equivalent to the ﬁrst

-moment functionals for some

n ≤m

, in

the sense that there are real-valued coeﬃcients (

i j

) and (

i j

) such that for

any ν ∈P

(R) ∩P

(R),

(ν) =

j=1

i j

(ν) + b

, i = 1, . . . , m ;

(ν) =

i=1

i j

(ν) + c

0 j

, j = 1, . . . , n . 4

The proof is somewhat lengthy and is given in Remark 8.2 at the end of the

chapter.

As a corollary, we may deduce that any sketch that can be expressed as

an invertible function of the ﬁrst

moments is also Bellman closed. More

precisely, if

is a sketch that is an invertible transformation of the sketch

corresponding to the ﬁrst

moments, say

h ◦ψ

, then

is Bellman closed

with corresponding Bellman operator

h ◦T

◦h

−1

. Thus, for example, we may

deduce that the sketch corresponding to the mean and variance functionals is

Bellman closed, since the mean and variance are expressible as an invertible

function of the mean and uncentered second moment. On the other hand, many

Draft version.

244 Chapter 8

other statistical functionals (including quantile functionals) are not covered by

Theorem 8.12. In the latter case, this is because there is no function

R →R

whose expectation for an arbitrary distribution

recovers the

th quantile of

(Exercise 8.5). Still, as established in Example 8.10, quantile sketches are not

Bellman closed.

8.4 Statistical Functional Dynamic Programming

When a sketch

is not Bellman closed, we lack an operator

that emu-

lates the combination of the distributional Bellman operator and this sketch.

This precludes a dynamic programming approach that bootstraps its functional

value estimates directly from the previous estimates. However, approximate

dynamic programming with arbitrary statistical functionals is still possible if we

introduce an additional imputation step

that reconstructs plausible probability

distributions from functional values. As we will now see, this allows us to apply

the distributional Bellman operator to the reconstructed distributions and then

extract the functional values of the resulting return function estimate.

Deﬁnition 8.13.

An imputation strategy for the sketch

(

)

→R

is a

function

→P

(

). We say that it is exact if for any valid functional values

, …, s

) ∈I

, we have

(ι(s

, …, s

)) = s

, i = 1, . . . , m .

Otherwise, we say that it is approximate.

By extension, we write

(

)

∈P

(

)

for the return-distribution function

corresponding to the collection of functional values s ∈I

. 4

In other words, if

is an exact imputation strategy for the sketch

(

, …, ψ

), then for any valid values

, …, s

of the functionals

, …, ψ

we have that

(

, …, s

) is a probability distribution with the required values

under each functional. In a certain sense,

is a pseudo-inverse to the vector-

valued map

ν 7→

(

)

, …, ψ

(

)). Note that a true inverse to

does not

exist, as ψ generally does not capture all aspects of the distribution ν.

Once an imputation strategy has been selected, it is possible to write down an

approximate dynamic programming algorithm for the functional values under

consideration. An abstract framework is given in Algorithm 8.1. In eﬀect, such

an algorithm recursively computes the iterates

k+1

= ψ



ι(s

)



(8.8)

from an initial

∈I

. Procedures that implement the iterative process described

by Equation 8.8 are referred to as statistical functional dynamic programming

(SFDP) algorithms. When the sketch

is Bellman closed and its imputation

Draft version.

Statistical Functionals 245

strategy

is exact, the sequence of iterates (

)

k≥0

converges to

(

), so long

as ψ is continuous (with respect to a Wasserstein metric).

Algorithm 8.1: Statistical functional dynamic programming

Algorithm parameters: statistical functionals ψ

, …, ψ

imputation strategy ι,

initial functional values



(x))

i=1

: x ∈X



desired number of iterations K

for k = 1, . . . , K do

 Impute distributions

η ←



ι(s

(x), …, s

(x)) : x ∈X



 Apply distributional Bellman operator

˜η ←T

foreach state x ∈X do

for i = 1, . . . , m do

 Update statistical functional

values

(x) ←ψ

(˜η(x))

end for

end foreach

end for

return



(x))

i=1

: x ∈X



Example 8.14.

For the quantile functionals (

)

i=1

with

2i−1

for

1, …, m, an exact imputation strategy is

, …, q

) 7→

i=1

. (8.9)

This follows because the

2i−1

-quantile of

i=1

is precisely q

Note that when

, . . . , τ

∈

1) are arbitrary levels with quantile values

(

, . . . , q

), however, it is generally not true that Equation 8.9 is an exact

imputation strategy for the corresponding quantile functionals. 4

Example 8.15.

Categorical dynamic programming can be interpreted as an

SFDP algorithm. Indeed, the parameters

, …, p

found by the categorical

Draft version.

246 Chapter 8

projection correspond to the values of the following statistical functionals:

(ν) = E

Z∼ν



(ς

−1

(Z −θ

)



, i = 1, . . . , m (8.10)

where (

)

i=1

are the triangular and half-triangular kernels deﬁning the cate-

gorical projection on (

)

i=1

(Section 5.6). An exact imputation strategy in this

case is the function that returns the unique distribution supported on (

)

i=1

that

matches the estimated functional values p

= ψ

(ν), i = 1, . . . , m:

, . . . , p

) 7→

i=1

. 4

Mathematically, an exact imputation strategy always exists, because we

deﬁned imputation strategies in terms of valid functional values. However, there

is no guarantee that an eﬃcient algorithm exists to compute the application of

this strategy to arbitrary functional values. In practice, we may favor approx-

imate strategies with eﬃcient implementations. For example, we may map

functional values to probability distributions from a representation

by opti-

mizing some notion of distance between functional values. The optimization

process may not yield an exact match in

(one may not even exist) but can

often be performed eﬃciently.

Example 8.16.

Let

, . . . , ψ

be the categorical functionals from Equation

8.10. Suppose we are given the corresponding functional values

, . . . , p

of a

probability distribution ν:

= ψ

(ν), i = 1, . . . , m .

An approximate imputation strategy for these functionals is to ﬁnd the

-quantile

distribution (n possibly diﬀerent from m)

i=1

that best ﬁts p

according to the loss

L(θ) =

i=1



−ψ

(ν

)



. (8.11)

Exercise 8.7 asks you to demonstrate that this strategy is approximate for

m >

Although in this context, we know of an exact imputation strategy based on

categorical distributions, this illustrates that it is possible to impute distributions

from a diﬀerent representation. 4

Draft version.

Statistical Functionals 247

8.5 Relationship to Distributional Dynamic Programming

In Chapter 5, we introduced distributional dynamic programming (DDP) as a

class of methods that operates over return-distribution functions. In fact, every

statistical functional dynamic programming is also a DDP algorithm (but not

the other way around; see Exercise 8.8). This relationship is established by

considering the implied representation

F = {ι(s) : s ∈I

}⊆P(R)

and the projection Π

= ι ◦ψ (see Figure 8.3).

η ˜η η

s s

Figure 8.3

The interpretation of SFDP algorithms as distributional dynamic programming algo-

rithms. Traversing along the diagram from

corresponds to dynamic programming

implementing a projected Bellman operator, while the path from

corresponds to

statistical functional dynamic programming (SFDP).

From this correspondence, we may establish the relationship between Bell-

man closedness and the notion of a diﬀusion-free projection developed in

Chapter 5.

Proposition 8.17.

Let

be a Bellman-closed sketch. Then for any choice

of exact imputation strategy

→P

(

), the projection operator Π

ιψ is diﬀusion-free. 4

Proof.

We may directly check the diﬀusion-free property (omitting parentheses

for conciseness):

= ιψT

ιψ

(a)

= ιT

ψιψ

(b)

= ιT

(a)

= ιψT

= Π

where steps marked (a) follow from the identity

ψT

, and (b) follows

from the identity ψιψ = ψ for any exact imputation strategy ι for ψ.

Imputation strategies formalize how one might interpret functional values

as parameters of a probability distribution. Naturally, the chosen imputation

Draft version.

248 Chapter 8

strategy aﬀects the approximation artifacts from distributional dynamic pro-

gramming, the rate of convergence, and whether the algorithm converges at

all.

Compared with representation-based algorithms of the style introduced in

Chapter 5, working with statistical functionals allows us to design the projection

in two separate pieces: a sketch

and an imputation strategy

. In particular,

this makes it possible to learn statistical functionals that would be diﬃcult to

directly capture in a probability distribution representation. As the next section

demonstrates, this allows us to create new kinds of distributional reinforcement

learning algorithms.

8.6 Expectile Dynamic Programming

Expectiles form a family of statistical functionals parameterized by a level

τ ∈

1). They extend the notion of the mean of a distribution (

= 0

5) similar

to how quantiles extend the notion of a median. Expectiles have classically

found application in econometrics and ﬁnance as a form of risk measure (see the

bibliographical remarks for further details). Based on the principles of statistical

functional dynamic programming, expectile dynamic programming

uses an

approximate imputation strategy in order to iteratively estimate the expectiles

of the return function.

Deﬁnition 8.18.

For a given

τ ∈

1), the

-expectile of a distribution

ν ∈

(R) is

(ν) = arg min

z∈R

(z; ν) , (8.12)

where

(z; ν) = E

Z∼ν



{Z<z}

−τ|×(Z −z)



(8.13)

is the expectile loss. 4

The loss appearing in Deﬁnition 8.18 is strongly convex (Boyd and Vanden-

berghe 2004) and bounded below by 0. As a consequence, Equation 8.12 has a

unique minimizer for a given

; this veriﬁes that the corresponding expectile is

uniquely deﬁned.

To understand the relationship to the mean functional and develop some

intuition for the statistical property than an expectile encodes, observe that the

mean of a distribution ν ∈P

(R) can be expressed as

(ν) = arg min

z∈R

Z∼ν

[(Z −z)

] .

64.

The incremental analogue is called expectile temporal-diﬀerence learning (Rowland et al. 2019).

Draft version.

Statistical Functionals 249

Similar to how a quantile is derived from a loss that weights errors asymmetri-

cally (depending on whether the realization from

is smaller or greater than

), the expectile loss for

τ ∈

1) is the asymmetric version of the above. For

greater than

, one can think of the expectile as an “optimistic” summary of

the distribution – a value that emphasizes outcomes that are greater than the

mean. Conversely, for

smaller than

, the corresponding expectile is in a

sense “pessimistic.”

Expectile dynamic programming (EDP) estimates the values of a ﬁnite set

of expectile functionals with values 0

< τ

< ···< τ

1. For a distribution

ν ∈P

(R), let us write

= ψ

(ν) .

Given the collection of expectile values

, …, e

, EDP uses an imputation

strategy that outputs an

-quantile probability distribution that approximately

has these expectile values.

The imputation strategy ﬁnds a suitable reconstruction by ﬁnding a solution to

a root-ﬁnding problem. To begin, this strategy outputs a

-quantile distribution

ˆν, with n possibly diﬀerent from m:

ˆν =

j=1

Following Deﬁnition 8.13, for this imputation to be exact, the expectiles of

ˆν

, . . . , τ

should be equal to e

, . . . , e

(ˆν) = e

, i = 1, . . . , m .

This constraint implies that the derivatives of the expectile loss, instantiated

with τ

, . . . , τ

and evaluated with ˆν, should all be 0:

∂



z; ˆν





z=e

= 0, i = 1, . . . , m . (8.14)

Written out in full for the choice of ˆν above, these derivatives take the form

∂



z; ˆν





z=e

j=1

−θ

{θ

< e

}

−τ

|, i = 1, . . . , m .

An alternative to the root-ﬁnding problem expressed in Equation 8.14 is the

following optimization problem:

minimise

i=1

∂



z; ˆν





z=e

. (8.15)

65.

Of course, this particular form for the imputation strategy is a design choice; the reader is

invited to consider what other imputation strategies might be sensible here.

Draft version.

250 Chapter 8

A practical implementation of this imputation strategy therefore applies an opti-

mization algorithm to the objective in Equation 8.15, or a root-ﬁnding method

to Equation 8.14, viewed as functions of

, …, θ

. Because the optimization

algorithm may return a solution that does not exactly satisfy Equation 8.14,

this method is an approximate (rather than exact) imputation strategy. It can

be used in the impute distributions step of Algorithm 8.1, yielding a dynamic

programming algorithm that aims to approximately learn return-distribution

expectiles. If the root-ﬁnding algorithm is always able to ﬁnd

ˆν

exactly sat-

isfying Equation 8.14, then the imputation strategy is exact in this instance;

otherwise, it is approximate. A speciﬁc implementation is explored in detail in

Exercise 8.10.

8.7 Inﬁnite Collections of Statistical Functionals

Thus far, our treatment of statistical functionals has focused on ﬁnite collec-

tions of statistical functionals – what we call a sketch. From a computational

standpoint, this is sensible since, to implement an SFDP algorithm, one needs

to be able to operate on individual functional values. On the other hand, in

Section 8.3, we saw that many sketches are not Bellman closed and must be

combined with an imputation strategy in order to perform dynamic program-

ming. An alternative, which we will study in greater detail in Chapter 10, is to

implicitly parameterize an inﬁnite family of statistical functionals.

Many (though not all) inﬁnite families of functionals provide a lossless

encoding of probability distributions and are consequently Bellman closed

– that is, knowing the values taken on by these functionals is equivalent to

knowing the distribution itself. We encode this property with the following

deﬁnition.

Deﬁnition 8.19.

Let Ψ be a set of statistical functionals. We say that Ψ char-

acterizes probability distributions over the real numbers if, for each

ν ∈P

(

there is a unique collection of functional values (ψ(ν) : ψ ∈Ψ). 4

The following families of statistical functionals all characterize probability

distributions over R.

The cumulative distribution function.

The functionals mapping distribu-

tions

to the probabilities

Z∼ν

(

Z ≤z

), indexed by

z ∈R

. Closely related are

upper-tail probabilities,

ν 7→P

Z∼ν

(Z ≥z) ,

and the quantile functionals

ν 7→F

−1

(τ) ,

indexed by τ ∈(0, 1).

Draft version.

Statistical Functionals 251

The characteristic function. Functionals of the form

ν 7→ E

Z∼ν

iuZ

] ∈C ,

indexed by

u ∈R

(and where i

−

1). The corresponding collection of statistical

values is the characteristic function of ν, denoted χ

Moments and cumulants.

The inﬁnite collection of moment functionals

(

)

∞

p=1

does not unconditionally characterize the distribution

: there are distinct

distributions that have the same sequence of moments. However, if the sequence

of moments does not grow too quickly, uniqueness is restored. In particular, a

suﬃcient condition for uniqueness is that the underlying distribution

has a

moment-generating function

u 7→ E

Z∼ν

] ,

which is ﬁnite in an open neighborhood of

= 0; see Remark 8.3 for further

details. Under this condition, the moment-generating function itself also charac-

terizes the distribution, as does the cumulant-generating function, deﬁned as

the logarithm of the moment-generating function,

u 7→log



Z∼ν

]



The cumulants (

)

∞

p=1

are deﬁned through a power series expansion of the

cumulant-generating function

log



Z∼ν

]



∞

p=1

Under the condition that the moment-generating function is ﬁnite in an open

neighborhood of the origin, the sequences of cumulants and moments are

determined by one another, and so the sequence of cumulants is another

characterization of the distribution under this condition.

Example 8.20. Consider the return-variable Bellman equation

(x, a)

= R + γG

, A

) , X = x , A = a .

If for each

u ∈R

we apply the functional

ν 7→ E

Z∼ν

[

iuZ

] to the distribution of the

random variables on each side, we obtain the characteristic function Bellman

equation:

(x,a)

(u) = E

iu(R+γG

))

| X = x, A = a]

= E



iuR

| X = x, A = a





iγuG

)

| X = x, A = a



= χ

(·|x,a)

(u) E



)

(γu) | X = x, A = a



Draft version.

252 Chapter 8

This is a diﬀerent kind of distributional Bellman equation in which the addi-

tion of independent random variables corresponds to a multiplication of their

characteristic functions. The equation highlights that the characteristic function

evaluated at

depends on the next-state characteristic functions evaluated

γu

. This shows that for a set

S ⊆R

, the sketch (

ν 7→χ

(

) :

u ∈S

) cannot be

Bellman closed unless

is inﬁnite or

{

}

. Exercise 8.12 asks you to give a

theoretical analysis of a dynamic programming approach based on characteristic

functions. 4

Another way to understand collections of statistical functionals that are

characterizing (in the sense of Deﬁnition 8.19) is to interpret them in light of

our deﬁnition of a probability distribution representation (Deﬁnition 5.2). Recall

that a representation

is a collection of distributions indexed by a parameter

θ:

F =



∈P (R) : θ ∈Θ



Here, the functional values associated with the set of statistical functionals Ψ

correspond to the (inﬁnite-dimensional) parameter θ, so that

= P(R) .

This clearly implies that

is closed under the distributional Bellman operator

(Section 5.3) and hence that approximation-free distributional dynamic

programming is (mathematically) possible with F

8.8 Moment Temporal-Diﬀerence Learning*

In Section 8.2, we introduced the

-moment Bellman operator, from which an

exact dynamic programming algorithm can be derived. A natural follow-up is to

apply the tools of Chapter 6 to derive an incremental algorithm for learning the

moments of the return-distribution function from samples. Here, an algorithm

that incrementally updates an estimate

M ∈R

X×A×m

of the

ﬁrst moments of

the return function can be directly obtained through the unbiased estimation

approach, as the corresponding operator can be written as an expectation. Given

a sample transition (

x, a, r, x

, a

), the unbiased estimation approach yields the

update rule (for i = 1, . . . , m)

M(x, a, i) ←(1 −α)M(x, a, i) + α







j=0

i−j

M(x

, a

, i − j)







, (8.16)

where again we take M(·, ·, 0) = 1 by convention.

Unlike the TD and CTD algorithms analyzed in Chapter 6, this algorithm is

derived from an operator,

(m)

, which is not a contraction in a supremum-norm

over states. As a result, the theory developed in Chapter 6 cannot immediately

Draft version.

Statistical Functionals 253

be applied to demonstrate convergence of this algorithm under appropriate

conditions. With some care, however, a proof is possible; we now give an

overview of what is needed.

The proof of Proposition 8.7 demonstrates that the behavior of

(m)

is closely

related to that of a contraction mapping. Speciﬁcally, the behavior of

(m)

updating the estimates of

th moments of returns is contractive if the lower

moment estimates are suﬃciently close to their correct values. To turn these

observations into a proof of convergence, an inductive argument on the moments

being learnt must be made, as in the proof of Proposition 8.7. Further, the

approach of Chapter 6 needs to be extended to deal with a vanishing bias term

in the update to account for this “near-contractivity” of

(m)

; to this end one

may, for example, begin from the analysis of Bertsekas and Tsitsiklis (1996,

Proposition 4.5).

Before moving on, let us remark that in practice, we are likely to be interested

in centered moments such as the variance (m = 2); these take the form

h

∞

t=0

−Q

(x, a)





= x, A

= a

These can be derived from their uncentered counterparts; for example, the vari-

ance of the return distribution

(

x, a

) is obtained from the ﬁrst two uncentered

moments via Equation 8.2.

It is also possible to perform dynamic programming on centered moments

directly, as was shown in the context of the mean and variance in Section 5.4

(Exercise 8.14 asks you to derive the Bellman operators for the more general

case of the ﬁrst m centered moments). Given in terms of state-action pairs, the

Bellman equation for the return variances

(·, ·, 2) ∈R

X×A

(x, a, 2) = Var

(R | X = x, A = a) + (8.17)



Var

, A

) | X = x, A = a) + E

[

, A

, 2) | X = x, A = a]



;

contrast with Equation 5.20.

One challenge with deriving an incremental algorithm for learning the vari-

ance directly is that unbiasedly estimating some of the variance terms on the

right-hand side requires multiple samples. For example, an unbiased estimator

Var

, A

) | X = x, A = a)

in general requires two independent realizations of

, A

for a given source

state-action pair

x, a

. Consequently, unbiased estimation of the corresponding

operator application with a single transition is not feasible in this case. Despite

the fact that the ﬁrst

centered and uncentered moments of a probability

Draft version.

254 Chapter 8

distribution can be recovered from one another, there is a distinct advantage

associated with working with uncentered moments when learning from samples.

8.9 Technical Remarks

Remark 8.1.

Theorem 8.12 illustrates how dynamic programming over func-

tional values must incur some approximation error, unless the underlying sketch

is Bellman closed. One way to avoid this error is to augment the state space

with additional information: for example, the return accumulated so far. We in

fact took this approach when optimizing the conditional value-at-risk (CVaR)

of the return in Chapter 7; in fact, risk measures are statistical functionals that

may also take on the value −∞ (see Deﬁnition 7.14). 4

Remark 8.2

(

Proof of Theorem 8.12

)

It is suﬃcient to consider a pair of

states,

and

, such that

deterministically transitions to

with reward

Because

is Bellman closed, we can identify an associated Bellman operator

. For a given return function

whose state-indexed collection of functional

values is

(

), let us write (

)

(

) for the

th functional value at state

for

= 1

, . . . , m

. By construction and deﬁnition of the operator

, (

)

(

) is a

function of the functional values at

as well as the reward

and discount factor

γ, and so we may write

(x) = g



r, γ, ψ

(η(y)), …, ψ

(η(y))



for some function

. We next argue that

is aﬃne

in the inputs

(

))

, …, ψ

(

)). This is readily observed as each functional

, …, ψ

aﬃne in its input distribution,

(αν + (1 −α)ν

) = E

Z∼αν+(1−α)ν

[ f

(Z)]

= α E

Z∼ν

[ f

(Z)] + (1 −α) E

Z∼ν

[ f

(Z)]

= αψ

(ν) + (1 −α)ψ

(ν

) ,

and

(x) = E

Z∼η(y)

[ f

(r + γZ)]

is also aﬃne as a function of

. This aﬃneness would be contradicted if

were

not also aﬃne. Hence, there exist functions

R ×

→R

for

= 1

, …, m

66.

Recall that a function

M → M

between vector spaces

and

is aﬃne if for

, u

∈ M

λ ∈(0, 1), we have h(λu

+ (1 −λ)u

) = λh(u

) + (1 −λ)h(u

Draft version.

Statistical Functionals 255

such that



r, γ, ψ

(η(y)), …, ψ

(η(y))



= β

(r, γ) +

i=1

(r, γ)ψ

(η(y)) ,

and therefore

Z∼η(y)

[ f

(r + γZ)] = E

Z∼η(y)



j=0

(r, γ) f

(Z)



where

(

) = 1. Taking

(

) to be a Dirac delta

then gives the following

identity:

(r + γz) =

j=0

(r, γ) f

(z) .

We therefore have that the ﬁnite-dimensional function space spanned by

, f

, …, f

(where

is the constant function equal to 1) is closed under

translation (by

r ∈R

) and scaling (by

γ ∈

1)). Engert (1970) shows that

the only ﬁnite-dimensional subspaces of measurable functions closed under

translation are contained in the span of ﬁnitely many functions of the form

z 7→z



exp

(

λz

), with

 ∈N

and

λ ∈C

. Since we further require closure under

scaling by

γ ∈

1), we deduce that we must have

= 0 in any such function,

and the subspace must be equal to the space spanned by the ﬁrst

monomials

(and the constant function).

To conclude, since each monomial

z 7→z

for

= 1

, …, n

is expressible as a

linear combination of

, …, f

, the corresponding expectations

Z∼ν

[

] are

expressible as linear combinations of the expectations

Z∼ν

[

(

)], for any

distribution

. The converse also holds, and so we conclude that the sketch

encodes the same distributional information as the ﬁrst n moments. 4

Remark 8.3.

The question of whether a distribution is characterized by its

sequence of moments has been a subject of study in probability theory for

over a century. The suﬃcient condition on the moment-generating function

described in Section 8.8 means that the characteristic function of such a dis-

tribution can be written as a power series with scaled moments as coeﬃcients,

ensuring uniqueness of the distribution; see, for example, Billingsley (2012)

for a detailed discussion. Lin (2017) gives a survey of known suﬃcient condi-

tions for characterization, as well as examples where characterization does not

hold. 4

8.10 Bibliographical Remarks

8.1.

Statistical functionals are a core notion in statistics; see, for example,

the classic text by van der Vaart (2000). In reinforcement learning, speciﬁc

Draft version.

256 Chapter 8

functionals such as moments, quantiles, and CVaR have been of interest for

risk-sensitive control (more on this in the bibliographical remarks of Chapter 7).

Chandak et al. (2021) consider the problem of oﬀ-policy Monte Carlo policy

evaluation of arbitrary statistical functionals of the return distribution.

8.2, 8.8.

Sobel (1982) gives a Bellman equation for return-distribution moments

for state-indexed value functions with deterministic policies. More recent work

in this direction includes that of Lattimore and Hutter (2012), Azar et al. (2013),

and Azar et al. (2017), who make use of variance estimates in combination with

Bernstein’s inequality to improve the eﬃciency of exploration algorithms, as

well as the work of White and White (2016), who use estimated return variance

to set trace coeﬃcients in multistep TD learning methods. Sato et al. (2001),

Tamar et al. (2012), Tamar et al. (2013), and Prashanth and Ghavamzadeh

(2013) further develop methods for learning the variance of the return. Tamar

et al. (2016) show that the operator

(2)

is a contraction under a weighted

norm (see Exercise 8.4), develop an incremental algorithm with a proof of

convergence using the ODE method, and study both dynamic programming

and incremental algorithms under linear function approximation (the topic of

Chapter 9).

8.3–8.5.

The notion of Bellman closedness is due to Rowland et al. (2019),

although our presentation here is a revised take on the idea. The noted connec-

tion between Bellman closedness and diﬀusion-free representations and the

term “statistical functional dynamic programming” are new to this book.

8.6.

The expectile dynamic programming algorithm is new to this book but

is directly derived from expectile temporal-diﬀerence learning (Rowland et

al. 2019). Expectiles themselves were introduced by Newey and Powell (1987)

in the context of testing in econometric regression models, with the asymmetric

squared loss deﬁning expectiles already appearing in Aigner et al. (1976).

Expectiles have since found further application as risk measures, particularly

within ﬁnance (Taylor 2008; Kuan et al. 2009; Bellini et al. 2014; Ziegel

2016; Bellini and Di Bernardino 2017). Our presentation here focuses on the

asymmetric squared loss, requiring a ﬁnite second-moment assumption, but an

equivalent deﬁnition allows expectiles to be deﬁned for all distributions with a

ﬁnite ﬁrst moment (Newey and Powell 1987).

8.7.

The study of characteristic functions in distributional reinforcement learn-

ing is due to Farahmand (2019), who additionally provides error propagation

analysis for the characteristic value iteration algorithm, in which value iteration

is carried out with characteristic function representations of return distributions.

Earlier, Mandl (1971) studied the characteristic function of the return in Markov

decision processes with deterministic immediate rewards and policies. Chow

et al. (2015) combine a state augmentation method (see Chapter 7) with an

Draft version.

Statistical Functionals 257

inﬁnite-dimensional Bellman equation for CVaR values to learn a CVaR-optimal

policy. They develop an implementable version of the algorithm by tracking

ﬁnitely many CVaR values and using linear interpolation for the remainder, an

approach related to the imputation strategies described earlier in the chapter.

Characterization via the quantile function has driven the success of several large-

scale distributional reinforcement learning algorithms (Dabney et al. 2018a;

Yang et al. 2019), and is the subject of further study in Chapter 10.

8.11 Exercises

Exercise 8.1.

Consider the

-moment Bellman operator

(m)

(Deﬁnition 8.6).

For M ∈R

X×A×m

, deﬁne the norm

kMk

∞,max

= max

i∈{1,...,m}

sup

x∈X

a∈A

M(x, a, i).

By means of a counterexample, show that

(m)

is not a contraction mapping in

the metric induced by k·k

∞,max

. 4

Exercise 8.2.

Let

ε >

0. Determine a bound on the computational cost (in

(

)

notation) of performing iterative policy evaluation with the

-moment Bellman

operator to obtain an approximation

such that

max

i∈{1,...,m}

sup

x∈X

a∈A

(x, a, i) −M

(x, a, i)|< ε.

You may ﬁnd it convenient to refer to the proof of Proposition 8.7. 4

Exercise 8.3.

Equation 5.2 gives the value function

as the solution of the

linear system of equations

V = r

+ γP

Provide the analogous linear system for the moment function M

. 4

Exercise 8.4. The purpose of this exercise is to show that T

(2)

is a contraction

mapping on

X×A×2

in a weighted

∞

norm, as shown by Tamar et al. (2016).

Let

M ∈R

X×A×2

be a moment function estimate (speciﬁcally, for the ﬁrst two

moments). For each α ∈(0, 1), deﬁne the α-weighted norm on R

X×A×2

kMk

= αkM

(1)

∞

+ (1 −α)kM

(2)

∞

where M

(i)

= M(·, ·, i) ∈R

X×A

. For any M, M

∈R

X×A×2

, show that

(2)

(M−M

≤αkγP

(1)

− M

(1)

∞

+ (1 −α)k2γC

(1)

− M

(1)

) + γ

(2)

− M

(2)

∞

Draft version.

258 Chapter 8

where P

is the state-action transition operator, deﬁned by

Q)(x, a) =

)∈X×A

|x, a)π(a

)Q(x

, a

) ,

and C

is the diagonal reward operator

Q)(x, a) = E[R |X = x, A = a]Q(x, a) .

Writing

λ ≥

0 for the Lipschitz constant of

with respect to the

∞

metric,

deduce that

(2)

(M −M

≤(αγ + 2(1 −α)γλ)kM

(1)

− M

(1)

∞

+ (1 −α)γ

(2)

− M

(2)

∞

Hence, deduce that there exist parameters α ∈(0, 1), β ∈[0, 1) such that

(2)

(M −M

≤βkM −M

as required. 4

Exercise 8.5. Consider the median functional

ν 7→F

−1

(0.5) .

Show that there does not exist a function f : R →R such that, for any ν ∈R,

Z∼ν

[ f (Z)] = F

−1

(0.5) . 4

Exercise 8.6.

Consider the subset of probability distributions endowed with a

probability density

. Repeat the preceding exercise for the diﬀerential entropy

functional

ν 7→−

z∈R

(z) log



(z)



dz . 4

Exercise 8.7. For the imputation strategy of Example 8.16:

(i) show that for m = 2, the imputation strategy is exact, for any n ∈N

(ii)

show that for

m >

2, this imputation strategy is inexact. Hint. Find a

distribution ν for which ψ

(ι(p

, . . . , p

)) , p

, for some i = 1, . . . , m. 4

Exercise 8.8.

In Section 8.4, we argued that every statistical functional dynamic

programming algorithm is a distributional dynamic programming algorithm.

Explain why the converse is false. Under what circumstances may we favor

either an algorithm that operates on statistical functionals or one that operates

on probability distribution representations? 4

Exercise 8.9.

Consider an imputation strategy

for a sketch

. We say the (

ψ, ι

)

pair is mean-preserving if, for any probability distribution ν ∈P

(R),

= ιψ(ν)

Draft version.

Statistical Functionals 259

satisﬁes

Z∼ν

[Z] = E

Z∼ν

[Z] .

Show that in this case, the operator

ψ ◦T

◦ι

is also mean-preserving. 4

Exercise 8.10.

Using your favorite numerical computation software, implement

the expectile imputation strategy described in Section 8.6. Speciﬁcally:

(i)

Implement a procedure for approximately determining the expectile values

, . . . , e

of a given distribution. Hint. An incremental approach in the style

of quantile regression, or a binary search approach, will allow you to deal

with continuous distributions.

(ii)

Given a set of expectile values,

, . . . , e

, implement a procedure that

imputes an n-quantile distribution

i=1

by minimizing the objective given in Equation 8.15.

Test your implementation on discrete and continuous distributions, and compare

it with the best

-quantile approximation of those distributions. Is one method

better suited to discrete distributions than the other? More generally, when

might one method be preferred over the other? 4

Exercise 8.11.

Formulate a variant of expectile dynamic programming that

imputes

-quantile distributions and whose (possibly approximate) imputation

strategy is mean-preserving in the sense of Exercise 8.9. 4

Exercise 8.12

(*)

This exercise applies the line of reasoning from Chapter 4

to characteristic functions and is based on Farahmand (2019). For a probability

distribution ν, recall that its characteristic function χ

(u) = E

Z∼ν



iuZ



Now, for p ∈[1, ∞), deﬁne the probability metric

1,p

(ν, ν

) =

u∈R

|χ

(u) −χ

(u)|

|u|

and its supremum extension to return functions

1,p

(η, η

) = sup

(x,a)∈X×A

1,p

(η(x, a), η

(x, a)) .

(i) Determine a subset P

χ,p

(R) ⊆P(R) on which d

1,p

is a proper metric.

(ii)

Provide assumption(s) under which the return function

lies in

χ,p

(

Draft version.

260 Chapter 8

(iii)

Prove that for

p ≥

2, the distributional Bellman operator is a contraction

mapping in d

1,p

, with modulus γ

p−1

. 4

Exercise 8.13 (*). Consider the probability metric

2,2

(ν, ν

) =



u∈R



(u) −χ

(u)





Show that

2,2

is the Cramér distance



. Hint. Use the Parseval–Plancherel

identity. 4

Exercise 8.14.

Let

m ∈N

. Derive a Bellman operator on

X×A×m

whose

unique ﬁxed point

is the collection of centered moments:

(x, a, i) = E



(x, a) −Q

(x, a)



, i = 1, . . . , m . 4

Draft version.