Statistical Functionals 263
Here, the functional values associated with the set of statistical functionals
correspond to the (inﬁnite-dimensional) parameter ✓, so that
F
= P(R).
This clearly implies that
F
is closed under the distributional Bellman operator
T
⇡
(Section 5.3) and hence that approximation-free distributional dynamic
programming is (mathematically) possible with F
.
8.8 Moment Temporal-Difference Learning*
In Section 8.2 we introduced the m-moment Bellman operator, from which an
exact dynamic programming algorithm can be derived. A natural follow-up is to
apply the tools of Chapter 6 to derive an incremental algorithm for learning the
moments of the return-distribution function from samples. Here, an algorithm
that incrementally updates an estimate M
2R
X⇥A⇥m
of the m ﬁrst moments
of the return function can be directly obtained through the unbiased estimation
approach, as the corresponding operator can be written as an expectation. Given
a sample transition (x, a, r, x
0
, a
0
), the unbiased estimation approach yields the
update rule (for i = 1, ..., m)
M(x, a, i) (1 – ↵)M(x, a, i)+↵
i
j=0
i–j
i
j
r
j
M(x
0
, a
0
, i – j)
, (8.16)
where again we take M(·, ·, 0) = 1 by convention.
Unlike the TD and CTD algorithms analysed in Chapter 6, this algorithm is
derived from an operator, T
⇡
(m)
, which is not a contraction in a supremum-norm
over states. As a result, the theory developed in Chapter 6 cannot immediately
be applied to demonstrate convergence of this algorithm under appropriate
conditions. With some care, however, a proof is possible; we now give an
overview of what is needed.
The proof of Proposition 8.7 demonstrates that the behaviour of T
⇡
(m)
is closely
related to that of a contraction mapping. Speciﬁcally, the behaviour of T
⇡
(m)
in updating the estimates of i
th
moments of returns is contractive if the lower
moment estimates are sufﬁciently close to their correct values. To turn these
observations into a proof of convergence, an inductive argument on the moments
being learnt must be made, as in the proof of Proposition 8.7. Further, the
approach of Chapter 6 needs to be extended to deal with a vanishing bias term
in the update to account for this ‘near-contractivity’ of T
⇡
(m)
; to this end, one
may for example begin from the analysis of Bertsekas and Tsitsiklis [1996,
Proposition 4.5].