148 Chapter 5
Included in this table is whether a representation is mean-preserving. We say
that a representation
F
and its associated projection operator Π
F
are mean-
preserving if for any distribution
ν
from a suitable space, the mean of Π
F
ν
is
the same as that of
ν
. The
m
-categorical algorithm presented in this chapter is
mean-preserving provided the range of considered returns does not exceed the
boundaries of its support; the
m
-quantile algorithm is not. In addition to the
mean, we can also consider what happens to other aspects of the approximated
distributions. For example, the
m
-categorical algorithm produces probability
distributions that in general have more variance than those of the true return
function.
The choice of representation also has consequences beyond distributional
dynamic programming. In the next chapter, for example, we will consider the
design of incremental algorithms for learning return-distribution functions from
samples. There, we will see that the projected operator derived from the cate-
gorical representation directly translates into an incremental algorithm, while
the operator derived from the quantile representation does not. In Chapter 9,
we will also see how the choice of representation interplays with the use of
parameterized models such as neural networks to represent the return function
compactly. Another consideration, not listed here, concerns the sensitivity of the
algorithm to its parameters: for example, for small values of
m
, the
m
-quantile
representation tends to be a better choice than the
m
-categorical representation,
which suﬀers from large gaps between its particles’ locations (as an extreme
example, take m = 2).
The design and study of representations remains an active topic in distribu-
tional reinforcement learning. The representations we presented here are by no
mean an exhaustive portrait of the ﬁeld. For example, Barth-Maron et al. (2018)
considered using mixtures of
m
normal distributions in domains with vector-
valued action spaces. Analyzing representations like these is more challenging
because known projection methods suﬀer from local minima, which in turn
implies that dynamic programming may give diﬀerent (and possibly suboptimal)
solutions for diﬀerent initial conditions.
5.12 Technical Remarks
Remark 5.1
(
Finiteness of R
)
.
In the algorithms presented in this chapter, we
assumed that rewards are distributed on a ﬁnite set
R
. This is not actually needed
for most of our analysis but makes it possible to compute expectations and
convolutions in ﬁnite time and hence devise concrete dynamic programming
algorithms. However, there are many problems in which the rewards are better
modeled using continuous or unbounded distributions. Rewards generated from
Draft version.