Distributional Dynamic Programming 161
5.4.
Sobel [1982] is usually noted as the source of the Bellman equation for the
variance and its associated operator. The equation plays an important role in the-
oretical exploration [Lattimore and Hutter, 2012, Azar et al., 2013]. Tamar et al.
[2016] studies the variance equation in the context of function approximation.
5.5.
The m-categorical representation was used in a distributional setting by
Bellemare et al. [2017a], inspired by the success of categorical representations
in generative modelling [Van den Oord et al., 2016]. Dabney et al. [2018b]
introduced the quantile representation to avoid the inefficiencies in using a fixed
set of evenly-spaced locations, as well as deriving an algorithm more closely
grounded in the Wasserstein distances.
Morimura et al. [2010a] used the m-particle representation to design a risk-
sensitive distributional reinforcement learning algorithm. In a similar vein,
Maddison et al. [2017] used the same representation in the context of exponen-
tial utility reinforcement learning. Both approaches are closely related to particle
filtering and sequential Monte Carlo methods [Gordon et al., 1993, Doucet et al.,
2001, Särkkä, 2013, Naesseth et al., 2019, Chopin and Papaspiliopoulos, 2020],
which rely on stochastic sampling and resampling procedures, by contrast to
the deterministic dynamic programming methods of this chapter.
5.6, 5.8.
The categorical projection was originally proposed as an ad-hoc solu-
tion to address the need to map the output of the distributional Bellman operator
back onto the support of the distribution. Its description as the expectation of a
triangular kernel was shown by Rowland et al. [2018], justifying its use from
a theoretical perspective and providing the proof of Proposition 5.14. Lemma
5.16 is due to Dabney et al. [2018b].
5.7, 5.9–5.10.
The language and analysis of projected operators is inherited
from the theoretical analysis of linear function approximation in reinforcement
learning; a canonical exposition may be found in Tsitsiklis and Van Roy [1997]
and Lagoudakis and Parr [2003]. Because the space of probability distributions
is not a vector space, the analysis is somewhat different, and among other
things require more technical care (as discussed in Chapter 4). A version of
Theorem 5.28 in the special case of CDP appears in Rowland et al. [2018]. Of
note, in the linear function approximation setting the main technical argument
revolves around the noncontractive nature of the stochastic matrix P
⇡
in a
weighted L
2
norm, whereas here it is due to the c-homogeneity of the analysis
metric (and does not involve P
⇡
). See Chapter 9.
5.11.
A discussion of the mean-preserving property is given in Lyle et al. [2019].