The Distribution of Returns 45
analogy of a frog hopping around a lily pond to convey the dynamics of a
Markov decision process; we ﬁnd our own analogy more vivid. The special
consideration owed to inﬁnite sequences is studied at length by Hutter [2005].
2.4.
The work of Veness et al. [2015] makes the return (as a random variable)
the central object of interest, and is the starting point of our own investigations
into distributional reinforcement learning. Issues regarding the existence of the
random return and a proper probabilistic formulation can be found in that paper.
An early formulation of the random-variable function can be found in Jaquette
[1973], who used it to study alternative optimality criteria. The Blackjack and
cliff-walking examples are adapted from Sutton and Barto [2018] and in the
latter case, inspired by one of the authors’ trip to Ireland. In both cases we
put a special emphasis on the probability distribution of the random return.
The uniform distribution example is taken from Bellemare et al. [2017a]; such
discounted sums of Bernoulli random variables also have a long history in
probability theory [Jessen and Wintner, 1935], see Solomyak [1995], Diaconis
and Freedman [1999], Peres et al. [2000] and references therein.
2.5–2.6.
The Bellman equation is standard to most textbooks on the topic. A
particularly thorough treatment can be found in the work of Puterman [2014]
and Bertsekas [2012]. The former also provides a good discussion on the
implications of the Markov property and time-homogeneity.
2.7-2.8.
Bellman equations relating quantities other than expected returns were
originally introduced in the context of risk-sensitive control, at varying levels
of generality. The formulation of the distributional Bellman equation in terms
of cumulative distribution functions was ﬁrst given by Sobel [1982], under the
assumption of deterministic rewards and policies. Chung and Sobel [1987] later
gave a version for random, bounded rewards. See also the work of White [1988]
for a review of some of these approaches, and Morimura et al. [2010a] for a
more recent presentation of the CDF Bellman equation.
Other versions of the distributional Bellman equation have been phrased in terms
of moments [Sobel, 1982], characteristic functions [Mandl, 1971, Farahmand,
2019] and the Laplace transform [Howard and Matheson, 1972, Jaquette, 1973,
1976, Denardo and Rothblum, 1979], again at varying levels of generality, and in
some cases using the undiscounted return. Morimura et al. [2010b] also present
a version of the equation in terms of probability densities. The formulation of
the distributional Bellman equation in terms of pushforward distributions is due
to Rowland et al. [2018]; the pushforward notation is broadly used in measure-
theoretic probability, and our use of it is inﬂuenced by optimal transport theory
[Villani, 2008].