198 Chapter 6
approximation theory, allowing for a unified approach to proving almost-sure
convergence [Gurvits et al., 1994, Dayan and Sejnowski, 1994, Tsitsiklis, 1994,
Jaakkola et al., 1994, Bertsekas and Tsitsiklis, 1996, Littman and Szepesvári,
1996]. The unbiased estimation framework presented comes from these works,
and the second principle is based on the ideas behind two-timescale algo-
rithms [Borkar, 1997, 2008]. A broader framework based on asymptotically
approximating the trajectories of differential equations is a central theme of
algorithm design and stochastic approximation theory more generally [Ljung,
1977, Kusher and Clark, 1978, Benveniste et al., 2012, Borkar and Meyn, 2000,
Kushner and Yin, 2003, Borkar, 2008, Meyn, 2022].
In addition to the CTD and QTD algorithms described in this chapter, several
other approaches to incremental learning of return distributions have been
proposed. Morimura et al. [2010b] propose to update parametric density models
by taking gradients of the KL divergence between the current estimates, and
the result of applying the Bellman operator to these estimates. Barth-Maron
et al. [2018] also take this approach, using a representation based on mixtures
of Gaussians. Nam et al. [2021] also use mixtures of Gaussians, and minimise
the Cramér distance from a multi-step target, incorporating ideas from TD(
)
[Sutton, 1984, 1988]. Gruslys et al. [2018] combine CTD with Retrace(
), a
multi-step off-policy evaluation algorithm [Munos et al., 2016]. Nguyen et al.
[2021] combine the quantile representation with a loss based on the MMD
metrics described in Chapter 4. Martin et al. [2020] propose a proximal update
scheme for the quantile representation based on (regularised) Wasserstein flows
[Jordan et al., 1998, Cuturi, 2013, Peyré et al., 2019].
Example 6.1 is from Bellemare et al. [2016].
6.3.
The categorical temporal-difference algorithm as a mixture update was
presented by Rowland et al. [2018]. This is a variant of the C51 algorithm
introduced by Bellemare et al. [2017a], which uses a projection in a mixture of
Kullback-Leibler divergence and Cramér distance. Distributional versions of
gradient temporal-difference learning [Sutton et al., 2008a, 2009] based on the
categorical representation have also been explored by Qu et al. [2019].
6.4.
The QTD algorithm was introduced by Dabney et al. [2018b]. Quan-
tile regression itself is a long-established tool within statistics, introduced by
Koenker and Bassett Jr [1978]; Koenker [2005] is a classic reference on the
subject. The incremental rule for estimating quantiles of a fixed distribution was
in fact proposed by Robbins and Monro [1951], in the same paper that launched
the field of stochastic approximation.