Incremental Algorithms 191
Retrace(
λ
), a multistep oﬀ-policy evaluation algorithm (Munos et al. 2016).
Nguyen et al. (2021) combine the quantile representation with a loss based
on the MMD metrics described in Chapter 4. Martin et al. (2020) propose a
proximal update scheme for the quantile representation based on (regularized)
Wasserstein ﬂows (Jordan et al. 1998; Cuturi 2013; Peyré and Cuturi 2019).
Example 6.1 is from Bellemare et al. (2016).
6.3.
The categorical temporal-diﬀerence algorithm as a mixture update was
presented by Rowland et al. (2018). This is a variant of the C51 algorithm
introduced by Bellemare et al. (2017a), which uses a projection in a mixture of
Kullback–Leibler divergence and Cramér distance. Distributional versions of
gradient temporal-diﬀerence learning (Sutton et al. 2008a; Sutton et al. 2009)
based on the categorical representation have also been explored by Qu et
al. (2019).
6.4.
The QTD algorithm was introduced by Dabney et al. (2018b). Quan-
tile regression itself is a long-established tool within statistics, introduced by
Koenker and Bassett (1978); Koenker (2005) is a classic reference on the sub-
ject. The incremental rule for estimating quantiles of a ﬁxed distribution was in
fact proposed by Robbins and Monro (1951), in the same paper that launched
the ﬁeld of stochastic approximation.
6.5.
The discussion of sequences of learning rates that result in convergence
goes back to Robbins and Monro (1951), who introduced the ﬁeld of stochastic
approximation. Szepesvári (1998), for example, considers this framework in
their study of the asymptotic convergence rate of Q-learning. A ﬁne-grained
analysis in the case of temporal-diﬀerence learning algorithms, taking ﬁnite-
time concentration into account, was undertaken by Even-Dar and Mansour
(2003); see also Azar et al. (2011).
6.6–6.10.
Our proof of Theorem 6.9 (via Propositions 6.4, 6.7, and 6.8) closely
follows the argument given by Bertsekas and Tsitsiklis (1996) and Tsitsiklis
(1994). Speciﬁcally, we adapt this argument to deal with distributional informa-
tion, rather than a single scalar value. Proposition 6.4 is a special case of the
Robbins–Siegmund theorem (Robbins and Siegmund 1971), and a particularly
clear exposition of this and related material is given by Walton (2021). We note
also that this result can also be established via earlier results in the stochastic
approximation literature (Dvoretzky 1956), as noted by Jaakkola et al. (1994).
Theorem 6.10 is classical, and results of this kind can be found in Bertsekas and
Tsitsiklis (1996). Theorem 6.12 was ﬁrst proven by Rowland et al. (2018), albeit
with a monotonicity argument based on that of Tsitsiklis (1994); the argument
here is based on a contraction mapping argument to match the analysis of the
temporal-diﬀerence algorithm. For further background on signed measures, see
Doob (1994).
Draft version.