Linear Function Approximation 301
9.4.
Barnard [1993] provides a proof that temporal-difference learning is not
a true gradient-descent method, justifying the term semi-gradient [Sutton and
Barto, 2018]. The situation in which
⇠
differs from the steady-state (or sampling)
distribution of the induced Markov chain is called off-policy learning; Example
9.9 is a simpliﬁed version of the original counterexample due to Baird [1995].
Baird argued for the direct minimisation of the Bellman residual by gradient
descent as a replacement to temporal-difference learning, but this method suffers
from other issues, and can produce undesirable solutions even in mild scenarios
[Sutton et al., 2008a]. The GTD line of work [Sutton et al., 2009, Maei, 2011]
is a direct attempt at handling the issue by using a pair of approximations (see
Qu et al. [2019] for applications of this idea in a distributional context); more
recent work directly considers a corrected version of the Bellman residual [Dai
et al., 2018, Chen and Jiang, 2019]. The convergence of temporal difference
learning with linear function approximation was proven under fairly general
conditions by Tsitsiklis and Van Roy [1997], using the ODE method from
stochastic approximation theory [Ljung, 1977, Benveniste et al., 2012, Kushner
and Yin, 2003]; see also Dayan [1992].
Parr et al. [2007, 2008], Sutton et al. [2008b] provide a domain-dependent
analysis of linear value function approximation, which is extended by Ghosh
and Bellemare [2020] to establish the existence of representations that are
stable under a greater array of conditions (Exercises 9.10 and 9.11). Kolter
[2011] studies the space of temporal-difference ﬁxed point as a function of the
distribution ⇠.
9.5.
Linear CTD and QTD were implicitly introduced by Bellemare et al.
[2017a] and Dabney et al. [2018b], respectively, in the design of deep rein-
forcement learning agents (see Chapter 10). Their presentation as given here is
new.
9.6–9.7.
The algorithm based on signed distributions is new to this book. It
improves on an algorithm proposed by Bellemare et al. [2019b] in that it adjusts
the total mass of return distributions to always be one. Lyle et al. [2019] give
evidence that the original algorithm generally underperforms the categorical
algorithm. They further establish that, in the risk-neutral control setting, distri-
butional algorithms cannot do better than value-based algorithms when using
linear approximation. The reader interested in the theory of signed measures is
referred to Doob [1994].