Linear Function Approximation 277
While interesting in their own right, linear CTD and QTD are particularly
important in that they can be straightforwardly adapted to learn return-
distribution functions with nonlinear function approximation schemes such
as deep neural networks; we return to this point in Chapter 10. For now, it is
worth noting that the update rules of linear TD, linear QTD, and linear CTD
can all be expressed as
w
i
←w
i
+ αε
i
φ(x) ,
where
ε
i
is an error term. One interesting diﬀerence is that for both linear
CTD and linear QTD,
ε
i
lies in the interval [
−
1
,
1], while for linear TD, it
is unbounded. This gives evidence that we should expect diﬀerent learning
dynamics from these algorithms. In addition, combining linear TD or linear
QTD to a tabular state representation recovers the corresponding incremen-
tal algorithms from Chapter 6. For linear CTD, the update corresponds to a
tabular representation of the softmax parameters rather than the probabilities
themselves, and the correspondence is not as straightforward.
Analyzing linear QTD and CTD is complicated by the fact that the return
functions themselves are not linear in
w
1
, . . . , w
m
. One solution is to relax the
requirement that the approximation
η
w
(
x
) be a probability distribution; as we
will see in the next section, in this case the distributional approximation behaves
much like the value function approximation, and a theoretical guarantee can be
obtained.
9.6 An Algorithm Based on Signed Distributions*
So far, we made sure that the outputs of our distributional algorithms were valid
probability distributions (or could be as interpreted as such: for example, when
working with statistical functionals). This was done explicitly when using the
softmax parameterization in deﬁning linear CTD and implicitly in the mixture
update rule of CTD in Chapter 3. In this section, we consider an algorithm that
is similar to linear CTD but omits the softmax function. As a consequence of
this change, this modiﬁed algorithm’s outputs are signed distributions, which
we brieﬂy encountered in Chapter 6 in the course of analyzing categorical
temporal-diﬀerence learning.
Compared to linear CTD, this approach has the advantage of being both
closer to the tabular algorithm (it ﬁnds a best ﬁt in
2
distance, like tabular
CTD) and closer to linear value function approximation (making it amenable
to analysis). Although the learned predictions lack some aspects of probability
distributions – such as well-deﬁned quantiles – the learned signed distributions
can be used to estimate expectations of functions, including expected values.
Draft version.