9
Linear Function Approximation
A probability distribution representation is used to describe return functions
in terms of a collection of numbers that can be stored in a computer’s mem-
ory. With it, we can devise algorithms that operate on return distributions in a
computationally efﬁcient manner, including distributional dynamic program-
ming algorithms and incremental algorithms such as CTD and QTD. Function
approximation arises when our representation of the value or return function
uses parameters that are shared across states. This allows reinforcement learning
methods to be applied to domains where it is impractical or even impossible to
keep in memory a table with a separate entry for each state, as we have done
in preceding chapters. In addition, it makes it possible to make predictions
about states that have not been encountered in effect, to generalise a learned
estimate to new states.
As a concrete example, consider the problem of determining an approximation to
the optimal value function for the game of Go. In Go, players take turns placing
white and black stones on a 19
19 grid. At any time, each location of the
board is either occupied by a white or black stone, or unoccupied; consequently,
there are astronomically many possible board states.
68
Any practical algorithm
for this problem must therefore use a succinct representation of its value or
return function.
Function approximation is also used to apply reinforcement learning algorithms
to problems with continuous state variables. The classic Mountain Car domain,
in which the agent must drive an under-powered car up a steep hill, is one
such problem. Here, the state consists of the car’s position and velocity (both
bounded on some interval); learning to control the car requires being able to
68.
A naive estimate is 3
1919
. The real ﬁgure is somewhat lower due to symmetries and the
impossibility of certain states.
273 274 Chapter 9
Figure 9.1
A Markov decision process in which aliasing due to function approximation may result
in the wrong value function estimates even at the unaliased states x
1
and x
2
.
map two-dimensional points to a desired action, usually by predicting the return
obtained following this action.
69
While there are similarities between the use of probability distribution rep-
resentations (parametrising the output of a return function) and function
approximation (parametrising its input), the latter requires a different algorith-
mic treatment. When function approximation is required, it is usually because
it is infeasible to exhaustively enumerate the state space and apply dynamic
programming methods. One solution is to rely on samples (of the state space,
transitions, trajectories, etc.), but this results in additional complexities in the
algorithm’s design. Combining incremental algorithms with function approxi-
mation may result in instability or even divergence; in the distributional setting,
the analysis of these algorithms is complicated by two levels of approximation
(one for probability distributions, and one across states). With proper care,
however, function approximation provides an effective way of dealing with
large reinforcement learning problems.
9.1 Function Approximation and Aliasing
By necessity, when parameters are shared across states a single parameter usu-
ally affects the predictions (value or distribution) at multiple states. In this case,
we say that the states are aliased. State aliasing has surprising consequences in
the context of reinforcement learning, including the unwanted propagation of
errors and potential instability in the learning process.
Example 9.1.
Consider the Markov decision process in Figure 9.1, with four
non-terminal states x
1
, x
2
, y, and z, a single action, a deterministic reward
69. Domains such as Mountain Car, which have a single initial state and a deterministic transition
function can be solved without function approximation, e.g. by means of a search algorithm.
However, function approximation allows us to learn a control policy that can in theory be applied to
any given state, and has a low run-time cost.
Linear Function Approximation 275
function, an initial state distribution
0
, and no discounting. Consider an
approximation based on three parameters w
x
1
, w
x
2
, and w
yz
, such that
ˆ
V(x
1
)=w
x
1
ˆ
V(x
2
)=w
x
2
ˆ
V(y)=
ˆ
V(z)=w
yz
.
Because the rewards from y and z are different, no choice of w
yz
can yield
ˆ
V
= V
. As such, any particular choice of w
yz
y and z. When a reinforcement learning algorithm is combined with function
algorithm’s characteristics and the parameters of the learning process. For
example, the best approximation obtained by the incremental Monte Carlo
algorithm (Section 3.2) correctly learns the value of x
1
and x
2
:
ˆ
V(x
1
)=2,
ˆ
V(x
2
)=0,
but learns a parameter w
yz
that depends on the frequency at which states y and
z are visited. This is because w
yz
is updated towards 2 whenever the estimate
ˆ
V
(y) is updated, and towards 0 whenever
ˆ
V
(z) is updated. In our example, the
frequency at which this occurs is directly implied by the initial state distribution
0
, and we have
w
yz
=2P
(X
1
= y)+0P
(X
1
= z)
=2
0
(x
1
) . (9.1)
When the approximation is learned using a bootstrapping procedure, aliasing
can also result in incorrect estimates at states that are not themselves aliased.
The solution found by temporal-difference learning, w
yz
, is as per Equation 9.1,
but the algorithm also learns the incorrect value at x
1
and x
2
:
ˆ
V(x
1
)=
ˆ
V(x
2
)=0+
ˆ
V(z)
=2
0
(x
1
).
Thus, errors due to function approximation can compound in unexpected ways;
we will study this phenomenon in greater detail in Section 9.3. 4
In a linear value function approximation, the value estimate at a state x is
given by a weighted combination of features of x. This is in opposition to a
tabular representation, where value estimates are stored in a table with one
entry per state.
70
As we will see, linear approximation is simple to implement
and relatively easy to analyse.
70.
Technically, a tabular representation can also be expressed using the trivial collection of indicator
features. In practice, the two are used in distinct problem settings.
276 Chapter 9
Deﬁnition 9.2.
Let n
2N
+
.Astate representation is a mapping
:
X!R
n
.
A linear value function approximation V
w
2R
X
is parametrised by a weight
vector w 2R
n
and maps states to their expected return estimates according to
V
w
(x)=(x)
>
w .
A feature
i
(x)
2R
, i = 1,
...
, n is an individual element of
(x). We call the
vectors
i
2R
X
basis functions. 4
As its name implies, a linear value function approximation is linear in the weight
vector w. That is, for any w
1
, w
2
2R
n
, , 2R, we have
V
w
1
+w
2
= V
w
1
+ V
w
2
.
w
(x) with respect to w is given by
r
w
V
w
(x)=(x).
As we will see, these properties affect the learning dynamics of algorithms that
use linear value function approximation.
We extend linear value function approximation to state-action values in the
usual way. For a state representation : XA!R
n
, we deﬁne
Q
w
(x, a)=(x, a)
>
w .
A practical alternative is to use a distinct set of weights for each action, and use
a common representation
(x) across actions. In this case, we use a collection
of weight vectors (w
a
: a 2A), with w
a
2R
n
, and write
Q
w
(x, a)=(x)
>
w
a
.
Remark 9.1 discusses the relationship between these two methods.
9.2 Optimal Linear Value Function Approximations
In this chapter we will assume that there is a ﬁnite (but very large) number
of states. In this case, the state representation
:
X!R
n
can expressed as a
feature matrix
2R
Xn
whose rows are the vectors
(x), x
2X
. This yields
the approximation
V
w
= w .
The state representation determines a space of value function approximations
that are constructed from linear combinations of features. Expressed in terms of
the feature matrix, this space is
F
={w : w 2R
n
}. Linear Function Approximation 277
We ﬁrst consider the problem of ﬁnding the best linear approximation to a value
function V
. Because
F
is a n-dimensional linear subspace of the space of
value functions
R
X
, there are necessarily some value functions that cannot be
represented with a given state representation (unless n = N
X
). We measure the
discrepancy between a value function V
and an approximation V
w
=
w in
-weighted L
2
norm, for 2P(X):
kV
V
w
k
,2
=
x2X
(x)
V
(x)–V
w
(x)
2
1
/2
.
The weighting
reﬂects the relative importance given to different states. For
example, we may weigh states according to the frequency at which they are
visited; or we may put greater importance on initial states. Provided that
(x)>0
for all x 2X, the norm k·k
,2
induces the -weighted L
2
metric on R
X
:
71
d
,2
(V, V
0
)=kV V
0
k
,2
.
The best linear approximation under this metric is the solution to the
minimisation problem
min
w2R
n
kV
V
w
k
,2
. (9.2)
One advantage of measuring approximation error in a weighted L
2
norm, rather
than the L
1
norm used in the analyses of previous chapters, is that a solution
w
to Equation 9.2 can be easily determined by solving a least-squares system.
Proposition 9.3.
Suppose that the columns of the feature matrix
are
linearly independent and
(x) > 0 for all x
2X
. Then Equation 9.2 has a
unique solution w
given by
w
=(
>
⌅)
–1
>
V
, (9.3)
where 2R
XX
is a diagonal matrix with entries ((x):x 2X). 4
Proof. By a standard calculus argument, any optimum w must satisfy
r
w
x2X
(x)
V
(x)–(x)
>
w
2
=0
=)
x2X
(x)
V
(x)–(x)
>
w
(x)=0.
71.
Technically,
k·k
,2
is only a proper norm if
is strictly positive for all x; otherwise, it is
a seminorm. Under the same condition, d
,2
is proper metric, otherwise it is a pseudo-metric.
Assuming that (x) > 0 for all x addresses uniqueness issues and simpliﬁes the analysis. 278 Chapter 9
Written in matrix form, this is
>
(w V
)=0
=)
>
⌅w =
>
V
.
Because
has rank n, then so does
>
⌅
: for any u
2R
n
with u
6
= 0, we have
u = v 6= 0 and so
u
>
>
⌅u = v
>
v
=
x2X
(x)v(x)
2
>0,
as
(x) > 0 and v(x)
2
0 for all x
2X
, and
x2X
v(x)
2
> 0. Hence
>
⌅
is
invertible and the only solution w
to the above satisﬁes Equation 9.3.
9.3 A Projected Bellman Operator for Linear Value Function
Approximation
Dynamic programming ﬁnds an approximation to the value function V
by
successively computing the iterates
V
k+1
= T
V
k
.
As we saw in preceding chapters, dynamic programming makes it easy to
derive incremental algorithms for learning the value from samples, and also
allows us to ﬁnd an approximation to the optimal value function Q
. Often, it is
the de facto approach for ﬁnding an approximation of the return-distribution
function. It is also particularly useful when using function approximation, where
it enables algorithms that learn by extrapolating to unseen states.
When dynamic programming is combined with function approximation, we
obtain a range of methods called approximate dynamic programming. In the case
of linear value function approximation, the iterates (V
k
)
k0
are given by linear
combinations of features, which allows us to apply dynamic programming to
problems with larger state spaces than can be described in memory. In general,
however, the space of approximations
F
is not closed under the Bellman
operator, in the sense that
V 2F
6=) T
V 2F
.
Similar to the notion of a distributional projection introduced in Chapter 5, we
address the issue by projecting, for V
2R
X
, the value function T
V back onto
F
. Let us deﬁne the projection operator
,
: R
X
!R
X
as
(
,
V)(x)=(x)
>
w
such that w
2arg min
w2R
n
kV V
w
k
,2
.
Linear Function Approximation 279
This operator returns the approximation V
w
=
w
that is closest to V
2R
X
in the
-weighted L
2
norm. As established by Proposition 9.3, when
is fully
supported on
X
and the basis functions (
i
)
n
i=1
are linearly independent, then this
projection is unique.
72
By repeatedly applying the projected Bellman operator
,
T
from an initial condition V
0
2F
, we obtain the iterates
V
k+1
=
,
T
V
k
. (9.4)
Unlike the approach taken in Chapter 5, however, it is usually impractical
to implement Equation 9.4 as-is, as there are too many states to enumerate.
A simple solution is to rely on a sampling procedure that approximates the
operator itself. For example, one may sample a batch of K states and ﬁnd
the best linear ﬁt to T
V
k
at these states. In the next section we will study the
related approach of using an incremental algorithm to learn the linear value
function approximation from sample transitions. Understanding the behaviour
of the exact projected operator
,
T
informs us about the behaviour of these
approximations, as it describes in some sense the ideal behaviour that one
expects from both of these approaches.
Also different from the setting of Chapter 5 is the presence of aliasing across
states. As a consequence of this aliasing, we have limited freedom in the choice
of projection if we wish to guarantee the convergence of the iterates of Equation
9.4. To obtain such a guarantee, in general we need to impose a condition on
the distribution
that deﬁnes the projection
,
. We will demonstrate that
the projected Bellman operator is a contraction mapping with modulus
with
respect to the
-weighted L
2
norm, for a speciﬁc choice of
. Historically, this
approach predates the analysis of distributional dynamic programming and is in
fact a key inspiration for our analysis of distributional reinforcement learning
algorithms as approximating projected Bellman operators (see bibliographical
remarks).
To begin, let us introduce the convention that the Lipschitz constant of an
operator with respect to a norm (such as L
1
) follows Deﬁnition 5.20, applied
to the metric associated with the norm. In the case of L
1
, this metric deﬁnes
the distance between u, u
0
2R
X
by
ku u
0
k
1
.
Now recall that the Bellman operator T
is a contraction mapping in L
1
norm,
with modulus
. One reason this holds is because the transition matrix P
72.
If only the ﬁrst of those two conditions hold, then there may be multiple optimal weight vectors
but they all result in the same value function, and the projection remains unique.
280 Chapter 9
satisﬁes
kP
uk
1
kuk
1
, for all u 2R
X
;
we made use of this fact in the proof of Proposition 4.4. This is equivalent to
requiring that the Lipschitz constant of P
satisfy
kP
k
1
1.
Unfortunately, the Lipschitz constant of
,
in the L
1
norm may be greater
than 1, precluding a direct analysis in that norm (see Exercise 9.5). We instead
prove that the Lipschitz constant of P
in the -weighted L
2
norm satisﬁes the
same condition when
is taken to be a steady-state distribution under policy
.
Deﬁnition 9.4.
Consider a Markov decision process and let
be a policy
deﬁning the probability distribution
P
over the random transition (X, A, R, X
0
).
We say that 2P(X) is a steady-state distribution for if for all x
0
2X,
(x
0
)=
x2X
(x)P
(X
0
= x
0
| X = x). 4
Assumption 9.5.
There is a unique steady-state distribution
, and it satisﬁes
(x) > 0 for all x 2X. 4
Qualitatively, Assumption 9.5 ensures that approximation error at any state is
reﬂected in the norm
k·k
,2
; contrast with the setting in which
is nonzero
only at a handful of states. Uniqueness is not strictly necessary, but simpliﬁes the
exposition. There are a number of practical scenarios in which the assumption
does not hold, most importantly when there is a terminal state. We discuss how
to address such a situation in Remark 9.2.
Lemma 9.6. Let : X!P(A) be a policy and let
bution for this policy. The transition matrix is a non-expansion with respect to
the
-weighted L
2
metric. That is,
kP
k
,2
=1. 4
Proof.
A simple algebraic argument shows that if U
2R
X
is such that U(x)=1
for all x, then
P
U = U .
This shows that kP
k
,2
1. Now for an arbitrary U 2R
X
, write
kP
Uk
2
,2
=
x2X
(x)
(P
U)(x)
2 Linear Function Approximation 281
=
x2X
(x)
x
0
2X
P
(x
0
| x)U(x
0
)
2
(a)
x2X
(x)
x
0
2X
P
(x
0
| x)
U(x
0
)
2
=
x
0
2X
U(x
0
)
2
x2X
(x)P
(x
0
| x)
(b)
=
x
0
2X
(x
0
)
U(x
0
)
2
= kUk
2
,2
,
where (a) follows by Jensen’s inequality and (b) by Deﬁnition 9.4. Since
kP
k
,2
= sup
U2R
X
kP
Uk
,2
kUk
,2
1,
this concludes the proof.
Lemma 9.7.
For any
2P
(
X
) with
(x) > 0, the projection operator
,
is a
non-expansion in the -weighted L
2
metric, in the sense that
k
,
k
,2
=1. 4
The proof constitutes Exercise 9.4.
Theorem 9.8.
Let
be a policy and suppose that Assumption 9.5 holds; let
be the corresponding steady-state distribution. The projected Bellman
operator
,
T
is a contraction with respect to the
-weighted L
2
norm
with modulus , in the sense that for any V, V
0
2R
X
,
k
,
T
V
,
T
V
0
k
,2
kV V
0
k
,2
.
As a consequence, this operator has a unique ﬁxed point
ˆ
V
=
,
T
ˆ
V
, (9.5)
which satisﬁes
k
ˆ
V
V
k
,2
1
1–
2
k
,
V
V
k
,2
. (9.6)
In addition, for an initial value function V
0
2R
X
, the sequence of iterates
V
k+1
=
,
T
V
k
(9.7)
converges to this ﬁxed point. 4 282 Chapter 9
Proof.
The contraction result and consequent convergence of the iterates in
Equation 9.7 to a unique ﬁxed point follow from Lemmas 9.6 and 9.7, which
combined with Lemma 5.21 allows us to deduce that
k
,
T
k
,2
.
Because Assumption 9.5 guarantees that
k·k
,2
induces a proper metric, we
may then apply Banach’s ﬁxed point theorem. For the error bound of Equation
9.6, we use Pythagoras’ theorem to write
k
ˆ
V
V
k
2
,2
= k
ˆ
V
,
V
k
2
,2
+ k
,
V
V
k
2
,2
= k
,
T
ˆ
V
,
T
V
k
2
,2
+ k
,
V
V
k
2
,2
2
k
ˆ
V
V
k
2
,2
+ k
,
V
V
k
2
,2
,
since
k
,
T
k
,2
. The result follows by rearranging terms and taking
the square root of both sides.
Theorem 9.8 implies that the iterates (V
k
)
k0
are guaranteed to converge when
the projection is performed in
-weighted L
2
norm. Of course, this does not
imply that a projection in a different norm may not result in a convergent
algorithm (see Exercise 9.8), but divergence is a practical concern (we return to
this point in the next section). A sound alternative to imposing a condition on
the distribution
is to instead impose a condition on the feature matrix
; this
is explored in Exercise 9.10.
When the feature matrix
forms a basis of
R
X
(i.e. it has rank N
X
), it is always
possible to ﬁnd a weight vector w
for which
w
= T
V ,
for any given V 2R
X
. As a consequence, for any 2P(X)
,
T
= T
and Theorem 9.8 reduces to the analysis of the (unprojected) Bellman operator
given in Section 4.2. On the other hand, when n < N
X
the ﬁxed point of Equation
9.5 is in general different from the minimum-error solution
,
V
and is
called the temporal-difference learning ﬁxed point. Similar to the diffusion
effect studied in Section 5.8, successive applications of the projected Bellman
operator result in compounding approximation errors. The nature of this ﬁxed
point is by now well-studied in the literature (see bibliographical remarks).
Linear Function Approximation 283
We now consider the design of a sample-based, incremental algorithm for
learning the linear approximation of a value function V
. In the context of
domains with large state spaces, algorithms that learn from samples have an
advantage over dynamic programming approaches: whereas the latter require
some form of enumeration and hence have a computational cost that depends on
the size of
X
, the computational cost of the former instead depends on the size
of the function approximation (in the linear case, on the number of features n).
To begin, let us consider learning a linear value function approximation using
an incremental Monte Carlo algorithm. We are presented with a sequence of
state-return pairs (x
k
, g
k
)
k0
, with the assumption that the source states x
k
are
realisations of independent draws from the distribution
, and that each g
k
is a
corresponding independent realisation of the random return G
(x
k
). As before,
we are interested in the optimal weight vector w
for the problem
min
w2R
n
kV
V
w
k
,2
, V
w
(x)=(x)
>
w . (9.8)
Of note, the optimal approximation V
w
is also the solution to the problem
min
w2R
n
E[kG
V
w
k
,2
] . (9.9)
Consequently, a simple approach for ﬁnding w
descent with a loss function that reﬂects Equation 9.9. For x
2X
, z
2R
let us
deﬁne the sample loss
L(w)=
z (x)
>
w
2
,
whose gradient with respect to w is
r
w
L(w) = –2
z (x)
>
w
(x).
Stochastic gradient descent updates the weight vector w by following the (nega-
tive) gradient of the sample loss constructed from each sample. Instantiating
the sample loss with x = x
k
and z = g
k
, this results in the update rule
w w +
k
(g
k
(x
k
)
>
w
(x
k
) , (9.10)
where
k
2
[0, 1) is a time-varying step size that also subsumes the constant
from the loss. Under appropriate conditions, this update rule ﬁnds the optimal
weight vector w
[see e.g., Bottou, 1998]. Exercise 9.9 asks you to verify that
the optimal weight vector w
is a ﬁxed point of the update rule.
Let us now consider the problem of learning the value function from a sequence
of sample transitions (x
k
, a
k
, r
k
, x
0
k
)
k0
, again assumed to be independent real-
isations from the appropriate distributions. Given a weight vector w, the
temporal-difference learning target for linear value function approximation 284 Chapter 9
is
r
k
+ V
w
(x
0
k
)=r
k
+ (x
0
k
)
>
w .
We use this target in lieu of g
k
in Equation 9.10 to obtain the semi-gradient
temporal-difference learning update rule
w w +
k
r
k
+ (x
0
k
)
>
w (x
k
)
>
w

TD error
(x
k
) , (9.11)
in which the temporal-difference (TD) error appears, now with value function
estimates constructed from linear approximation. By substituting the Monte
Carlo target g
k
for the temporal-difference target, the intent is to learn an
approximation to V
by a bootstrapping process, as in the tabular setting. The
term “semi-gradient” reﬂects the fact that the update rule does not actually
r
k
+ (x
0
k
)
>
w (x
k
)
>
w
2
,
which contains additional terms related to (x
0
) (see bibliographical remarks).
We can understand the relationship between semi-gradient temporal-difference
learning and the projected Bellman operator
,
T
by way of an update rule
deﬁned in terms of a second set of weights
˜
w
, the target weights. This update
rule is
w w +
k
(r
k
+ (x
0
k
)
>
˜
w (x
k
)
>
w
(x
k
) . (9.12)
When
˜
w
= w, this is Equation 9.11. However, if
˜
w
is a separate weight vector
this is the update rule of stochastic gradient descent on the sample loss
r
k
+ (x
0
k
)
>
˜
w (x
k
)
>
w
2
.
Consequently, this update rule ﬁnds a weight vector w
that approximately
minimises
kT
V
˜
w
V
w
k
,2
,
and Equation 9.12 describes an incremental algorithm for computing a single
step of the projected Bellman operator applied to V
˜
w
; its solution w
satisﬁes
w
=
,
T
V
˜
w
.
This argument suggests that semi-gradient temporal-difference learning tracks
the behaviour of the projected Bellman operator. In particular, at the ﬁxed point
ˆ
V
=
ˆ
w
of this operator, semi-gradient TD learning (applied to realisations
from the sample transition model (X, A, R, X
0
), with X
) leaves the weight
vector unchanged, in expectation.
E

R + (X
0
)
>
ˆ
w (X)
>
ˆ
w
(X)
=0.
Linear Function Approximation 285
In semi-gradient temporal-difference learning, however, the sample target
r
k
+

(x
0
k
)
>
w depends on w, and is used to update w itself. This establishes
a feedback loop which, combined with function approximation, can result in
divergence even when the projected Bellman operator is well-behaved. The
following example illustrates this phenomenon.
Example 9.9
(Baird’s counterexample [Baird, 1995])
.
Consider the Markov
decision process depicted in Figure 9.2a and the state representation
(x) = (1, 3, 2) (y) = (4, 3, 3).
Since the reward is zero everywhere, the value function for this MDP is V
= 0.
However, if X
k
with (x)=(y) = 1/2, the semi-gradient update
w
k+1
= w
k
+ (0 + (X
0
k
)
>
w
k
(X
k
)
>
w
k
)(X
k
)
diverges unless w
0
= 0. Figure 9.2b depicts the effect of this divergence on the
approximation error: on average, the distance to the ﬁxed point
k
w
k
–0
k
,2
grows exponentially with each update. Also shown is the approximation error
over time if we take to be the steady-state distribution
(x)=
1
/11 (y)=
10
/11,
in which case the approximation error becomes close to zero as k
!1
. This
demonstrates the impact of the relative update frequency of different states on
the behaviour of the algorithm.
On the other hand, note that the basis functions implied by
span
R
X
and thus
,
T
V = T
V
for any V
2R
2
. Hence the iterates V
k+1
=
,
T
V
k
in this case converge to
0 for any initial V
0
2R
2
. This illustrates that requiring the convergence that
the sequence of iterates derived from the projected Bellman operator is not
sufﬁcient to guarantee the convergence of the semi-gradient iterates. 4
Example 9.9 illustrates how reinforcement learning with function approximation
is a more delicate task than in the tabular setting. If states are updated propor-
tionally to their steady-state distribution, however, a convergence guarantee
becomes possible (see bibliographical remarks).
9.5 Semi-Gradient Algorithms for Distributional Reinforcement
Learning
With the right modelling tools, function approximation can also be used to
tractably represent the return functions of large problems. One difference with 286 Chapter 9
0 20 40 60 80 100
Iteration
10
0
10
1
10
2
Approximation error (log)
(a) (b)
Figure 9.2
(a)
A Markov decision process for which semi-gradient temporal-difference learning can
diverge.
(b)
Approximation error (measured in unweighted L
2
norm) over the course of
100 runs of the algorithm with
= 0.01 and the same initial condition w
0
= (1, 0, 0) but
different draws of the sample transitions. Black and red lines indicate runs with
(x)=
1
/
2
and
1
/11, respectively.
the expected-value setting is that it is typically more challenging to construct
an approximation that is linear in the true sense of the word. With linear
approximations:
V
w
1
+w
2
(x)=V
w
1
(x)+V
w
2
(x).
In the distributional setting, the same cannot apply because probability dis-
tributions do not form a vector space. This means that we cannot expect a
return-distributino function representation to satisfy
w
1
+w
2
(x)
?
=
w
1
(x)+
w
2
(x) ; (9.13)
the right-hand side is not a probability distribution (it is, however, a signed
distribution: more on this in Section 9.6). An alternative is to take a slightly
broader view and consider distributions whose parameters depend linearly on
w. There are now two sources of approximation: one due to the ﬁnite parametri-
sation of probability distributions in
F
, another because those parameters are
themselves aliased. This is an expressive framework, albeit one under which
the analysis of algorithms is signiﬁcantly more complex.
Linear QTD.
Let us ﬁrst derive a linear approximation of quantile temporal-
difference learning. Linear QTD represents the locations of quantile distri-
butions using linear combinations of features. If we write w
2R
mn
for the
matrix whose columns are w
1
,
...
, w
m
2R
n
, then the linear QTD return function
estimate takes the form
w
(x)=
1
m
m
i=1
(x)
>
w
i
. Linear Function Approximation 287
One can verify that
w
(x) is not a linear combination of features, even though
its parameters are. We construct the linear QTD update rule by following the
negative gradient of the quantile loss (Equation 6.12), taken with respect to the
parameters w
1
, ..., w
m
. We ﬁrst rewrite this loss in terms of a function
:
()=|
{ < 0}
| ||
so that for a sample z
2R
and estimate
2R
, the loss of Equation 6.12 can be
expressed as
|
{z < }
| |z |=
(z ).
We instantiate the quantile loss with
=
(x)
>
w
i
and
=
i
=
2i–1
2m
to obtain the
loss
L
i
(w)=
i
(z (x)
>
w
i
) . (9.14)
By the chain rule, the gradient of this loss with respect to w
i
is
r
w
i
i
(z (x)
>
w
i
) = –(
i
{z < (x)
>
w
i
})(x) . (9.15)
As in our derivation of QTD, from a sample transition (x, a, r, x
0
) we construct
m sample targets:
g
j
= r + (x
0
)
>
w
j
, j = 1, ..., m .
By instantiating the gradient expression in Equation 9.15 with z = g
j
and taking
the average over the m sample targets, we obtain the update rule
w
i
w
i
1
m
m
j=1
r
w
i
i
g
j
(x)
>
w
i
,
which is more explicitly
w
i
w
i
+
1
m
m
j=1
i
{g
j
< (x)
>
w
i
}

quantile TD error
(x). (9.16)
Note that, by plugging g
j
into the expression for the gradient (Equation 9.15),
we obtain a semi-gradient update rule. That is, analogous to the value-based
case, Equation 9.16 is not equivalent to to the gradient update
w
i
w
i
1
m
m
j=1
r
w
i
i
r + (x
0
)
>
w
j
(x)
>
w
i
,
because in general
r
w
i
i
r + (x
0
)
>
w
i
(x)
>
w
i
6=
i
{g
i
< (x)
>
w
i
}
(x). 288 Chapter 9
Linear CTD.
To derive a linear approximation of categorical temporal-
difference learning, we represent the probabilities of categorical distributions
using linear combinations of features. Speciﬁcally, we apply the softmax func-
tion to transform the parameters (
(x)
>
w
i
)
m
i=1
into a probability distribution. We
write
p
i
(x; w)=
e
(x)
>
w
i
m
j=1
e
(x)
>
w
j
. (9.17)
Recall that the probabilities
p
i
(x; w))
m
i=1
correspond to m locations (
i
)
m
i=1
. The
softmax transformation guarantees that the expression
w
(x)=
m
i=1
p
i
(x; w)
i
describes a bona ﬁde probability distribution. We thus construct the sample
target ¯(x):
¯(x)=
C
(b
r,
)
#
w
(x
0
)=
C
m
i=1
p
i
(x
0
; w)
r+✓
i
=
m
i=1
¯
p
i
i
,
where
¯
p
i
denotes the probability assigned to the location
i
by the sample target
¯
(x). Expressed in terms of the CTD coefﬁcients (Equation 6.10), the probability
¯
p
i
is
¯
p
i
=
m
j=1
i,j
(r)p
j
(x
0
; w).
As with quantile regression, we adjust the weights w by means of a gradient
descent procedure. Here, we use the cross-entropy loss between
w
(x) and
¯(x):
73
L(w)=–
m
i=1
¯
p
i
log p
i
(x; w). (9.18)
Combined with the softmax function, Equation 9.18 becomes
L(w)=–
m
i=1
¯
p
i
(x)
>
w
i
log
m
j=1
e
(x)
>
w
j
.
73.
The choice of cross-entropy loss is justiﬁed because it is the matching loss for the softmax
function, and their combination results in a convex objective (Auer et al. 1995; see also Bishop
2006, Section 4.3.6). Linear Function Approximation 289
With some algebra and again invoking the chain rule, we obtain that the gradient
with respect to the weights w
i
is
r
w
i
L(w)=–
¯
p
i
p
i
(x; w)
(x).
By adjusting the weights in the opposite direction of this gradient, this results
in the update rule
w
i
w
i
+
¯
p
i
p
i
(x; w)

CTD error
(x). (9.19)
While interesting in their own right, linear CTD and QTD are particularly impor-
tant in that they can be straightforwardly adapted to learn return-distribution
functions with nonlinear function approximation schemes such as deep neural
networks; we return to this point in Chapter 10. For now, it is worth noting that
the update rules of linear TD, linear QTD, and linear CTD can all be expressed
as
w
i
w
i
+ ↵"
i
(x),
where
"
i
is an error term. One interesting difference is that for both linear
CTD and linear QTD,
"
i
lies in the interval [–1, 1], while for linear TD it is
unbounded, giving evidence that we should expect different learning dynamics
from these algorithms. In addition, combining linear TD or linear QTD to a
tabular state representation recovers the corresponding incremental algorithms
from Chapter 6. For linear CTD, the update corresponds to a tabular representa-
tion of the softmax parameters rather than the probabilities themselves, and the
correspondence is not as straightforward.
Analysing linear QTD and CTD is complicated by the fact that the return
functions themselves are not linear in w
1
,
...
, w
m
. One solution is to relax the
requirement that the approximation
w
(x) be a probability distribution; as we
will see in the next section, in this case the distributional approximation behaves
much like the value function approximation and a theoretical guarantee can be
obtained.
9.6 An Algorithm Based On Signed Distributions*
So far, we made sure that the outputs of our distributional algorithms were
valid probability distributions (or could be as interpreted as such, e.g. when
working with statistical functionals). This was done explicitly when using the
softmax parametrisation in deﬁning linear CTD, and implicitly in the mixture
update rule of CTD in Chapter 3. In this section we consider an algorithm that
is similar to linear CTD but omits the softmax function. As a consequence of
this change, this modiﬁed algorithm’s outputs are signed distributions, which
290 Chapter 9
we brieﬂy encountered in Chapter 6 in the course of analysing categorical
temporal-difference learning.
Compared to linear CTD, this approach has the advantage of being both closer
to the tabular algorithm (it ﬁnds a best ﬁt in
`
2
distance, like tabular CTD) and
closer to linear value function approximation (making it amenable to analysis).
Although the learned predictions lack some aspects of probability distributions
such as well-deﬁned quantiles the learned signed distributions can be used
to estimate expectations of functions, including expected values.
To begin, let us deﬁne a (ﬁnite) signed distribution
as a weighted sum of two
probability distributions:
=
1
1
+
2
2
1
,
2
2R ,
1
,
2
2P(R) . (9.20)
We write
M
(
R
) for the space of ﬁnite signed distributions. For
2M
(
R
)
decomposed as in Equation 9.20, we deﬁne its cumulative distribution function
F
and the expectation of a function f : R !R as
F
(z)=
1
F
1
(z)+
2
F
2
(z), z 2R
E
Z
[f (Z)] =
1
E
Z
1
[f (Z)] +
2
E
Z
2
[f (Z)] . (9.21)
Exercise 9.14 asks you to verify that these deﬁnitions are independent of the
decomposition of
into a sum of probability distributions. The total mass of
is given by
()=
1
+
2
.
We make these deﬁnitions explicit because signed distributions are not proba-
bility distributions; in particular, we cannot draw samples from
. In that sense,
the notation Z
in the deﬁnition of expectation is technically incorrect, but
we use it here for convenience.
Deﬁnition 9.10.
The signed m-categorical representation parametrises the
mass of m particles at ﬁxed locations (
i
)
m
i=1
:
F
S,m
=
m
i=1
p
i
i
: p
i
2R for i = 1, ..., m,
m
i=1
p
i
=1
. 4
Compared to the usual m-categorical representation, its signed analogue adds
a degree of freedom: it allows the mass of its particles to be negative and of
magnitude greater than 1 (we reserve “probability” for values strictly in [0, 1]).
However, in our deﬁnition we still require that signed m-categorical distributions
have unit total mass; as we will see, this avoids a number of technical difﬁculties. Linear Function Approximation 291
Exercise 9.15 asks you to verify that
2F
S,m
is a signed distribution in the
sense of Equation 9.20.
Recall from Section 5.6 that the categorical projection
C
:
P
(
R
)
!F
C,m
is
deﬁned in terms of the triangular and half-triangular kernels h
i
:
R !
[0, 1], i =
1,
...
, m. We use Equation 9.21 to extend this projection to signed distributions,
written
C
:
M
(
R
)
!F
S,m
. Given
2M
(
R
), the masses of
C
=
m
i=1
p
i
i
are given by
p
i
= E
Z
h
i
&
–1
m
(Z
i
)

,
where as before
&
–1
m
=
i+1
i
(i < m) and we write
E
Z
in the sense of Equation
9.21. We also extend the notation to signed return-distribution functions in the
usual way. Observe that if
is a probability distribution, then
C
matches our
earlier deﬁnition.
The distributional Bellman operator, too, can be extended to signed distribu-
tions. Let
2M
(
R
)
X
be a signed return function. We deﬁne
T
:
M
(
R
)
X
!
M (R)
X
:
(T
)(x)=E
[(b
R,
)
#
(X
0
)|X = x].
This is the same equation as before, except that now the operator constructs
convex combinations of signed distributions.
Deﬁnition 9.11.
Given a state representation
:
X!R
n
and evenly-spaced
particle locations
1
,
,
m
,asigned linear return function approximation
w
2F
X
S,m
is parametrised by a weight matrix w
2R
nm
and maps states to
signed return function estimates according to
w
(x)=
m
i=1
p
i
(x; w)
i
, p
i
(x; w)=(x)
>
w
i
+
1
m
1–
m
j=1
(x)
>
w
j
, (9.22)
where w
i
is the i
th
column of w. We denote the space of signed return functions
that can be represented in this form by F
,S,m
. 4
Equation 9.22 can be understood as approximating the mass of each particle
linearly, and then adding mass to all particles uniformly to normalise the signed
distribution to have unit total mass. It deﬁnes a subset of signed m-categorical
distributions that are constructed from linear combinations of features. Because
of the normalisation, and unlike the space of linear value function approxima-
tions constructed from
,
F
,S,m
is not a linear subspace of
M
(
R
). However,
for each x 2X the mapping
w 7!
w
(x)
292 Chapter 9
is said to be afﬁne, a property that is sufﬁcient to permit theoretical analysis
(see Remark 9.3).
Using a linearly parametrised representation of the form given in Equation 9.22,
we seek to build a distributional dynamic programming algorithm based on the
signed categorical representation. The complication now is that the distributional
Bellman operator takes the return function approximation away from our linear
parametrisation in two separate ways: ﬁrst, the distributions themselves may
move away from the categorical representation, and second, the distributions
may no longer be representable by a linear combination of features. We address
these issues with a doubly-projected distributional operator:
,,`
2
C
T
,
where the outer projection ﬁnds the best approximation to
C
T
in
F
,S,m
. By
analogy with the value-based setting, we deﬁne “best approximation” in terms
of a -weighted Cramér distance, denoted `
,2
:
`
2
2
(,
0
)=
R
(F
(z)–F
0
(z))
2
dz ,
`
2
,2
(,
0
)=
x2X
(x)`
2
2
(x),
0
(x)
,
,,`
2
= arg min
0
2F
,S,m
`
2
,2
(,
0
).
Because we are now dealing with signed distributions, the Cramér distance
`
2
2
(
,
0
) is inﬁnite if the two input signed distributions
,
0
do not have the
same total mass. This justiﬁes our restriction to signed distributions with unit
total mass in deﬁning both the signed m-categorical representation and the linear
approximation. The following lemma shows that the distributional Bellman
operator preserves this property (see Exercise 9.16).
Lemma 9.12.
Suppose that
2M
(
R
)
X
is a signed return-distribution function.
Then
(T
)(x)
=
x
0
2X
P
(X
0
= x
0
| X = x)
(x
0
)
.
In particular, if all distributions of
have unit total mass, i.e.
(
(x)) = 1 for all
x, then
(T
)(x)
=1. 4
Linear Function Approximation 293
While it is possible to derive algorithms that are not restricted to outputting
signed distributions with unit total mass, one must then deal with added com-
plexity due to the distributional Bellman operator moving total mass from state
to state.
We next derive a semi-gradient update rule based on the doubly-projected
Bellman operator. Following the unbiased estimation method for designing
incremental algorithms (Section 6.2), we construct the signed target
C
˜(x)=
C
(b
r,
)
#
w
(x
0
).
Compared to the sample target of Equation 6.9, here
w
is a signed return
function in F
X
S,m
.
i
to minimise the
`
2
distance between
C
˜
(x) and the predicted distribution
w
(x). It does so by
taking a step in direction of the negative gradient of the squared `
2
distance
74
w
i
w
i
r
w
i
`
2
2
(
w
(x),
C
˜(x)) .
We obtain an implementable version of this update rule by expressing distri-
butions in
F
S,m
in terms of m-dimensional vectors of masses. For a signed
categorical distribution
=
m
i=1
p
i
i
,
let us denote by p
2R
m
the vector (p
1
,
, p
m
); similarly, denote by p
0
the vector
of masses for a signed distribution
0
. Because the cumulative functions of
signed categorical distributions with unit total mass are constant on the intervals
[
i
,
i+1
] and equal outside of the [
1
,
m
] interval, we can write
`
2
2
(,
0
)=
R
F
(z)–F
0
(z)
2
dz
= &
m
m–1
i=1
F
(
i
)–F
0
(
i
)
2
= &
m
kCp Cp
0
k
2
2
, (9.23)
74. In this expression, ˜(x) is used as a target estimate and treated as constant with regards to w. 294 Chapter 9
where
k·k
2
is the usual Euclidean norm and C
2R
mm
is the lower-triangular
matrix
C =
10··· 00
11··· 00
.
.
.
.
.
.
.
.
.
11··· 10
11··· 11
.
Letting p(x) and
˜
p
(x) denote the vector of masses for the signed approximation
w
(x) and the signed target
C
˜
(x), respectively, we rewrite the above in terms
of matrix-vector operations (Exercise 9.17 asks you to derive the gradient of
`
2
2
with respect to w
i
):
w
i
w
i
+ ↵&
m
(
˜
p(x)–p(x))
>
C
>
C
˜
e
i
(x) , (9.24)
where
˜
e
i
2R
m
is a vector whose entries are
˜
e
ij
=
{i = j}
1
m
.
By precomputing the vectors C
>
C
˜
e
i
, this update rule can be applied in O(m
2
+
mn) operations per sample transition.
9.7 Convergence of the Signed Algorithm*
We now turn our attention to establishing a contraction result for the doubly-
projected Bellman operator, by analogy with Theorem 9.8. We write
M
`
2
,1
(R)={
1
1
+
2
2
:
1
,
2
2R,
1
+
2
= 1,
1
,
2
2P
`
2
(R)} ,
for ﬁnite signed distributions with unit total mass and ﬁnite Cramér distance
from one another. In particular, note that F
S,m
M
`
2
,1
(R).
Theorem 9.13.
Suppose Assumption 4.29(1) holds and that Assump-
tion 9.5 holds with unique stationary distribution
. Then the projected
operator
,
,`
2
C
T
:
M
`
2
,1
(
R
)
X
!M
`
2
,1
(
R
)
X
is a contraction with
respect to the metric `
,2
, with contraction modulus
1
/2
. 4
This result is established by analysing separately each of the three operators
that composed together constitute the doubly-projected operator
,
,`
2
C
T
.
Lemma 9.14.
Under the assumptions of Theorem 9.13,
T
:
M
`
2
,1
(
R
)
!
M
`
2
,1
(R) is a
1
/2
-contraction with respect to `
,2
. 4 Linear Function Approximation 295
Lemma 9.15.
The categorical projection operator
C
:
M
`
2
,1
(
R
)
!F
S,m
is a
non-expansion with respect to `
2
. 4
Lemma 9.16.
Under the assumptions of Theorem 9.13, the function approx-
imation projection operator
,
,`
2
:
F
X
S,m
!F
X
S,m
is a non-expansion with
respect to `
,2
. 4
The proof of Lemma 9.14 essentially combines the reasoning of Proposi-
tion 4.20 and Lemma 9.6, and so is left as Exercise 9.18. Similarly, the proof of
Lemma 9.15 follows according to exactly the same calculations as in the proof
of Lemma 5.23, and so it is left as Exercise 9.19. The central observation is
that the corresponding arguments made for the Cramér distance in Chapter 5
under the assumption of probability distributions do not actually make use of
the monotonicity of the cumulative distribution functions, and so extend to the
signed distributions under consideration here.
Proof of Lemma 9.16.
Let
,
0
2F
X
S,m
and write p(x), p
0
(x) for the vector of
masses of (x) and
0
(x), respectively. From Equation 9.23, we have
`
2
,2
(,
0
)=
x2X
(x)`
2
2
((x),
0
(x))
=
x2X
(x)&
m
kCp(x)–Cp
0
(x)k
2
2
= &
m
kp p
0
k
2
C
>
C
,
where
C
>
C is a positive-semideﬁnite matrix in
R
(Xm)(Xm)
, with
=
diag
(
)
2R
XX
, and
p
,
p
0
2R
Xm
are the vectorised probabilities associated
with
,
0
. Therefore,
,
,`
2
can be interpreted as a Euclidean projection
under the norm
k·k
C
>
C
, and hence is a non-expansion with respect to the
corresponding norm; the result follows as this norm precisely induces the metric
`
,2
.
Proof of Theorem 9.13.
We use similar logic to that of Lemma 5.21 to com-
bine the individual contraction results established above, taking care that the
operators to be composed have several different domains between them. Let
,
0
2M
`
2
,1
(R)
X
. Then note that
`
,2
(
,
,`
2
C
T
,
,
,`
2
C
T
0
)
(a)
`
,2
(
C
T
,
C
T
0
)
(b)
`
,2
(T
, T
0
) 296 Chapter 9
(c)
1
/2
`
,2
(,
0
),
as required. Here, (a) follows from Lemma 9.16, since
C
T
,
C
T
0
2
F
X
S,m
. (b) follows from Lemma 9.15 and the straightforward corollary that
`
,2
(
C
,
C
0
)
`
,2
(
,
0
) for
,
0
2M
`
2
,1
(
R
). Finally, (c) follows from
Lemma 9.14.
The categorical projection is a useful computational device as it allows
,,`
2
to be implemented strictly in terms of signed m-categorical representations,
and we rely on it in our analysis. Mathematically, however, it is not strictly
necessary as it is implied by the projection onto the
F
,S,m
; this is demonstrated
by the following Pythagorean lemma.
Lemma 9.17.
For any signed m-categorical distribution
2F
S,m
(for which
=
C
) and any signed distribution
0
2M
`
2
,1
(R),
`
2
2
(,
0
)=`
2
2
(,
C
0
)+`
2
2
(
C
0
,
0
) . (9.25)
Consequently, for any 2M
`
2
,1
(R),
,,`
2
C
T
=
,,`
2
T
. 4
Equation 9.25 is obtained by a similar derivation to the one given in Remark
5.4, which proves the identity when
and
0
are the usual m-categorical
distributions.
Just as in the case studied in Chapter 5 without function approximation, it is
now possible to establish a guarantee on the quality of the ﬁxed point of the
operator
,
,`
2
C
T
.
Proposition 9.18.
Suppose that the conditions of Theorem 9.13 hold, and
let
ˆ
S
be the resulting ﬁxed point of the projected operator
,
,`
2
C
T
.
We have the following guarantee on the quality of
ˆ
S
compared with the
(`
2,
, F
,S,m
)-optimal approximation of
, namely
,
,`
2
:
`
,2
(
, ˆ
S
)
`
,2
(
,
,
,`
2
)
p
1–
. 4
Proof. We calculate directly:
`
2
,2
(
, ˆ
S
)
(a)
= `
2
,2
(
,
,
,`
2
C
)+`
2
,2
(
,
,`
2
C
,
,
,`
2
C
ˆ
S
)
(b)
= `
2
,2
(
,
,
,`
2
C
)+`
2
,2
(
,
,`
2
C
T
,
,
,`
2
C
T
ˆ
S
) Linear Function Approximation 297
1.0 0.5 0.0 0.5 1.0
Return
0
2
4
6
8
10
Density
1.0 0.5 0.0 0.5 1.0
Return
0.2
0.0
0.2
0.4
Mass
1.0 0.5 0.0 0.5 1.0
Return
0.00
0.05
0.10
0.15
0.20
Mass
1.0 0.5 0.0 0.5 1.0
Return
0.0
0.1
0.2
0.3
0.4
0.5
Probability
(a) (b)
(c) (d)
Figure 9.3
Signed linear approximations of the return distribution of the initial state in Example 2.9.
(a–b)
Ground truth return-distribution function and categorical Monte Carlo approxima-
tion, for reference. (
c
) Best approximation from
F
,S,m
based on the state representation
(x) = [1, x
r
, r
c
] where x
r
, x
c
are the row and column indices of a given state, with (0, 0)
denoting the top-left corner. (d) Fixed point of the signed categorical algorithm.
(c)
`
2
,2
(
,
,
,`
2
C
)+`
2
,2
(
, ˆ
S
)
=) `
2
,2
(
, ˆ
S
)
1
1–
`
2
,2
(
,
,
,`
2
C
).
where (a) follows from the Pythagorean identity in Lemma 9.17 and a similar
identity concerning
,
,`
2
, (b) follows since
is ﬁxed by
T
and
ˆ
S
is ﬁxed
by
,
,`
2
C
T
, and (c) follows from
1
/2
-contractivity of
,
,`
2
C
T
in
`
2,
.
This result provides a quantitative guarantee on how much the approximation
error compounds when
ˆ
S
is computed by approximate dynamic programming.
Of note, here there are two sources of error: one due to the use of a ﬁnite
number m of distributional parameters, and another due to the use of function
approximation.
Example 9.19.
Consider linearly approximating the return-distribution func-
tion of the safe policy in the Cliffs domain (Example 2.9). If we use a
3-dimensional state representation
(x) = [1, x
r
, r
c
] where x
r
, x
c
are the row and
column indices of a given state, then we observe aliasing in the approximated
return distribution (Figure 9.3). In addition, the use of a signed distribution
298 Chapter 9
representation results in negative mass being assigned to some locations in
the optimal approximation. This is by design; given the limited capacity of
the approximation, the algorithms ﬁnd a solution that mitigates errors at some
locations by introducing negative mass at other locations. As usual, the use of a
bootstrapping procedure introduces diffusion error, here quite signiﬁcant due to
the low-dimensional state representation. 4
9.8 Technical Remarks
Remark 9.1.
For ﬁnite action spaces, we can easily convert a state represen-
tation
(x) to a state-action representation
(x, a) by repeating a basic feature
matrix
2R
Xn
. Let N
A
be the number of actions. We build a block-diagonal
feature matrix
X,A
2R
(XA)(nN
A
)
that contains N
A
copies of :
X,A
:=
0 ··· 0
0 ··· 0
.
.
.
.
.
.
.
.
.
.
.
.
00···
.
The weight vector w is also extended to be of dimension nN
A
, so that
Q
w
(x, a)=
X,A
w
(x, a)
as before. This is equivalent to, but somewhat more verbose than the use of
per-action weight vectors. 4
Remark 9.2.
Assumption 9.5 enabled us to demonstrate that the projected
Bellman operator
,
T
has a unique ﬁxed point, by invoking Banach’s ﬁxed
point theorem. The ﬁrst part of the assumption, on the uniqueness of
, is
relatively mild and is only used to simplify the exposition. However, if there is
a state x for which
(x) = 0, then
k·k
,2
does not deﬁne a proper metric on
R
n
and Banach’s theorem cannot be used. In this case, there might be multiple
ﬁxed points (see Exercise 9.7).
to assign zero probability to some states, the norm
k·k
,2
may not be a very interesting measure of accuracy. One common
situation where the issue arises is when there is a terminal state x
?
which is
reached with probability 1, in which case
lim
t!1
P
(X
t
= x
?
)=1.
It is easy to see that in this case
puts all of its probability mass on x
?
, so that
Theorem 9.8 becomes vacuous: the norm
k·k
,2
only measures the error at
the terminal state. A more interesting distribution to consider is the distribution Linear Function Approximation 299
of states resulting from immediately resetting to an initial state when x
?
is
reached. In particular, this corresponds to the distribution used in many practical
applications. Let
0
be the initial state distribution; without loss of generality,
let us assume that
0
(x
?
) = 0. Deﬁne the substochastic transition operator
P
X,?
(x
0
| x, a)=
0 if x
0
= x
?
,
P
X
(x
0
| x, a) otherwise.
In addition, deﬁne a transition operator that replaces transitions to the terminal
state by transition to one of the initial states, according to the initial distribution
0
:
P
X,
0
(x
0
| x, a)=P
X,?
(x
0
| x, a)+
{x
0
6= x
?
}
P
X
(x
?
| x, a)
0
(x
0
).
One can show that the Bellman operator T
?
(deﬁned from P
X,?
) satisﬁes
T
?
V = T
V
for any V
2R
X
for which V(x
?
) = 0, and that the steady-state distribution
?
induced by P
X,
0
and a policy is such that
there exists t 2N, P
(X
t
= x)>0 =)
?
(x)>0.
Let P
?
be the transition matrix corresponding to
P
X,?
(x
0
| x, a) and the policy
. Exercise 9.21 asks you to prove that
kP
?
k
?
,2
1,
from which a modiﬁed version of Theorem 9.8 can be obtained. 4
Remark 9.3.
Let M and M
0
be vector spaces. A mapping
O
: M
!
M
0
is said
to be afﬁne if, for any U, U
0
2M and 2[0, 1],
O(U + (1 )U
0
)=OU + (1 )OU
0
.
When
w
is a signed linear return function approximation parametrised by w,
the map
w 7!
w
(x)
is afﬁne for each x
2X
, as we now show. It is this property that allows us to
express the
`
2
distance between
,
0
2F
S,m
in terms of a difference of vectors
of probabilities (Equation 9.23), needed in the proof of Lemma 9.16.
Let w, w
0
2R
nm
be the parameters of signed return functions of the form of
Equation 9.22. For these, write p
w
(x) for the vector of masses determined by w:
p
w
(x)=w
>
(x)+
1
m
e
1–e
>
w
>
(x)
, 300 Chapter 9
where e is the m-dimensional vector of ones. We then have
m
i=1
(x)
>
w
i
= e
>
w
>
(x),
and similarly for w
0
. Hence
p
w+(1–)w
0
(x)=(w + (1 )w
0
)
>
(x)–
1
m
ee
>
(w + (1 )w
0
)
>
(x)
= p
w
(x) + (1 )p
w
0
(x).
We conclude that w 7!
w
(x) is indeed afﬁne in w. 4
9.9 Bibliographical Remarks
9.1–9.2.
Linear value function approximation as described in this book is in
effect a special case of linear regression where the inputs are taken from a ﬁnite
set (
(x))
x2X
and noiseless labels; see Strang , Bishop , Murphy
 for a discussion on linear regression. Early in the history of reinforce-
ment learning research, function approximation was provided by connectionist
systems [Barto et al., 1983, 1995]; and used to deal with inﬁnite state spaces
[Boyan and Moore, 1995, Sutton, 1996]. An earlier-still form of linear function
approximation and temporal-difference learning was used by Samuel  to
train a strong player for the game of checkers.
9.3.
Bertsekas and Tsitsiklis [1996, Chapter 6] studies linear value function
approximation and its combination with various reinforcement learning meth-
ods, including TD [see also Bertsekas, 2011, 2012]. Tsitsiklis and Van Roy
 establishes the contractive nature of the projected Bellman operator
,
T
under the steady-state distribution (Theorem 9.8 in this book). The
temporal-difference ﬁxed point can be determined directly by solving a least
squares problem, yielding the least-squares TD algorithm [LSTD; Bradtke and
Barto, 1996], which is extended to the control setting by least-squares policy
iteration [LSPI; Lagoudakis and Parr, 2003]. In the control setting, the method
is also called ﬁtted Q-iteration [Gordon, 1995, Ernst et al., 2005, Riedmiller,
2005].
Bertsekas  gives an early demonstration that the TD ﬁxed point is in
general different (and has greater error than) the best approximation to V
,
measured in
-weighted L
2
norm. However, it is possible to interpret the
temporal-difference ﬁxed point as the solution to an oblique projection problem
that minimises errors between consecutive states [Harmon and Baird, 1996,
Scherrer, 2010].
Linear Function Approximation 301
9.4.
Barnard  provides a proof that temporal-difference learning is not
Barto, 2018]. The situation in which
differs from the steady-state (or sampling)
distribution of the induced Markov chain is called off-policy learning; Example
9.9 is a simpliﬁed version of the original counterexample due to Baird .
Baird argued for the direct minimisation of the Bellman residual by gradient
descent as a replacement to temporal-difference learning, but this method suffers
from other issues, and can produce undesirable solutions even in mild scenarios
[Sutton et al., 2008a]. The GTD line of work [Sutton et al., 2009, Maei, 2011]
is a direct attempt at handling the issue by using a pair of approximations (see
Qu et al.  for applications of this idea in a distributional context); more
recent work directly considers a corrected version of the Bellman residual [Dai
et al., 2018, Chen and Jiang, 2019]. The convergence of temporal difference
learning with linear function approximation was proven under fairly general
conditions by Tsitsiklis and Van Roy , using the ODE method from
stochastic approximation theory [Ljung, 1977, Benveniste et al., 2012, Kushner
Parr et al. [2007, 2008], Sutton et al. [2008b] provide a domain-dependent
analysis of linear value function approximation, which is extended by Ghosh
and Bellemare  to establish the existence of representations that are
stable under a greater array of conditions (Exercises 9.10 and 9.11). Kolter
 studies the space of temporal-difference ﬁxed point as a function of the
distribution .
9.5.
Linear CTD and QTD were implicitly introduced by Bellemare et al.
[2017a] and Dabney et al. [2018b], respectively, in the design of deep rein-
forcement learning agents (see Chapter 10). Their presentation as given here is
new.
9.6–9.7.
The algorithm based on signed distributions is new to this book. It
improves on an algorithm proposed by Bellemare et al. [2019b] in that it adjusts
the total mass of return distributions to always be one. Lyle et al.  give
evidence that the original algorithm generally underperforms the categorical
algorithm. They further establish that, in the risk-neutral control setting, distri-
butional algorithms cannot do better than value-based algorithms when using
linear approximation. The reader interested in the theory of signed measures is
referred to Doob .
302 Chapter 9
9.10 Exercises
Exercise 9.1.
Use the update rules of Equation 9.10 and 9.11 to prove the
results from Example 9.1. 4
Exercise 9.2.
Prove that for a linear value function approximation V
w
=
w,
we have, for any w, w
0
2R
n
and , 2R,
V
w+w
0
= V
w
+ V
w
0
, r
w
V
w
(x)=(x). 4
Exercise 9.3. In the statement of Proposition 9.3 we required that
(i) The columns of the feature matrix be linearly independent;
(ii) (x) > 0 for all x 2X.
Explain how the result is affected when either of these requirements is omitted.
4
Exercise 9.4.
Prove Lemma 9.7. Hint. Apply Pythagoras’s theorem to a well-
chosen inner product. 4
Exercise 9.5.
The purpose of this exercise is to empirically study the con-
tractive and expansive properties of the projection in L
2
norm. Consider an
integer-valued state space x
2X
= {1,
...
, 10} with two-dimensional state rep-
resentation
(x) = (1, x), and write
for the L
2
projection of a vector v
2R
X
onto the linear subspace deﬁned by . That is,
v = (
>
)
–1
>
v,
where is the feature matrix.
With this representation, we can represent the vector u(x) = 0 exactly, and hence
u = u. Now consider the vector u
0
deﬁned by
u
0
(x) = log x.
With a numerical experiment, show that
k
u
0
uk
2
ku
0
u k
2
but k
u
0
uk
1
> ku
0
u k
1
. 4
Exercise 9.6.
Provide an example Markov decision process and state represen-
tation that result in the left- and right-hand sides of Equation 9.6 being equal.
Hint. A diagram might prove useful. 4
Exercise 9.7.
Following Remark 9.2, suppose that the steady-state distribution
is such that
(x) = 0 for some state x. Discuss the implications on the analysis
Linear Function Approximation 303
performed in this chapter, in particular on the set of solutions to Equation 9.8
and the behaviour of semi-gradient temporal-difference learning when source
states are drawn from this distribution. 4
Exercise 9.8.
Let
0
be some initial distribution and
a policy. Deﬁne the
discounted state-visitation distribution
¯
(x
0
) = (1 )
0
(x
0
)+
x2X
P
(X
0
= x
0
| X = x)
¯
(x), for all x
0
2X.
Following the line of reasoning leading to Lemma 9.6, show that the projected
Bellman operator
,
T
is a
1
/2
contraction in the
¯
-weighted L
2
metric:
kT
k
¯
,2
1
/2
. 4
Exercise 9.9.
Let
2P
(
X
). Suppose that the feature matrix
2R
Xn
has
linearly independent columns and
(x) > 0 for all x
2X
. Show that the unique
optimal weight vector w
that is a solution to Equation 9.2 satisﬁes
E
G
(X)–(X)
>
w
(X)
= 0, X . 4
Exercise 9.10
(*)
.
This exercise studies the divergence of semi-gradient
temporal-difference learning from a dynamical systems perspective. Recall
that 2R
XX
is the diagonal matrix whose entries are (x).
(i)
Show that in expectation and in matrix form, Equation 9.11 produces the
sequence of weight vectors (w
k
)
k0
given by
w
k+1
= w
k
+
k
>
r
+ P
w
k
w
k
).
(ii) Assume that
k
= > 0. Express the above as an update of the form
w
k+1
= Aw
k
+ b, (9.26)
where A 2R
nn
and b 2R
n
.
(iii)
Suppose that the matrix C =
>
(
P
I)
has eigenvalues that are all
real. Show that if one of these is positive, then A has at least one eigenvalue
greater than 1.
(iv)
Use the preceding result to conclude that under those conditions, there exists
a w
0
such that kw
k
k
2
!1.
(v)
Suppose now that all of the matrix Cs eigenvalues are real and nonpositive.
Show that in this case, there exists an
2
(0, 1) such that taking
k
=
, the
sequence (w
k
)
k0
converges to the temporal-difference ﬁxed point. 4
304 Chapter 9
Exercise 9.11 (*). Suppose that the state representation is such that
(i) the matrix P
2R
Xn
has columns that lie in the column span of , and
(ii)
the matrix has columns that are orthogonal with regards to the
-weighted
inner product. That is,
>
⌅ = I.
Show that in this case, the eigenvalues of the matrix
>
(P
I)
are all non-positive, for any choice of
2P
(
X
). Based on the previous exercise,
conclude on the dynamical properties of semi-gradient temporal-difference
learning with this representation. 4
Exercise 9.12.
Using your favourite numerical computation software, imple-
ment the Markov decision process from Baird’s counterexample (Example 9.9)
w
k+1
= w
k
+
(x
0
k
)w
k
(x
k
)w
k
(x
k
),
for the state representation of the example and a small, constant value of
.
Here, it is assumed that X
k
has distribution
and X
0
k
is drawn from the transition
kernel depicted in Figure 9.2.
(i)
Vary
(x) from 0 to 1, and plot the norm of the value function estimate,
kV
k
k
,2
, as a function of k.
(ii)
Now replace
(X
0
k
)w
k
by
(X
0
k
)
˜
w
k
, where
˜
w
k
= w
kk mod L
for some integer
L
1. That is,
˜
w
k
is kept ﬁxed for L iterations. Plot the evolution of
k
V
k
k
,2
for different values of L. What do you observe? 4
Exercise 9.13.
Repeat Exercise 3.3, replacing the uniform grid encoding the
state with a representation deﬁned by
(x)=
1, sin(x
1
), cos(x
1
), , sin(x
4
), cos(x
4
)
,
where (x
1
, x
2
) denote the Cart-Pole state variables. Implement linear CTD and
QTD and use these with the state representation
to learn a return-distribution
function approximation for the uniform and forward-leaning policies. Visualise
the learned approximations at selected states, including the initial state, and com-
pare them to ground-truth return distributions estimated by the non-parametric
Monte Carlo algorithm (Remark 3.1). 4
Linear Function Approximation 305
Exercise 9.14.
Recall that given a ﬁnite signed distribution
expressed as
a sum of probability distributions
=
1
1
+
2
2
(with
1
,
2
2R
,
1
,
2
2
P
(
R
)), we deﬁned expectations under
in terms of sums of expectations
under
1
and
2
. Verify that this deﬁnition is not dependent on the choice of
decomposition of
into a sum of probability distributions. That is, suppose that
there also exist
0
1
,
0
2
2R
and
0
1
,
0
2
2P
(
R
) with
=
0
1
0
1
+
0
2
0
2
. Show that
for any (measurable) function f : R !R with
E
Z
1
[|f (Z)|] < 1, E
Z
2
[|f (Z)|] < 1, E
Z
0
1
[|f (Z)|] < 1, E
Z
0
2
[|f (Z)|] < 1,
we have
1
E
Z
1
[f (Z)] +
2
E
X
2
[f (Z)] =
1
E
Z
0
1
[f (Z)] +
2
E
Z
0
2
[f (Z)] . 4
Exercise 9.15.
Show that any m-categorical signed distribution
2F
S,m
can
be written as the weighted sum of two m-categorical (probability) distributions
1
,
2
2F
C,m
. 4
Exercise 9.16.
Suppose that we consider a return function
2M
(
R
)
X
deﬁned
over signed distributions (not necessarily with unit total mass). Show that
(T
)(x)
=
x
0
2X
P
(X
0
= x | X = x)((x
0
)) .
Conclude that if 2F
X
S,m
, then
(T
)(x)
=1. 4
Exercise 9.17.
Consider two signed m-categorical distributions
,
0
2F
S,m
.
Denote their respective vectors of masses by p, p
0
2R
m
. Prove the correctness
of Equation 9.24. That is, show that
r
w
i
`
2
2
(,
0
) = –2&
m
(p
0
p)
>
C
>
C
˜
e
i
(x). 4
Exercise 9.18. Prove Lemma 9.14. 4
Exercise 9.19. Prove Lemma 9.15. 4
Exercise 9.20. Prove Lemma 9.17. 4
Exercise 9.21. Following the discussion in Remark 9.2, show that
kP
?
k
?
,2
1. 4