6
Incremental Algorithms
The concept of experience is central to reinforcement learning. Methods such
as TD and categorical TD learning iteratively update predictions on the basis of
transitions experienced by interacting with an environment. Such incremental
algorithms are applicable in a wide range of scenarios, including those in
which no model of the environment is known, or in which the model is too
complex to allow for dynamic programming methods to be applied. Incremental
algorithms are also often easier to implement. For these reasons, they are key in
the application of reinforcement learning to many real-world domains.
With this ease of use, however, comes an added complication. In contrast to
dynamic programming algorithms, which steadily make progress towards the
desired goal, there is no guarantee that incremental methods will generate con-
sistently improving estimates of return distributions from iteration to iteration.
For example, an unusually high reward in a sampled transition may actually
lead to a short-term degrading of the value estimate for the corresponding state.
In practice, this requires making sufficiently small steps with each update, to
average out such variations. In theory, the stochastic nature of incremental
updates makes the analysis substantially more complicated than contraction
mapping arguments.
This chapter takes a closer look at the behaviour and design of incremental algo-
rithms distributional and not. Using the language of operators and probability
distribution representations, we also formalise what it means for a incremen-
tal algorithm to perform well, and discuss how to analyse its asymptotic
convergence to the desired estimate.
167
168 Chapter 6
6.1 Computation and Statistical Estimation
Iterative policy evaluation finds an approximation to the value function V
by
successively computing the iterates
V
k+1
= T
V
k
, (6.1)
defined by an arbitrary initial value function estimate V
0
2R
X
. We can also
think of temporal-difference learning as computing an approximation to the
value function V
, albeit with a different mode of operation. To begin, recall
from Chapter 3 the TD learning update rule:
V(x) (1 )V(x)+
r + V(x
0
)
. (6.2)
One of the aims of this chapter is to study the long-term behaviour of the value
function estimate V (and eventually, of estimates produced by incremental,
distributional algorithms).
At the heart of our analysis is the behaviour of a single update. That is, for a
fixed V
2R
X
we may understand the learning dynamics of temporal-difference
learning by considering the random value function estimate V
0
defined via the
sample transition model (X = x, A, R, X
0
):
V
0
(x) = (1 )V(x)+
R + V(X
0
)
, (6.3)
V
0
(y)=V(y), y 6= x .
There is a close connection between the expected effect of the update given by
Equation 6.3 and iterative policy evaluation. Specifically, the expected value of
the quantity R +
V(X
0
) precisely corresponds to the application of the Bellman
operator to V, evaluated at the source state x:
E
[R + V(X
0
)|X = x]=(T
V)(x).
Consequently, in expectation TD learning adjusts its value function estimate at
x in the direction given by the Bellman operator:
E
[V
0
(x)|X = x] = (1 )V(x)+(T
V)(x) . (6.4)
To argue that temporal-difference learning correctly finds an approximation to
V
, we must also be able to account for the random nature of TD updates. An
effective approach is to rewrite Equation 6.3 as the sum of an expected target
and a mean-zero noise term:
V
0
(x) = (1 )V(x)+
(T
V)(x)

expected target
+ R + V(X
0
)–(T
V)(x)

noise
; (6.5)
with this decomposition, we may simultaneously analyse the mean dynamics of
TD learning as well as the effect of the noise on the value function estimates. In
Incremental Algorithms 169
the second half of the chapter, we will use Equation 6.5 to establish that under
appropriate conditions, these dynamics can be controlled so as to guarantee
the convergence of temporal-difference learning to V
, and analogously the
convergence of categorical temporal-difference learning to the fixed point ˆ
C
.
6.2 From Operators To Incremental Algorithms
As illustrated in the preceding section, we can explain the behaviour of temporal-
difference learning (an incremental algorithm) by relating it to the Bellman
operator. New incremental algorithms can also be obtained by following the
reverse process by deriving an update rule from a given operator. This tech-
nique is particularly effective in distributional reinforcement learning, where
one often needs to implement incremental counterparts to a variety of dynamic
programming algorithms. To describe how one might derive an update rule
from an operator, we now introduce an abstract framework based on what is
known as stochastic approximation theory.
45
Let us assume that we are given a contractive operator
O
over some state-
indexed quantity, and that we are interested in determining the fixed point U
of
this operator. With dynamic programming methods, we obtain an approximation
of U
by computing the iterates
U
k+1
(x)=(OU
k
)(x) , for all x 2X .
To construct a corresponding incremental algorithm, we must first identify what
information is available at each update; this constitutes our sampling model.
For example, in the case of temporal-difference learning, this is the sample
transition model (X, A, R, X
0
). For Monte Carlo algorithms, the sampling model
is the random trajectory (X
t
, A
t
, R
t
)
t0
(see Exercise 6.1). In the context of this
chapter, we assume that the sampling model takes the form (X, Y), where X is
the source state to be updated, and Y comprises all other information in the
model, which we term the sample experience.
Given a step size
and realisations x and y of the source state variable X and
sample experience Y, respectively, we consider incremental algorithms whose
update rule can be expressed as
U(x) (1 )U(x)+
ˆ
O(U, x, y) . (6.6)
Here,
ˆ
O
(U, x, y) is a sample target which may depend on the current estimate
U. Typically, the particular setting we are in also imposes some limitation on
45.
Our treatment of incremental algorithms and their relation to stochastic approximation theory is
far from exhaustive; the interested reader is invited to consult the bibliographical remarks.
170 Chapter 6
the form of
ˆ
O
. For example, when
O
is the Bellman operator T
, although
ˆ
O
(U, x, y)=(
O
U)(x) is a valid instantiation of Equation 6.6, its implementation
might require knowledge of the environment’s transition kernel and reward
function. Implicit within Equation 6.6 is the notion that the space that the
estimate U occupies supports a mixing operation; this will indeed be the case
for the algorithms we consider in this chapter, which work either with finite-
dimensional parameter sets, or probability distributions themselves.
With this framework in mind, the question is what makes a sensible choice for
ˆ
O.
Unbiased update.
An important case is when the sample target
ˆ
O
can be
chosen so that in expectation it corresponds to the application of the operator
O:
E[
ˆ
O(U, X, Y)|X = x]=(OU)(x) . (6.7)
In general, when Equation 6.7 holds the resulting incremental algorithm is also
well-behaved. More formally, we will see that under reasonable conditions,
the estimates produced by such an algorithm are guaranteed to converge to
U
a generalisation of our earlier statement that temporal-difference learning
converges to V
.
Conversely, when the operator
O
can be expressed as an expectation over some
function of U, X, and Y, then it is possible to derive a sample target simply by
substituting the random variables involved with their realisations. In effect, we
then use the sample experience to construct an unbiased estimate of (
O
U)(x).
As a concrete example, the TD target, expressed in terms of random variables,
is
ˆ
O(V, X, Y)=R + V(X
0
);
the corresponding update rule is
V(x) (1 )V(x)+ (r + V(x
0
))

sample target
.
In the next section, we will show how to use this approach to derive categorical
temporal-difference learning (introduced in Chapter 3) from the categorical-
projected Bellman operator.
Example 6.1.
The consistent Bellman operator is an operator over state-action
value functions based on the idea of making consistent choices at each state. At
a high level, the consistent operator adds the constraint that actions that leave
Incremental Algorithms 171
the state unchanged should be repeated. This operator is formalised as
T
C
Q(x, a)=E
R + max
a
0
2A
Q(X
0
, a
0
)
{X
0
6= x }
+ Q(x, a)
{X
0
= x }
| X = x
.
Let (x, a, r, x
0
) be drawn according to the sample transition model. The update
rule derived by substitution is
Q(x, a)
(1 )Q(x, a)+
r + max
a
0
2A
Q(x
0
, a
0
)
if x
0
6= x
(1 )Q(x, a)+(r + Q(x, a)) otherwise.
Compared to Q-learning (Section 3.7), the consistent update rule increases the
action gap at each state, in the sense that its operator’s fixed point Q
C
has the
property that for all (x, a) 2XA,
max
a
0
2A
Q
C
(x, a
0
)–Q
C
(x, a) max
a
0
2A
Q
(x, a
0
)–Q
(x, a),
with strict inequality whenever P
X
(x | x, a) > 0. 4
A general principle.
Sometimes, expressing the operator
O
in the form of
Equation 6.7 requires information that is not available to our sampling model.
In this case, it is sometimes still possible to construct an update rule whose
repeated application approximates the operator
O
. More precisely, given a fixed
estimate
˜
U
, with this approach we look for a sample target function
ˆ
O
such that
from a suitable initial condition, repeated updates of the form
U(x) (1 )U(x)+
ˆ
O(
˜
U, x, y)
lead to U
O
˜
U
. In this case, a necessary condition for
ˆ
O
to be a suitable sample
target is that it should leave the fixed point U
unchanged, in expectation:
E
[
ˆ
O(U
, X, Y)|X = x]=U
(x).
In Section 6.4 we will introduce quantile temporal-difference learning, an
algorithm that applies this principle to find the fixed point of the quantile-
projected Bellman operator.
6.3 Categorical Temporal-Difference Learning
Categorical dynamic programming (CDP) computes a sequence (
k
)
k0
of
return-distribution functions, defined by iteratively applying the projected dis-
tributional Bellman operator
C
T
to an initial return-distribution function
0
:
k+1
=
C
T
k
.
As we established in Section 5.9, the sequence generated by CDP converges
to the fixed point
ˆ
C
. Let us express this fixed point in terms of a collec-
tion of probabilities
(p
i
(x))
m
i=1
: x
2X
associated with m particles located at
172 Chapter 6
1
, ...,
m
:
ˆ
C
(x)=
m
i=1
p
i
(x)
i
.
To derive an incremental algorithm from the categorical-projection Bellman
operator, let us begin by expressing the projected distributional operator
C
T
in terms of an expectation over the sample transition (X = x, A, R, X
0
):
C
T
(x)=
C
E
b
R,
#
(X
0
)|X = x
. (6.8)
Following the line of reasoning from Section 6.2, in order to construct an
unbiased sample target by substituting R and X
0
with their realisations, we
need to rewrite Equation 6.8 with the expectation outside of the projection
C
.
The following establishes the validity of exchanging the order of these two
operations.
Proposition 6.2.
Let
2F
X
C,m
be a return function based on the m-
categorical representation. Then for each state x 2X,
(
C
T
(x)=E
C
(b
R,
)
#
(X
0
)
| X = x
. 4
Proposition 6.2 establishes that the projected operator
C
T
can be written in
such a way that the substitution of random variables with their realisations can
be performed. Consequently, we deduce that the random sample target
ˆ
O
, x,(R, X
0
)
=
C
(b
R,
)
#
(X
0
)
provides an unbiased estimate of (
C
T
)(x). For a given realisation (x, a, r, x
0
)
of the sample transition, this leads to the update rule
46
(x) (1 )(x)+
C
(b
r,
)
#
(x
0
)

sample target
. (6.9)
The last part of the CTD derivation is to express Equation 6.9 in terms of
the actual parameters being updated. These parameters are the probabilities
(p
i
(x))
m
i=1
: x 2X
of the return-distribution function estimate :
(x)=
m
i=1
p
i
(x)
i
.
The sample target in Equation 6.9 is given by the pushforward transformation of
a m-categorical distribution (
(x
0
)) followed by a categorical projection. As we
demonstrated in Section 3.6, the projection of a such a transformed distribution
46.
Although the action a is not needed to construct the sample target, we include it for consistency.
Incremental Algorithms 173
can be expressed concisely from a set of coefficients (
i,j
(r):i, j
2
{1,
...
, m}).
In terms of the triangular and half-triangular kernels (h
i
)
m
i=1
that define the
categorical projection (Section 5.6), these coefficients are
i,j
(r)=h
i
&
–1
m
(r + ✓
j
i
)
. (6.10)
With these coefficients the update rule over the probability parameters is
p
i
(x) (1 )p
i
(x)+
m
j=1
i,j
(r)p
j
(x
0
).
Our derivation illustrates how substituting random variables for their realisations
directly leads to an incremental algorithm, provided we have the right operator
to begin with. In many situations this is simpler than the step-by-step process
that we originally followed in Chapter 3. Because the random sample target is
an unbiased estimate of the projected Bellman operator, it is also simpler to
prove its convergence to the fixed point
ˆ
C
; in the second half of this chapter,
we will in fact apply the same technique to analyse both temporal-difference
learning and CTD.
Proof of Proposition 6.2. For a given r 2R, x
0
2X, let us write
˜(r, x
0
) = (b
r,
)
#
(x
0
).
Fix x 2X. For conciseness, let us define, for z 2R,
˜
h
i
(z)=h
i
&
–1
m
(z
i
)
.
With this notation, we have
E
C
(b
R,
)
#
(X
0
)
| X = x
(a)
= E
m
i=1
i
E
Z˜(R,X
0
)
[
˜
h
i
(Z)] | X = x
=
m
i=1
i
E
E
Z˜(R,X
0
)
[
˜
h
i
(Z)] | X = x
(b)
=
m
i=1
i
E
Z
0
(T
)(x)
˜
h
i
(Z
0
)
=
C
(T
)(x).
where (a) follows by definition of the categorical projection in terms of the
triangular and half-triangular kernels (h
i
)
m
i=1
and (b) follows by noting that
if the conditional distribution of R +
G(X
0
) (where G is an instantiation of
independent of the sample transition (x, A, R, X
0
)) given R, X
0
is
˜
(R, X
0
)=
174 Chapter 6
(
b
R,
)
#
(X
0
), then the unconditional distribution of G when X = x is (
T
)(x).
6.4 Quantile Temporal-Difference Learning
Quantile regression is a method for determining the quantiles of a probability
distribution incrementally and from samples.
47
In this section, we develop an
algorithm that aims to find the fixed point
ˆ
Q
of the quantile-projected Bellman
operator
Q
T
via quantile regression.
To begin, suppose that given
2
(0, 1) we are interested in estimating the
th
quantile of a distribution
, corresponding to F
–1
(
). Quantile regression
maintains an estimate
of this quantile. Given a sample z drawn from
, it
adjusts according to
+ (
{z < }
) . (6.11)
One can show that quantile regression follows the negative gradient of the
quantile loss
48
L
()=(
{z < }
)(z )
=|
{z < }
| |z | . (6.12)
In Equation 6.12, the term |
{z < }
| is an asymmetric step size which is
either
or 1
, according to whether the sample z is greater or smaller than
, respectively. When
< 0.5, samples greater than
have a lesser effect on it
than samples smaller than
; the effect is reversed when
> 0.5. The update
rule in Equation 6.11 will continue to adjust the estimate until the equilibrium
point
is reached (Exercise 6.4 asks you to visualise the behaviour of quantile
regression with different distributions). This equilibrium point is the location at
which smaller and larger samples have an equal effect in expectation. At that
point, letting Z , we have
0=E
{Z <
}
= E
{Z <
}
= P(Z <
)
=) P(Z <
)=
=)
= F
–1
() . (6.13)
47.
More precisely, quantile regression is the problem of estimating a predetermined set of quantiles
of a collection of probability distributions. By extension, in this book, we also use ‘quantile
regression’ to refer to the incremental method that solves this problem.
48.
More precisely, Equation 6.11 updates
in the direction of the negative gradient of
L
provided
that P
Z
(Z = ) = 0. This holds trivially if is a continuous probability distribution.
Incremental Algorithms 175
For ease of exposition, in the final line we assumed that there is a unique z
2R
for which F
(z)=; Remark 6.1 discusses the general case.
Now, let us consider applying quantile regression to find a m-quantile approxi-
mation to the return-distribution function (ideally, the fixed point
ˆ
Q
). Recall
that a m-quantile return-distribution function
2F
X
Q,m
is parametrised by the
locations
(
i
(x))
m
i=1
: x 2X
:
(x)=
1
m
m
i=1
i
(x)
.
Now, the quantile projection
Q
of a probability distribution is given by
Q
=
1
m
m
i=1
F
–1
(
i
)
,
i
=
2i –1
2m
for i = 1, , m .
Given a source state x
2X
, the general idea is to perform quantile regression for
all location parameters (
i
)
m
i=1
simultaneously, using the quantile levels (
i
)
m
i=1
and samples drawn from (
T
)(x). To this end, let us momentarily introduce a
random variable J uniformly distributed on {1,
...
, m}. By Proposition 4.11,
we have
D
R + ✓
J
(X
0
)|X = x
=(T
)(x) . (6.14)
Given a realised transition (x, a, r, x
0
), we may therefore construct m sample
targets
r +
✓
j
(x
0
)
m
j=1
. Applying Equation 6.11 to these targets leads to the
update rule
i
(x)
i
(x)+
m
m
j=1
i
{r + ✓
j
(x
0
)<
i
(x)}
, i = 1, ...m . (6.15)
This is the quantile temporal-difference learning (QTD) algorithm. A concrete
instantiation in the online case is summarised by Algorithm 6.1, by analogy with
the presentation of categorical temporal-difference learning in Algorithm 3.4.
Note that applying Equation 6.15 requires computing a total of m
2
terms per
location; when m is large, an alternative is to instead use a single term from
the sum in Equation 6.15, with j sampled uniformly at random. Interestingly
enough, for m sufficiently small the per-step cost of QTD is less than the cost
of sorting the full distribution (
T
)(x) (which has up to N
X
N
R
m particles),
suggesting that the quantile regression approach to the projection step may be
useful even in the context of distributional dynamic programming.
The use of quantile regression to derive QTD can be seen as an instance of the
principle introduced at the end of Section 6.2. Suppose that we consider an
176 Chapter 6
Algorithm 6.1: Online quantile temporal-difference learning
Algorithm parameters: step size 2(0, 1],
policy : X!P(A),
number of quantiles m,
initial locations
(
0
i
(x))
m
i=1
: x 2X
i
(x)
0
i
(x) for i = 1, ..., m, x 2X
i
2i–1
2m
for i = 1, ..., m
Loop for each episode:
Observe initial state x
0
Loop for t = 0, 1, ...
Draw a
t
from (· | x
t
)
Take action a
t
, observe r
t
, x
t+1
for i = 1, , m do
0
i
i
(x
t
)
for j = 1, , m do
if x
t+1
is terminal then
g r
t
else
g r
t
+ ✓
j
(x
t+1
)
0
i
0
i
+
m
i
{g <
i
(x
t
)}
end for
end for
for i = 1, , m do
i
(x
t
)
0
i
end for
until x
t+1
is terminal
end
initial return function
0
(x)=
1
m
m
j=1
0
j
(x)
.
If we substitute the sample target in Equation 6.15 by one constructed from this
initial return function, we obtain the update rule
i
(x)
i
(x)+
m
m
j=1
i
{r + ✓
0
j
(x
0
)<
i
(x)}
, i = 1, ...m . (6.16)
Incremental Algorithms 177
By inspection, we see that Equation 6.16 corresponds to quantile regression
applied to the problem of determining, for each state x
2X
, the quantiles of
the distribution (
T
0
)(x). Consequently, one may think of quantile temporal-
difference learning as performing an update that would converge to the quantiles
of the target distribution, if that distribution were held fixed.
Based on this observation, we can verify that QTD is a reasonable distributional
reinforcement learning algorithm by considering its behaviour at the fixed point
ˆ
Q
=
Q
T
ˆ
Q
,
the solution found by quantile dynamic programming. Let us denote the param-
eters of this return function by
ˆ
i
(x), for i = 1,
...
, m and x
2X
. For a given
state x, consider the intermediate target
˜(x)=
T
ˆ
Q
(x)
Now by definition of the quantile projection operator, we have
ˆ
i
(x)=F
–1
˜(x)
2i–1
2m
.
However, by Equation 6.13, we also know that the quantile regression update
rule applied at
ˆ
i
(x) with
i
=
2i–1
2m
leaves the parameter unchanged in expecta-
tion. In other words, the collection of locations
ˆ
i
(x)
m
i=1
are a fixed point of
the expected quantile regression update, and consequently the return function
ˆ
Q
is a solution of the quantile temporal-difference learning algorithm. This
gives some intuition that is it indeed a valid learning rule for distributional
reinforcement learning with quantile representations.
Before concluding, it is useful to illustrate why the straightforward approach
taken to derive categorical temporal-difference learning, based on unbiased
operator estimation, cannot be applied to the quantile setting. Recall that the
quantile-projected operator takes the form
Q
T
(x)=
Q
E
b
R,
#
(X
0
)|X = x
, (6.17)
As the following example shows, exchanging the expectation and projection
results in a different operator, one whose fixed point is not
ˆ
Q
. Consequently,
we cannot substitute random variables for their realisations, as was done in the
categorical setting.
Example 6.3.
Consider an MDP with a single state x, single action a, transition
dynamics so that x transitions back to itself, and immediate reward distribution
178 Chapter 6
N
(0, 1). Given
(x)=
0
, we have (
T
)(x)=
N
(0, 1), and hence the projec-
tion via
Q
onto
F
Q,m
with m = 1 returns a Dirac delta on the median of this
distribution:
0
.
In contrast, the sample target (
b
R,
)
#
(X
0
) is
R
, and so the projection of this
target via
Q
remains
R
. We therefore have
E
[
Q
(b
R,
)
#
)(X
0
)|X = x]=E
[
R
| X = x]=N(0, 1) ,
which is distinct from the result of the projected operator, (
Q
T
)(x)=
0
.
4
6.5 An Algorithmic Template for Theoretical Analysis
In the second half of this chapter, we present a theoretical analysis of a class of
incremental algorithms that includes the incremental Monte Carlo algorithm
(see Exercise 6.9), temporal-difference learning, and the CTD algorithm. This
analysis builds on the contraction mapping theory developed in Chapter 4, but
also accounts for the randomness introduced by the use of sample targets in
the update rule, via stochastic approximation theory. Compared to the analysis
of dynamic programming algorithms, the main technical challenge lies in
characterising the effect of this randomness on the learning process.
To begin, let us view the output of the temporal-difference learning algorithm
after k updates as a value function estimate V
k
. Extending the discussion from
Section 6.1, this estimate is a random quantity because it depends on the
particular sample transitions observed by the agent, and possibly the randomness
in the agent’s choices.
49
We are particularly interested in the sequence of random
estimates (V
k
)
k0
. From an initial estimate V
0
, this sequence is formally defined
as
V
k+1
(X
k
) = (1
k
)V
k
(X
k
)+
k
R
k
+ V
k
(X
0
k
)
V
k+1
(x)=V
k
(x) if x 6= X
k
,
where (X
k
, A
k
, R
k
, X
0
k
)
k0
is the sequence of random transitions used to calculate
the TD updates. In our analysis, the object of interest is the limiting point of this
sequence, and we seek to answer the question: does the algorithm’s estimate
converge to the value function V
? We consider the limiting point because any
single update may or may not improve the accuracy of the estimate V
k
at the
source state X
k
. We will show that, under the right conditions, the sequence
(V
k
)
k0
converges to V
. That is,
lim
k!1
V
k
(x)–V
(x)
= 0, for all x 2X .
49. In this context, we even allow the step size
k
to be random.
Incremental Algorithms 179
More precisely, the above holds with probability 1: with overwhelming odds,
the variables X
0
, R
0
, X
0
0
, X
1
, ... are drawn in such a way that V
k
!V
.
50
We will prove a more general result that holds for a family of incremental
algorithms whose sequence of estimates can be expressed by the template
U
k+1
(X
k
) = (1
k
)U
k
(X
k
)+
k
ˆ
O(U
k
, X
k
, Y
k
)
U
k+1
(x)=U
k
(x) if x 6= X
k
. (6.18)
Here, X
k
is the (possibly random) source state at time k,
ˆ
O
(U
k
, X
k
, Y
k
) is the
sample target, and
k
is an (also possibly random) step size. As in Section 6.2,
the sample experience Y
k
describes the collection of random variables used
to construct the sample target, for example a sample trajectory or a sample
transition (X
k
, A
k
, R
k
, and X
0
k
).
Under this template, the estimate U
k
describes the collection of variables main-
tained by the algorithm and constitutes its “prediction”. More specifically,
it is a state-indexed collection of m-dimensional real-valued vectors, written
U
k
2R
Xm
. In the case of the TD algorithm, m = 1 and U
k
= V
k
.
We assume that there is an operator
O
:
R
Xm
!R
Xm
whose unique fixed
point is the quantity to be estimated by the incremental algorithm. If we denote
this fixed point by U
, this implies that
OU
= U
.
We further assume the existence of a base norm
k·k
over
R
m
, extended to the
space of estimates according to
kUk
1
= sup
x2X
kU(x)k,
such that
O
is a contraction mapping of modulus
with respect to the metric
induced by
k·k
1
. For TD learning,
O
= T
and the base norm is simply the
absolute value; the contractivity of T
was established by Proposition 4.4.
Within this template, there is some freedom in how the source state X
k
is selected.
Formally, X
k
is assumed to be drawn from a time-varying distribution
k
that
may depend on all previously observed random variables up to but excluding
time k, as well as the initial estimate U
0
. That is,
X
k
k
X
0:k–1
, Y
0:k–1
,
0:k–1
, U
0
.
50.
Put negatively, there may be realisations of X
0
, R
0
, X
0
0
, X
1
,
...
for which the sequence (V
k
)
k0
does not converge, but the set of such realisations has zero probability.
180 Chapter 6
This includes scenarios in which source states are drawn from a fixed distribu-
tion
2P
(
X
), enumerated in a round-robin manner, or selected in proportion
to the magnitude of preceding updates (called prioritised replay; see [Moore
and Atkeson, 1993, Schaul et al., 2016]). It also accounts for the situation in
which states are sequentially updated along a sampled trajectory, as is typical
of online algorithms.
We further assume that the sample target is an unbiased estimate of the operator
O
applied to U
k
, and evaluated at X
k
. That is, for all x
2X
for which
P
(X
k
= x)>
0,
E
ˆ
O(U
k
, X
k
, Y
k
)|X
0:k–1
, Y
0:k–1
,
0:k–1
, U
0
, X
k
= x]=(OU
k
)(x).
This implies that Equation 6.18 can be expressed in terms of a mean-zero noise
w
k
, similar to our derivation in Section 6.1:
U
k+1
(X
k
) = (1
k
)U
k
(X
k
)+
k
(OU
k
)(X
k
)+(
ˆ
O(U
k
, X
k
, Y
k
)–(OU
k
)(X
k
)

w
k
.
Because w
k
is zero in expectation, this assumption guarantees that, on average,
the incremental algorithm must make progress towards the fixed point U
. That
is, if we fix the source state X
k
= x and step size
k
, then
E
U
k+1
(x)|X
0:k–1
, Y
0:k–1
,
0:k–1
, X
k
= x,
k
] (6.19)
= (1
k
)U
k
(x)+
k
E
(OU
k
)(x)+w
k
| X
k
= x
= (1
k
)U
k
(x)+
k
(OU
k
)(x).
By choosing an appropriate sequence of step sizes (
k
)
k0
and under a few
additional technical conditions, we can in fact provide the stronger guarantee
that the sequence of iterates (U
k
)
k0
converges to U
w.p. 1, as the next section
illustrates.
6.6 The Right Step Sizes
To understand the role of step sizes in the learning process, consider an abstract
algorithm described by Equation 6.18 and for which
ˆ
O(U
k
, X
k
, Y
k
)=(OU
k
)(X
k
).
In this case, the noise term w
k
is always zero and can be ignored: the abstract
algorithm adjusts its estimate directly towards
O
U
k
. Here we should take the
step sizes (
k
)
k0
to be large in order to make maximal progress towards U
.
For
k
= 1, we obtain a kind of dynamic programming algorithm that updates
its estimate one state at a time, and whose convergence to U
can be reasonably
easily demonstrated; conversely, taking
k
< 1 must in some sense slow down
the learning process.
Incremental Algorithms 181
In general, however, the noise term is not zero and cannot be neglected. In
this case, large step sizes amplify w
k
and prevent the algorithm from converg-
ing to U
(consider, in the extreme, what happens when
k
= 1). A suitable
choice of step size must therefore balance rapid learning progress and eventual
convergence to the right solution.
To illustrate what “suitable choice” might mean in practice, let us distil the
issue down to its essence and consider the process that estimates the mean of a
distribution 2P
1
(R) according to the incremental update
V
k+1
= (1
k
)V
k
+
k
R
k
, (6.20)
where (R
k
)
k0
are i.i.d. random variables distributed according to
. For con-
creteness, let us assume that
=
N
(0, 1), so that we would like (V
k
)
k0
to
converge to 0.
Suppose that the initial estimate is V
0
= 0 (the desired solution) and consider
three step size schedules:
k
= 0.1,
k
=
1
/
k+1
, and
k
=
1
/
(k+1)
2
. Figure 6.1 illus-
trates the sequences of estimates obtained by applying the incremental update
with each of these schedules and a single, shared sequence of realisations of the
random variables (R
k
)
k0
.
The
1
/
k+1
schedule corresponds to the right step size schedule for the incremental
Monte Carlo algorithm (Section 3.2), and accordingly we observe that it is
converging to the correct expected value.
51
By contrast, the constant schedule
continues to exhibit variations over time, as the noise is not sufficiently averaged
out. The quadratic schedule (
1
/
(k+1)
2
) decreases too quickly and the algorithm
settles on an incorrect prediction.
To prove the convergence of algorithms that fit the template described in
Section 6.5, we will require that the sequence of step sizes satisfies the Robbins-
Monro conditions [Robbins and Monro, 1951]. These conditions formalise the
range of step sizes that are neither too small nor too large and hence guarantee
that the algorithm must eventually find the solution U
. As with the source state
X
k
, the step size
k
at a give time k may be random, and its distribution may
depend on X
k
, X
0:k–1
,
0:k–1
, and Y
0:k–1
, but not the sample experience Y
k
. As in
the previous section, these conditions should hold with probability 1.
51.
In our example, V
k
is the average of k i.i.d. normal random variables, and is itself normally
distributed. Its standard deviation can be computed analytically and is equal to
1
/
p
k
(k
1). This
implies that after k = 1000 iterations, we expect V
k
to be in the range
±
3
1
/
p
k
=
±
0.111, because
99.7% of a normal random variable’s probability is within three standard deviations of its mean.
Compare with Figure 6.1.
182 Chapter 6
Figure 6.1
The behaviour of a simple incremental update rule 6.20 for estimating the expected value
of a normal distribution. Different curves represent the sequence of estimates obtained
from different step size schedules. The ground truth (V = 0) is indicated by the dashed
line.
Condition 1: not too small.
In the example above, taking
k
=
1
/
(k+1)
2
results in
premature convergence of the estimate (to the wrong solution). This is because
when the step sizes decay too quickly, the updates made by the algorithm may
not be of large enough magnitude to reach the fixed point of interest. To avoid
this situation, we require that (
k
)
k0
satisfy
k0
k {X
k
= x }
= 1, for all x 2X.
Implicit in this assumption is also the idea that every state should be updated
infinitely often. This assumption is violated, for example, if there is a state x
and time K after which X
k
6
= x, for all k
K. For a reasonably well-behaved
distribution of source states, this condition is satisfied for constant step sizes,
including
k
= 1: in the absence of noise, it is possible to make rapid progress
towards to the fixed point. On the other hand, it disallows
k
=
1
/(k+1)
2
, since
1
k=0
1
(k + 1)
2
=
2
6
< 1.
Condition 2: not too large.
Figure 6.1 illustrates how, with a constant step size
and in the presence of noise, the estimate U
k
(x) continues to vary substantially
over time. To avoid this issue, the step size should be decreased so that individual
updates result in progressively smaller changes in the estimate. To achieve this,
the second requirement on the step size sequence (
k
)
k0
is
k0
2
k
{X
k
= x }
< 1, for all x 2X.
Incremental Algorithms 183
In reinforcement learning, a simple step size schedule that satisfies both of these
conditions is
k
=
1
N
k
(X
k
)+1
, (6.21)
where N
k
(x) is the number of updates to a state x up to but not including
algorithm time k. We encountered this schedule in Section 3.2 when deriving
the incremental Monte Carlo algorithm. As will be shown in the following
sections, this schedule is also sufficient for the convergence of TD and CTD
algorithms.
52
Exercise 6.7 asks you to verify that Equation 6.21 satisfies the
Robbins-Monro conditions and investigates other step size sequences that also
satisfy these conditions.
6.7 Overview of Convergence Analysis
Provided that an incremental algorithm satisfies the template laid out in Section
6.5, with a step size schedule that satisfies the Robbins-Monro conditions,
we can prove that the sequence of estimates produced by this algorithm must
converge to the fixed point U
of the implied operator
O
. Before presenting the
proof in detail, we illustrate the main bounding-box argument underlying the
proof.
Let us consider a two-dimensional state space
X
={x
1
, x
2
} and an incre-
mental algorithm for estimating a 1-dimensional quantity (m = 1). As per
the template, we consider a contractive operator
O
:
R
X
!R
X
given by
O
U =
0.8U(x
2
), 0.8U(x
2
)
; note that the fixed point of
O
is U
= (0, 0). At
each time step, a source state (x
1
or x
2
) is chosen uniformly at random and the
corresponding estimate is updated. The step sizes are
k
=(k + 1)
–0.7
, satisfying
the Robbins-Monro conditions.
Suppose first that the sample target is noiseless. That is,
ˆ
O(U
k
, X
k
, Y
k
) = 0.8U
k
(x
2
).
In this case, each iteration of the algorithm contracts along a particular coor-
dinate. Figure 6.2(a) illustrates a sequence (U
k
)
k0
defined by the update
equations
U
k+1
(X
k
) = (1
k
)U
k
(X
k
)+
k
ˆ
O(X
k
, U
k
, Y
k
)
U
k+1
(x)=U
k
(x), x 6= X
k
.
52.
Because the process of bootstrapping constructs sample targets that are not in general unbiased
with regards to the value function V
, the optimal step size schedule for TD learning decreases at a
rate that is slower than
1
/k. See bibliographical remarks.
184 Chapter 6
(a) (b) (c)
1 0 1
U(x
1
)
1.0
0.5
0.0
0.5
1.0
U(x
2
)
1 0 1
U(x
1
)
1.0
0.5
0.0
0.5
1.0
U(x
2
)
0.0 0.5 1.0
U(x
1
)
0.0
0.5
1.0
U(x
2
)
Figure 6.2
(a)
Example behaviour of the iterates (U
k
)
k0
in the noiseless case. The colour scheme
indicates the iteration number from purple (k = 0) through to red (k = 1000).
(b)
Example
behaviour of the iterates (U
k
)
k0
in the general case.
(c)
Behaviour of iterates for 10
random seeds, with the noiseless (expected) behaviour overlaid.
As shown in the figure, the algorithm makes steady (if not direct) progress
towards the fixed point with each update. To prove that (U
k
)
k0
converges to
U
, we first show that the error
k
U
k
U
k
1
is bounded by a fixed quantity
for all k
0 (indicated by the outermost dashed-line square around the fixed
point U
= 0 in Figure 6.2a). The argument then proceeds inductively: if U
k
lies
within a given radius of the fixed point for all k greater than some K, then there
is some K
0
K for which, for all k
K
0
, it must lie within the next smallest
dashed-line square. We will see that this follows by contractivity of
O
and the
first Robbins-Monro condition. Provided that the diameter of these squares
shrinks to zero, then this establishes convergence of U
k
to U
.
Now consider adding noise to the sample target, such that
ˆ
O(U
k
, X
k
, Y
k
) = 0.8U
k
(y)+w
k
.
For concreteness, let us take w
k
to be an independent random variable with
distribution
U
([–1, 1]). In this case, the behaviour of the sequence (U
k
)
k0
is
more complicated (Figure 6.2b). The sequence (U
k
)
k0
no longer follows a neat
path to the fixed point, but can behave somewhat more erratically. Nevertheless,
the long-term behaviour exhibited the algorithm bears similarity to the noiseless
case: overall, progress is made towards the fixed point U
.
The proof of convergence follows the same pattern as for the noiseless case:
prove inductively that if
k
U
k
U
k
1
is eventually bounded by some fixed
quantity B
l
2R, then kU
k
U
k
1
is eventually bounded by a smaller quantity
B
l+1
. As in the noiseless case, this argument is depicted by the concentric squares
Incremental Algorithms 185
in Figure 6.2c. Again, if these diameters shrink to zero, this also establishes
convergence of U
k
to U
.
Because noise can increase the error between the estimate U
k
and the fixed point
U
at any given time step, to guarantee convergence we need to progressively
decrease the step size
k
. The second Robbins-Monro condition is sufficient for
this purpose, and with it the inductive step can be proven with a more delicate
argument. One additional challenge is that the base case (that
sup
k0
k
U
k
U
k
1
<
1
) is no longer immediate; a separate argument is required to establish
this fact. This property is called the stability of the sequence (U
k
)
k0
, and is
often one of the harder aspects of the proof of convergence of incremental
algorithms.
We conclude this section with a result that is crucial in understanding the
influence of noise in the algorithm. In the analysis carried out in this chapter, it
is the only result whose proof requires tools from advanced probability theory.
53
Proposition 6.4.
Let (Z
k
)
k0
be a sequence of random variables taking
values in
R
m
, and (
k
)
k0
be a collection of step sizes. Given
¯
Z
0
= 0,
consider the sequence defined by
¯
Z
k+1
= (1
k
)
¯
Z
k
+
k
Z
k
.
Suppose that the following conditions hold with probability one:
E[Z
k
| Z
0:k–1
,
0:k
] = 0 , sup
k0
E[kZ
k
k
2
| Z
0:k–1
,
0:k
]<1,
1
k=0
k
= 1,
1
k=0
2
k
< 1.
Then
¯
Z
k
!0 with probability one. 4
The proof is given in Remark 6.2; here, we provide some intuition that can be
gleaned without consulting the proof.
First, parallels can be drawn with the strong law of large numbers. Expanding
the definition of
¯
Z
k
yields
¯
Z
k
= (1
k
) ···(1
1
)
0
Z
0
+ (1
k
) ···(1
2
)
1
Z
1
+ ···+
k
Z
k
.
Thus,
¯
Z
k
is a weighted average of the mean-zero terms (Z
l
)
k
l=0
. If
k
=
1
/
k+1
, then
we obtain the usual uniformly-weighted average that appears in the strong law of
53.
Specifically, the supermartingale convergence theorem; the result is a special case of the
Robbins-Siegmund theorem [Robbins and Siegmund, 1971].
186 Chapter 6
large numbers. We also note that unlike the standard strong law of large numbers,
the noise terms (Z
l
)
k
l=0
are not necessarily independent. Nevertheless, it seems
reasonable that this sequence should exhibit similar behaviour to the averages
that appear in the strong law of large numbers. This also provides further
intuition for the conditions of Proposition 6.4. If the variance of individual
noise terms
k
Z
k
is too great, the weighted average may not ‘settle down’ as
the number of terms increases. Similarly, if
1
k=0
k
is too small, the initial
noise term Z
0
will have too large an influence over the weighted average, even
as k !1.
Second, for readers familiar with stochastic gradient descent, we can rewrite
the update scheme as
¯
Z
k+1
=
¯
Z
k
+
k
(–
¯
Z
k
+ Z
k
).
This is a stochastic gradient update for the loss function
L
(
¯
Z
k
)=
1
/
2k
¯
Z
k
k
2
(the
minimiser of which is the origin). The negative gradient of this loss is
¯
Z
k
, Z
k
is a mean-zero perturbation of this gradient, and
k
is the step size used in the
update. Proposition 6.4 can therefore be interpreted as stating that stochastic
gradient descent on this specific loss function converges to the origin, under
the required conditions on the step sizes and noise. It is perhaps surprising
that understanding the behaviour of stochastic gradient descent in this specific
setting is enough to understand the general class of algorithms expressed by
Equation 6.18.
6.8 Convergence of Incremental Algorithms*
We now provide a formal run-through of the arguments of the previous section,
and explain how they apply to temporal-difference learning algorithms. We
begin by introducing some notation to simplify the argument. We first define a
per-state step size that incorporates the choice of source state X
k
:
k
(x)=
k
if x = X
k
0 otherwise,
w
k
(x)=
ˆ
O(U
k
, X
k
, Y
k
)–(OU
k
)(X
k
) if x = X
k
0 otherwise.
This allows us to recursively define (U
k
)
k0
in a single equation:
U
k+1
(x) = (1
k
(x))U
k
(x)+
k
(x)
(OU
k
)(x)+w
k
(x)
. (6.22)
Equation 6.22 encapsulates all of the random variables (X
k
)
k0
,(Y
k
)
k0
,
(
k
)
k0
which together determine the sequence of estimates (U
k
)
k0
.
It is useful to separate the effects of the noise into two separate cases; one
in which the noise has been ‘processed’ by an application of the contractive
mapping
O
, and one in which the noise has not been passed through this
Incremental Algorithms 187
mapping. To this end, we introduce the cumulative external noise vectors
(W
k
(x):k
0, x
2X
). These are random vectors, with each W
k
(x) taking values
in R
m
, and defined by
W
0
(x)=0, W
k+1
(x) = (1
k
(x))W
k
(x)+
k
(x)w
k
(x).
We also introduce the sigma-algebras
F
k
=
(X
0:k
,
0:k
, Y
0:k–1
); these encode
the information available to the learning agent just prior to sampling Y
k
and
applying the update rule to produce U
k+1
.
We now list several assumptions we will require of the algorithm to establish the
convergence result. Recall that
k·k
is the base norm identified in Section 6.5,
which gives rise to the supremum extension
k·k
1
. In particular, we assume
that O is a -contraction mapping in the metric induced by k·k
1
.
Assumption 6.5
(Robbins-Monro conditions)
.
For each x
2X
, the step sizes
(
k
(x))
k0
satisfy
k0
k
(x)=
1
and
k0
k
(x)
2
<
1
with probability 1.
4
The second assumption encompasses the mean-zero condition described in
Section 6.5, and introduces an additional condition that the variance of this
noise, conditional on the state of the algorithm, does not grow too quickly.
Assumption 6.6.
The noise terms (w
k
(x):k
0, x
2X
) satisfy
E
[w
k
(x)|
F
k
]=
0 with probability 1, and
E
[
k
w
k
(x)
k
2
|
F
k
]
C
1
+ C
2
k
U
k
k
2
1
w.p. 1, for some
constants C
1
, C
2
0, for all x 2Xand k 0. 4
We would like to use Proposition 6.4 to show that the cumulative external noise
(W
k
(x))
k0
is well-behaved, and then use this intermediate result to establish the
convergence of the sequence (U
k
)
k0
itself. The proposition is almost applicable
to the sequence (W
k
(x))
k0
. The difficulty is that the proposition stipulates that
the individual noise terms Z
k
have bounded variance, whereas Assumption 6.6
only bounds the conditional expectation of
k
w
k
(x)
k
2
in terms of
k
U
k
k
2
1
, which
a priori may be unbounded. Unfortunately, in temporal-difference learning
algorithms, the update variance typically does scale with the magnitude of
current estimates, so this is not an assumption that we can weaken. To get around
this difficulty, we first establish the boundedness of the sequence (U
k
)
k0
, as
described informally in the previous section, often referred to as stability in the
stochastic approximation literature.
188 Chapter 6
Proposition 6.7.
Suppose Assumptions 6.5 and 6.6 hold. Then there is
a finite random variable B such that
sup
k0
k
U
k
k
1
< B with probability
1. 4
Proof.
The idea of the proof is to work with a ‘renormalised’ version of the
noises (w
k
(x))
k0
to which Proposition 6.4 can be applied. First, we show that
the contractivity of
O
means that when U is sufficiently far from 0,
O
contracts
the iterate back towards 0. To make this precise, we first observe that
kOUk
1
kOU U
k
1
+ kU
k
1
kU U
k
1
+ kU
k
1
kUk
1
+ D ,
where D = (1 +
)
k
U
k
1
. Let
¯
B
>
D
/
1–
so that
¯
B
>
¯
B
+ D, and define
2
(
, 1)
by
¯
B + D =
¯
B. Now that that for any U with kUk
1
¯
B, we have
kOUk
1
kUk
1
+ D = kUk
1
+( )
¯
B kUk
1
+( )kUk
1
= kUk
1
.
Second, we construct a sequence of bounds (
¯
B
k
)
k0
related to the iterates
(U
k
)
k0
as follows. It will be convenient to introduce 1 +
"
=
–1
, the inverse
of the contraction factor
above. Take
¯
B
0
=
max
(
¯
B
,
k
U
0
k
1
), and iteratively
define
¯
B
k+1
=
¯
B
k
if kU
k+1
k
1
(1 + ")
¯
B
k
(1 + ")
k
¯
B
0
where k = min{l : kU
k+1
k
1
(1 + ")
l
¯
B
0
} otherwise .
Thus, the (
¯
B
k
)
k0
define a kind of soft ‘upper envelope’ on (
k
U
k
k
1
)
k0
, which
are only updated when a norm exceeds the previous bound
¯
B
k
by a factor at
least (1 + "). Note that (kU
k
k
1
)
k0
is unbounded if and only if
¯
B
k
!1.
We now use the (
¯
B
k
)
k0
to define a ‘renormalised’ noise sequence (
˜
w
k
)
k0
to which Proposition 6.4 can be applied. We set
˜
w
k
= w
k
/
¯
B
k
, and define
˜
W
k
iteratively by
˜
W
0
= w
0
, and
˜
W
k+1
(x) = (1
k
(x))
˜
W
k
(x)+
k
(x)
˜
w
k
(x).
By Assumption 6.6, we still have
E
[
˜
w
k
|
F
k
] = 0, and obtain that
E
[
k
˜
w
k
k
2
1
|
F
k
]
is uniformly bounded. Using Assumption 6.5, Proposition 6.4 now applies, and
we deduce that
˜
W
k
!0 with probability 1.
In particular, there is a (random) time K such that
k
˜
W
k
(x)
k
<
"
for all k
K
and x
2X
. Now supposing
¯
B
k
!1
, we may also take K so that
k
U
K
k
1
¯
B
K
.
We will now prove by induction that for all k
K, we have both
k
U
k
k
1
(1 +
"
)
¯
B
K
and
k
U
k
W
k
k
1
<
¯
B
K
; the base case is clear from the above. For the
inductive step, suppose for some k
K, we have both
k
U
k
k
1
(1 +
"
)
¯
B
K
and
Incremental Algorithms 189
kU
k
W
k
k
1
<
¯
B
K
. Now observe that
kU
k+1
(x)–W
k+1
(x)k= k(1
k
(x))U
k
(x)+
k
(x)(OU
k
)(x)+
k
(x)w
k
(x)–W
k+1
(x)k
(1
k
(x))kU
k
(x)–W
k
(x)k+
k
(x)k(OU
k
)(x)k
(1
k
(x))
¯
B
K
+
k
(x)(kU
k
k
1
+ D)
(1
k
(x))
¯
B
K
+
k
(x) (1 + ")
¯
B
K
¯
B
K
.
And additionally,
kU
k+1
k
1
kU
k+1
W
k+1
k
1
+ kW
k+1
k
1
¯
B
K
+ "
¯
B
K
= (1 + ")
¯
B
K
.
as required.
We can now establish the convergence of the cumulative external noise.
Proposition 6.8.
Suppose Assumptions 6.5 and 6.6 hold. Then the external
noise W
k
(x) converges to 0 with probability 1, for each x 2X. 4
Proof.
By Proposition 6.7, there exists a finite random variable B such that
k
U
k
k
1
B for all k
0. We therefore have
E
[
k
w
k
(x)
k
2
|
F
k
]
C
1
+ C
2
B
2
=
:
B
0
w.p. 1 for all x
2X
and k
0, by Assumption 6.6. Proposition 6.4 therefore
applies to give the conclusion.
With this result in hand, we now prove the central result of this section, using the
stability result as the base case for the inductive argument intuitively explained
above.
Theorem 6.9.
Suppose Assumptions 6.5 and 6.6 hold. Then U
k
!
U
with probability one. 4
Proof.
By Proposition 6.7, there is a finite random variable B
0
such that
k
U
k
U
k
1
< B
0
for all k
0 w.p. 1. Let
"
> 0 such that
+2
"
< 1; we will show by
induction that if B
l
=(
+2
"
)B
l–1
for all l
1, then for each l
0, there is a
(possibly random) finite time K
l
such that
k
U
k
U
k
1
< B
l
for all k
K
l
, which
proves the theorem.
To prove this claim by induction, let l
0 and suppose there is a random finite
time K
l
such that
k
U
k
U
k
1
< B
l
for all k
K
l
w.p. 1. Now let x
2X
, and
190 Chapter 6
k K
l
. We have
U
k+1
(x)–U
(x)–W
k+1
(x)
=(1
k
(x))U
k
(x)+
k
(x)((OU
k
)(x)+w
k
(x)) U
(x) (1
k
(x))W
k
(x)–
k
(x)w
k
(x)
=(1
k
(x))(U
k
(x)–U
(x)–W
k
(x)) +
k
(x)((OU
k
)(x)–U
(x)) .
Since
O
is a contraction mapping under
k·k
1
with fixed point U
and con-
traction modulus
, we have
k
(
O
U
k
)(x)–U
(x)
kk
U
k
U
k
1
<
B
l
, and
so
kU
k+1
(x)–U
(x)–W
k+1
(x)k(1
k
(x))kU
k
(x)–U
(x)–W
k
(x)k+
k
(x)B
l
.
Letting
k
(x)=kU
k
(x)–U
(x)–W
k
(x)k, we then have
k+1
(x) (1
k
(x))
k
(x)+
k
(x)B
l
=)
k+1
(x)–B
l
(1
k
(x))(
k
(x)–B
l
).
Telescoping this inequality from K
l
to k yields
k+1
(x)–B
l
k
s=K
l
(1
s
(x))
(
K
l
(x)–B
l
).
If
K
l
(x)–
B
l
0, then
k
(x)
B
l
for all k
K
l
. If not, then we can use the
inequality 1 x e
x
(applied to x =
k
0) to deduce
k+1
(x)–B
l
exp
k
s=K
l
k
(x)
(
K
l
(x)–B
l
),
and since
s0
s
(x)=
1
by assumption, the right-hand side tends to 0. There-
fore, there exists a random finite time after which
k
(x)
(
+
"
)B
l
. Since
X
is
finite, there is a random finite time after which this holds for all x
2X
. Finally,
since W
k
(x)
!
0 under
k·k
for all x
2X
w.p. 1 by Proposition 6.8, there is a
random finite time after which
k
W
k
(x)
k"
B
l
for all x
2X
. Letting K
l+1
K
l
be the maximum of all these random times, we therefore have that for k
K
l+1
,
for all x 2X,
kU
k
(x)–U
(x)kkU
k
(x)–U
(x)–W
k
(x)k+ kW
k
(x)k( + ")B
l
+ "B
l
= B
l+1
,
as required.
6.9 Convergence of Temporal-Difference Learning*
We can now apply Theorem 6.9 to demonstrate the convergence of the sequence
of value function estimates produced by TD learning. Formally, we consider a
stream of sample transitions (X
k
, A
k
, R
k
, X
0
k
)
k0
, along with associated step sizes
(
k
)
k0
, that satisfy the Robbins-Monro conditions (Assumption 6.5) and give
Incremental Algorithms 191
rise to zero-mean noise terms (Assumption 6.6). More precisely, we assume
there are sequences of functions (
k
)
k0
and (
k
)
k0
such that our sample model
takes the following form (for k 0):
54
X
k
|(X
0:k–1
, A
0:k–1
, R
0:k–1
, X
0
0:k–1
,
0:k–1
)
k
X
0:k–1
, A
0:k–1
, R
0:k–1
, X
0
0:k–1
0:k–1
;
k
|(X
0:k
, A
0:k–1
, R
0:k–1
, X
0
0:k–1
,
0:k–1
)
k
X
0:k
, A
0:k–1
, R
0:k–1
, X
0
0:k–1
,
0:k–1
;
A
k
|
X
0:k
, A
0:k–1
, R
0:k–1
, X
0
0:k–1
,
0:k
( · | X
k
);
R
k
|
X
0:k
, A
0:k
, R
0:k–1
, X
0
0:k–1
,
0:k
P
R
( · | X
k
, A
k
);
X
0
k
|
X
0:k
, A
0:k
, R
0:k
, X
0
0:k–1
,
0:k
P
X
( · | X
k
, A
k
) . (6.23)
A generative, or algorithmic, perspective on this model is that at each update step
k, a source state X
k
and step size
k
are selected on the basis of all previously-
observed random variables (possibly using an additional source of randomness
to make this selection), and the variables (A
k
, R
k
, X
0
k
) are sampled according
to
and the environment dynamics, conditionally independent of all random
variables already observed given X
k
. Readers may compare this with the model
equations in Section 2.3 describing the joint distribution of a trajectory generated
by following the policy
. As discussed in Sections 6.5 and 6.6, this is fairly
flexible model that allows us to analyse a variety of learning schemes.
Theorem 6.10.
Consider the value function iterates (V
k
)
k0
defined by
some initial estimate V
0
and satisfying
V
k+1
(X
k
) = (1
k
)V
k
(X
k
)+
k
R
k
+ V
k
(X
0
k
)
V
k+1
(x)=V
k
(x) if x 6= X
k
,
where (X
k
, A
k
, R
k
, X
0
k
)
k0
is a sequence of transitions. Suppose that:
(a)
The source states (X
k
)
k0
and step sizes (
k
)
k0
satisfy the Robbins-
Monro conditions:
1
k=0
k {X
k
= x }
= 1,
1
k=0
2
k
{X
k
= x }
< 1
w.p. 1, for all x 2X.
(b)
The joint distribution of (X
k
, A
k
, R
k
, X
0
k
)
k0
is an instance of the
sampling model expressed in Equation 6.23.
(c) The reward distributions for all state-action pairs have finite variance.
Then V
k
!V
with probability 1. 4
54.
As in previous chapters, terms of the form X
0:–1
arising when k = 0 above are to be interpreted
as the empty sequence.
192 Chapter 6
Theorem 6.10 gives formal meaning to our earlier assertion that the conver-
gence of incremental reinforcement learning algorithms can be guaranteed for a
variety of source state distributions. Interestingly enough, the condition on the
source state distribution appears only implicitly, through the Robbins-Monro
conditions: effectively, what matters is not so much when the states are updated,
but rather the “total amount of step size” by which the estimate may be moved.
Proof.
We first observe that the temporal-difference algorithm described in
the statement is an instance of the abstract algorithm described in Section 6.5,
by taking U
k
= V
k
, m = 1,
O
= T
, Y
k
=(A
k
, R
k
, X
0
k
), and
ˆ
O
(U
k
, X
k
, Y
k
)=R
k
+
V
k
(X
0
k
). The base norm
k·k
is simply the absolute value on
R
. In this case,
O
is a
-contraction on
R
X
by Proposition 4.4, with fixed point V
, and the noise
w
k
is equal to R
k
+
V
k
(X
0
k
)–(T
U
k
)(X
k
), by the decomposition in Equation 6.5.
It therefore remains to check that Assumptions 6.5 and 6.6 hold; Theorem 6.9
then applies to give the result. Assumption 6.5 is immediate from the conditions
of the theorem. To see that Assumption 6.6 holds, first note that
E[w
k
| F
k
]=E
[R
k
+ V
k
(X
0
k
)–(T
V
k
)(X
k
)|F
k
]
= E
[R
k
+ V
k
(X
0
k
)–(T
V
k
)(X
k
)|X
k
, V
k
]
=0,
since conditional on (X
k
, V
k
), the expectation of R +
V
k
(X
0
k
) is (T
V
k
)(X
k
).
Additionally, we note that
E
|w
k
|
2
| F
k
= E
|w
k
|
2
| X
k
, V
k
= E
|R + V
k
(X
0
k
)–(T
V
k
)(X
k
)|
2
| X
k
, V
k
2
E
|R + V
k
(X
0
k
)|
2
| X
k
, V
k
+(T
V
k
)(X
k
)
2
C
1
+ C
2
kV
k
k
2
1
,
for some C
1
, C
2
> 0 as required by the assumption on boundedness of reward
variance.
6.10 Convergence of Categorical Temporal-Difference Learning*
Let us now consider proving the convergence of categorical TD learning
by means of Theorem 6.9. Writing CTD in terms of a sequence of return
distribution estimates, we have
k+1
(X
k
) = (1
k
)
k
(X
k
)+
k
C
(b
R
k
,
)
#
(X
0
k
)
k+1
(x)=
k
(x) if x 6= X
k
. (6.24)
Incremental Algorithms 193
Following the principles of the previous section, we may decompose the update
at X
k
into an operator term and a noise term:
k+1
(X
k
) = (1
k
)
k
(X
k
)+
k
(
C
T
)(X
k
)

(OU)(X
k
)
+
C
(b
R
k
,
)
#
(X
0
k
)–(
C
T
)(X
k
)

w
k
(6.25)
Assuming that R
k
and X
0
k
are drawn appropriately, this decomposition is sensible
by virtue of Proposition 6.2, in the sense that the expectation of the sample
target is the projected distributional Bellman operator:
E
C
(b
R
k
,
)
#
(X
0
k
)
| X
k
= x
=(
C
T
(x) for all x . (6.26)
With this decomposition, the noise term is not a probability distribution but
rather a signed distribution; this is illustrated in Figure 6.3. Informally speaking,
a signed distribution may assign negative “probabilities”, and may not integrate
to one (we will revisit signed distributions in Chapter 9). Based on Proposition
6.2, we may intuit that w
k
is mean-zero noise, where “zero” here is to be
understood as a special signed distribution.
(a) Categorical TD target (b) Expected update (c) Mean-zero noise
Figure 6.3
The sample target in a categorical TD update
(a)
can be decomposed into an expected
update specified by the operator
C
T
(b) and a mean-zero signed distribution (c).
However, expressing the CTD update rule in terms of signed distributions is not
sufficient to apply Theorem 6.9. This is because the theorem requires that the
iterates
U
k
(x)
k0
be elements of
R
m
, whereas (
k
(x))
k0
are probability distri-
butions. To address this issue, we leverage the fact that categorical distributions
are represented by a finite number of parameters, and view their cumulative
distribution functions as in vectors in R
m
.
Recall that
F
X
C,m
is the space of m-categorical return distribution functions. To
invoke Theorem 6.9, we construct an isometry
I
between
F
X
C,m
and a certain
subset of R
Xm
. For a categorical return function 2F
X
C,m
, write
I()=
F
(x)
(
i
):x 2X, i 2{1, ..., m}
2R
Xm
,
194 Chapter 6
where as before
1
,
...
,
m
denotes the locations of the m particles whose
probabilities are parametrised in
F
C,m
. The isometry
I
maps return functions
to elements of
R
Xm
describing the corresponding cumulative distribution
functions (CDFs), evaluated at these particles. The image of F
X
C,m
under I is
R
X
I
={z 2R
m
:0z
1
···z
m
= 1}
X
.
The inverse map
I
–1
: R
X
I
!F
X
C,m
maps vectors describing the cumulative distribution functions of categorical
return functions back to their distributions.
With this construction, the metric induced by the L
2
norm
k·k
2
over
R
m
is
proportional to the Cramér distance between probability distributions and is
readily extended to R
Xm
. That is, for ,
0
2F
X
C,m
, we have
kI()–I(
0
)k
2,1
= sup
x2X
k(I())(x)–(I(
0
))(x)k
2
=
1
&
m
`
2
(,
0
).
We will prove the convergence of the sequence (
k
)
k0
defined by Equation
6.24 to the fixed point of the projected operator
C
T
by applying Theorem
6.9 to the sequence (
I
(
k
))
k0
and the L
2
metric, and arguing (by isometry) that
the original sequence must also converge. An important additional property of
I is that it commutes with expectations, in the following sense.
Lemma 6.11.
The isometry
I
:
F
X
C,m
!R
X
I
is an affine map. That is, for any
,
0
2F
X
C,m
and 2[0, 1],
I(↵⌘ + (1 )
0
)=I() + (1 )I(
0
).
As a result, if is a random return-distribution function, then we have
E[I()] = I(E[]) . 4
Incremental Algorithms 195
Theorem 6.12.
Let m
2 and consider the return function iterates (
k
)
k0
generated by Equation 6.24 from some possibly random
0
. Suppose that:
(a)
The source states (X
k
)
k0
and step sizes (
k
)
k0
satisfy the Robbins-
Monro conditions:
1
k=0
k {X
k
= x }
= 1,
1
k=0
2
k
{X
k
= x }
< 1
with probability 1, for all x 2X.
(b)
The joint distribution of (X
k
, A
k
, R
k
, X
0
k
)
k0
is an instance of the
sampling model expressed in Equation 6.23.
Then with probability 1,
k
! ˆ
C
with respect to the supremum Cramér
distance
`
2
, where
ˆ
C
is the unique fixed point of the projected operator
C
T
:
ˆ
C
=
C
T
ˆ
C
. 4
Proof.
We begin by constructing a sequence (U
k
)
k0
, U
k
2R
X
I
that parallels
the sequence of return functions (
k
)
k0
. Write
O= I
C
T
I
–1
and define, for each k
2N
, U
k
=
I
(
k
). By Lemma 5.24,
O
is a contraction with
modulus
1
/2
in k·k
2,1
and we have
U
k+1
(X
k
) = (1
k
)U
k
(X
k
)+
k
I
C
(b
R
k
,
)
#
(X
0
k
)
= (1
k
)U
k
(X
k
)+
k
(OU
k
)(X
k
)+I
C
(b
R
k
,
)
#
(I
–1
U
k
)(X
0
k
)–(OU
k
)(X
k
)

w
k
.
To see that Assumption 6.6 (bounded, mean-zero noise) holds, first note that
by Proposition 6.2 and affineness of
I
and
I
–1
from Lemma 6.11, we have
E
[w
k
|
F
k
] = 0. Furthermore, w
k
is a bounded random variable, because each
coordinate is a difference of two probabilities, and hence in the interval [–1, 1].
Hence, we have E[kw
k
k
2
| F
k
]<C for some C > 0, as required.
By Banach’s theorem, the operator
O
has a unique fixed point U
. We can
thus apply Theorem 6.9 to conclude that the sequence (U
k
)
k0
converges to U
,
satisfying
U
= I
C
T
I
–1
U
. (6.27)
Because
I
is an isometry, this implies that (
k
)
k0
converges to
=
I
–1
U
.
Applying I
–1
to both sides of Equation 6.27, we obtain
I
–1
U
=
C
T
I
–1
U
,
196 Chapter 6
and since
C
T
has a unique fixed point we conclude that
= ˆ
C
.
The proof of Theorem 6.12 illustrates how the parameters of categorical distribu-
tions are by design bounded, so that stability (i.e. Proposition 6.7) is immediate.
In fact, stability is also immediate for TD learning when the reward distributions
have bounded support.
6.11 Technical Remarks
Remark 6.1.
Given a probability distribution
2P
(
R
) and a level
2
(0, 1),
quantile regression finds a value
2R such that
F
(
)= . (6.28)
In some situations, for example when
is a discrete distribution, there are
multiple values satisfying Equation 6.28. Let us write
S ={ : F
()=}.
Then one can show that S forms an interval. We can argue that quantile regres-
sion converges to this set by noting that, for
2
(0, 1) the expected quantile
loss
L
()= E
Z
|
{Z<}
| |Z |
is convex in
. In addition, for this loss we have that for any
,
0
2
S and
00
/2
S,
L
()=L
(
0
)<L
(
00
).
Convergence follows under appropriate conditions by appealing to standard
arguments regarding the convergence of stochastic gradient descent; see for
example Kushner and Yin [2003]. 4
Remark 6.2
(
Proof of Proposition 6.4
)
.
Our goal is to show that
k
¯
Z
k
k
2
2
behaves like a non-negative supermartingale, from which convergence would
follow from the supermartingale convergence theorem [see e.g. Billingsley,
2012]. We begin by expressing the squared Euclidean norms of the sequence
elements recursively, writing F
k
= (Z
0:k–1
,
0:k
):
E[k
¯
Z
k+1
(x)k
2
2
| F
k
]=E[k(1
k
)
¯
Z
k
+
k
Z
k
k
2
2
| F
k
]
(a)
= (1
k
)
2
k
¯
Z
k
k
2
2
+
2
k
E[kZ
k
k
2
2
| F
k
]
(b)
(1
k
)
2
k
¯
Z
k
k
2
2
+
2
k
B
(1
k
)k
¯
Z
k
k
2
2
+
2
k
B . (6.29)
Incremental Algorithms 197
Here, (a) follows by expanding the squared norm and using E[Z
k
| F
k
] = 0, and
(b) follows from the boundedness of the conditional variance of the (Z
k
)
k0
,
where B is a bound on such variances.
This inequality does not establish the supermartingale property, due to the
presence of the additive term
2
k
B on the right-hand side. However, the ideas
behind the Robbins-Siegmund theorem [Robbins and Siegmund, 1971] can be
applied to deal with this term. The argument first constructs the sequence
k
= k
¯
Z
k
k
2
2
+
k–1
s=0
k
k
¯
Z
s
k
2
2
B
k–1
s=0
2
s
.
Inequality 6.29 above then shows that (
k
)
k0
is a supermartingale, but may
not be uniformly bounded below, meaning the supermartingale convergence
theorem still cannot be applied. Defining the stopping times t
q
=
inf
{k
0:
B
0
k
s=0
2
s
> q} for each q
2N
(with the convention that
inf ;
=
1
), each
stopped process (
k^t
q
)
k0
is a supermartingale bounded below by q, and
hence each such process converges w.p. 1 by the supermartingale convergence
theorem. However, B
0
1
s=0
2
s
<
1
w.p. 1 by assumption, so w.p. 1 t
q
=
1
for
sufficiently large q, and hence
k
converges w.p. 1. Since
1
k=0
k
=
1
w.p.
1 by assumption, it must be the case that
k
¯
Z
k
k
2
2
!
0 as k
!
0, in order for
k
to have a finite limit, and hence we are done. Although somewhat involved,
Exercise 6.13 demonstrates the necessity of this argument. 4
6.12 Bibliographical Remarks
The focus of this chapter has been in developing and analysing single-step
temporal-difference algorithms. Further algorithmic development include the
use of multi-step returns [Sutton, 1988], off-policy corrections [Precup et al.,
2000], and gradient-based algorithms [Sutton et al., 2009, 2008a]; the exercises
in this chapter develop a few such approaches.
6.1-6.2.
This chapter analyses incremental algorithms through the lens of
approximating the application of dynamic programming operators. Temporal-
difference algorithms have a long history [Samuel, 1959], and the idea of
incremental approximations to dynamic programming formed motivation for
several general-purpose temporal-difference learning algorithms [Sutton, 1984,
1988, Watkins, 1989].
Although early proofs of particular kinds of convergence for these algorithms
did not directly exploit this connection with dynamic programming [Watkins,
1989, Watkins and Dayan, 1992, Dayan, 1992], later a strong theoretical connec-
tion was established that viewed these algorithms through the lens of stochastic
198 Chapter 6
approximation theory, allowing for a unified approach to proving almost-sure
convergence [Gurvits et al., 1994, Dayan and Sejnowski, 1994, Tsitsiklis, 1994,
Jaakkola et al., 1994, Bertsekas and Tsitsiklis, 1996, Littman and Szepesvári,
1996]. The unbiased estimation framework presented comes from these works,
and the second principle is based on the ideas behind two-timescale algo-
rithms [Borkar, 1997, 2008]. A broader framework based on asymptotically
approximating the trajectories of differential equations is a central theme of
algorithm design and stochastic approximation theory more generally [Ljung,
1977, Kusher and Clark, 1978, Benveniste et al., 2012, Borkar and Meyn, 2000,
Kushner and Yin, 2003, Borkar, 2008, Meyn, 2022].
In addition to the CTD and QTD algorithms described in this chapter, several
other approaches to incremental learning of return distributions have been
proposed. Morimura et al. [2010b] propose to update parametric density models
by taking gradients of the KL divergence between the current estimates, and
the result of applying the Bellman operator to these estimates. Barth-Maron
et al. [2018] also take this approach, using a representation based on mixtures
of Gaussians. Nam et al. [2021] also use mixtures of Gaussians, and minimise
the Cramér distance from a multi-step target, incorporating ideas from TD(
)
[Sutton, 1984, 1988]. Gruslys et al. [2018] combine CTD with Retrace(
), a
multi-step off-policy evaluation algorithm [Munos et al., 2016]. Nguyen et al.
[2021] combine the quantile representation with a loss based on the MMD
metrics described in Chapter 4. Martin et al. [2020] propose a proximal update
scheme for the quantile representation based on (regularised) Wasserstein flows
[Jordan et al., 1998, Cuturi, 2013, Peyré et al., 2019].
Example 6.1 is from Bellemare et al. [2016].
6.3.
The categorical temporal-difference algorithm as a mixture update was
presented by Rowland et al. [2018]. This is a variant of the C51 algorithm
introduced by Bellemare et al. [2017a], which uses a projection in a mixture of
Kullback-Leibler divergence and Cramér distance. Distributional versions of
gradient temporal-difference learning [Sutton et al., 2008a, 2009] based on the
categorical representation have also been explored by Qu et al. [2019].
6.4.
The QTD algorithm was introduced by Dabney et al. [2018b]. Quan-
tile regression itself is a long-established tool within statistics, introduced by
Koenker and Bassett Jr [1978]; Koenker [2005] is a classic reference on the
subject. The incremental rule for estimating quantiles of a fixed distribution was
in fact proposed by Robbins and Monro [1951], in the same paper that launched
the field of stochastic approximation.
Incremental Algorithms 199
6.5.
The discussion of sequences of learning rates that result in convergence
goes back to Robbins and Monro [1951], who introduced the field of stochastic
approximation. Szepesvári [1998], for example, considers this framework in
their study of the asymptotic convergence rate of Q-learning. A fine-grained
analysis in the case of temporal-difference learning algorithms, taking finite-
time concentration into account, was undertaken by Even-Dar and Mansour
[2003] ; see also Azar et al. [2011].
6.6-6.10.
Our proof of Theorem 6.9, via Propositions 6.4, 6.7 & 6.8, closely
follows the argument given by Bertsekas and Tsitsiklis [1996] and Tsitsiklis
[1994]. Specifically, we adapt this argument to deal with distributional informa-
tion, rather than a single scalar value. Proposition 6.4 is a special case of the
Robbins-Siegmund theorem [Robbins and Siegmund, 1971], and a particularly
clear exposition of this and related material is given by Walton [2021]. We note
also that this result can also be established via earlier results in the stochastic
approximation literature [Dvoretzky, 1956], as noted by Jaakkola et al. [1994].
Theorem 6.10 is classical, and results of this kind can be found in Bertsekas and
Tsitsiklis [1996]. Theorem 6.12 was first proven by Rowland et al. [2018], albeit
with a monotonicity argument based on that of Tsitsiklis [1994]; the argument
here is based on a contraction mapping argument to match the analysis of the
temporal-difference algorithm. For further background on signed measures, see
Doob [1994].
6.13 Exercises
Exercise 6.1.
In this chapter we argued for a correspondence between oper-
ators and incremental algorithms. This also holds true for the incremental
Monte Carlo algorithm introduced in Section 3.2. What is peculiar about the
corresponding operator? 4
Exercise 6.2.
Exercise 3.2 asked you to derive an incremental algorithm from
the n-step Bellman equation
V
(x)=E
n–1
t=0
t
R
t
+
n
V
(X
n
)|X
0
= x
.
Describe this process in terms of the method where we substitute random
variables with their realisations, then derive the corresponding incremental
algorithm for state-action value functions. 4
200 Chapter 6
Exercise 6.3
(*)
.
The n-step random-variable Bellman equation for a policy
is given by
G
(x)
D
=
n–1
t=0
t
R
t
+
n
G
(X
n
), X
0
= x,
where the trajectory (X
0
= x, A
0
, R
0
,
, X
n
, A
n
, R
n
) is distributed according to
P
(· | X
0
= x).
(i)
Write down the distributional form of this equation, and the corresponding
n-step distributional Bellman operator.
(ii)
Show that it is a contraction on a suitable subset of
P
(
R
)
X
with respect to
an appropriate metric.
(iii)
Further show that the composition of this operator with either the categorical
projection or the quantile projection is also a contraction mapping in the
appropriate metric.
(iv)
Using the approach described in this chapter, derive n-step versions of
categorical and quantile temporal-difference learning.
(v)
In the case of n-step CTD, describe an appropriate set of conditions that
allow for Theorem 6.9 to be used to obtain convergence to the projected
operator fixed point with probability 1. What happens to the fixed points of
the projected operators as n !1?
4
Exercise 6.4.
Implement the quantile regression update rule (Equation 6.11).
Given an initial estimate
0
= 0, visualise the sequence of estimates produced
by quantile regression for
2
{0.01, 0.1, 0.5} and a constant step size
= 0.01,
given samples from
(i) A normal distribution N(1, 2) ;
(ii) A Bernoulli distribution U({0, 1}) ;
(iii) The mixture distribution
1
3
1
+
2
3
U([2, 3]) . 4
Exercise 6.5.
Let
0
be a m-quantile return-distribution function, and let
(x, a, r, x
0
) denote a sample transition. Find a Markov decision process for
which the update rule
(x)
Q
(1 )(x)+↵⌘
0
(x
0
)
Incremental Algorithms 201
does not converge. 4
Exercise 6.6.
Implement the TD, CTD, and QTD algorithms, and use these
algorithms to approximate the value (or return) function of the quick policy on
the Cliffs domain (Example 2.9). Compare their accuracy to the ground-truth
value function and return distribution function estimated using many Monte
Carlo rollouts, both in terms of an appropriate metric and by visually comparing
the approximations to the ground-truth functions.
Investigate how this accuracy is affected by different choices of constant step
sizes and sequences of step sizes that satisfy the requirements laid out in Section
6.6. What do you notice about the relative performance of these algorithms as
the degree of action noise p is varied?
Investigate what happens when we modify the TD algorithm by restricting
value function estimates on the interval [–C, C], for a suitable C
2R
+
. Does
this restriction affect the performance of the algorithm differently from the
restriction to [
1
,
m
] that is intrinsic to CTD? 4
Exercise 6.7.
Let (X
k
, A
k
, R
k
, X
0
k
)
k0
be a random sequence of transitions such
that for each x 2X, we have X
k
= x for infinitely many k 0. Show that taking
k
=
1
N
k
(X
k
)+1
.
satisfies Assumption 6.5. 4
Exercise 6.8.
Theorems 6.10 & 6.12 establish convergence for state-indexed
value functions and return-distribution functions under TD and CTD learning,
respectively. Discuss how Theorem 6.9 can be used to establish convergence of
the corresponding state-action-indexed algorithms. 4
Exercise 6.9
(*)
.
Theorem 6.10 establishes that temporal-difference learning
converges for a reasonably wide parametrisation of the distribution of source
states and step size schedules. Given a source state X
k
, consider the incremental
Monte Carlo update
V
k+1
(X
k
) = (1
k
)V
k
(X
k
)+
k
G
k
V
k+1
(x)=V
k
(x) if x 6= X
k
.
where G
k
(X
k
) is a random return. Explain how Theorem 6.10 and its proof
should be adapted to prove that the sequence (V
k
)
k0
converges to V
. 4
202 Chapter 6
Exercise 6.10
(Necessity of conditions for convergence of TD learning)
.
The
purpose of this exercise is to explore the behaviour of TD learning when the
assumptions of Theorem 6.10 do not hold.
(i) Write down an MDP with a single state x, from which trajectories immedi-
ately terminate, and a sequence of positive step sizes (
k
(x))
k0
satisfying
k0
k
(x)<
1
, with the property that the TD update rule applied with
these step sizes produces a sequence of estimates (V
k
)
k0
that does not
converge to V
.
(ii)
For the same MDP, write down a sequence of positive step sizes (
k
(x))
k0
such that
k0
2
k
(x)=
1
, and show that the sequence of estimates (V
k
)
k0
generated by TD learning with these step sizes does not converge to V
.
(iii)
Based on your answers to the previous two parts, for which values of
0
do step size sequences of the form
k
=
1
(N
k
(X
k
) + 1)
lead to guaranteed TD converge, assuming all states are visited infinitely
often?
(iv)
Consider an MDP with a single state x , from which trajectories immedi-
ately terminate. Suppose the reward distribution at x is standard Cauchy
distribution, with density
f (z)=
1
(1 + z
2
)
.
Show that if V
0
also has a Cauchy distribution with median 0, then for any
positive step sizes (
k
(x))
k0
, V
k
has a Cauchy distribution with median
0, and hence the sequence does not converge to a constant. Hint. The
characteristic function of the Cauchy distribution is given by s 7!exp(–|s|).
(v)
(*) In the proof of Theorem 6.9, the inequality 1 u
exp
(–u) was used to
deduce that the condition
k0
k
(x)=
1
w.p. 1 is sufficient to guarantee
that
k
l=0
(1
l
(x)) !0
w.p. 1. Show that if
k
(x)
2
[0, 1], the condition
k0
k
(x)=
1
w.p. 1
is necessary as well as sufficient for the sequence
k
l=0
(1
l
(x)) to also
converge to zero w.p. 1. 4
Incremental Algorithms 203
Exercise 6.11.
Using the tools from this chapter, prove the convergence of
the undiscounted, finite-horizon categorical Monte Carlo algorithm (Algorithm
3.3). 4
Exercise 6.12. Recall the no-loop operator introduced in Example 4.6:
T
NL
V
(x)=E
R + V(X
0
)
{X
0
6= x }
| X = x
.
Denote its fixed point by V
NL
. For a transition (x , a, r, x
0
) and time-varying
step-size
k
2[0, 1), consider the no-loop update rule:
V(x)
(1
k
)V(x)+
k
(r + V(x
0
)) if x
0
6= x,
(1
k
)V(x)+
k
r if x
0
= x
.
(i)
Demonstrate that this update rule can be derived by substitution applied to
the no-loop operator.
(ii)
In words, describe how you would modify the online, incremental first-visit
Monte Carlo algorithm (Algorithm 3.1) to learn V
NL
.
(iii)
(*) Provide conditions under which the no-loop update converges to V
NL
,
and prove that it does converge under those conditions. 4
Exercise 6.13.
Assumption 6.5 requires that the sequence of step sizes (
k
(x):
k 0, x 2X) satisfy
1
k=0
k
(x)
2
< 1 (6.30)
with probability 1. For k
0, let N
k
(x) be the number of times that x has been
updated, and let u
k
(x) be the most recent time at which x
l
= x, l < k with the
convention that u
k
(x) = 1 if N
k
(x) = 0. Consider the step size schedule
k+1
=
1
N
k
(X
k
)+1
if u
k
(X
k
)
k
2
1 otherwise.
This schedule takes larger steps for states whose estimate has not been recently
updated. Suppose that X
k
for some distribution
that puts positive probabil-
ity mass on all states. Show that the sequence (
k
)
k0
satisfies Equation 6.30
w.p. 1, yet there is no B 2R for which
1
k=0
k
(x)
2
< B
with probability 1. This illustrates the need for the care taken in the proof of
Proposition 6.4. 4