Incremental Algorithms

The concept of experience is central to reinforcement learning. Methods such

as TD and categorical TD learning iteratively update predictions on the basis of

transitions experienced by interacting with an environment. Such incremental

algorithms are applicable in a wide range of scenarios, including those in

which no model of the environment is known, or in which the model is too

complex to allow for dynamic programming methods to be applied. Incremental

algorithms are also often easier to implement. For these reasons, they are key in

the application of reinforcement learning to many real-world domains.

With this ease of use, however, comes an added complication. In contrast to

dynamic programming algorithms, which steadily make progress toward the

desired goal, there is no guarantee that incremental methods will generate con-

sistently improving estimates of return distributions from iteration to iteration.

For example, an unusually high reward in a sampled transition may actually

lead to a short-term degrading of the value estimate for the corresponding state.

In practice, this requires making suﬃciently small steps with each update, to

average out such variations. In theory, the stochastic nature of incremental

updates makes the analysis substantially more complicated than contraction

mapping arguments.

This chapter takes a closer look at the behavior and design of incremen-

tal algorithms – distributional and not. Using the language of operators and

probability distribution representations, we also formalize what it means for an

incremental algorithm to perform well and discuss how to analyze its asymptotic

convergence to the desired estimate.

6.1 Computation and Statistical Estimation

Iterative policy evaluation ﬁnds an approximation to the value function

successively computing the iterates

k+1

= T

, (6.1)

Draft version. 161

162 Chapter 6

deﬁned by an arbitrary initial value function estimate

∈R

. We can also

think of temporal-diﬀerence learning as computing an approximation to the

value function

, albeit with a diﬀerent mode of operation. To begin, recall

from Chapter 3 the TD learning update rule:

V(x) ←(1 −α)V(x) + α



r + γV(x

)



. (6.2)

One of the aims of this chapter is to study the long-term behavior of the value

function estimate

(and, eventually, of estimates produced by incremental,

distributional algorithms).

At the heart of our analysis is the behavior of a single update. That is, for a

ﬁxed

V ∈R

, we may understand the learning dynamics of temporal-diﬀerence

learning by considering the random value function estimate

deﬁned via the

sample transition model (X = x, A, R, X

(x) = (1 −α)V(x) + α



R + γV(X

)



, (6.3)

(y) = V(y) , y , x .

There is a close connection between the expected eﬀect of the update given by

Equation 6.3 and iterative policy evaluation. Speciﬁcally, the expected value of

the quantity

γV

(

) precisely corresponds to the application of the Bellman

operator to V, evaluated at the source state x:

[R + γV(X

) | X = x] = (T

V)(x) .

Consequently, in expectation TD learning adjusts its value function estimate at

x in the direction given by the Bellman operator:

(x) | X = x] = (1 −α)V(x) + α(T

V)(x) . (6.4)

To argue that temporal-diﬀerence learning correctly ﬁnds an approximation to

, we must also be able to account for the random nature of TD updates. An

eﬀective approach is to rewrite Equation 6.3 as the sum of an expected target

and a mean-zero noise term:

(x) = (1 −α)V(x) + α



V)(x)

| {z }

expected target

+ R + γV(X

) −(T

V)(x)

| {z }

noise



; (6.5)

with this decomposition, we may simultaneously analyze the mean dynamics of

TD learning as well as the eﬀect of the noise on the value function estimates. In

the second half of the chapter, we will use Equation 6.5 to establish that under

appropriate conditions, these dynamics can be controlled so as to guarantee

the convergence of temporal-diﬀerence learning to

and analogously the

convergence of categorical temporal-diﬀerence learning to the ﬁxed point ˆη

Draft version.

Incremental Algorithms 163

6.2 From Operators to Incremental Algorithms

As illustrated in the preceding section, we can explain the behavior of temporal-

diﬀerence learning (an incremental algorithm) by relating it to the Bellman

operator. New incremental algorithms can also be obtained by following the

reverse process – by deriving an update rule from a given operator. This tech-

nique is particularly eﬀective in distributional reinforcement learning, where

one often needs to implement incremental counterparts to a variety of dynamic

programming algorithms. To describe how one might derive an update rule

from an operator, we now introduce an abstract framework based on what is

known as stochastic approximation theory.

Let us assume that we are given a contractive operator

over some state-

indexed quantity and that we are interested in determining the ﬁxed point

∗

this operator. With dynamic programming methods, we obtain an approximation

of U

∗

by computing the iterates

k+1

(x) = (OU

)(x) , for all x ∈X.

To construct a corresponding incremental algorithm, we must ﬁrst identify what

information is available at each update; this constitutes our sampling model.

For example, in the case of temporal-diﬀerence learning, this is the sample

transition model (

X, A, R, X

). For Monte Carlo algorithms, the sampling model

is the random trajectory (

, A

, R

)

t≥0

(see Exercise 6.1). In the context of this

chapter, we assume that the sampling model takes the form (

X, Y

), where

the source state to be updated, and

comprises all other information in the

model, which we term the sample experience.

Given a step size

and realizations

and

of the source state variable

and

sample experience

, respectively, we consider incremental algorithms whose

update rule can be expressed as

U(x) ←(1 −α)U(x) + α

O(U, x, y) . (6.6)

Here,

(

U, x, y

) is a sample target that may depend on the current estimate

. Typically, the particular setting we are in also imposes some limitation on

the form of

. For example, when

is the Bellman operator

, although

(

U, x, y

) = (

)(

) is a valid instantiation of Equation 6.6, its implementation

might require knowledge of the environment’s transition kernel and reward

function. Implicit within Equation 6.6 is the notion that the space that the

estimate

occupies supports a mixing operation; this will indeed be the case

45.

Our treatment of incremental algorithms and their relation to stochastic approximation theory is

far from exhaustive; the interested reader is invited to consult the bibliographical remarks.

Draft version.

164 Chapter 6

for the algorithms we consider in this chapter, which work either with ﬁnite-

dimensional parameter sets or probability distributions themselves.

With this framework in mind, the question is what makes a sensible choice

for

Unbiased update.

An important case is when the sample target

can be

chosen so that in expectation, it corresponds to the application of the operator

O(U, X, Y) | X = x] = (OU)(x) . (6.7)

In general, when Equation 6.7 holds, the resulting incremental algorithm is

also well behaved. More formally, we will see that under reasonable conditions,

the estimates produced by such an algorithm are guaranteed to converge to

∗

– a generalization of our earlier statement that temporal-diﬀerence learning

converges to V

Conversely, when the operator

can be expressed as an expectation over

some function of

, and

, then it is possible to derive a sample target simply

by substituting the random variables involved with their realizations. In eﬀect,

we then use the sample experience to construct an unbiased estimate of (

)(

As a concrete example, the TD target, expressed in terms of random variables,

O(V, X, Y) = R + γV(X

) ;

the corresponding update rule is

V(x) ←(1 −α)V(x) + α (r + γV(x

))

| {z }

sample target

In the next section, we will show how to use this approach to derive categorical

temporal-diﬀerence learning (introduced in Chapter 3) from the categorical-

projected Bellman operator.

Example 6.1.

The consistent Bellman operator is an operator over state-action

value functions based on the idea of making consistent choices at each state. At

a high level, the consistent operator adds the constraint that actions that leave

the state unchanged should be repeated. This operator is formalized as

Q(x, a) = E



R + γ max

∈A

Q(X

, a

)

, x}

+ γQ(x, a)

= x}

| X = x



Let (

x, a, r, x

) be drawn according to the sample transition model. The update

rule derived by substitution is

Q(x, a) ←











(1 −α)Q(x, a) + α



r + γ max

∈A

Q(x

, a

)



if x

, x

(1 −α)Q(x, a) + α(r + γQ(x, a)) otherwise.

Draft version.

Incremental Algorithms 165

Compared to Q-learning (Section 3.7), the consistent update rule increases the

action gap at each state, in the sense that its operator’s ﬁxed point

∗

has the

property that for all (x, a) ∈X×A,

max

∈A

∗

(x, a

) −Q

∗

(x, a) ≥max

∈A

∗

(x, a

) −Q

∗

(x, a) ,

with strict inequality whenever P

(x | x, a) > 0. 4

A general principle.

Sometimes, expressing the operator

in the form of

Equation 6.7 requires information that is not available to our sampling model.

In this case, it is sometimes still possible to construct an update rule whose

repeated application approximates the operator

. More precisely, given a ﬁxed

estimate

, with this approach we look for a sample target function

such that

from a suitable initial condition, repeated updates of the form

U(x) ←(1 −α)U(x) + α

U, x, y)

lead to

U ≈O

. In this case, a necessary condition for

to be a suitable sample

target is that it should leave the ﬁxed point U

∗

unchanged, in expectation:

[

O(U

∗

, X, Y) | X = x] = U

∗

(x) .

In Section 6.4, we will introduce quantile temporal-diﬀerence learning, an

algorithm that applies this principle to ﬁnd the ﬁxed point of the quantile-

projected Bellman operator.

6.3 Categorical Temporal-Diﬀerence Learning

Categorical dynamic programming (CDP) computes a sequence (

)

k≥0

return-distribution functions, deﬁned by iteratively applying the projected dis-

tributional Bellman operator Π

to an initial return-distribution function

k+1

= Π

As we established in Section 5.9, the sequence generated by CDP converges

to the ﬁxed point

ˆη

. Let us express this ﬁxed point in terms of a collection of

probabilities



(

))

i=1

x ∈X



associated with

particles located at

, . . . , θ

ˆη

(x) =

i=1

(x)δ

To derive an incremental algorithm from the categorical-projection Bellman

operator, let us begin by expressing the projected distributional operator Π

in terms of an expectation over the sample transition (X = x, A, R, X





(x) = Π



R,γ



) | X = x

. (6.8)

Draft version.

166 Chapter 6

Following the line of reasoning from Section 6.2, in order to construct an

unbiased sample target by substituting

and

with their realizations, we

need to rewrite Equation 6.8 with the expectation outside of the projection Π

The following establishes the validity of exchanging the order of these two

operations.

Proposition 6.2.

Let

η ∈F

C,m

be a return function based on the

categorical representation. Then for each state x ∈X,

(Π



(x) = E





R,γ

)

η(X

)



| X = x



. 4

Proposition 6.2 establishes that the projected operator Π

can be written

in such a way that the substitution of random variables with their realizations

can be performed. Consequently, we deduce that the random sample target



η, x, (R, X

)



= Π

R,γ

)

η(X

)

provides an unbiased estimate of (Π

)(

). For a given realization (

x, a, r, x

)

of the sample transition, this leads to the update rule

η(x) ←(1 −α)η(x) + αΠ



r,γ

)

η(x

)



| {z }

sample target

. (6.9)

The last part of the CTD derivation is to express Equation 6.9 in terms of

the actual parameters being updated. These parameters are the probabilities



(x))

i=1

: x ∈X



of the return-distribution function estimate η:

η(x) =

i=1

(x)δ

The sample target in Equation 6.9 is given by the pushforward transformation of

-categorical distribution (

(

)) followed by a categorical projection. As we

demonstrated in Section 3.6, the projection of a such a transformed distribution

can be expressed concisely from a set of coeﬃcients (

i, j

(

) :

i, j ∈{

, . . . , m}

In terms of the triangular and half-triangular kernels (

)

i=1

that deﬁne the

categorical projection (Section 5.6), these coeﬃcients are

i, j

(r) = h



−1

(r + γθ

−θ

)



. (6.10)

With these coeﬃcients, the update rule over the probability parameters is

(x) ←(1 −α)p

(x) + α

j=1

i, j

(r)p

) .

46.

Although the action

is not needed to construct the sample target, we include it for consistency.

Draft version.

Incremental Algorithms 167

Our derivation illustrates how substituting random variables for their real-

izations directly leads to an incremental algorithm, provided we have the right

operator to begin with. In many situations, this is simpler than the step-by-

step process that we originally followed in Chapter 3. Because the random

sample target is an unbiased estimate of the projected Bellman operator, it

is also simpler to prove its convergence to the ﬁxed point

ˆη

; in the second

half of this chapter, we will in fact apply the same technique to analyze both

temporal-diﬀerence learning and CTD.

Proof of Proposition 6.2. For a given r ∈R, x

∈X, let us write

˜η(r, x

) = (b

r,γ

)

η(x

) .

Fix x ∈X. For conciseness, let us deﬁne, for z ∈R,

(z) = h



−1

(z −θ

)



With this notation, we have





R,γ

)

η(X

)



| X = x



(a)

= E

i=1

Z∼˜η(R,X

)

[

(Z)] | X = x

i=1

Z∼˜η(R,X

)

[

(Z)] | X = x

(b)

i=1

∼(T

η)(x)



)



= Π

η)(x) ,

where (a) follows by deﬁnition of the categorical projection in terms of the

triangular and half-triangular kernels (

)

i=1

and (b) follows by noting that

if the conditional distribution of

γG

(

) (where

is an instantiation of

independent of the sample transition (

x, A, R, X

)) given

R, X

˜η

(

R, X

) =

(

R,γ

)

(

), then the unconditional distribution of

when

is (

)(

6.4 Quantile Temporal-Diﬀerence Learning

Quantile regression is a method for determining the quantiles of a probability

distribution incrementally and from samples.

In this section, we develop an

47.

More precisely, quantile regression is the problem of estimating a predetermined set of quantiles

of a collection of probability distributions. By extension, in this book, we also use “quantile

regression” to refer to the incremental method that solves this problem.

Draft version.

168 Chapter 6

algorithm that aims to ﬁnd the ﬁxed point

ˆη

of the quantile-projected Bellman

operator Π

via quantile regression.

To begin, suppose that given

τ ∈

1), we are interested in estimating the

th quantile of a distribution

, corresponding to

−1

(

). Quantile regression

maintains an estimate

of this quantile. Given a sample

drawn from

, it

adjusts θ according to

θ ←θ + α(τ −

{z < θ}

) . (6.11)

One can show that quantile regression follows the negative gradient of the

quantile loss

(θ) = (τ −

{z < θ}

)(z −θ)

= |

{z < θ}

−τ|×|z −θ|. (6.12)

In Equation 6.12, the term

{z < θ}

−τ|

is an asymmetric step size that is either

or 1

−τ

, according to whether the sample

is greater or smaller than

respectively. When

τ <

5, samples greater than

have a lesser eﬀect on it

than samples smaller than

; the eﬀect is reversed when

τ >

5. The update

rule in Equation 6.11 will continue to adjust the estimate until the equilibrium

point

∗

is reached (Exercise 6.4 asks you to visualize the behavior of quantile

regression with diﬀerent distributions). This equilibrium point is the location at

which smaller and larger samples have an equal eﬀect in expectation. At that

point, letting Z ∼ν, we have

0 = E



τ −

{Z < θ

∗

}



= τ −E



{Z < θ

∗

}



= τ −P(Z < θ

∗

)

=⇒ P(Z < θ

∗

) = τ

=⇒ θ

∗

= F

−1

(τ) . (6.13)

For ease of exposition, in the ﬁnal line we assumed that there is a unique

z ∈R

for which F

(z) = τ; Remark 6.1 discusses the general case.

Now, let us consider applying quantile regression to ﬁnd a

-quantile approx-

imation to the return-distribution function (ideally, the ﬁxed point

ˆη

). Recall

that a

-quantile return-distribution function

η ∈F

Q,m

is parameterized by the

locations



(θ

(x))

i=1

: x ∈X



η(x) =

i=1

(x)

48.

More precisely, Equation 6.11 updates

in the direction of the negative gradient of

provided

that P

Z∼ν

(Z = θ) = 0. This holds trivially if ν is a continuous probability distribution.

Draft version.

Incremental Algorithms 169

Now, the quantile projection Π

ν of a probability distribution ν is given by

ν =

i=1

−1

(τ

)

, τ

2i −1

for i = 1, …, m .

Given a source state

x ∈X

, the general idea is to perform quantile regression for

all location parameters (

)

i=1

simultaneously, using the quantile levels (

)

i=1

and samples drawn from (

)(

). To this end, let us momentarily introduce a

random variable

uniformly distributed on

{

, . . . , m}

. By Proposition 4.11, we

have



R + γθ

) | X = x



= (T

η)(x) . (6.14)

Given a realized transition (

x, a, r, x

), we may therefore construct

sample

targets



γθ

(

)



j=1

. Applying Equation 6.11 to these targets leads to the

update rule

(x) ←θ

(x) +

j=1



− {r + γθ

) < θ

(x)}



, i = 1, . . . m . (6.15)

This is the quantile temporal-diﬀerence learning (QTD) algorithm. A concrete

instantiation in the online case is summarized by Algorithm 6.1, by analogy with

the presentation of categorical temporal-diﬀerence learning in Algorithm 3.4.

Note that applying Equation 6.15 requires computing a total of

terms per

location; when

is large, an alternative is to instead use a single term from

the sum in Equation 6.15, with

sampled uniformly at random. Interestingly

enough, for

suﬃciently small, the per-step cost of QTD is less than the cost

of sorting the full distribution (

)(

) (which has up to

particles).

This suggests that the quantile regression approach to the projection step may

be useful even in the context of distributional dynamic programming.

The use of quantile regression to derive QTD can be seen as an instance of

the principle introduced at the end of Section 6.2. Suppose that we consider an

initial return function

(x) =

j=1

(x)

If we substitute the sample target in Equation 6.15 by a target constructed from

this initial return function, we obtain the update rule

(x) ←θ

(x) +

j=1



− {r + γθ

) < θ

(x)}



, i = 1, . . . m . (6.16)

By inspection, we see that Equation 6.16 corresponds to quantile regression

applied to the problem of determining, for each state

x ∈X

, the quantiles of

Draft version.

170 Chapter 6

Algorithm 6.1: Online quantile temporal-diﬀerence learning

Algorithm parameters: step size α ∈(0, 1],

policy π : X→P(A),

number of quantiles m,

initial locations



(θ

(x))

i=1

: x ∈X



(x) ←θ

(x) for i = 1, . . . , m, x ∈X

←

2i−1

for i = 1, . . . , m

Loop for each episode:

Observe initial state x

Loop for t = 0, 1, . . .

Draw a

from π(· | x

)

Take action a

, observe r

, x

t+1

for i = 1, …, m do

←θ

)

for j = 1, …, m do

if x

t+1

is terminal then

g ←r

else

g ←r

+ γθ

t+1

)

←θ



− {g < θ

)}



end for

for i = 1, …, m do

) ←θ

end for

until x

t+1

is terminal

end

the distribution (

)(

). Consequently, one may think of quantile temporal-

diﬀerence learning as performing an update that would converge to the quantiles

of the target distribution, if that distribution were held ﬁxed.

Based on this observation, we can verify that QTD is a reasonable distribu-

tional reinforcement learning algorithm by considering its behavior at the ﬁxed

point

ˆη

= Π

ˆη

the solution found by quantile dynamic programming. Let us denote the param-

eters of this return function by

(

), for

= 1

, . . . , m

and

x ∈X

. For a given

Draft version.

Incremental Algorithms 171

state x, consider the intermediate target

˜η(x) =



ˆη



(x) .

Now, by deﬁnition of the quantile projection operator, we have

(x) = F

−1

˜η(x)



2i−1



However, by Equation 6.13, we also know that the quantile regression update

rule applied at

(

) with

2i−1

leaves the parameter unchanged in expec-

tation. In other words, the collection of locations



(

)



i=1

is a ﬁxed point of

the expected quantile regression update, and consequently the return function

ˆη

is a solution of the quantile temporal-diﬀerence learning algorithm. This

gives some intuition that is it indeed a valid learning rule for distributional

reinforcement learning with quantile representations.

Before concluding, it is useful to illustrate why the straightforward approach

taken to derive categorical temporal-diﬀerence learning, based on unbiased

operator estimation, cannot be applied to the quantile setting. Recall that the

quantile-projected operator takes the form





(x) = Π



R,γ



) | X = x

. (6.17)

As the following example shows, exchanging the expectation and projection

results in a diﬀerent operator, one whose ﬁxed point is not

ˆη

. Consequently,

we cannot substitute random variables for their realizations, as was done in the

categorical setting.

Example 6.3.

Consider an MDP with a single state

, single action

, transition

dynamics so that

transitions back to itself, and immediate reward distribution

1). Given

(

) =

, we have (

)(

) =

1), and hence the projec-

tion via Π

onto

Q,m

with

= 1 returns a Dirac delta on the median of this

distribution: δ

In contrast, the sample target (

R,γ

)

(

) is

, and so the projection of this

target via Π

remains δ

. We therefore have

[Π

R,γ

)

η)(X

) |X = x] = E

[δ

|X = x] = N(0, 1) ,

which is distinct from the result of the projected operator, (Π

)(

) =

6.5 An Algorithmic Template for Theoretical Analysis

In the second half of this chapter, we present a theoretical analysis of a class of

incremental algorithms that includes the incremental Monte Carlo algorithm

(see Exercise 6.9), temporal-diﬀerence learning, and the CTD algorithm. This

analysis builds on the contraction mapping theory developed in Chapter 4 but

also accounts for the randomness introduced by the use of sample targets in

Draft version.

172 Chapter 6

the update rule, via stochastic approximation theory. Compared to the analysis

of dynamic programming algorithms, the main technical challenge lies in

characterizing the eﬀect of this randomness on the learning process.

To begin, let us view the output of the temporal-diﬀerence learning algorithm

after

updates as a value function estimate

. Extending the discussion from

Section 6.1, this estimate is a random quantity because it depends on the

particular sample transitions observed by the agent and possibly the randomness

in the agent’s choices.

We are particularly interested in the sequence of random

estimates (

)

k≥0

. From an initial estimate

, this sequence is formally deﬁned

k+1

) = (1 −α

) + α



+ γV

)



k+1

(x) = V

(x) if x , X

where (

, A

, R

, X

)

k≥0

is the sequence of random transitions used to calculate

the TD updates. In our analysis, the object of interest is the limiting point of this

sequence, and we seek to answer the question: does the algorithm’s estimate

converge to the value function

? We consider the limiting point because any

single update may or may not improve the accuracy of the estimate

at the

source state

. We will show that, under the right conditions, the sequence

)

k≥0

converges to V

. That is,

lim

k→∞



(x) −V

(x)



= 0, for all x ∈X.

More precisely, the above holds with probability 1: with overwhelming odds,

the variables X

, R

, X

, . . . are drawn in such a way that V

→V

We will prove a more general result that holds for a family of incremental

algorithms whose sequence of estimates can be expressed by the template

k+1

) = (1 −α

) + α

O(U

, X

, Y

)

k+1

(x) = U

(x) if x , X

. (6.18)

Here,

is the (possibly random) source state at time

(

, X

, Y

) is the

sample target, and α

is an (also possibly random) step size. As in Section 6.2,

the sample experience

describes the collection of random variables used

to construct the sample target: for example, a sample trajectory or a sample

transition (X

, A

, R

, and X

Under this template, the estimate

describes the collection of variables

maintained by the algorithm and constitutes its “prediction.” More speciﬁcally,

49. In this context, we even allow the step size α

to be random.

50.

Put negatively, there may be realizations of

, R

, X

, . . .

for which the sequence (

)

k≥0

does not converge, but the set of such realizations has zero probability.

Draft version.

Incremental Algorithms 173

it is a state-indexed collection of

-dimensional real-valued vectors, written

∈R

X×m

. In the case of the TD algorithm, m = 1 and U

= V

We assume that there is an operator

X×m

→R

X×m

whose unique ﬁxed

point is the quantity to be estimated by the incremental algorithm. If we denote

this ﬁxed point by U

∗

, this implies that

∗

= U

∗

We further assume the existence of a base norm

k·k

over

, extended to the

space of estimates according to

kUk

∞

= sup

x∈X

kU(x)k,

such that

is a contraction mapping of modulus

with respect to the metric

induced by

k·k

∞

. For TD learning,

and the base norm is simply the

absolute value; the contractivity of T

was established by Proposition 4.4.

Within this template, there is some freedom in how the source state

selected. Formally,

is assumed to be drawn from a time-varying distribution

that may depend on all previously observed random variables up to but

excluding time k, as well as the initial estimate U

. That is,

∼ξ



0:k−1

, Y

0:k−1

, α

0:k−1

, U



This includes scenarios in which source states are drawn from a ﬁxed distribu-

tion

ξ ∈P

(

), enumerated in a round-robin manner, or selected in proportion

to the magnitude of preceding updates (called prioritized replay; see Moore and

Atkeson 1993; Schaul et al. 2016). It also accounts for the situation in which

states are sequentially updated along a sampled trajectory, as is typical of online

algorithms.

We further assume that the sample target is an unbiased estimate of the

operator

applied to

and evaluated at

. That is, for all

x ∈X

for which

P(X

= x) > 0,



O(U

, X

, Y

) | X

0:k−1

, Y

0:k−1

, α

0:k−1

, U

, X

= x] = (OU

)(x) .

This implies that Equation 6.18 can be expressed in terms of a mean-zero noise

, similar to our derivation in Section 6.1:

k+1

) = (1 −α

) + α

(OU

)(X

) + (

O(U

, X

, Y

) −(OU

)(X

)

| {z }

Because

is zero in expectation, this assumption guarantees that, on average,

the incremental algorithm must make progress toward the ﬁxed point

∗

. That

is, if we ﬁx the source state X

= x and step size α

, then



k+1

(x) | X

0:k−1

, Y

0:k−1

, α

0:k−1

, X

= x, α

] (6.19)

Draft version.

174 Chapter 6

= (1 −α

(x) + α



(OU

)(x) + w

| X

= x



= (1 −α

(x) + α

(OU

)(x) .

By choosing an appropriate sequence of step sizes (

)

k≥0

and under a few

additional technical conditions, we can in fact provide the stronger guarantee

that the sequence of iterates (

)

k≥0

converges to

∗

w.p. 1, as the next section

illustrates.

6.6 The Right Step Sizes

To understand the role of step sizes in the learning process, consider an abstract

algorithm described by Equation 6.18 and for which

O(U

, X

, Y

) = (OU

)(X

) .

In this case, the noise term

is always zero and can be ignored: the abstract

algorithm adjusts its estimate directly toward

. Here we should take the

step sizes (

)

k≥0

to be large in order to make maximal progress toward

∗

. For

= 1, we obtain a kind of dynamic programming algorithm that updates its

estimate one state at a time and whose convergence to

∗

can be reasonably

easily demonstrated; conversely, taking

1 must in some sense slow down

the learning process.

In general, however, the noise term is not zero and cannot be neglected. In

this case, large step sizes amplify

and prevent the algorithm from converg-

ing to

∗

(consider, in the extreme, what happens when

= 1). A suitable

choice of step size must therefore balance rapid learning progress and eventual

convergence to the right solution.

To illustrate what “suitable choice” might mean in practice, let us distill the

issue down to its essence and consider the process that estimates the mean of a

distribution ν ∈P

(R) according to the incremental update

k+1

= (1 −α

+ α

, (6.20)

where (

)

k≥0

are i.i.d. random variables distributed according to

. For concrete-

ness, let us assume that

1), so that we would like (

)

k≥0

to converge

to 0.

Suppose that the initial estimate is

= 0 (the desired solution) and consider

three step size schedules:

= 0

/k+1

, and

/(k+1)

. Figure 6.1 illus-

trates the sequences of estimates obtained by applying the incremental update

with each of these schedules and a single, shared sequence of realizations of the

random variables (R

)

k≥0

The

/k+1

schedule corresponds to the right step size schedule for the incre-

mental Monte Carlo algorithm (Section 3.2), and accordingly, we observe that it

Draft version.

Incremental Algorithms 175

Figure 6.1

The behavior of a simple incremental update rule 6.20 for estimating the expected value

of a normal distribution. Diﬀerent curves represent the sequence of estimates obtained

from diﬀerent step size schedules. The ground truth (

= 0) is indicated by the dashed

line.

is converging to the correct expected value.

By contrast, the constant schedule

continues to exhibit variations over time, as the noise is not suﬃciently averaged

out. The quadratic schedule (

/(k+1)

) decreases too quickly and the algorithm

settles on an incorrect prediction.

To prove the convergence of algorithms that ﬁt the template described in

Section 6.5, we will require that the sequence of step sizes satisﬁes the Robbins–

Monro conditions (Robbins and Monro 1951). These conditions formalize the

range of step sizes that are neither too small nor too large and hence guarantee

that the algorithm must eventually ﬁnd the solution

∗

. As with the source state

, the step size

at a given time

may be random, and its distribution may

depend on

0:k−1

, α

0:k−1

and

0:k−1

but not the sample experience

. As in

the previous section, these conditions should hold with probability 1.

Condition 1: not too small.

In the example above, taking

/(k+1)

results

in premature convergence of the estimate (to the wrong solution). This is because

when the step sizes decay too quickly, the updates made by the algorithm may

not be of large enough magnitude to reach the ﬁxed point of interest. To avoid

this situation, we require that (α

)

k≥0

satisfy

k≥0

k {X

= x}

= ∞, for all x ∈X.

Implicit in this assumption is also the idea that every state should be updated

inﬁnitely often. This assumption is violated, for example, if there is a state

51.

In our example,

is the average of

i.i.d. normal random variables and is itself normally

distributed. Its standard deviation can be computed analytically and is equal to

√

(

k ≥

1). This

implies that after

= 1000 iterations, we expect

to be in the range

√

111, because

99.7 percent of a normal random variable’s probability is within three standard deviations of its

mean. Compare with Figure 6.1.

Draft version.

176 Chapter 6

and time

after which

, x

, for all

k ≥K

. For a reasonably well-behaved

distribution of source states, this condition is satisﬁed for constant step sizes,

including

= 1: in the absence of noise, it is possible to make rapid progress

toward the ﬁxed point. On the other hand, it disallows α

/(k+1)

, since

∞

k=0

(k + 1)

< ∞.

Condition 2: not too large.

Figure 6.1 illustrates how, with a constant step

size and in the presence of noise, the estimate

(

) continues to vary substan-

tially over time. To avoid this issue, the step size should be decreased so that

individual updates result in progressively smaller changes in the estimate. To

achieve this, the second requirement on the step size sequence (α

)

k≥0

= x}

< ∞, for all x ∈X.

In reinforcement learning, a simple step size schedule that satisﬁes both of

these conditions is

) + 1

, (6.21)

where

(

) is the number of updates to a state

up to but not including

algorithm time

. We encountered this schedule in Section 3.2 when deriving

the incremental Monte Carlo algorithm. As will be shown in the following

sections, this schedule is also suﬃcient for the convergence of TD and CTD

algorithms.

Exercise 6.7 asks you to verify that Equation 6.21 satisﬁes the

Robbins–Monro conditions and investigates other step size sequences that also

satisfy these conditions.

6.7 Overview of Convergence Analysis

Provided that an incremental algorithm satisﬁes the template laid out in Section

6.5, with a step size schedule that satisﬁes the Robbins–Monro conditions,

we can prove that the sequence of estimates produced by this algorithm must

converge to the ﬁxed point

∗

of the implied operator

. Before presenting the

proof in detail, we illustrate the main bounding-box argument underlying the

proof.

Let us consider a two-dimensional state space

, x

}

and an incre-

mental algorithm for estimating a 1-dimensional quantity (

= 1). As per the

template, we consider a contractive operator

→R

given by

52.

Because the process of bootstrapping constructs sample targets that are not in general unbiased

with regards to the value function

, the optimal step size schedule for TD learning decreases at a

rate that is slower than

/k. See bibliographical remarks.

Draft version.

Incremental Algorithms 177



(

)

(

)



; note that the ﬁxed point of

∗

= (0

0). At each

time step, a source state (

) is chosen uniformly at random and the

corresponding estimate is updated. The step sizes are

= (

+ 1)

−0.7

, satisfying

the Robbins–Monro conditions.

Suppose ﬁrst that the sample target is noiseless. That is,

O(U

, X

, Y

) = 0.8U

In this case, each iteration of the algorithm contracts along a particular

coordinate. Figure 6.2a illustrates a sequence (

)

k≥0

deﬁned by the update

equations

k+1

) = (1 −α

) + α

O(X

, U

, Y

)

k+1

(x) = U

(x), x , X

As shown in the ﬁgure, the algorithm makes steady (if not direct) progress

toward the ﬁxed point with each update. To prove that (

)

k≥0

converges to

∗

, we ﬁrst show that the error

−U

∗

∞

is bounded by a ﬁxed quantity

for all

k ≥

0 (indicated by the outermost dashed-line square around the ﬁxed

point

∗

= 0 in Figure 6.2a). The argument then proceeds inductively: if

lies

within a given radius of the ﬁxed point for all

greater than some

, then there

is some

≥K

for which, for all

k ≥K

, it must lie within the next smallest

dashed-line square. We will see that this follows by contractivity of

and the

ﬁrst Robbins–Monro condition. Provided that the diameter of these squares

shrinks to zero, then this establishes convergence of U

to U

∗

Now consider adding noise to the sample target, such that

O(U

, X

, Y

) = 0.8U

(y) + w

For concreteness, let us take

to be an independent random variable with dis-

tribution

([

−

1]). In this case, the behavior of the sequence (

)

k≥0

is more

complicated (Figure 6.2b). The sequence (

)

k≥0

no longer follows a neat path

to the ﬁxed point but can behave somewhat more erratically. Nevertheless, the

long-term behavior exhibited by the algorithm bears similarity to the noiseless

case: overall, progress is made toward the ﬁxed point U

∗

The proof of convergence follows the same pattern as for the noiseless case:

prove inductively that if

−U

∗

∞

is eventually bounded by some ﬁxed

quantity

∈R

, then

−U

∗

∞

is eventually bounded by a smaller quantity

l+1

. As in the noiseless case, this argument is depicted by the concentric squares

in Figure 6.2c. Again, if these diameters shrink to zero, this also establishes

convergence of U

to U

∗

Draft version.

178 Chapter 6

Because noise can increase the error between the estimate

and the ﬁxed

point

∗

at any given time step, to guarantee convergence we need to pro-

gressively decrease the step size

. The second Robbins–Monro condition is

suﬃcient for this purpose, and with it the inductive step can be proven with a

more delicate argument. One additional challenge is that the base case (that

sup

k≥0

−U

∗

∞

< ∞

) is no longer immediate; a separate argument is required

to establish this fact. This property is called the stability of the sequence (

)

k≥0

and is often one of the harder aspects of the proof of convergence of incremental

algorithms.

We conclude this section with a result that is crucial in understanding the

inﬂuence of noise in the algorithm. In the analysis carried out in this chapter, it

is the only result whose proof requires tools from advanced probability theory.

Proposition 6.4.

Let (

)

k≥0

be a sequence of random variables taking

values in

and (

)

k≥0

be a collection of step sizes. Given

= 0, consider

the sequence deﬁned by

k+1

= (1 −α

)

+ α

Suppose that the following conditions hold with probability 1:

E[Z

| Z

0:k−1

, α

0:k

] = 0 , sup

k≥0

E[kZ

| Z

0:k−1

, α

0:k

] < ∞,

∞

k=0

= ∞,

∞

k=0

< ∞.

Then

→0 with probability 1. 4

The proof is given in Remark 6.2; here, we provide some intuition that can

be gleaned without consulting the proof.

First, parallels can be drawn with the strong law of large numbers. Expanding

the deﬁnition of

yields

= (1 −α

) ···(1 −α

)α

+ (1 −α

) ···(1 −α

)α

+ ···+ α

Thus,

is a weighted average of the mean-zero terms (

)

l=0

. If

/k+1

, then

we obtain the usual uniformly weighted average that appears in the strong law of

large numbers. We also note that unlike the standard strong law of large numbers,

the noise terms (

)

l=0

are not necessarily independent. Nevertheless, it seems

reasonable that this sequence should exhibit similar behavior to the averages that

appear in the strong law of large numbers. This also provides further intuition

53.

Speciﬁcally, the supermartingale convergence theorem; the result is a special case of the

Robbins–Siegmund theorem (Robbins and Siegmund 1971).

Draft version.

Incremental Algorithms 179

(a) (b) (c)

1 0 1

U(x

)

1.0

0.5

0.0

0.5

1.0

U(x

)

1 0 1

U(x

)

1.0

0.5

0.0

0.5

1.0

U(x

)

0.0 0.5 1.0

U(x

)

0.0

0.5

1.0

U(x

)

Figure 6.2

(a)

Example behavior of the iterates (

)

k≥0

in the noiseless case. The color scheme

indicates the iteration number from purple (

= 0) through to red (

= 1000).

(b)

Example

behavior of the iterates (

)

k≥0

in the general case.

(c)

Behavior of iterates for ten random

seeds, with the noiseless (expected) behavior overlaid.

for the conditions of Proposition 6.4. If the variance of individual noise terms

is too great, the weighted average may not “settle down” as the number of

terms increases. Similarly, if

∞

k=0

is too small, the initial noise term

will

have too large an inﬂuence over the weighted average, even as k →∞.

Second, for readers familiar with stochastic gradient descent, we can rewrite

the update scheme as

k+1

+ α

(−

+ Z

) .

This is a stochastic gradient update for the loss function

(

) =

/2k

(the

minimizer of which is the origin). The negative gradient of this loss is

−

is a mean-zero perturbation of this gradient, and

is the step size used in the

update. Proposition 6.4 can therefore be interpreted as stating that stochastic

gradient descent on this speciﬁc loss function converges to the origin, under

the required conditions on the step sizes and noise. It is perhaps surprising

that understanding the behavior of stochastic gradient descent in this speciﬁc

setting is enough to understand the general class of algorithms expressed by

Equation 6.18.

6.8 Convergence of Incremental Algorithms*

We now provide a formal run-through of the arguments of the previous section

and explain how they apply to temporal-diﬀerence learning algorithms. We

begin by introducing some notation to simplify the argument. We ﬁrst deﬁne a

Draft version.

180 Chapter 6

per-state step size that incorporates the choice of source state X

(x) =











if x = X

0 otherwise,

(x) =











O(U

, X

, Y

) −(OU

)(X

) if x = X

0 otherwise.

This allows us to recursively deﬁne (U

)

k≥0

in a single equation:

k+1

(x) = (1 −α

(x))U

(x) + α

(x)



(OU

)(x) + w

(x)



. (6.22)

Equation 6.22 encapsulates all of the random variables – (

)

k≥0

, (

)

k≥0

, (

)

k≥0

– which together determine the sequence of estimates (U

)

k≥0

It is useful to separate the eﬀects of the noise into two separate cases: one

in which the noise has been “processed” by an application of the contractive

mapping

and one in which the noise has not been passed through this mapping.

To this end, we introduce the cumulative external noise vectors (

(

) :

k ≥

, x ∈X

). These are random vectors, with each

(

) taking values in

deﬁned by

(x) = 0 , W

k+1

(x) = (1 −α

(x))W

(x) + α

(x)w

(x) .

We also introduce the sigma-algebras

(

0:k

, α

0:k

, Y

0:k−1

); these encode

the information available to the learning agent just prior to sampling

and

applying the update rule to produce U

k+1

We now list several assumptions we will require of the algorithm to establish

the convergence result. Recall that

k·k

is the base norm identiﬁed in Section

6.5, which gives rise to the supremum extension

k·k

∞

. In particular, we assume

that O is a β-contraction mapping in the metric induced by k·k

∞

Assumption 6.5

(Robbins–Monro conditions)

For each

x ∈X

, the step sizes

(α

(x))

k≥0

satisfy

k≥0

(x) = ∞ and

k≥0

(x)

< ∞ with probability 1. 4

The second assumption encompasses the mean-zero condition described in

Section 6.5 and introduces an additional condition that the variance of this noise,

conditional on the state of the algorithm, does not grow too quickly.

Assumption 6.6.

The noise terms (

(

) :

k ≥

, x ∈X

) satisfy

[

(

)

] =

0 with probability 1, and

[

(

)

]

≤C

∞

w.p. 1, for some

constants C

, C

≥0, for all x ∈X and k ≥0. 4

We would like to use Proposition 6.4 to show that the cumulative external

noise (

(

))

k≥0

is well behaved, and then use this intermediate result to estab-

lish the convergence of the sequence (

)

k≥0

itself. The proposition is almost

applicable to the sequence (

(

))

k≥0

. The diﬃculty is that the proposition

stipulates that the individual noise terms

have bounded variance, whereas

Assumption 6.6 only bounds the conditional expectation of

(

)

in terms of

∞

, which a priori may be unbounded. Unfortunately, in temporal-diﬀerence

Draft version.

Incremental Algorithms 181

learning algorithms, the update variance typically does scale with the magnitude

of current estimates, so this is not an assumption that we can weaken. To get

around this diﬃculty, we ﬁrst establish the boundedness of the sequence (

)

k≥0

as described informally in the previous section, often referred to as stability in

the stochastic approximation literature.

Proposition 6.7.

Suppose Assumptions 6.5 and 6.6 hold. Then there is a

ﬁnite random variable

such that

sup

k≥0

∞

< B

with probability 1.

Proof.

The idea of the proof is to work with a “renormalized” version of the

noises (

(

))

k≥0

to which Proposition 6.4 can be applied. First, we show that

the contractivity of

means that when

is suﬃciently far from 0,

contracts

the iterate back toward 0. To make this precise, we ﬁrst observe that

kOUk

∞

≤kOU −U

∗

∞

+ kU

∗

∞

≤βkU −U

∗

∞

+ kU

∗

∞

≤βkUk

∞

+ D ,

where

= (1 +

)

∗

∞

. Let

B >

/1−β

so that

B > β

, and deﬁne

ψ ∈

(

β,

by β

B + D = ψ

B. Now that that for any U with kUk

∞

≥

B, we have

kOUk

∞

≤βkUk

∞

+ D = βkUk

∞

+ (ψ −β)

B ≤βkUk

∞

+ (ψ −β)kUk

∞

= ψkUk

∞

Second, we construct a sequence of bounds (

)

k≥0

related to the iterates (

)

k≥0

as follows. It will be convenient to introduce 1 +

−1

, the inverse of the

contraction factor ψ above. Take

= max(

B, kU

∞

), and iteratively deﬁne

k+1











if kU

k+1

∞

≤(1 + ε)

min{(1 + ε)

: l ∈N

, kU

k+1

∞

≤(1 + ε)

} otherwise .

Thus, the (

)

k≥0

deﬁne a kind of soft “upper envelope” on (

∞

)

k≥0

, which

are only updated when a norm exceeds the previous bound

by a factor at

least (1 + ε). Note that (kU

∞

)

k≥0

is unbounded if and only if

→∞.

We now use the (

)

k≥0

to deﬁne a “renormalized” noise sequence (

˜w

)

k≥0

to which Proposition 6.4 can be applied. We set

˜w

, and deﬁne

iteratively by

= w

, and

k+1

(x) = (1 −α

(x))

(x) + α

(x) ˜w

(x) .

By Assumption 6.6, we still have

[

˜w

| F

] = 0 and obtain that

[

k˜w

∞

| F

]

is uniformly bounded. Using Assumption 6.5, Proposition 6.4 now applies, and

we deduce that

→0 with probability 1.

In particular, there is a (random) time

such that

(

)

k< ε

for all

k ≥K

and

x ∈X

. Now supposing

→∞

, we may also take

so that

∞

≤

We will now prove by induction that for all

k ≥K

, we have both

∞

≤

(1 +

)

and

−W

∞

; the base case is clear from the above. For the

Draft version.

182 Chapter 6

inductive step, suppose for some

k ≥K

, we have both

∞

≤

(1 +

)

and

−W

∞

. Now observe that

k+1

(x) −W

k+1

(x)k

=k(1 −α

(x))U

(x) + α

(x)(OU

)(x) + α

(x)w

(x) −W

k+1

(x)k

≤(1 −α

(x))kU

(x) −W

(x)k+ α

(x)k(OU

)(x)k

≤(1 −α

(x))

+ α

(x)(βkU

∞

+ D)

≤(1 −α

(x))

+ α

(x)ψ(1 + ε)

≤

And additionally,

k+1

∞

≤kU

k+1

−W

k+1

∞

+ kW

k+1

∞

≤

+ ε

= (1 + ε)

as required.

We can now establish the convergence of the cumulative external noise.

Proposition 6.8.

Suppose Assumptions 6.5 and 6.6 hold. Then the external

noise W

(x) converges to 0 with probability 1, for each x ∈X. 4

Proof.

By Proposition 6.7, there exists a ﬁnite random variable

such that

∞

≤B

for all

k ≥

0. We therefore have

[

(

)

]

≤C

w.p. 1 for all

x ∈X

and

k ≥

0, by Assumption 6.6. Proposition 6.4 therefore

applies to give the conclusion.

With this result in hand, we now prove the central result of this section,

using the stability result as the base case for the inductive argument intuitively

explained above.

Theorem 6.9.

Suppose Assumptions 6.5 and 6.6 hold. Then

→U

∗

with

probability 1. 4

Proof.

By Proposition 6.7, there is a ﬁnite random variable

such that

−

∗

∞

< B

for all

k ≥

0 w.p. 1. Let

ε >

0 such that

+ 2

ε <

1; we will show by

induction that if

= (

+ 2

)

l−1

for all

l ≥

1, then for each

l ≥

0, there is a

(possibly random) ﬁnite time

such that

−U

∗

∞

< B

for all

k ≥K

, which

proves the theorem.

To prove this claim by induction, let

l ≥

0 and suppose there is a random

ﬁnite time

such that

−U

∗

∞

< B

for all

k ≥K

w.p. 1. Now let

x ∈X

, and

Draft version.

Incremental Algorithms 183

k ≥K

. We have

k+1

(x) −U

∗

(x) −W

k+1

(x)

= (1 −α

(x))U

(x) + α

(x)((OU

)(x) + w

(x))

−U

∗

(x) −(1 −α

(x))W

(x) −α

(x)w

(x)

= (1 −α

(x))(U

(x) −U

∗

(x) −W

(x)) + α

(x)((OU

)(x) −U

∗

(x)) .

Since

is a contraction mapping under

k·k

∞

with ﬁxed point

∗

and con-

traction modulus

, we have

(

)(

)

−U

∗

(

)

k≤βkU

−U

∗

∞

< βB

, and

k+1

(x) −U

∗

(x) −W

k+1

(x)k≤(1 −α

(x))kU

(x) −U

∗

(x) −W

(x)k+α

(x)βB

Letting ∆

(x) = kU

(x) −U

∗

(x) −W

(x)k, we then have

∆

k+1

(x) ≤(1 −α

(x))∆

(x) + α

(x)βB

=⇒ ∆

k+1

(x) −βB

≤(1 −α

(x))(∆

(x) −βB

) .

Telescoping this inequality from K

to k yields

∆

k+1

(x) −βB

≤







s=K

(1 −α

(x))







(∆

(x) −βB

) .

If ∆

(

)

−βB

≤

0, then ∆

(

)

≤βB

for all

k ≥K

. If not, then we can use the

inequality 1 −x ≤e

−x

(applied to x = α

≥0) to deduce

∆

k+1

(x) −βB

≤exp







−

s=K

(x)







(∆

(x) −βB

) ,

and since

s≥0

(

) =

∞

by assumption, the right-hand side tends to 0. There-

fore, there exists a random ﬁnite time after which ∆

(

)

≤

(

)

. Since

ﬁnite, there is a random ﬁnite time after which this holds for all

x ∈X

. Finally,

since

(

)

→

0 under

k·k

for all

x ∈X

w.p. 1 by Proposition 6.8, there is a

random ﬁnite time after which

(

)

k≤εB

for all

x ∈X

. Letting

l+1

≥K

the maximum of all these random times, we therefore have that for

k ≥K

l+1

, for

all x ∈X,

(x) −U

∗

(x)k≤kU

(x) −U

∗

(x) −W

(x)k+ kW

(x)k≤(β + ε)B

+ εB

= B

l+1

as required.

6.9 Convergence of Temporal-Diﬀerence Learning*

We can now apply Theorem 6.9 to demonstrate the convergence of the sequence

of value function estimates produced by TD learning. Formally, we consider a

stream of sample transitions (

, A

, R

, X

)

k≥0

, along with associated step sizes

Draft version.

184 Chapter 6

(α

)

k≥0

, that satisfy the Robbins–Monro conditions (Assumption 6.5) and give

rise to zero-mean noise terms (Assumption 6.6). More precisely, we assume

there are sequences of functions (

)

k≥0

and (

)

k≥0

such that our sample model

takes the following form (for k ≥0):

|(X

0:k−1

, A

0:k−1

, R

0:k−1

, X

0:k−1

, α

0:k−1

)∼ξ



0:k−1

, A

0:k−1

, R

0:k−1

, X

0:k−1

, α

0:k−1



;

|(X

0:k

, A

0:k−1

, R

0:k−1

, X

0:k−1

, α

0:k−1

)∼ν



0:k

, A

0:k−1

, R

0:k−1

, X

0:k−1

, α

0:k−1



;



0:k

, A

0:k−1

, R

0:k−1

, X

0:k−1

, α

0:k



∼π( · | X

);



0:k

, A

0:k

, R

0:k−1

, X

0:k−1

, α

0:k



∼P

( · | X

, A

);



0:k

, A

0:k

, R

0:k

, X

0:k−1

, α

0:k



∼P

( · | X

, A

) . (6.23)

A generative, or algorithmic, perspective on this model is that at each update step

, a source state

and step size

are selected on the basis of all previously

observed random variables (possibly using an additional source of randomness

to make this selection), and the variables (

, R

, X

) are sampled according

and the environment dynamics, conditionally independent of all random

variables already observed given

. Readers may compare this with the model

equations in Section 2.3 describing the joint distribution of a trajectory generated

by following the policy

. As discussed in Sections 6.5 and 6.6, this is fairly

ﬂexible model that allows us to analyze a variety of learning schemes.

Theorem 6.10.

Consider the value function iterates (

)

k≥0

deﬁned by

some initial estimate V

and satisfying

k+1

) = (1 −α

) + α



+ V

)



k+1

(x) = V

(x) if x , X

where (X

, A

, R

, X

)

k≥0

is a sequence of transitions. Suppose that:

(a)

The source states (

)

k≥0

and step sizes (

)

k≥0

satisfy the Robbins–

Monro conditions: w.p. 1, for all x ∈X,

∞

k=0

k {X

= x}

= ∞,

∞

k=0

= x}

< ∞.

(b)

The joint distribution of (

, A

, R

, X

)

k≥0

is an instance of the

sampling model expressed in Equation 6.23.

(c)

The reward distributions for all state-action pairs have ﬁnite variance.

Then V

→V

with probability 1. 4

Theorem 6.10 gives formal meaning to our earlier assertion that the conver-

gence of incremental reinforcement learning algorithms can be guaranteed for a

Draft version.

Incremental Algorithms 185

variety of source state distributions. Interestingly enough, the condition on the

source state distribution appears only implicitly, through the Robbins–Monro

conditions: eﬀectively, what matters is not so much when the states are updated

but rather the “total amount of step size” by which the estimate may be moved.

Proof.

We ﬁrst observe that the temporal-diﬀerence algorithm described in

the statement is an instance of the abstract algorithm described in Section 6.5,

by taking

= 1,

= (

, R

, X

), and

(

, X

, Y

) =

γV

(

). The base norm

k·k

is simply the absolute value on

. In this case,

-contraction on

by Proposition 4.4, with ﬁxed point

, and the noise

is equal to

γV

(

)

−

(

)(

), by the decomposition in Equation 6.5.

It therefore remains to check that Assumptions 6.5 and 6.6 hold; Theorem 6.9

then applies to give the result. Assumption 6.5 is immediate from the conditions

of the theorem. To see that Assumption 6.6 holds, ﬁrst note that

E[w

] = E

+ γV

) −(T

)(X

) |F

]

= E

+ γV

) −(T

)(X

) |X

, V

]

= 0 ,

since conditional on (

, V

), the expectation of

γV

(

) is (

)(

Additionally, we note that





= E



, V



= E



|R + γV

) −(T

)(X

, V



≤2





|R + γV

, V



+ (T

)(X

)



≤C

+ C

∞

for some C

, C

> 0.

6.10 Convergence of Categorical Temporal-Diﬀerence Learning*

Let us now consider proving the convergence of categorical TD learning

by means of Theorem 6.9. Writing CTD in terms of a sequence of return

distribution estimates, we have

k+1

) = (1 −α

)η

) + α



,γ

)

η(X

)



k+1

(x) = η

(x) if x , X

. (6.24)

Following the principles of the previous section, we may decompose the update

at X

into an operator term and a noise term:

k+1

) = (1 −α

)η

) (6.25)

+ α



(Π

η)(X

)

| {z }

(OU)(X

)

+ Π

,γ

)

η(X

) −(Π

η)(X

)

| {z }



Draft version.

186 Chapter 6

Assuming that

and

are drawn appropriately, this decomposition is sensible

by virtue of Proposition 6.2, in the sense that the expectation of the sample

target is the projected distributional Bellman operator:





,γ

)

η(X

)



| X

= x



= (Π



(x) for all x . (6.26)

With this decomposition, the noise term is not a probability distribution but

rather a signed distribution; this is illustrated in Figure 6.3. Informally speaking,

a signed distribution may assign negative “probabilities” and may not integrate

to one (we will revisit signed distributions in Chapter 9). Based on Proposition

6.2, we may intuit that

is mean-zero noise, where “zero” here is to be

understood as a special signed distribution.

(a) Categorical TD target (b) Expected update (c) Mean-zero noise

Figure 6.3

The sample target in a categorical TD update

(a)

can be decomposed into an expected

update speciﬁed by the operator Π

(b) and a mean-zero signed distribution (c).

However, expressing the CTD update rule in terms of signed distributions

is not suﬃcient to apply Theorem 6.9. This is because the theorem requires

that the iterates



(

)



k≥0

be elements of

, whereas (

(

))

k≥0

are proba-

bility distributions. To address this issue, we leverage the fact that categorical

distributions are represented by a ﬁnite number of parameters and view their

cumulative distribution functions as in vectors in R

Recall that

C,m

is the space of

-categorical return distribution functions.

To invoke Theorem 6.9, we construct an isometry

between

C,m

and a certain

subset of R

X×m

. For a categorical return function η ∈F

C,m

, write

I(η) =



η(x)

(θ

) : x ∈X, i ∈{1, . . . , m}



∈R

X×m

where as before

, . . . , θ

denotes the locations of the

particles whose prob-

abilities are parameterized in

C,m

. The isometry

maps return functions

to elements of

X×m

describing the corresponding cumulative distribution

functions (CDFs), evaluated at these particles. The image of F

C,m

under I is

= {z ∈R

: 0 ≤z

≤···≤z

= 1}

Draft version.

Incremental Algorithms 187

The inverse map

−1

: R

→F

C,m

maps vectors describing the cumulative distribution functions of categorical

return functions back to their distributions.

With this construction, the metric induced by the

norm

k·k

over

proportional to the Cramér distance between probability distributions and is

readily extended to R

X×m

. That is, for η, η

∈F

C,m

, we have



I(η) −I(η

)



2,∞

= sup

x∈X



(I(η))(x) −(I(η

))(x)





(η, η

We will prove the convergence of the sequence (

)

k≥0

deﬁned by Equation 6.24

to the ﬁxed point of the projected operator Π

by applying Theorem 6.9 to

the sequence (

(

))

k≥0

and the

metric and arguing (by isometry) that the

original sequence must also converge. An important additional property of

that it commutes with expectations, in the following sense.

Lemma 6.11.

The isometry

C,m

→R

is an aﬃne map. That is, for any

η, η

∈F

C,m

and α ∈[0, 1],

I(αη + (1 −α)η

) = αI(η) + (1 −α)I(η

) .

As a result, if η is a random return-distribution function, then we have

E[I(η)] = I(E[η]) . 4

Theorem 6.12.

Let

m ≥

2 and consider the return function iterates (

)

k≥0

generated by Equation 6.24 from some possibly random

. Suppose that:

(a)

The source states (

)

k≥0

and step sizes (

)

k≥0

satisfy the Robbins–

Monro conditions: w.p. 1, for all x ∈X,

∞

k=0

k {X

= x}

= ∞,

∞

k=0

= x}

< ∞.

(b)

The joint distribution of (

, A

, R

, X

)

k≥0

is an instance of the

sampling model expressed in Equation 6.23.

Then, with probability 1,

→ ˆη

with respect to the supremum Cramér

distance



, where

ˆη

is the unique ﬁxed point of the projected operator

ˆη

= Π

ˆη

. 4

Draft version.

188 Chapter 6

Proof.

We begin by constructing a sequence (

)

k≥0

∈R

that parallels the

sequence of return functions (η

)

k≥0

. Write

O= I◦Π

◦I

−1

and deﬁne, for each

k ∈N

(

). By Lemma 5.24,

is a contraction with

modulus γ

in k·k

2,∞

and we have

k+1

)

= (1 −α

) + α



,γ

)

η(X

)



= (1 −α

) + α



(OU

)(X

) + IΠ

,γ

)

−1

)(X

) −(OU

)(X

)

| {z }



To see that Assumption 6.6 (bounded, mean-zero noise) holds, ﬁrst note that

by Proposition 6.2 and aﬃneness of

and

−1

from Lemma 6.11, we have

[

] = 0. Furthermore,

is a bounded random variable, because each

coordinate is a diﬀerence of two probabilities and hence in the interval [

−

1].

Hence, we have E[kw

] < C for some C > 0, as required.

By Banach’s theorem, the operator

has a unique ﬁxed point

∗

. We can

thus apply Theorem 6.9 to conclude that the sequence (

)

k≥0

converges to

∗

satisfying

∗

= IΠ

−1

∗

. (6.27)

Because

is an isometry, this implies that (

)

k≥0

converges to

∗

−1

∗

Applying I

−1

to both sides of Equation 6.27, we obtain

−1

∗

= Π

−1

∗

Since Π

has a unique ﬁxed point, we conclude that η

∗

= ˆη

The proof of Theorem 6.12 illustrates how the parameters of categorical

distributions are by design bounded, so that stability (i.e., Proposition 6.7) is

immediate. In fact, stability is also immediate for TD learning when the reward

distributions have bounded support.

6.11 Technical Remarks

Remark 6.1.

Given a probability distribution

ν ∈P

(

) and a level

τ ∈

1),

quantile regression ﬁnds a value θ

∗

∈R such that

(θ

∗

) = τ . (6.28)

In some situations, for example when

is a discrete distribution, there are

multiple values satisfying Equation 6.28. Let us write

S = {θ : F

(θ) = τ}.

Draft version.

Incremental Algorithms 189

Then one can show that

forms an interval. We can argue that quantile regres-

sion converges to this set by noting that, for

τ ∈

1) the expected quantile

loss

(θ) = E

Z∼ν



{Z<θ}

−τ|×|Z −θ|



is convex in

. In addition, for this loss, we have that for any

θ, θ

∈S

and

< S

(θ) = L

(θ

) < L

(θ

) .

Convergence follows under appropriate conditions by appealing to standard

arguments regarding the convergence of stochastic gradient descent; see, for

example, Kushner and Yin (2003). 4

Remark 6.2

(

Proof of Proposition 6.4

)

Our goal is to show that

behaves like a nonnegative supermartingale, from which convergence would

follow from the supermartingale convergence theorem (see, e.g., Billingsley

2012). We begin by expressing the squared Euclidean norms of the sequence

elements recursively, writing F

= σ(Z

0:k−1

, α

0:k

E[k

k+1

(x)k

] = E[k(1 −α

)

+ α

]

(a)

= (1 −α

)

+ α

E[kZ

]

(b)

≤ (1 −α

)

+ α

≤(1 −α

+ α

B . (6.29)

Here, (a) follows by expanding the squared norm and using

[

] = 0, and

(b) follows from the boundedness of the conditional variance of the (

)

k≥0

where B is a bound on such variances.

This inequality does not establish the supermartingale property, due to the

presence of the additive term

on the right-hand side. However, the ideas

behind the Robbins–Siegmund theorem (Robbins and Siegmund 1971) can be

applied to deal with this term. The argument ﬁrst constructs the sequence

= k

k−1

s=0

−B

k−1

s=0

Inequality 6.29 above then shows that (Λ

)

k≥0

is a supermartingale but may

not be uniformly bounded below, meaning the supermartingale convergence

theorem still cannot be applied. Deﬁning the stopping times

inf{k ≥

0 :

s=0

> q}

for each

q ∈N

(with the convention that

inf ∅

∞

), each stopped

process (Λ

k∧t

)

k≥0

is a supermartingale bounded below by

−q

, and hence each

such process converges w.p. 1 by the supermartingale convergence theorem.

However,

∞

s=0

< ∞

w.p. 1 by assumption, so w.p. 1

∞

for suﬃciently

large

, and hence Λ

converges w.p. 1. Since

∞

k=0

∞

w.p. 1 by assumption,

Draft version.

190 Chapter 6

it must be the case that

→

0 as

k →

0, in order for Λ

to have a ﬁnite

limit, and hence we are done. Although somewhat involved, Exercise 6.13

demonstrates the necessity of this argument. 4

6.12 Bibliographical Remarks

The focus of this chapter has been in developing and analyzing single-step

temporal-diﬀerence algorithms. Further algorithmic developments include

the use of multistep returns (Sutton 1988), oﬀ-policy corrections (Precup et

al. 2000), and gradient-based algorithms (Sutton et al. 2009; Sutton et al. 2008a);

the exercises in this chapter develop a few such approaches.

6.1–6.2.

This chapter analyzes incremental algorithms through the lens of

approximating the application of dynamic programming operators. Temporal-

diﬀerence algorithms have a long history (Samuel 1959), and the idea of

incremental approximations to dynamic programming formed motivation for

several general-purpose temporal-diﬀerence learning algorithms (Sutton 1984,

1988; Watkins 1989).

Although early proofs of particular kinds of convergence for these algorithms

did not directly exploit this connection with dynamic programming (Watkins

1989; Watkins and Dayan 1992; Dayan 1992), later a strong theoretical connec-

tion was established that viewed these algorithms through the lens of stochastic

approximation theory, allowing for a uniﬁed approach to proving almost-sure

convergence (Gurvits et al. 1994; Dayan and Sejnowski 1994; Tsitsiklis 1994;

Jaakkola et al. 1994; Bertsekas and Tsitsiklis 1996; Littman and Szepesvári

1996). The unbiased estimation framework presented comes from these works,

and the second principle is based on the ideas behind two-timescale algorithms

(Borkar 1997, 2008). A broader framework based on asymptotically approxi-

mating the trajectories of diﬀerential equations is a central theme of algorithm

design and stochastic approximation theory more generally (Ljung 1977; Kusher

and Clark 1978; Borkar and Meyn 2000; Kushner and Yin 2003; Borkar 2008;

Benveniste et al. 2012; Meyn 2022).

In addition to the CTD and QTD algorithms described in this chapter, several

other approaches to incremental learning of return distributions have been

proposed. Morimura et al. (2010b) propose to update parametric density models

by taking gradients of the Kullback-Leibler divergence between the current

estimates and the result of applying the Bellman operator to these estimates.

Barth-Maron et al. (2018) also take this approach, using a representation based

on mixtures of Gaussians. Nam et al. (2021) also use mixtures of Gaussians

and minimize the Cramér distance from a multistep target, incorporating ideas

from TD(

) (Sutton 1984, 1988). Gruslys et al. (2018) combine CTD with

Draft version.

Incremental Algorithms 191

Retrace(

), a multistep oﬀ-policy evaluation algorithm (Munos et al. 2016).

Nguyen et al. (2021) combine the quantile representation with a loss based

on the MMD metrics described in Chapter 4. Martin et al. (2020) propose a

proximal update scheme for the quantile representation based on (regularized)

Wasserstein ﬂows (Jordan et al. 1998; Cuturi 2013; Peyré and Cuturi 2019).

Example 6.1 is from Bellemare et al. (2016).

6.3.

The categorical temporal-diﬀerence algorithm as a mixture update was

presented by Rowland et al. (2018). This is a variant of the C51 algorithm

introduced by Bellemare et al. (2017a), which uses a projection in a mixture of

Kullback–Leibler divergence and Cramér distance. Distributional versions of

gradient temporal-diﬀerence learning (Sutton et al. 2008a; Sutton et al. 2009)

based on the categorical representation have also been explored by Qu et

al. (2019).

6.4.

The QTD algorithm was introduced by Dabney et al. (2018b). Quan-

tile regression itself is a long-established tool within statistics, introduced by

Koenker and Bassett (1978); Koenker (2005) is a classic reference on the sub-

ject. The incremental rule for estimating quantiles of a ﬁxed distribution was in

fact proposed by Robbins and Monro (1951), in the same paper that launched

the ﬁeld of stochastic approximation.

6.5.

The discussion of sequences of learning rates that result in convergence

goes back to Robbins and Monro (1951), who introduced the ﬁeld of stochastic

approximation. Szepesvári (1998), for example, considers this framework in

their study of the asymptotic convergence rate of Q-learning. A ﬁne-grained

analysis in the case of temporal-diﬀerence learning algorithms, taking ﬁnite-

time concentration into account, was undertaken by Even-Dar and Mansour

(2003); see also Azar et al. (2011).

6.6–6.10.

Our proof of Theorem 6.9 (via Propositions 6.4, 6.7, and 6.8) closely

follows the argument given by Bertsekas and Tsitsiklis (1996) and Tsitsiklis

(1994). Speciﬁcally, we adapt this argument to deal with distributional informa-

tion, rather than a single scalar value. Proposition 6.4 is a special case of the

Robbins–Siegmund theorem (Robbins and Siegmund 1971), and a particularly

clear exposition of this and related material is given by Walton (2021). We note

also that this result can also be established via earlier results in the stochastic

approximation literature (Dvoretzky 1956), as noted by Jaakkola et al. (1994).

Theorem 6.10 is classical, and results of this kind can be found in Bertsekas and

Tsitsiklis (1996). Theorem 6.12 was ﬁrst proven by Rowland et al. (2018), albeit

with a monotonicity argument based on that of Tsitsiklis (1994); the argument

here is based on a contraction mapping argument to match the analysis of the

temporal-diﬀerence algorithm. For further background on signed measures, see

Doob (1994).

Draft version.

192 Chapter 6

6.13 Exercises

Exercise 6.1.

In this chapter, we argued for a correspondence between oper-

ators and incremental algorithms. This also holds true for the incremental

Monte Carlo algorithm introduced in Section 3.2. What is peculiar about the

corresponding operator? 4

Exercise 6.2.

Exercise 3.2 asked you to derive an incremental algorithm from

the n-step Bellman equation

(x) = E



n−1

t=0

+ γ

) | X

= x



Describe this process in terms of the method where we substitute random

variables with their realizations, then derive the corresponding incremental

algorithm for state-action value functions. 4

Exercise 6.3

(*)

The

-step random-variable Bellman equation for a policy

is given by

(x)

n−1

t=0

+ γ

), X

= x,

where the trajectory (

x, A

, R

, …, X

, A

, R

) is distributed according to

(· | X

= x).

(i)

Write down the distributional form of this equation and the corresponding

n-step distributional Bellman operator.

(ii)

Show that it is a contraction on a suitable subset of

(

)

with respect to

an appropriate metric.

(iii)

Further show that the composition of this operator with either the categori-

cal projection or the quantile projection is also a contraction mapping in the

appropriate metric.

(iv)

Using the approach described in this chapter, derive

-step versions of

categorical and quantile temporal-diﬀerence learning.

(v)

In the case of

-step CTD, describe an appropriate set of conditions that

allow for Theorem 6.9 to be used to obtain convergence to the projected

operator ﬁxed point with probability 1. What happens to the ﬁxed points of

the projected operators as n →∞?

Exercise 6.4.

Implement the quantile regression update rule (Equation 6.11).

Given an initial estimate

= 0, visualize the sequence of estimates produced

by quantile regression for

τ ∈{

}

and a constant step size

= 0

01,

given samples from

Draft version.

Incremental Algorithms 193

(i) a normal distribution N(1, 2) ;

(ii) a Bernoulli distribution U({0, 1}) ;

(iii) the mixture distribution

U([2, 3]) . 4

Exercise 6.5.

Let

be a

-quantile return-distribution function, and let

(

x, a, r, x

) denote a sample transition. Find a Markov decision process for

which the update rule

η(x) ←Π

(1 −α)η(x) + αη

)

does not converge. 4

Exercise 6.6.

Implement the TD, CTD, and QTD algorithms, and use these

algorithms to approximate the value (or return) function of the quick policy on

the Cliﬀs domain (Example 2.9). Compare their accuracy to the ground-truth

value function and return-distribution function estimated using many Monte

Carlo rollouts, both in terms of an appropriate metric and by visually comparing

the approximations to the ground-truth functions.

Investigate how this accuracy is aﬀected by diﬀerent choices of constant step

sizes and sequences of step sizes that satisfy the requirements laid out in Section

6.6. What do you notice about the relative performance of these algorithms as

the degree of action noise p is varied?

Investigate what happens when we modify the TD algorithm by restricting

value function estimates on the interval [

−C, C

], for a suitable

C ∈R

. Does

this restriction aﬀect the performance of the algorithm diﬀerently from the

restriction to [θ

, θ

] that is intrinsic to CTD? 4

Exercise 6.7.

Let (

, A

, R

, X

)

k≥0

be a random sequence of transitions such

that for each x ∈X, we have X

= x for inﬁnitely many k ≥0. Show that taking

) + 1

satisﬁes Assumption 6.5. 4

Exercise 6.8.

Theorems 6.10 and 6.12 establish convergence for state-indexed

value functions and return-distribution functions under TD and CTD learning,

respectively. Discuss how Theorem 6.9 can be used to establish convergence of

the corresponding state-action-indexed algorithms. 4

Exercise 6.9

(*)

Theorem 6.10 establishes that temporal-diﬀerence learning

converges for a reasonably wide parameterization of the distribution of source

states and step size schedules. Given a source state

, consider the incremental

Monte Carlo update

k+1

) = (1 −α

) + α

Draft version.

194 Chapter 6

k+1

(x) = V

(x) if x , X

where

∼η

(

) is a random return. Explain how Theorem 6.10 and its proof

should be adapted to prove that the sequence (V

)

k≥0

converges to V

. 4

Exercise 6.10

(Necessity of conditions for convergence of TD learning)

The

purpose of this exercise is to explore the behavior of TD learning when the

assumptions of Theorem 6.10 do not hold.

(i)

Write down an MDP with a single state

, from which trajectories immedi-

ately terminate, and a sequence of positive step sizes (

(

))

k≥0

satisfying

k≥0

(

)

< ∞

, with the property that the TD update rule applied with these

step sizes produces a sequence of estimates (

)

k≥0

that does not converge

to V

(ii)

For the same MDP, write down a sequence of positive step sizes (

(

))

k≥0

such that

k≥0

(

) =

∞

, and show that the sequence of estimates (

)

k≥0

generated by TD learning with these step sizes does not converge to V

(iii)

Based on your answers to the previous two parts, for which values of

β ≥

do step size sequences of the form

) + 1)

lead to guaranteed TD converge, assuming all states are visited inﬁnitely

often?

(iv)

Consider an MDP with a single state

, from which trajectories immedi-

ately terminate. Suppose the reward distribution at

is a standard Cauchy

distribution, with density

f (z) =

π(1 + z

)

Show that if

also has a Cauchy distribution with median 0, then for any

positive step sizes (

(

))

k≥0

has a Cauchy distribution with median

0, and hence the sequence does not converge to a constant. Hint. The

characteristic function of the Cauchy distribution is given by

s 7→exp

(

−|s|

(v)

(*) In the proof of Theorem 6.9, the inequality 1

−u ≤exp

(

−u

) was used

to deduce that the condition

k≥0

(

) =

∞

w.p. 1 is suﬃcient to guarantee

that

l=0

(1 −α

(x)) →0

w.p. 1. Show that if

(

)

∈

1], the condition

k≥0

(

) =

∞

w.p. 1 is

necessary as well as suﬃcient for the sequence

l=0

−α

(

)) to also

converge to zero w.p. 1. 4

Draft version.

Incremental Algorithms 195

Exercise 6.11.

Using the tools from this chapter, prove the convergence of

the undiscounted, ﬁnite-horizon categorical Monte Carlo algorithm (Algorithm

3.3). 4

Exercise 6.12. Recall the no-loop operator introduced in Example 4.6:





(x) = E



R + γV(X

)

, x}

| X = x



Denote its ﬁxed point by

. For a transition (

x, a, r, x

) and time-varying step

size α

∈[0, 1), consider the no-loop update rule:

V(x) ←











(1 −α

)V(x) + α

(r + γV(x

)) if x

, x,

(1 −α

)V(x) + α

r if x

= x

(i)

Demonstrate that this update rule can be derived by substitution applied to

the no-loop operator.

(ii)

In words, describe how you would modify the online, incremental ﬁrst-visit

Monte Carlo algorithm (Algorithm 3.1) to learn V

(iii)

(*) Provide conditions under which the no-loop update converges to

and prove that it does converge under those conditions. 4

Exercise 6.13.

Assumption 6.5 requires that the sequence of step sizes (

(

) :

k ≥0, x ∈X) satisfy

∞

k=0

(x)

< ∞ (6.30)

with probability 1. For

k ≥

0, let

(

) be the number of times that

has been

updated, and let

(

) be the most recent time at which

l < k

with the

convention that u

(x) = 1 if N

(x) = 0. Consider the step size schedule

k+1

(

)+1

if u

) ≤

1 otherwise.

This schedule takes larger steps for states whose estimate has not been recently

updated. Suppose that

∼ξ

for some distribution

that puts positive probabil-

ity mass on all states. Show that the sequence (

)

k≥0

satisﬁes Equation 6.30

w.p. 1, yet there is no B ∈R for which

∞

k=0

(x)

< B

with probability 1. This illustrates the need for the care taken in the proof of

Proposition 6.4. 4

Draft version.