3

Learning the Return Distribution

Reinforcement learning provides a computational model of what and how an

animal, or more generally an agent, learns to predict on the basis of received

experience. Often, the prediction of interest is about the expected return under

the initial state distribution or, counterfactually, from a query state x

2X

. In the

latter case, making good predictions is equivalent to having a good estimate

of the value function V

⇡

. From these predictions, reinforcement learning also

seeks to explain how the agent might best control its environment, for example

to maximise its expected return.

By learning from experience, we generally mean from data rather than from the

Markov decision process description of the environment (its transition kernel,

reward distribution, and so on). In many settings, this data is taken from sample

interactions with the environment. In their simplest forms, these interactions are

realisations of the random trajectory (X

t

, A

t

, R

t

)

t0

. The record of a particular

game of chess and the ﬂuctuations of an investment account throughout the

month are two examples of sample trajectories. Learning from experience is

a powerful paradigm, because it frees us from needing a complete description

of the environment, an often impractical if not infeasible requirement. It also

enables incremental algorithms whose run time does not depend on the size of

the environment and that can easily be implemented and parallelised.

The aim of this chapter is to introduce a concrete algorithm for learning return

distributions from experience, called categorical temporal-difference learning.

In doing so, we will provide an overview of the design choices that must be made

when creating distributional reinforcement learning algorithms. While there are

well-established, and in some cases deﬁnitive methods for learning to predict

the expected return in classical reinforcement learning, in the distributional

setting choosing the right algorithm requires balancing multiple, sometimes

conﬂicting considerations. In great part, this is because of the unique and

53

54 Chapter 3

exciting challenges that arise when one wishes to estimate sometimes intricate

probability distributions, rather than scalar expectations.

3.1 The Monte Carlo Method

Birds such as the pileated woodpecker follow a feeding routine that regularly

takes them back to the same foraging grounds. The success of this routine can be

measured in terms of the total amount of food obtained during a ﬁxed period of

time, say a single day. As part of a ﬁeld study, it may be desirable to predict the

success of a particular bird’s routine on the basis of a limited set of observations;

for example, to assess its survival chances at the beginning of winter based

on feeding observations from the summer months. In reinforcement learning

terms, we view this as the problem of learning to predict the expected return

(total food per day) of a given policy

⇡

(the feeding routine). Here, variations in

weather, human activity, and other foraging animals are but a few of the factors

that affect the amount of food obtained on any particular day.

In our example, the problem of learning to predict is abstractly a problem of

statistical estimation. To this end, let us model the woodpecker’s feeding routine

as a Markov decision process.

17

We associate each day with a sample trajectory

or episode, corresponding to measurements made at regular intervals about the

bird’s location x, behaviour a, and per-period food intake r. Suppose that we

have observed a set of K sample trajectories,

(x

k,t

, a

k,t

, r

k,t

)

T

k

–1

t=0

K

k=1

, (3.1)

where we use k to index the trajectory and t to index time, and where T

k

denotes

the number of measurements taken each day. In this example, it is most sensible

to assume a ﬁxed number of measurements T

k

= T, but in the general setting T

k

may be random and possibly dependent on the trajectory, often corresponding to

the time when a terminal state is ﬁrst reached. For now, let us also assume that

there is a unique starting state x

0

, such that x

k,0

= x

0

for all k. We are interested

in the problem of estimating the expected return

E

⇡

T–1

t=0

t

R

t

= V

⇡

(x

0

),

corresponding to the expected per-day food intake of our bird.

18

17.

Although this may be a reasonable approximation of reality, it is useful to remember that

concepts such as the Markov property are in this case modelling assumptions, rather than actual

facts.

18.

In this particular example, a discount factor of

= 1 is reasonable given that T is ﬁxed and we

should have no particular preference for mornings over evenings.

Learning the Return Distribution 55

Monte Carlo methods estimate the expected return by averaging the outcomes of

observed trajectories. Let us denote by g

k

the sample return for the k

th

trajectory:

g

k

=

T–1

t=0

t

r

k,t

. (3.2)

The sample-mean Monte Carlo estimate is the average of these K sample

returns:

ˆ

V

⇡

(x

0

)=

1

K

K

k=1

g

k

. (3.3)

This a sensible procedure that also has the beneﬁt of being simple to implement

and broadly applicable. In our example above, the Monte Carlo method corre-

sponds to estimating the expected per-day food intake by simply averaging the

total intake measured over different days.

Example 3.1.

Monte Carlo tree search is a family of approaches that have

proven effective for planning in stochastic environments and have been used

to design state-of-the-art computer programs that play the game of Go. Monte

Carlo tree search combines a search procedure with so-called Monte Carlo

rollouts (simulations). In a typical implementation, a computer Go algorithm

will maintain a partial search tree whose nodes are Go positions, rooted in the

current board conﬁguration. Nodes at the fringe of this tree are evaluated by

performing a large number of rollouts of a ﬁxed rollout policy (often uniformly

random) from the nodes’ position to the game’s end.

By deﬁning the return of one rollout to be 1 if the game is won and 0 if the game

is lost, the expected undiscounted return from a leaf node corresponds to the

probability of winning the game (under the rollout policy) from that position.

By estimating the expected return from complete rollouts, the Monte Carlo

method avoids the challenges associated with devising a heuristic to determine

the value of a given position. 4

We can use the Monte Carlo method to estimate the value function V

⇡

rather

than only the expected return from the initial state. Suppose now that the sample

trajectories (Equation 3.1) have different starting states (that is, x

k,0

varies across

trajectories). The Monte Carlo method constructs the value function estimate

ˆ

V

⇡

(x)=

1

N(x)

K

k=1

{x

k,0

=x}

g

k

,

56 Chapter 3

where

N(x)=

K

k=1

{x

k,0

= x}

is the number of trajectories whose starting state is x. For simplicity of

exposition, we assume here that N(x) > 0 for all x 2X.

Learning the value function is useful because it provides ﬁner-grained detail

about the environment and chosen policy compared to only learning about the

expected return from the initial state. Often, estimating the full value function

can be done at little additional cost, because sample trajectories contain infor-

mation about the expected return from multiple states. For example, given a

collection of K sample trajectories drawn independently from the generative

equations, the ﬁrst-visit Monte Carlo estimate is

ˆ

V

⇡

FV

(x)=

1

N

FV

(x)

K

k=1

T

k

–1

t=0

{x

k,t

= x, x

k,0

, ..., x

k,t–1

6= x}

c

k,t

T

k

–1

s=t

s–t

r

k,s

N

FV

(x)=

K

k=1

T

k

–1

t=0

c

k,t

.

The ﬁrst-visit Monte Carlo estimate treats each time step as the beginning of a

new trajectory; this is justiﬁed by the Markov property and time-homogeneity

of the random trajectory ( X

t

, A

t

, R

t

)

t0

(Section 2.6). The restriction to the ﬁrst

occurrence of x in a particular trajectory guarantees that

ˆ

V

⇡

FV

behaves like the

sample-mean estimate.

3.2 Incremental Learning

Both in practice and in theory, it is useful to consider a learning model under

which sample trajectories are processed sequentially, rather than all at once.

Algorithms that operate in this fashion are called incremental algorithms, as

they maintain a running value function estimate V

2R

X

which they improve

with each sample.

19

Under this model, we now consider an inﬁnite sequence of

sample trajectories

(x

k,t

, a

k,t

, r

k,t

)

T

k

–1

t=0

k0

, (3.4)

presented one at a time to the learning algorithm. In addition, we consider the

more general setting in which the initial states (x

k,0

)

k0

may be different; we

19.

Incremental methods are also sometimes called stochastic or sample-based. We prefer the term

“incremental” as these methods can be applied even in the absence of randomness, and because

other estimation methods – including the sample-mean method of the preceding section – are also

based on samples.

Learning the Return Distribution 57

call these states the source states, as with the sample transition model (Section

2.6). As in the previous section, a minimum requirement for learning V

⇡

is that

every state x 2X should be the source state of some trajectories.

Algorithm 3.1: Online incremental ﬁrst-visit Monte Carlo

Algorithm parameters: step size ↵ 2(0, 1],

policy ⇡ : X!P(A),

initial estimate V

0

2R

X

V(x) V

0

(x) for all x 2X

Loop for each episode:

Observe initial state x

0

T 0

Loop for t = 0, 1, …

Draw a

t

from ⇡( · | x

t

)

Take action a

t

, observe r

t

, x

t+1

T T +1

until x

t+1

is terminal

g 0

for t = T – 1, …,0do

g r

t

+ g

if x

t

is not in {x

0

, …, x

t–1

} then

V(x

t

) (1 – ↵)V(x

t

)+↵g

end for

end

The incremental Monte Carlo algorithm begins by initialising its value function

estimate:

V(x) V

0

(x), for all x 2X.

Given the k

th

trajectory, the algorithm computes the sample return g

k

(Equation

3.2), called the Monte Carlo target in this context. It then adjusts its value

function estimate for the initial state x

k,0

towards this target, according to the

update rule

V(x

k,0

) (1 – ↵

k

)V(x

k,0

)+↵

k

g

k

.

The parameter

↵

k

2

[0, 1) is a time-varying step size that controls the impact

of a single sample trajectory on the value estimate V. Because the incremental

Monte Carlo update rule only depends on the starting state and the sample

return, we can more generally consider learning V

⇡

on the basis of the sequence

58 Chapter 3

of state-return pairs

x

k

, g

k

)

k0

. (3.5)

Under this simpliﬁed model, the sample return g

k

is assumed drawn from the

return distribution

⌘

⇡

(x

k

) (rather than constructed from the sample trajectory; of

course, these two descriptions are equivalent). The initial state x

k,0

is substituted

for x

k

to yield:

V(x

k

) (1 – ↵

k

)V(x

k

)+↵

k

g

k

. (3.6)

By choosing a step size that is inversely proportional to the number of times

N

k

(x

k

) that the value of state x

k

has been previously encountered, we recover

the sample-mean estimate of the previous section (see Exercise 3.1). In practice,

it is also common to use a constant step size

↵

to avoid the need to track N

k

.We

will study the effect of step sizes in greater detail in Chapter 6.

As given, the Monte Carlo update rule is agnostic to the mechanism by which

the sample trajectories are generated, or when learning occurs in relation to the

agent observing these trajectories. A frequent, and important setting in reinforce-

ment learning is when each trajectory is consumed by the learning algorithm

immediately after being experienced, which we call the online setting.

20

Com-

plementing the abstract presentation given in this section, Algorithm 3.1 gives

pseudo-code for an incremental implementation of the ﬁrst-visit Monte Carlo

algorithm in the online, episodic

21

setting.

In Equation 3.6, the notation A

B indicates that the value B (which may be a

scalar, a vector, or a distribution) should be stored in the variable A. In the case

of Equation 3.6, V is a collection of scalars – one per state – and the update

rule describes the process of modifying the value of one of these variables “in

place”. This provides a succinct way of highlighting the incremental nature of

the process. On the other hand, it is often useful to consider the value of the

variable after a given number of iterations. For k > 0, we express this with the

notation

V

k+1

(x

k

) = (1 – ↵

k

)V

k

(x

k

)+↵

k

g

k

V

k+1

(x)=V

k

(x) for x 6= x

k

, (3.7)

where the second line reﬂects the fact that only the variable associated to state

x

k

is modiﬁed at time k.

20.

The online setting is sometimes deﬁned in terms of individual transitions, which are to be

consumed immediately after being experienced. Our use of the term is broader and related to the

streaming setting considered in other ﬁelds of research [see e.g. Cormode and Muthukrishnan,

2005].

21. A environment is said to be episodic when all trajectories eventually reach a terminal state.

Learning the Return Distribution 59

3.3 Temporal-Difference Learning

Algorithm 3.2: Online temporal-difference learning

Algorithm parameters: step size ↵ 2(0, 1],

policy ⇡ : X!P(A)

initial estimate V

0

2R

X

V(x) V

0

(x) for all x 2X

Loop for each episode:

Observe initial state x

0

Loop for t = 0, 1, …

Draw a

t

from ⇡( · | x

t

)

Take action a

t

, observe r

t

, x

t+1

if x

t+1

is terminal then

V(x

t

) (1 – ↵)V(x

t

)+↵r

t

else

V(x

t

) (1 – ↵)V(x

t

)+↵

r

t

+ V(x

t+1

)

end if

until x

t+1

is terminal

end

Incremental algorithms are useful because they do not need to maintain the

entire set of sample trajectories in memory. In addition, they are often simpler

to implement and enable distributed and approximate computation. Temporal-

difference learning (TD learning, or simply TD) is particularly well-suited to

the incremental setting, because it can learn from sample transitions, rather than

entire trajectories. It does so by leveraging the relationship between the value

function at successive states – effectively, it takes advantage of the Bellman

equation.

To begin, let us abstract away the sequential nature of the trajectory and consider

the sample transition model (X, A, R, X

0

) deﬁned by Equation 2.12. Here, the

distribution

⇠

of the source state X may correspond, for example, to the relative

frequency at which states are visited over the random trajectory. We consider a

sequence of sample transitions drawn independently according to this model,

denoted

(x

k

, a

k

, r

k

, x

0

k

)

k0

. (3.8)

60 Chapter 3

As with the incremental Monte Carlo algorithm, the update rule of temporal-

difference learning is parametrised by a time-varying step size

↵

k

. For the k

th

sample transition, this update rule is given by

V(x

k

) (1 – ↵

k

)V(x

k

)+↵

k

r

k

+ V(x

0

k

)

. (3.9)

Algorithm 3.2 instantiates this update rule in the online, episodic setting. We

call the term r

k

+

V(x

0

k

) in Equation 3.9 the temporal-difference target. If we

again write the Bellman equation

V

⇡

(x)=E

⇡

R + V

⇡

(X

0

)|X = x

,

we see that the temporal-difference target can be thought of as a realisation

of the random variable whose expectation forms the right-hand side of the

Bellman equation, with the exception that the TD target uses the estimated

value function V in place of the true value function V

⇡

. This highlights the fact

that temporal-difference learning adjusts its value function estimate to be more

like the right-hand side of the Bellman equation; we will study this relationship

more formally in Chapter 6.

By rearranging terms, we can also express the temporal-difference update rule

in terms of the temporal-difference error r

k

+ V(x

0

k

)–V(x

k

):

V(x

k

) V(x

k

)+↵

k

r

k

+ V(x

0

k

)–V(x

k

)

;

this form highlights the direction of change of the value function estimate

(positive if the target is greater than our estimate, negative if it is not) and is

needed to express certain reinforcement learning algorithms, as we will see in

Chapter 9.

The incremental Monte Carlo algorithm updates its value function estimate

towards a ﬁxed (but noisy) target. By contrast, Equation 3.9 describes a recur-

sive process, without such a ﬁxed target. Temporal-difference learning instead

depends on the value function at the next state V(x

0

k

) being approximately

correct. As such, it is said to bootstrap from its own value function estimate.

Because of this recursive dependency, the dynamics of temporal-difference

learning are usually different from those of Monte Carlo methods, and are more

challenging to analyse (Chapter 6).

On the other hand, temporal-difference learning offers some important advan-

tages over Monte Carlo methods. One is the way in which value estimates

are naturally shared between states, so that once a value has been estimated

accurately at one state, this can often be used to improve the value estimates at

other states. In many situations, the estimates produced by temporal-difference

learning are consequently more accurate than their Monte Carlo counterparts.

Learning the Return Distribution 61

A full treatment of the statistical properties of temporal-difference learning is

beyond the scope of this book, but we provide references on the topic in the

bibliographical remarks.

3.4 From Values To Probabilities

In distributional reinforcement learning, we are interested in understanding

the random return as it arises from interactions with the environment. In the

context of this chapter, we are speciﬁcally interested in how we can learn the

return-distribution function ⌘

⇡

.

As a light introduction, consider the scenario in which rewards are binary

(R

t

2

{0, 1}) and where we are interested in learning the distribution of the

undiscounted ﬁnite-horizon return function

G

⇡

(x)=

H–1

t=0

R

t

, X

0

= x. (3.10)

Here, H

2N

+

denotes the horizon of the learner – how far it predicts into

the future. By enumeration, we see that G

⇡

(x) takes on one of H + 1 possible

values, integers ranging from 0 to H. These form the support of the probability

distribution

⌘

⇡

(x). To learn

⌘

⇡

, we will construct an incremental algorithm that

maintains a return-distribution function estimate

⌘

, the distributional analogue

of V from the previous sections. This estimate assigns a probability p

i

(x) to

each possible return i 2{0, ..., H}:

⌘(x)=

H

i=0

p

i

(x)

i

, (3.11)

where p

i

(x)

0 and

H

i=0

p

i

(x) = 1. We call Equation 3.11 a categorical rep-

resentation, since each possible return can now be thought of as one of H +1

categories. Under this perspective, we can think of the problem of learning the

return distribution for a given state x as a classiﬁcation problem – assigning

probabilities to each of the possible categories. We may then view the problem

of learning the return function

⌘

⇡

as a collection of classiﬁcation problems (one

per state).

As before, let us consider a sequence of state-return pairs (x

k

, g

k

)

k0

, where

each g

k

is drawn from the distribution

⌘

⇡

(x

k

). As in Section 3.2, the sample

return g

k

provides a target for an update rule except that now we want to adjust

the probability of observing g

k

rather than an estimate of the expected return.

For a step size ↵

k

2(0, 1], the categorical update rule is

p

g

k

(x

k

) (1 – ↵

k

)p

g

k

(x

k

)+↵

k

62 Chapter 3

p

i

(x

k

) (1 – ↵

k

)p

i

(x

k

), i 6= g

k

. (3.12)

The adjustment of the probabilities for returns other than g

k

ensures that

the return distribution estimate at x

k

continues to sum to 1 after the update.

Expressed as an operation over probability distributions, this update rule corre-

sponds to changing

⌘

(x

k

) to be a mixture between itself and a distribution that

puts all of its mass on g

k

:

⌘(x

k

) (1 – ↵

k

)⌘(x

k

)+↵

k

g

k

. (3.13)

We call this the undiscounted ﬁnite-horizon categorical Monte Carlo algorithm.

It is instantiated in the online, episodic setting in Algorithm 3.3. Similar to the

other incremental algorithms presented thus far, it is possible to demonstrate

that under the right conditions, this algorithm learns a good approximation to

the distribution of the binary-reward, undiscounted, ﬁnite-horizon return. In the

next section we will see how this idea can be carried over to the more general

setting.

3.5 The Projection Step

The ﬁnite, undiscounted algorithm of the previous section is a sensible approach

when the random return takes only a few distinct values. In the undiscounted

setting, we already saw that the number of possible returns is N

G

= H + 1 when

there are two possible rewards. However, this small number is the exception,

rather than the rule. If there are N

R

possible rewards then N

G

can be as large as

N

R

+H–1

H

, already a potentially quite large number for N

R

> 2 (see Exercise 3.5).

Worse, when a discount factor

is introduced, the number of possible returns

depends exponentially on H. To see why, recall the single-state, single-action

Markov decision process of Example 2.10 with reward distribution

P

⇡

(R

t

= 0) = P

⇡

(R

t

= 1) =

1

/2 .

With a discount factor

=

1

/

2

, we argued that the set of possible returns for this

MDP corresponds to the binary expansion of all numbers in the [0, 2] interval,

from which we concluded that the random return is uniformly distributed on

that interval. By the same argument, we can show that the H-horizon return

G =

H–1

t=0

t

R

t

has support on all numbers in the [0, 2] interval that are described by H binary

digits; there are 2

H

such numbers. Of course, if we take H to be inﬁnite there

may be uncountably many possible returns.

Learning the Return Distribution 63

Algorithm 3.3: Undiscounted ﬁnite-horizon categorical Monte

Carlo

Algorithm parameters: step size ↵ 2(0, 1],

horizon H,

policy ⇡ : X!P(A),

initial probabilities

(p

0

i

(x))

H

i=0

: x 2X

p

i

(x) p

0

i

(x) for all 0 i H, x 2X

Loop for each episode:

Observe initial state x

0

T 0

Loop for t = 0, 1, ...

Draw a

t

from ⇡( · | x

t

)

Take action a

t

, observe r

t

, x

t+1

T T +1

until x

t+1

is terminal

g 0

for t = T – 1, …,0do

g g + r

t

if t < T – H then

g g – r

t+H

if x

t

is not in {x

0

, …, x

t–1

} then

p

g

(x

t

) (1 – ↵)p

g

(x

t

)+↵

p

i

(x

t

) (1 – ↵)p

i

(x

t

), for i 6= g

end for

end

To deal with the issue of a large (or even inﬁnite) set of possible returns, we

insert a projection step prior to the mixture update in Equation 3.13.

22

The

purpose of the projection step is to transform an arbitrary target return g into

a modiﬁed target taking one of m values, for m reasonably small. From a

classiﬁcation perspective, we can think of the projection step as assigning a

label (from a small set of categories) to each possible return.

22.

When there are relatively few sample trajectories, a sensible alternative is to construct a

non-parametric estimate of the return distribution that simply puts equal probability mass on all

observed returns. The return distributions described in Section 1.2 and Example 2.9, in particular,

were estimated in this manner. See Remark 3.1 for further details.

64 Chapter 3

We will consider return distributions that assign probability mass to m

2

evenly-spaced values or locations. In increasing order, we denote these locations

by

✓

1

✓

2

···✓

m

. We write

&

m

for the gap between consecutive locations,

so that for i = 1, ..., m – 1, we have

&

m

= ✓

i+1

– ✓

i

.

A common design takes ✓

1

= V

MIN

and ✓

m

= V

MAX

, in which case

&

m

=

V

MAX

– V

MIN

m –1

.

However, other choices are possible and sometimes desirable. Note that the

undiscounted algorithm of the previous section corresponds to m = H + 1 and

✓

i

= i – 1.

We express the corresponding return distribution as a weighted sum of Dirac

deltas at these locations:

⌘(x)=

m

i=1

p

i

(x)

✓

i

;

For mathematical convenience, let us write

✓

0

=–

1

and

✓

m+1

=

1

. Consider

a sample return g, and denote by i

⇤

(g) the index of the largest element of the

support (extended with –1) that is no greater than g:

i

⇤

= arg max

i2{0,...,m}

{✓

i

: ✓

i

g}.

For this sample return, we write

⇧

–

(g)=✓

i

⇤

⇧

+

(g)=✓

i

⇤

+1

to denote the corresponding element of the support and its immediate successor;

when

✓

1

,

...

,

✓

m

are consecutive integers and g

2

[

✓

1

,

✓

m

] these are the ﬂoor and

ceiling of g, respectively.

The projection step begins by computing

⇣(g)=

g – ⇧

–

(g)

⇧

+

(g)–⇧

–

(g)

,

with the convention that ⇣(g) = 1 if ⇧

–

(g)=–1 and ⇣(g) = 0 if ⇧

+

(g)=1. The

⇣

(g) term corresponds to the distance of g to the two closest elements of the

support, scaled to lie in the interval [0, 1] – effectively, the fraction of the

distance between

⇧

–

(g) and

⇧

+

(g) at which g lies. The stochastic projection of

g is

⇧

±

(g)=

⇧

–

(g) with probability 1 – ⇣(g)

⇧

+

(g) with probability ⇣(g).

(3.14)

Learning the Return Distribution 65

We use this projection to construct the update rule

⌘(x) (1 – ↵)⌘(x)+↵

⇧

±

g

,

where as before x is a source state and g a sample return.

23

Similar to Equation

3.12, this update rule is implemented by adjusting the probabilities according to

p

i

±

(x) (1 – ↵)p

i

±

(x)+↵

p

i

(x) (1 – ↵)p

i

(x), i 6= i

±

.

where i

±

is the index of location ⇧

±

g.

We can improve on the stochastic projection by constructing a modiﬁed target

that contains information about both

⇧

–

(g) and

⇧

+

(g). In classiﬁcation terms,

this corresponds to using soft labels – meaning that the target is a probability

distribution over labels, rather than a single label. This deterministic projection

of g results in the update rule

⌘(x) (1 – ↵)⌘(x)+↵

(1 – ⇣(g))

⇧

–

(g)

+ ⇣(g)

⇧

+

(g)

. (3.15)

We denote the deterministic projection by

⇧

C

. Statistically speaking, the deter-

ministic projection produces return distribution estimates that are on average

the same as those produced by the stochastic projection, but are comparatively

more concentrated around their mean. Going forward, we will see that it is

conceptually simpler to apply this projection to probability distributions, rather

than to sample returns. Rather than ⇧

C

(g), we therefore write

⇧

C

g

= (1 – ⇣(g))

⇧

–

(g)

+ ⇣(g)

⇧

+

(g)

.

We call this method the categorical Monte Carlo algorithm. This algorithm can

be used to learn to predict inﬁnite-horizon, discounted returns, and is applicable

even when there are a large number of possible rewards.

Example 3.2.

Montezuma’s Revenge is a 1984 platform game designed by

then-16-year-old Robert Jaeger. As part of the Arcade Learning Environment

[Bellemare et al., 2013a], the Atari 2600 version of the game poses a challenging

reinforcement learning problem due to the rare occurrence of positive rewards.

The very ﬁrst task in Montezuma’s Revenge consists in collecting a key, an act

which rewards the agent with 100 points. Let us consider the integer support

✓

i

= i – 1, for i = 1, 2,

...

, 101. For a discount factor

= 0.99, the discounted

23.

From here onwards we omit the iteration index k from the notation as it is not needed in the

deﬁnition of the update rule. The algorithm proper should still be understood as applying this update

rule to a sequence of state-return pairs (x

k

, g

k

)

k0

, possibly with a time-varying step size ↵

k

.

66 Chapter 3

Figure 3.1

Illustration of the projection for the 5-location grid used in Example 3.3. The middle

panel depicts the proportion of probability mass assigned to the location

✓

3

= 1 in terms

of the sample return g. The left and right panels illustrate this probability assignment at

the boundary locations ✓

1

= 0 and ✓

m

= 2.

H-horizon return from the initial game state is

G =

H–1

t=0

t

R

t

=

{⌧ < H}

0.99

⌧

⇥100,

where

⌧

denotes the time at which the key is obtained. For a ﬁxed

⌧ 2R

, write

g = 0.99

⌧

⇥100 ⇣ = g – bgc.

For

⌧

< H, the deterministic projection of the return 0.99

⌧

puts probability mass

1–

⇣

on

b

g

c

and

⇣

on

d

g

e

. For example, if

⌧

= 60, then 100

⇥

0.99

⌧

⇡

54.72 and

the deterministic projection is

⇧

C

g

= 0.28 ⇥

54

+ 0.72 ⇥

55

. 4

By introducing a projection step we typically lose some information about

the target return. This is by necessity: we are asking the learning algorithm to

approximate complex distributions using a small, ﬁnite set of possible returns.

Under the right conditions, Equation 3.15 gives rise to a convergent algorithm.

The point of convergence of this algorithm is the return function

ˆ⌘

⇡

(x) for

which, for all x 2X, we have

ˆ⌘

⇡

(x)=E[⇧

C

G

⇡

(x)

]=E

1–⇣(G

⇡

(x))

⇧

–

(G

⇡

(x))

+ ⇣(G

⇡

(x))

⇧

+

(G

⇡

(x))

.

(3.16)

In fact, we may write

E[⇧

C

G

⇡

(x)

]=⇧

C

⌘

⇡

(x),

where

⇧

C

⌘

⇡

(x) denotes distribution supported on {

✓

1

,

…

,

✓

m

} produced by

projecting each possible outcome under the distribution

⌘

⇡

(x); we will discuss

this deﬁnition in further detail in Chapter 5.

Learning the Return Distribution 67

Example 3.3.

Recall the single-state Markov decision process of Example 2.10,

whose return is uniformly distributed on the interval [0, 2]. Suppose that we take

our support to be the uniform grid {0, 0.5, 1, 1.5, 2}. Let us write

ˆ

p

0

,

ˆ

p

0.5

,

...

,

ˆ

p

2

for the probabilities assigned to these locations by the projected distribution

ˆ⌘

⇡

(x)=

⇧

C

⌘

⇡

(x), where x is the unique state of the MDP. The probability density

of the return on the interval [0, 2] is 0.5. We thus have

ˆ

p

0

(a)

= 0.5

g2[0,2]

{⇧

–

(g) = 0}

(1 – ⇣(g)) +

{⇧

+

(g) = 0}

⇣(g)

dg

= 0.5

g2[0,0.5]

(1 – ⇣(g))dg

= 0.5

g2[0,0.5]

(1 – 2g)dg

= 0.125.

In (a), we re-expressed Equation 3.16 in terms of the probability assigned to

ˆ

p

0

. A similar computation shows that

ˆ

p

0.5

=

ˆ

p

1

=

ˆ

p

1.5

= 0.25, while

ˆ

p

2

= 0.125.

Figure 3.1 illustrates the process of assigning probability mass to different

locations. Intuitively, the solution makes sense: there is less probability mass

near the boundaries of the interval [0, 2]. 4

3.6 Categorical Temporal-Difference Learning

With the use of a projection step, the categorical Monte Carlo method allows us

to approximate the return-distribution function of a given policy from sample

trajectories and using a ﬁxed amount of memory. Like the Monte Carlo method

for value functions, however, it ignores the relationship between successive

states in the trajectory. To leverage this relationship, we turn to categorical

temporal-difference learning (CTD).categorical temporal-difference learning

(CTD)

Consider now a sample transition (x, a, r, x

0

). Like the categorical Monte Carlo

algorithm, CTD maintains a return function estimate

⌘

(x) supported on the

evenly-spaced locations {

✓

1

,

...

,

✓

m

}. Like temporal-difference learning, it

learns by bootstrapping from its current return function estimate. In this

case, however, the update target is a probability distribution supported on

{

✓

1

,

...

,

✓

m

}. The algorithm ﬁrst constructs an intermediate target by scaling the

return distribution

⌘

(x

0

) at the next state by the discount factor

, then shifting

it by the sample reward r. That is, if we write

⌘(x

0

)=

m

i=1

p

i

(x

0

)

✓

i

,

68 Chapter 3

then the intermediate target is

˜⌘(x)=

m

i=1

p

i

(x

0

)

r+✓

i

,

which can also be expressed in terms of a pushforward distribution:

˜⌘(x) = (b

r,

)

#

⌘(x

0

) . (3.17)

Observe that the shifting and scaling operations are applied to each particle

in isolation. After shifting and scaling, however, these particles in general no

longer lie in the support of the original distribution. This motivates the use of the

projection step described in the previous section. Let us denote the intermediate

particles by

˜

✓

i

=b

r,

(✓

i

)=r + ✓

i

.

The CTD target is formed by individually projecting each of these particles

back onto the support, and combining their probabilities. That is,

⇧

C

(b

r,

)

#

⌘(x

0

)=

m

j=1

p

j

(x

0

)⇧

C

r+✓

j

.

More explicitly, this is

⇧

C

(b

r,

)

#

⌘(x

0

)=

m

j=1

p

j

(x

0

)

(1 – ⇣(

˜

✓

j

))

⇧

–

(

˜

✓

j

)

+ ⇣(

˜

✓

j

)

⇧

+

(

˜

✓

j

)

=

m

i=1

✓

i

m

j=1

p

j

(x

0

)⇣

i,j

(r)

, (3.18)

where

⇣

i,j

(r)=

(1 – ⇣(

˜

✓

j

))

{⇧

–

(

˜

✓

j

)=✓

i

}

+ ⇣(

˜

✓

j

)

{⇧

+

(

˜

✓

j

)=✓

i

}

.

In Equation 3.18, the second line highlights that the CTD target is a probability

distribution supported on {

✓

1

,

...

,

✓

m

}. The probabilities assigned to speciﬁc

locations are given by the inner sum; as shown here, this assignment is obtained

by weighting the next-state probabilities p

j

(x

0

) by the coefﬁcients ⇣

i,j

(r).

The CTD update adjusts the return function estimate ⌘(x) towards this target:

⌘(x) (1 – ↵)⌘(x)+↵

⇧

C

(b

r,

)

#

⌘(x

0

)

. (3.19)

Expressed in terms of the probabilities of the distributions

⌘

(x) and

⌘

(x

0

), this is

(for i = 1, ..., m)

p

i

(x) (1 – ↵)p

i

(x)+↵

m

j=1

⇣

i,j

(r)p

j

(x

0

) . (3.20)

Learning the Return Distribution 69

With this form, we see that the CTD update rule adjusts each probability p

i

(x)

of the return distribution at state x towards a mixture of the probabilities of the

return distribution at the next state (see Algorithm 3.4).

Algorithm 3.4: Online categorical temporal-difference learning

Algorithm parameters: step size ↵ 2(0, 1],

policy ⇡ : X!P(A),

evenly-spaced locations ✓

1

, …, ✓

m

2R,

initial probabilities

(p

0

i

(x))

H

i=0

: x 2X

(p

i

(x))

H

i=0

(p

0

i

(x))

H

i=0

for all x 2X

Loop for each episode:

Observe initial state x

0

Loop for t = 0, 1, ...

Draw a

t

from ⇡( · | x

t

)

Take action a

t

, observe r

t

, x

t+1

ˆ

p

i

0 for i = 1, …, m

for j = 1, …, m do

if x

t+1

is terminal then

g r

t

else

g r

t

+ ✓

j

if g ✓

1

then

ˆ

p

1

ˆ

p

1

+ p

j

(x

t+1

)

else if g ✓

m

then

ˆ

p

m

ˆ

p

m

+ p

j

(x

t+1

)

else

i

⇤

largest i s.t. ✓

i

g

⇣

g–✓

i

⇤

✓

i

⇤

+1

–✓

i

⇤

ˆ

p

i

⇤

ˆ

p

i

⇤

+ (1 – ⇣)p

j

(x

t+1

)

ˆ

p

i

⇤

+1

ˆ

p

i

⇤

+1

+ ⇣p

j

(x

t+1

)

end for

for i = 1, …, m do

p

i

(x

t

) (1 – ↵)p

i

(x

t

)+↵

ˆ

p

i

end for

until x

t+1

is terminal

end

70 Chapter 3

1.0 0.5 0.0 0.5 1.0

0.00

0.25

0.50

0.75

1.00

Probability

Return

Probability

Return

Probability

Return

Probability

Return

1.0 0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

Probability

Return

1.0 0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

Probability

Return

1.0 0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

Probability

Return

1.0 0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

Probability

Return

1.0 0.5 0.0 0.5 1.0

0.0

0.1

0.2

0.3

0.4

0.5

1.0 0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

Probability

Return

1.0 0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

Probability

Return

Figure 3.2

The return distribution at the initial state in the Cliffs domain, as predicted by categorical

temporal-difference learning over the course of learning.

Top panels.

The predictions

after e

2

{0, 1000, 2000, 10000} episodes (

↵

= 0.05; see Algorithm 3.4). Here, the return

distributions are initialised by assigning equal probability to all locations.

Bottom panels.

Corresponding predictions when the return distributions initially put all probability mass

on the zero return.

(a) (b) (c)

1.0 0.5 0.0 0.5 1.0

Return

0

2

4

6

8

10

Density

1.0 0.5 0.0 0.5 1.0

Return

0.0

0.1

0.2

0.3

0.4

0.5

Probability

1.0 0.5 0.0 0.5 1.0

Return

0.00

0.05

0.10

0.15

0.20

0.25

Probability

Figure 3.3

(a)

The return distribution at the initial state in the Cliffs domain, visualised using

kernel density estimation (see Figure 2.2).

(b)

The return distribution estimate learned

by the categorical Monte Carlo algorithm with m = 31.

(c)

The estimate learned by the

categorical temporal-difference learning algorithm.

The deﬁnition of the intermediate target in pushforward terms (Equation

3.17) illustrates that categorical temporal-difference learning relates to the

distributional Bellman equation

⌘

⇡

(x)=E

⇡

[(b

R,

)

#

⌘

⇡

(X

0

)|X = x],

analogous to the relationship between TD learning and the classical Bellman

equation. We will continue to explore this relationship in later chapters. How-

ever, the two algorithms usually learn different solutions, due to the introduction

of approximation error from the bootstrapping process.

Learning the Return Distribution 71

Example 3.4.

We can study the behaviour of the categorical temporal-

difference learning algorithm by visualising how its predictions vary over the

course of learning. We apply CTD to approximate the return function of the safe

policy in the Cliffs domain (Example 2.9). Learning takes place online (as per

Algorithm 3.4), using a constant step size of

↵

= 0.05, and return distributions

are approximated with m = 31 locations evenly spaced between –1 and 1.

Figure 3.2 illustrates that the initial return function plays a substantial role in

the speed at which CTD learns a good approximation. Informally, this occurs

because the uniform distribution is closer to the ﬁnal approximation than the

distribution that puts all of its probability mass on zero (what “close” means in

this context will be the topic of Chapter 4). In addition, we see that categorical

temporal-difference learning learns a different approximation to the true return

function, compared to the categorical Monte Carlo algorithm (Figure 3.3). This

is due to a phenomenon we call diffusion which arises from the combination of

the bootstrapping step and projection; we will study diffusion in Chapter 5.

4

3.7 Learning To Control

A large part of this book considers the problem of learning to predict the

distribution of an agent’s returns. In Chapter 7, we will discuss how one might

instead learn to maximise or control these returns, and the role that distributional

reinforcement learning plays in this endeavour. By learning to control, we

classically mean obtaining (from experience) a policy

⇡

⇤

that maximises the

expected return:

E

⇡

⇤

1

t=0

t

R

t

E

⇡

1

t=0

t

R

t

, for all ⇡ .

Such a policy is called an optimal policy. From Section 2.5, recall that the

state-action value function Q

⇡

is given by

Q

⇡

(x, a)=E

⇡

1

t=0

t

R

t

X

0

= x, A

0

= a

.

Any optimal policy

⇡

⇤

has the property that its state-action value function also

satisﬁes the Bellman optimality equation:

Q

⇡

⇤

(x, a)=E

⇡

R + max

a

0

2A

Q

⇡

⇤

(X

0

, a

0

)|X = x, A = a

.

Similar in spirit to temporal-difference learning, Q-learning is an incremental

algorithm that ﬁnds an optimal policy. Q-learning maintains a state-action value

function estimate, Q, which it updates according to

Q(x, a) (1 – ↵)Q(x, a)+↵

r + max

a

0

2A

Q(x

0

, a

0

)

. (3.21)

72 Chapter 3

The use of the maximum in the Q-learning update rule results in different

behaviour than TD learning, as the selected action depends on the current

value estimate. We can think of this difference as constructing one target for

each action a

0

, and updating the value estimate towards the largest of such

targets. It can be shown that under the right conditions, Q-learning converges

to the optimal state-action value function Q

⇤

, corresponding to the expected

return under any optimal policy. We extract an optimal policy from Q

⇤

by

acting greedily with respect to Q

⇤

, that is choosing a policy

⇡

⇤

that selects

maximally-valued actions according to Q

⇤

.

The simplest way to extend Q-learning to the distributional setting is to express

the maximal action in Equation 3.21 as a greedy policy. Denote by

⌘

a return

function estimate over state-action pairs, such that

⌘

(x, a) is the return distri-

bution associated with the state-action pair (x, a)

2X ⇥A

. Deﬁne the greedy

action

a

⌘

(x) = arg max

a2A

E

Z⇠⌘(x,a)

[Z],

breaking ties arbitrarily. The categorical Q-learning update rule is

⌘(x, a) (1 – ↵)⌘(x, a)+↵

⇧

C

(b

r,

)

#

⌘

x

0

,a

⌘

(x

0

)

.

It can be shown that, under the same conditions as Q-learning, the mean of the

return function estimates also converges to Q

⇤

. The behaviour of the distribu-

tions themselves, however, may be surprisingly complicated. We can also put

the learned distributions to good use and make decisions on the basis of their

full characterisation, rather than from their mean alone. This forms the topic

of risk-sensitive reinforcement learning. We return to both of these points in

Chapter 7.

3.8 Further Considerations

Categorical temporal-difference learning learns to predict return distributions

from sample experience. As we will see in subsequent chapters, the choices that

we made in designing CTD are not unique, and the algorithm is best thought of

as a jumping-off point into a broad space of methods. For example, an important

question in distributional reinforcement learning asks how we should represent

probability distributions, given a ﬁnite memory budget. One issue with the

categorical representation is that it relies on a ﬁxed grid of locations to cover

the range [

✓

1

,

✓

m

], which lacks ﬂexibility and is in many situations inefﬁcient.

We will take a closer look at this issue in Chapter 5. In many practical situations

we also need to deal with a few additional considerations, including the use of

function approximation to deal with very large state spaces (Chapters 9 and 10).

Learning the Return Distribution 73

3.9 Technical Remarks

Remark 3.1

(Non-parametric distributional Monte Carlo algorithm)

.

In

Section 3.4, we saw that the (unprojected) ﬁnite-horizon categorical Monte

Carlo algorithm can in theory learn ﬁnite-horizon return-distribution functions

when there are only a small number of possible returns. It is possible to extend

these ideas to obtain a straightforward, general-purpose algorithm that can be

sometimes be used to learn an accurate approximation to the return distribution.

Like the sample-mean Monte Carlo method, the non-parametric distributional

Monte Carlo algorithm takes as input K ﬁnite-length trajectories with a com-

mon source state x

0

. After computing the sample returns (g

k

)

K

k=1

from these

trajectories, it constructs the estimate

ˆ⌘

⇡

(x

0

)=

1

K

K

k=1

g

k

(3.22)

of the return distribution

⌘

⇡

(x

0

). Here, non-parametric refers to the fact that

the approximating distribution in Equation 3.22 is not described by a ﬁnite

collection of parameters; in fact, the memory required to represent this object

may grow linearly with K. Although this is not an issue when K is relatively

small, this can be undesirable when working with large amounts of data, and

moreover precludes the use of function approximation (see Chapters 9 and 10).

However, unlike the categorical Monte Carlo and temporal-difference learning

algorithms presented in this chapter, the accuracy of this estimate is only limited

by the number of trajectories K; we describe various ways to quantify this

accuracy in Remark 4.3. As such, it provides a useful baseline for measuring the

quality of other distributional algorithms. In particular, we used this algorithm

to generate the ground truth return distribution estimates in Example 2.9 and in

Figure 3.3. 4

3.10 Bibliographical Remarks

The development of a distributional algorithm in this chapter follows our own

development of the distributional perspective, beginning with our work on using

compression algorithms in reinforcement learning [Veness et al., 2015].

3.1.

The ﬁrst-visit Monte Carlo estimate is studied by Singh and Sutton [1996],

where it is used to characterise the properties of replacing eligibility traces [see

also Sutton and Barto, 2018]. Statistical properties of model-based estimates

(which solve for the Markov decision process’s parameters as an intermediate

step) are analysed by Mannor et al. [2007]. Grünewälder and Obermayer [2011]

74 Chapter 3

argue that model-based methods must incur statistical bias, an argument that

also extends to temporal-difference algorithms. Their work also introduces

a reﬁned sample-mean Monte Carlo method that yields a minimum-variance

unbiased estimator (MVUE) of the value function. See Browne et al. [2012]

for a survey of Monte Carlo tree search methods, and Liu [2001], Robert and

Casella [2004], Owen [2013] for further background on Monte Carlo methods

more generally.

3.2.

Incremental algorithms are a staple of reinforcement learning and have

roots in stochastic approximation [Robbins and Monro, 1951, Widrow and Hoff,

1960, Kushner and Yin, 2003] and psychology [Rescorla and Wagner, 1972].

In the control setting, these are also called optimistic policy iteration methods,

and exhibit fairly complex behaviour [Sutton, 1999, Tsitsiklis, 2002].

3.3.

Temporal-difference learning was introduced by Sutton [1984, 1988]. The

sample transition model presented here differs from the standard algorithmic

presentation, but allows us to separate concerns of behaviour (data collection)

from learning. A similar model is used in Bertsekas and Tsitsiklis [1996] to

prove convergence of a broad class of TD methods, and by Azar et al. [2012] to

provide sample efﬁciency bounds for model-based control.

3.4.

What is effectively the undiscounted ﬁnite-horizon categorical Monte

Carlo algorithm was proposed by Veness et al. [2015]. There, the authors the

demonstrate that by means of Bayes’ rule, one can learn the return distribution

by ﬁrst learning the joint distribution over returns and states (in our notation,

P

⇡

(X, G)) by means of a compression algorithm, and subsequently using Bayes

rule to extract P

⇡

(G | X). The method proved surprisingly effective at learning

to play Atari 2600 games. Toussaint and Storkey [2006] consider the problem

of control as probabilistic inference, where the reward and trajectory length are

viewed as random variables to be optimised over, again obviating the need to

deal with a potentially large support for the return.

3.5–3.6.

Categorical temporal-difference learning for both prediction and

control was introduced by Bellemare et al. [2017a], in part to address the

shortcomings of the undiscounted algorithm. Its original form contains both the

projection step and the categorical representation as given here. The mixture

update that we study in this chapter is due to Rowland et al. [2018].

3.7.

The Q-learning algorithm is due to Watkins [1989]; see also Watkins and

Dayan [1992]. The explicit construction of a greedy policy is commonly found

in more complex reinforcement learning algorithms, including modiﬁed policy

iteration [Puterman and Shin, 1978],

-policy iteration [Bertsekas and Ioffe,

Learning the Return Distribution 75

1996], and non-stationary policy iteration [Scherrer and Lesner, 2012, Scherrer,

2014].

3.9.

The algorithm introduced in Remark 3.1 is essentially an application of the

standard Monte Carlo method to the return distribution, and is a special case of

the framework set out by Chandak et al. [2021], who also analyse the statistical

properties of the approach.

3.11 Exercises

Exercise 3.1.

Suppose that we begin with the initial value function estimate

V(x) = 0 for all x 2X.

(i)

Consider ﬁrst the setting in which we are given sample returns for a single

state x . Show that in this case, the incremental Monte Carlo algorithm

(Equation 3.6), instantiated with

↵

k

=

1

k+1

, is equivalent to computing the

sample-mean Monte Carlo estimate for x. That is, after processing the

sample returns g

1

, ..., g

K

, we have

V

K

(x)=

1

K

K

i=1

g

i

.

(ii)

Now consider the case where source states are drawn from a distribution

⇠

, with

⇠

(x) > 0 for all x, and let N

k

(x

k

) be the number of times x

k

has been

updated up to but excluding time k. Show that the appropriate step size to

match the sample-mean Monte Carlo estimate is ↵

k

=

1

N

k

(x

k

)+1

. 4

Exercise 3.2. Recall from Exercise 2.8 the n-step Bellman equation:

V

⇡

(x)=E

⇡

n–1

t=0

t

R

t

+

n

V

⇡

(X

n

)|X

0

= x

.

Explain what a sensible n-step temporal-difference learning update rule might

look like. 4

Exercise 3.3.

The Cart-Pole domain is a small, two-dimensional reinforcement

learning problem [Barto et al., 1983]. In this problem, the learning agent must

balance a swinging pole that is a attached to a moving cart. Using an open-

source implementation of Cart-Pole, implement the undiscounted ﬁnite-horizon

categorical Monte Carlo algorithm and evaluate its behaviour. Construct a ﬁnite

state space by discretising the four-dimensional state space using a uniform grid

of size 10

⇥

10

⇥

10

⇥

10. Plot the learned return distribution for a ﬁxed initial

state and other states of your choice when

76 Chapter 3

(i) The policy chooses actions uniformly at random;

(ii)

The policy moves the cart in the direction that the pole is leaning towards.

4

You may want to pick the horizon H to be the maximum length of an episode.

Exercise 3.4.

Implement the categorical Monte Carlo (CMC), non-parametric

categorical Monte Carlo (Remark 3.1), and categorical temporal-difference

learning (CTD) algorithms. For a discount factor

= 0.99, compare the return

distributions learned by these three algorithms on the Cart-Pole domain of the

previous exercise. For CMC and CTD, vary the number of particles m and the

range of the support [

✓

1

,

✓

m

]. How do the approximations vary as a function of

these parameters? 4

Exercise 3.5.

Suppose that the rewards R

t

take on one of N

R

values. Consider

the undiscounted ﬁnite-horizon return

G =

H–1

t=0

R

t

.

Denote by N

G

the number of possible realisations of G.

(i) Show that N

G

can be as large as

N

R

+H–1

H

.

(ii) Derive a better bound when R

t

2{0, 1, ..., N

R

– 1}.

(iii)

Explain, in words, why the bound is better in this case. Are there other sets

of rewards for which N

G

is smaller than the worst-case from (i)? 4

Exercise 3.6.

Recall from Section 3.5 that the categorical Monte Carlo

algorithm aims to ﬁnd the approximation

ˆ⌘

⇡

m

(x)=⇧

C

⌘

⇡

(x), x 2X ,

where we use the subscript m to more explicitly indicate that the quality of this

approximation depends on the number of locations.

Consider again the problem setting of Example 3.3. For m

2, suppose that we

take ✓

1

= 0 and ✓

m

= 2, such that

✓

i

=2

i –1

m –1

i = 1, ...m .

Show that in this case

ˆ⌘

⇡

m

(x) converges to the uniform distribution on the interval

[0, 2], in the sense that

lim

m!1

ˆ⌘

⇡

m

(x)

[a, b]

!

b – a

2

Learning the Return Distribution 77

for all a < b; recall that

⌫

([a, b]) denotes the probability assigned to the interval

[a, b] by a probability distribution ⌫. 4

Exercise 3.7.

The purpose of this exercise is to demonstrate that the categorical

Monte Carlo algorithm is what we call mean-preserving. Consider a sequence

of state-return pairs (x

k

, g

k

)

K

k=1

, and an evenly-spaced grid {

✓

1

,

...

,

✓

m

}, m

2.

Suppose that rewards are bounded in [R

MIN

, R

MAX

].

(i)

Suppose that V

MIN

✓

1

and V

MAX

✓

m

. For a given g

2

[V

MIN

, V

MAX

], show

that the distribution ⌫ = ⇧

C

g

satisﬁes

E

Z⇠⌫

[Z]=g .

(ii)

Based on this, argue that if

⌘

(x) is a distribution with mean V, then after

applying the update

⌘(x) (1 – ↵)⌘(x)+↵⇧

C

g

the mean of ⌘(x) is (1 – ↵)V + ↵g.

(iii)

By comparing with the incremental Monte Carlo update rule, explain why

categorical Monte Carlo can be said to be mean-preserving.

(iv)

Now suppose that [V

MIN

, V

MAX

]

6✓

[

✓

1

,

✓

m

]. How are the preceding results

affected?

4

Exercise 3.8.

Following the notation of Section 3.5, consider the nearest

neighbour projection method

⇧

NN

g =

⇧

–

(g) if ⇣(g) 0.5

⇧

+

(g) otherwise.

Show that this projection is not mean-preserving in the sense of the preced-

ing exercise. Implement and evaluate on the Cart-Pole domain. What do you

observe? 4