Learning the Return Distribution

Reinforcement learning provides a computational model of what and how an

animal, or more generally an agent, learns to predict on the basis of received

experience. Often, the prediction of interest is about the expected return under

the initial state distribution or, counterfactually, from a query state

x ∈X

. In the

latter case, making good predictions is equivalent to having a good estimate

of the value function

. From these predictions, reinforcement learning also

seeks to explain how the agent might best control its environment: for example,

to maximize its expected return.

By learning from experience, we generally mean from data rather than from

the Markov decision process description of the environment (its transition ker-

nel, reward distribution, and so on). In many settings, these data are taken

from sample interactions with the environment. In their simplest forms, these

interactions are realizations of the random trajectory (

, A

, R

)

t≥0

. The record

of a particular game of chess and the ﬂuctuations of an investment account

throughout the month are two examples of sample trajectories. Learning from

experience is a powerful paradigm, because it frees us from needing a complete

description of the environment, an often impractical if not infeasible require-

ment. It also enables incremental algorithms whose run time does not depend on

the size of the environment and that can easily be implemented and parallelized.

The aim of this chapter is to introduce a concrete algorithm for learning

return distributions from experience, called categorical temporal-diﬀerence

learning. In doing so, we will provide an overview of the design choices that

must be made when creating distributional reinforcement learning algorithms.

In classical reinforcement learning, there are well-established (and in some

cases, deﬁnitive) methods for learning to predict the expected return; in the

distributional setting, choosing the right algorithm requires balancing multiple,

sometimes conﬂicting considerations. In great part, this is because of the unique

and exciting challenges that arise when one wishes to estimate sometimes

intricate probability distributions, rather than scalar expectations.

Draft version. 51

52 Chapter 3

3.1 The Monte Carlo Method

Birds such as the pileated woodpecker follow a feeding routine that regularly

takes them back to the same foraging grounds. The success of this routine can be

measured in terms of the total amount of food obtained during a ﬁxed period of

time, say a single day. As part of a ﬁeld study, it may be desirable to predict the

success of a particular bird’s routine on the basis of a limited set of observations:

for example, to assess its survival chances at the beginning of winter based

on feeding observations from the summer months. In reinforcement learning

terms, we view this as the problem of learning to predict the expected return

(total food per day) of a given policy

(the feeding routine). Here, variations in

weather, human activity, and other foraging animals are but a few of the factors

that aﬀect the amount of food obtained on any particular day.

In our example, the problem of learning to predict is abstractly a problem

of statistical estimation. To this end, let us model the woodpecker’s feeding

routine as a Markov decision process.

We associate each day with a sample

trajectory or episode, corresponding to measurements made at regular intervals

about the bird’s location

, behavior

, and per-period food intake

. Suppose

that we have observed a set of K sample trajectories,

k,t

, a

k,t

, r

k,t

)

−1

t=0

k=1

, (3.1)

where we use

to index the trajectory and

to index time, and where

denotes

the number of measurements taken each day. In this example, it is most sensible

to assume a ﬁxed number of measurements

, but in the general setting,

may be random and possibly dependent on the trajectory, often corresponding to

the time when a terminal state is ﬁrst reached. For now, let us also assume that

there is a unique starting state

, such that

k,0

for all

. We are interested

in the problem of estimating the expected return

T −1

t=0

= V

) ,

corresponding to the expected per-day food intake of our bird.

Monte Carlo methods estimate the expected return by averaging the outcomes

of observed trajectories. Let us denote by

the sample return for the

17.

Although this may be a reasonable approximation of reality, it is useful to remember that

concepts such as the Markov property are in this case modeling assumptions, rather than actual

facts.

18.

In this particular example, a discount factor of

= 1 is reasonable given that

is ﬁxed and we

should have no particular preference for mornings over evenings.

Draft version.

Learning the Return Distribution 53

trajectory:

T −1

t=0

k,t

. (3.2)

The sample-mean Monte Carlo estimate is the average of these

sample

returns:

) =

k=1

. (3.3)

This a sensible procedure that also has the beneﬁt of being simple to implement

and broadly applicable. In our example above, the Monte Carlo method corre-

sponds to estimating the expected per-day food intake by simply averaging the

total intake measured over diﬀerent days.

Example 3.1.

Monte Carlo tree search is a family of approaches that have

proven eﬀective for planning in stochastic environments and have been used

to design state-of-the-art computer programs that play the game of Go. Monte

Carlo tree search combines a search procedure with so-called Monte Carlo

rollouts (simulations). In a typical implementation, a computer Go algorithm

will maintain a partial search tree whose nodes are Go positions, rooted in the

current board conﬁguration. Nodes at the fringe of this tree are evaluated by

performing a large number of rollouts of a ﬁxed rollout policy (often uniformly

random) from the nodes’ position to the game’s end.

By deﬁning the return of one rollout to be 1 if the game is won and 0 if the

game is lost, the expected undiscounted return from a leaf node corresponds to

the probability of winning the game (under the rollout policy) from that position.

By estimating the expected return from complete rollouts, the Monte Carlo

method avoids the challenges associated with devising a heuristic to determine

the value of a given position. 4

We can use the Monte Carlo method to estimate the value function

rather

than only the expected return from the initial state. Suppose now that the sample

trajectories (Equation 3.1) have diﬀerent starting states (that is,

k,0

varies across

trajectories). The Monte Carlo method constructs the value function estimate

(x) =

N(x)

k=1

k,0

=x}

where

N(x) =

k=1

k,0

= x}

is the number of trajectories whose starting state is

. For simplicity of

exposition, we assume here that N(x) > 0 for all x ∈X.

Draft version.

54 Chapter 3

Learning the value function is useful because it provides ﬁner-grained detail

about the environment and chosen policy compared to only learning about the

expected return from the initial state. Often, estimating the full value function

can be done at little additional cost, because sample trajectories contain infor-

mation about the expected return from multiple states. For example, given a

collection of

sample trajectories drawn independently from the generative

equations, the ﬁrst-visit Monte Carlo estimate is

(x) =

(x)

k=1

−1

t=0

k,t

= x, x

k,0

, . . . , x

k,t−1

, x}

| {z }

k,t



−1

s=t

s−t

k,s



(x) =

k=1

−1

t=0

k,t

The ﬁrst-visit Monte Carlo estimate treats each time step as the beginning of a

new trajectory; this is justiﬁed by the Markov property and time-homogeneity

of the random trajectory (

, A

, R

)

t≥0

(Section 2.6). The restriction to the ﬁrst

occurrence of

in a particular trajectory guarantees that

behaves like the

sample-mean estimate.

3.2 Incremental Learning

Both in practice and in theory, it is useful to consider a learning model under

which sample trajectories are processed sequentially, rather than all at once.

Algorithms that operate in this fashion are called incremental algorithms, as

they maintain a running value function estimate

V ∈R

that they improve with

each sample.

Under this model, we now consider an inﬁnite sequence of

sample trajectories



k,t

, a

k,t

, r

k,t

)

−1

t=0



k≥0

, (3.4)

presented one at a time to the learning algorithm. In addition, we consider the

more general setting in which the initial states (

k,0

)

k≥0

may be diﬀerent; we

call these states the source states, as with the sample transition model (Section

2.6). As in the previous section, a minimum requirement for learning

is that

every state x ∈X should be the source state of some trajectories.

The incremental Monte Carlo algorithm begins by initializing its value

function estimate:

V(x) ←V

(x), for all x ∈X.

19.

Incremental methods are also sometimes called stochastic or sample based. We prefer the term

“incremental” as these methods can be applied even in the absence of randomness, and because

other estimation methods – including the sample-mean method of the preceding section – are also

based on samples.

Draft version.

Learning the Return Distribution 55

Algorithm 3.1: Online incremental ﬁrst-visit Monte Carlo

Algorithm parameters: step size α ∈(0, 1],

policy π : X→P(A),

initial estimate V

∈R

V(x) ←V

(x) for all x ∈X

Loop for each episode:

Observe initial state x

T ←0

Loop for t = 0, 1, …

Draw a

from π( · | x

)

Take action a

, observe r

, x

t+1

T ←T + 1

until x

t+1

is terminal

g ←0

for t = T −1, …, 0 do

g ←r

+ γg

if x

is not in {x

, …, x

t−1

} then

V(x

) ←(1 −α)V(x

) + αg

end for

end

Given the

th trajectory, the algorithm computes the sample return

(Equation

3.2), called the Monte Carlo target in this context. It then adjusts its value

function estimate for the initial state

k,0

toward this target, according to the

update rule

V(x

k,0

) ←(1 −α

)V(x

k,0

) + α

The parameter

∈

1) is a time-varying step size that controls the impact

of a single sample trajectory on the value estimate

. Because the incremental

Monte Carlo update rule only depends on the starting state and the sample

return, we can more generally consider learning

on the basis of the sequence

of state-return pairs



, g

)

k≥0

. (3.5)

Under this simpliﬁed model, the sample return

is assumed drawn from the

return distribution

(

) (rather than constructed from the sample trajectory; of

course, these two descriptions are equivalent). The initial state

k,0

is substituted

for x

to yield

V(x

) ←(1 −α

)V(x

) + α

. (3.6)

Draft version.

56 Chapter 3

By choosing a step size that is inversely proportional to the number of times

(

) that the value of state

has been previously encountered, we recover

the sample-mean estimate of the previous section (see Exercise 3.1). In practice,

it is also common to use a constant step size

to avoid the need to track

. We

will study the eﬀect of step sizes in greater detail in Chapter 6.

As given, the Monte Carlo update rule is agnostic to the mechanism by which

the sample trajectories are generated or when learning occurs in relation to the

agent observing these trajectories. A frequent and important setting in reinforce-

ment learning is when each trajectory is consumed by the learning algorithm

immediately after being experienced, which we call the online setting.

Com-

plementing the abstract presentation given in this section, Algorithm 3.1 gives

pseudo-code for an incremental implementation of the ﬁrst-visit Monte Carlo

algorithm in the online, episodic

setting.

In Equation 3.6, the notation

A ← B

indicates that the value

(which may

be a scalar, a vector, or a distribution) should be stored in the variable

. In the

case of Equation 3.6,

is a collection of scalars – one per state – and the update

rule describes the process of modifying the value of one of these variables “in

place.” This provides a succinct way of highlighting the incremental nature of

the process. On the other hand, it is often useful to consider the value of the

variable after a given number of iterations. For

k >

0, we express this with the

notation

k+1

) = (1 −α

) + α

k+1

(x) = V

(x) for x , x

, (3.7)

where the second line reﬂects the fact that only the variable associated to state

is modiﬁed at time k.

3.3 Temporal-Diﬀerence Learning

Incremental algorithms are useful because they do not need to maintain the

entire set of sample trajectories in memory. In addition, they are often simpler

to implement and enable distributed and approximate computation. Temporal-

diﬀerence learning (TD learning, or simply TD) is particularly well suited to

the incremental setting, because it can learn from sample transitions, rather than

entire trajectories. It does so by leveraging the relationship between the value

20.

The online setting is sometimes deﬁned in terms of individual transitions, which are to be

consumed immediately after being experienced. Our use of the term is broader and related to the

streaming setting considered in other ﬁelds of research (see, e.g., Cormode and Muthukrishnan

2005).

21. A environment is said to be episodic when all trajectories eventually reach a terminal state.

Draft version.

Learning the Return Distribution 57

Algorithm 3.2: Online temporal-diﬀerence learning

Algorithm parameters: step size α ∈(0, 1],

policy π : X→P(A)

initial estimate V

∈R

V(x) ←V

(x) for all x ∈X

Loop for each episode:

Observe initial state x

Loop for t = 0, 1, …

Draw a

from π( · | x

)

Take action a

, observe r

, x

t+1

if x

t+1

is terminal then

V(x

) ←(1 −α)V(x

) + αr

else

V(x

) ←(1 −α)V(x

) + α



+ γV(x

t+1

)



end if

until x

t+1

is terminal

end

function at successive states – eﬀectively, it takes advantage of the Bellman

equation.

To begin, let us abstract away the sequential nature of the trajectory and

consider the sample transition model (

X, A, R, X

) deﬁned by Equation 2.12.

Here, the distribution

of the source state

may correspond, for example, to

the relative frequency at which states are visited over the random trajectory. We

consider a sequence of sample transitions drawn independently according to

this model, denoted

, a

, r

, x

)

k≥0

. (3.8)

As with the incremental Monte Carlo algorithm, the update rule of temporal-

diﬀerence learning is parameterized by a time-varying step size

. For the

sample transition, this update rule is given by

V(x

) ←(1 −α

)V(x

) + α



+ γV(x

)



. (3.9)

Algorithm 3.2 instantiates this update rule in the online, episodic setting. We

call the term

γV

(

) in Equation 3.9 the temporal-diﬀerence target. If we

again write the Bellman equation

(x) = E



R + γV

) | X = x



Draft version.

58 Chapter 3

we see that the temporal-diﬀerence target can be thought of as a realization

of the random variable whose expectation forms the right-hand side of the

Bellman equation, with the exception that the TD target uses the estimated

value function

in place of the true value function

. This highlights the fact

that temporal-diﬀerence learning adjusts its value function estimate to be more

like the right-hand side of the Bellman equation; we will study this relationship

more formally in Chapter 6.

By rearranging terms, we can also express the temporal-diﬀerence update

rule in terms of the temporal-diﬀerence error r

+ γV(x

) −V(x

V(x

) ←V(x

) + α



+ γV(x

) −V(x

)



;

this form highlights the direction of change of the value function estimate

(positive if the target is greater than our estimate, negative if it is not) and is

needed to express certain reinforcement learning algorithms, as we will see in

Chapter 9.

The incremental Monte Carlo algorithm updates its value function estimate

toward a ﬁxed (but noisy) target. By contrast, Equation 3.9 describes a recur-

sive process, without such a ﬁxed target. Temporal-diﬀerence learning instead

depends on the value function at the next state

(

) being approximately

correct. As such, it is said to bootstrap from its own value function estimate.

Because of this recursive dependency, the dynamics of temporal-diﬀerence

learning are usually diﬀerent from those of Monte Carlo methods, and are more

challenging to analyze (Chapter 6).

On the other hand, temporal-diﬀerence learning oﬀers some important advan-

tages over Monte Carlo methods. One is the way in which value estimates

are naturally shared between states, so that once a value has been estimated

accurately at one state, this can often be used to improve the value estimates at

other states. In many situations, the estimates produced by temporal-diﬀerence

learning are consequently more accurate than their Monte Carlo counterparts.

A full treatment of the statistical properties of temporal-diﬀerence learning is

beyond the scope of this book, but we provide references on the topic in the

bibliographical remarks.

3.4 From Values to Probabilities

In distributional reinforcement learning, we are interested in understanding

the random return as it arises from interactions with the environment. In the

context of this chapter, we are speciﬁcally interested in how we can learn the

return-distribution function η

As a light introduction, consider the scenario in which rewards are binary

(

∈{

}

) and where we are interested in learning the distribution of the

Draft version.

Learning the Return Distribution 59

undiscounted ﬁnite-horizon return function

(x) =

H−1

t=0

, X

= x. (3.10)

Here,

H ∈N

denotes the horizon of the learner – how far it predicts into

the future. By enumeration, we see that

(

) takes on one of

+ 1 possible

values, integers ranging from 0 to

. These form the support of the probability

distribution

(

). To learn

, we will construct an incremental algorithm that

maintains a return-distribution function estimate η, the distributional analogue

from the previous sections. This estimate assigns a probability

(

) to

each possible return i ∈{0, . . . , H}:

η(x) =

i=0

(x)δ

, (3.11)

where

(

)

≥

0 and

i=0

(

) = 1. We call Equation 3.11 a categorical repre-

sentation, since each possible return can now be thought of as one of

+ 1

categories. Under this perspective, we can think of the problem of learning the

return distribution for a given state

as a classiﬁcation problem – assigning

probabilities to each of the possible categories. We may then view the problem

of learning the return function

as a collection of classiﬁcation problems (one

per state).

As before, let us consider a sequence of state-return pairs (

, g

)

k≥0

, where

each

is drawn from the distribution

(

). As in Section 3.2, the sample

return

provides a target for an update rule except that now we want to adjust

the probability of observing

rather than an estimate of the expected return.

For a step size α

∈(0, 1], the categorical update rule is

) ←(1 −α

) + α

) ←(1 −α

) , i , g

. (3.12)

The adjustment of the probabilities for returns other than

ensures that

the return-distribution estimate at

continues to sum to 1 after the update.

Expressed as an operation over probability distributions, this update rule corre-

sponds to changing

(

) to be a mixture between itself and a distribution that

puts all of its mass on g

η(x

) ←(1 −α

)η(x

) + α

. (3.13)

We call this the undiscounted ﬁnite-horizon categorical Monte Carlo algorithm.

It is instantiated in the online, episodic setting in Algorithm 3.3. Similar to the

other incremental algorithms presented thus far, it is possible to demonstrate

that under the right conditions, this algorithm learns a good approximation to

Draft version.

60 Chapter 3

the distribution of the binary-reward, undiscounted, ﬁnite-horizon return. In the

next section, we will see how this idea can be carried over to the more general

setting.

Algorithm 3.3: Undiscounted ﬁnite-horizon categorical Monte

Carlo

Algorithm parameters: step size α ∈(0, 1],

horizon H,

policy π : X→P(A),

initial probabilities



(x))

i=0

: x ∈X



(x) ← p

(x) for all 0 ≤i ≤H, x ∈X

Loop for each episode:

Observe initial state x

T ←0

Loop for t = 0, 1, . . .

Draw a

from π( · | x

)

Take action a

, observe r

, x

t+1

T ←T + 1

until x

t+1

is terminal

g ←0

for t = T −1, …, 0 do

g ←g + r

if t < T − H then

g ←g −r

t+H

if x

is not in {x

, …, x

t−1

} then

) ←(1 −α)p

) + α

) ←(1 −α)p

), for i , g

end for

end

3.5 The Projection Step

The ﬁnite, undiscounted algorithm of the previous section is a sensible approach

when the random return takes only a few distinct values. In the undiscounted

setting, we already saw that the number of possible returns is

+ 1 when

there are two possible rewards. However, this small number is the exception,

rather than the rule. If there are

possible rewards, then

can be as large as

Draft version.

Learning the Return Distribution 61



+H−1



, already a potentially quite large number for

2 (see Exercise 3.5).

Worse, when a discount factor

is introduced, the number of possible returns

depends exponentially on

. To see why, recall the single-state, single-action

Markov decision process of Example 2.10 with reward distribution

= 0) = P

= 1) =

/2 .

With a discount factor

, we argued that the set of possible returns for this

MDP corresponds to the binary expansion of all numbers in the [0

2] interval,

from which we concluded that the random return is uniformly distributed on

that interval. By the same argument, we can show that the H-horizon return

G =

H−1

t=0

has support on all numbers in the [0

2] interval that are described by

binary

digits; there are 2

such numbers. Of course, if we take

to be inﬁnite, there

may be uncountably many possible returns.

To deal with the issue of a large (or even inﬁnite) set of possible returns,

we insert a projection step prior to the mixture update in Equation 3.13.

The purpose of the projection step is to transform an arbitrary target return

into a modiﬁed target taking one of

values, for

reasonably small. From

a classiﬁcation perspective, we can think of the projection step as assigning a

label (from a small set of categories) to each possible return.

We will consider return distributions that assign probability mass to

m ≥

evenly spaced values or locations. In increasing order, we denote these locations

≤θ

≤···≤θ

. We write

for the gap between consecutive locations, so

that for i = 1, . . . , m −1, we have

= θ

i+1

−θ

A common design takes θ

= V

min

and θ

= V

max

, in which case

max

−V

min

m −1

However, other choices are possible and sometimes desirable. Note that the

undiscounted algorithm of the previous section corresponds to

+ 1 and

= i −1.

22.

When there are relatively few sample trajectories, a sensible alternative is to construct a

nonparametric estimate of the return distribution that simply puts equal probability mass on all

observed returns. The return distributions described in Section 1.2 and Example 2.9, in particular,

were estimated in this manner. See Remark 3.1 for further details.

Draft version.

62 Chapter 3

We express the corresponding return distribution as a weighted sum of Dirac

deltas at these locations:

η(x) =

i=1

(x)δ

For mathematical convenience, let us write

−∞

and

m+1

∞

. Consider

a sample return

, and denote by

∗

(

) the index of the largest element of the

support (extended with −∞) that is no greater than g:

∗

= arg max

i∈{0,...,m}

{θ

: θ

≤g}.

For this sample return, we write

−

(g) = θ

∗

(g) = θ

∗

to denote the corresponding element of the support and its immediate successor;

when

, . . . , θ

are consecutive integers and

g ∈

[

, θ

], these are the ﬂoor and

ceiling of g, respectively.

The projection step begins by computing

ζ(g) =

g −Π

−

(g)

(g) −Π

−

(g)

with the convention that

(

) = 1 if Π

−

(

) =

−∞

and

(

) = 0 if Π

(

) =

∞

. The

(

) term corresponds to the distance of

to the two closest elements of the

support, scaled to lie in the interval [0

1] – eﬀectively, the fraction of the

distance between Π

−

(

) and Π

(

) at which

lies. The stochastic projection of

g is

(g) =

(

−

(g) with probability 1 −ζ(g)

(g) with probability ζ(g).

(3.14)

We use this projection to construct the update rule

η(x) ←(1 −α)η(x) + αδ

where as before,

is a source state and

a sample return.

Similar to Equation

3.12, this update rule is implemented by adjusting the probabilities according to

(x) ←(1 −α)p

(x) + α

(x) ←(1 −α)p

(x) , i , i

where i

is the index of location Π

23.

From here onward, we omit the iteration index

from the notation as it is not needed in the

deﬁnition of the update rule. The algorithm proper should still be understood as applying this update

rule to a sequence of state-return pairs (x

, g

)

k≥0

, possibly with a time-varying step size α

Draft version.

Learning the Return Distribution 63

We can improve on the stochastic projection by constructing a modiﬁed target

that contains information about both Π

−

(

) and Π

(

). In classiﬁcation terms,

this corresponds to using soft labels: the target is a probability distribution over

labels, rather than a single label. This deterministic projection of

results in

the update rule

η(x) ←(1 −α)η(x) + α



(1 −ζ(g))δ

−

(g)

+ ζ(g)δ

(g)



. (3.15)

We denote the deterministic projection by Π

. Statistically speaking, the deter-

ministic projection produces return-distribution estimates that are on average

the same as those produced by the stochastic projection but are comparatively

more concentrated around their mean. Going forward, we will see that it is

conceptually simpler to apply this projection to probability distributions, rather

than to sample returns. Rather than Π

(g), we therefore write

= (1 −ζ(g))δ

−

(g)

+ ζ(g)δ

(g)

We call this method the categorical Monte Carlo algorithm. This algorithm can

be used to learn to predict inﬁnite-horizon, discounted returns and is applicable

even when there are a large number of possible rewards.

Example 3.2.

Montezuma’s Revenge is a 1984 platform game designed by then-

sixteen-year-old Robert Jaeger. As part of the Arcade Learning Environment

(Bellemare et al. 2013a), the Atari 2600 version of the game poses a challenging

reinforcement learning problem due to the rare occurrence of positive rewards.

The very ﬁrst task in Montezuma’s Revenge consists in collecting a key, an

act that rewards the agent with 100 points. Let us consider the integer support

i −

1, for

= 1

, . . . ,

101. For a discount factor

= 0

99, the discounted

H-horizon return from the initial game state is

G =

H−1

t=0

{τ < H}

0.99

×100,

where τ denotes the time at which the key is obtained. For a ﬁxed τ ∈R, write

g = 0.99

×100 ζ = g −bgc.

For

τ < H

, the deterministic projection of the return 0

puts probability mass

−ζ

bgc

and

dge

. For example, if

= 60, then 100

≈

72 and

the deterministic projection is

= 0.28 ×δ

+ 0.72 ×δ

. 4

By introducing a projection step, we typically lose some information about

the target return. This is by necessity: we are asking the learning algorithm to

Draft version.

64 Chapter 3

Figure 3.1

Illustration of the projection for the ﬁve-location grid used in Example 3.3. The middle

panel depicts the proportion of probability mass assigned to the location

= 1 in terms

of the sample return

. The left and right panels illustrate this probability assignment at

the boundary locations θ

= 0 and θ

= 2.

approximate complex distributions using a small, ﬁnite set of possible returns.

Under the right conditions, Equation 3.15 gives rise to a convergent algorithm.

The point of convergence of this algorithm is the return function

ˆη

(

) for which,

for all x ∈X, we have

ˆη

(x) = E[Π

(x)

] = E



1 −ζ(G

(x))



−

(x))

+ ζ(G

(x))δ

(x))



. (3.16)

In fact, we may write

E[Π

(x)

] = Π

(x) ,

where Π

(

) denotes distribution supported on

{θ

, …, θ

}

produced by pro-

jecting each possible outcome under the distribution

(

); we will discuss this

deﬁnition in further detail in Chapter 5.

Example 3.3.

Recall the single-state Markov decision process of Example 2.10,

whose return is uniformly distributed on the interval [0

2]. Suppose that we take

our support to be the uniform grid

{

}

. Let us write

ˆp

, ˆp

0.5

, . . . , ˆp

for the probabilities assigned to these locations by the projected distribution

ˆη

(

) = Π

(

), where

is the unique state of the MDP. The probability density

of the return on the interval [0, 2] is 0.5. We thus have

ˆp

(a)

= 0.5

g∈[0,2]



{Π

−

(g) = 0}

(1 −ζ(g)) +

{Π

(g) = 0}

ζ(g)



= 0.5

g∈[0,0.5]

(1 −ζ(g))dg

= 0.5

g∈[0,0.5]

(1 −2g)dg

= 0.125.

Draft version.

Learning the Return Distribution 65

In (a), we reexpressed Equation 3.16 in terms of the probability assigned to

ˆp

. A similar computation shows that

ˆp

0.5

ˆp

1.5

= 0

25, while

ˆp

= 0

125.

Figure 3.1 illustrates the process of assigning probability mass to diﬀerent

locations. Intuitively, the solution makes sense: there is less probability mass

near the boundaries of the interval [0, 2]. 4

3.6 Categorical Temporal-Diﬀerence Learning

With the use of a projection step, the categorical Monte Carlo method allows us

to approximate the return-distribution function of a given policy from sample

trajectories and using a ﬁxed amount of memory. Like the Monte Carlo method

for value functions, however, it ignores the relationship between successive

states in the trajectory. To leverage this relationship, we turn to categorical

temporal-diﬀerence learning (CTD).

Consider now a sample transition (

x, a, r, x

). Like the categorical Monte

Carlo algorithm, CTD maintains a return function estimate

(

) supported on

the evenly spaced locations

{θ

, . . . , θ

}

. Like temporal-diﬀerence learning, it

learns by bootstrapping from its current return function estimate. In this case,

however, the update target is a probability distribution supported on

{θ

, . . . , θ

}

The algorithm ﬁrst constructs an intermediate target by scaling the return

distribution

(

) at the next state by the discount factor

, then shifting it by

the sample reward r. That is, if we write

η(x

) =

i=1

)δ

then the intermediate target is

˜η(x) =

i=1

)δ

r+γθ

which can also be expressed in terms of a pushforward distribution:

˜η(x) = (b

r,γ

)

η(x

) . (3.17)

Observe that the shifting and scaling operations are applied to each particle

in isolation. After shifting and scaling, however, these particles in general no

longer lie in the support of the original distribution. This motivates the use of the

projection step described in the previous section. Let us denote the intermediate

particles by

= b

r,γ

(θ

) = r + γθ

Draft version.

66 Chapter 3

The CTD target is formed by individually projecting each of these particles

back onto the support and combining their probabilities. That is,

r,γ

)

η(x

) =

j=1

)Π

r+γθ

More explicitly, this is

r,γ

)

η(x

) =

j=1

)



(1 −ζ(

))δ

−

(

)

+ ζ(

)δ

(

)



i=1



j=1

)ζ

i, j

(r)



, (3.18)

where

i, j

(r) =



(1 −ζ(

))

{Π

−

(

) = θ

}

+ ζ(

)

{Π

(

) = θ

}



In Equation 3.18, the second line highlights that the CTD target is a probability

distribution supported on

{θ

, . . . , θ

}

. The probabilities assigned to speciﬁc

locations are given by the inner sum; as shown here, this assignment is obtained

by weighting the next-state probabilities p

) by the coeﬃcients ζ

i, j

(r).

The CTD update adjusts the return function estimate

(

) toward this target:

η(x) ←(1 −α)η(x) + α



r,γ

)

η(x

)



. (3.19)

Expressed in terms of the probabilities of the distributions

(

) and

(

), this

is (for i = 1, . . . , m)

(x) ←(1 −α)p

(x) + α

j=1

i, j

(r)p

) . (3.20)

With this form, we see that the CTD update rule adjusts each probability

(

)

of the return distribution at state

toward a mixture of the probabilities of the

return distribution at the next state (see Algorithm 3.4).

The deﬁnition of the intermediate target in pushforward terms (Equation

3.17) illustrates that categorical temporal-diﬀerence learning relates to the

distributional Bellman equation

(x) = E

[(b

R,γ

)

) |X = x] ,

analogous to the relationship between TD learning and the classical Bellman

equation. We will continue to explore this relationship in later chapters. How-

ever, the two algorithms usually learn diﬀerent solutions, due to the introduction

of approximation error from the bootstrapping process.

Example 3.4.

We can study the behavior of the categorical temporal-diﬀerence

learning algorithm by visualizing how its predictions vary over the course

Draft version.

Learning the Return Distribution 67

Algorithm 3.4: Online categorical temporal-diﬀerence learning

Algorithm parameters: step size α ∈(0, 1],

policy π : X→P(A),

evenly spaced locations θ

, …, θ

∈R,

initial probabilities



(x))

i=0

: x ∈X



(x))

i=0

←(p

(x))

i=0

for all x ∈X

Loop for each episode:

Observe initial state x

Loop for t = 0, 1, . . .

Draw a

from π( · | x

)

Take action a

, observe r

, x

t+1

ˆp

←0 for i = 1, …, m

for j = 1, …, m do

if x

t+1

is terminal then

g ←r

else

g ←r

+ γθ

if g ≤θ

then

ˆp

← ˆp

+ p

t+1

)

else if g ≥θ

then

ˆp

← ˆp

+ p

t+1

)

else

∗

← largest i s.t. θ

≤g

ζ ←

g−θ

∗

−θ

∗

ˆp

∗

← ˆp

∗

+ (1 −ζ)p

t+1

)

ˆp

∗

← ˆp

∗

+ ζp

t+1

)

end for

for i = 1, …, m do

) ←(1 −α)p

) + α ˆp

end for

until x

t+1

is terminal

end

of learning. We apply CTD to approximate the return function of the safe

policy in the Cliﬀs domain (Example 2.9). Learning takes place online (as per

Algorithm 3.4), using a constant step size of

= 0

05, and return distributions

are approximated with m = 31 locations evenly spaced between −1 and 1.

Draft version.

68 Chapter 3

1.0 0.5 0.0 0.5 1.0

0.00

0.25

0.50

0.75

1.00

Probability

Return

Probability

Return

Probability

Return

Probability

Return

1.0 0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

Probability

Return

1.0 0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

Probability

Return

1.0 0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

Probability

Return

1.0 0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

Probability

Return

1.0 0.5 0.0 0.5 1.0

0.0

0.1

0.2

0.3

0.4

0.5

1.0 0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

Probability

Return

1.0 0.5 0.0 0.5 1.0

0.00

0.05

0.10

0.15

Probability

Return

Figure 3.2

The return distribution at the initial state in the Cliﬀs domain, as predicted by categorical

temporal-diﬀerence learning over the course of learning.

Top panels.

The predictions

after

e ∈{

1000

2000

10000

}

episodes (

= 0

05; see Algorithm 3.4). Here, the return

distributions are initialized by assigning equal probability to all locations.

Bottom pan-

els.

Corresponding predictions when the return distributions initially put all probability

mass on the zero return.

(a) (b) (c)

1.0 0.5 0.0 0.5 1.0

Return

Density

1.0 0.5 0.0 0.5 1.0

Return

0.0

0.1

0.2

0.3

0.4

0.5

Probability

1.0 0.5 0.0 0.5 1.0

Return

0.00

0.05

0.10

0.15

0.20

0.25

Probability

Figure 3.3

(a)

The return distribution at the initial state in the Cliﬀs domain, visualized using

kernel density estimation (see Figure 2.2).

(b)

The return-distribution estimate learned

by the categorical Monte Carlo algorithm with

= 31.

(c)

The estimate learned by the

categorical temporal-diﬀerence learning algorithm.

Figure 3.2 illustrates that the initial return function plays a substantial role in

the speed at which CTD learns a good approximation. Informally, this occurs

because the uniform distribution is closer to the ﬁnal approximation than the

distribution that puts all of its probability mass on zero (what “close” means in

this context will be the topic of Chapter 4). In addition, we see that categorical

temporal-diﬀerence learning learns a diﬀerent approximation to the true return

function, compared to the categorical Monte Carlo algorithm (Figure 3.3). This

is due to a phenomenon we call diﬀusion, which arises from the combination of

the bootstrapping step and projection; we will study diﬀusion in Chapter 5.

Draft version.

Learning the Return Distribution 69

3.7 Learning to Control

A large part of this book considers the problem of learning to predict the dis-

tribution of an agent’s returns. In Chapter 7, we will discuss how one might

instead learn to maximize or control these returns and the role that distribu-

tional reinforcement learning plays in this endeavor. By learning to control, we

classically mean obtaining (from experience) a policy

∗

that maximizes the

expected return:

∗

∞

t=0

≥E

∞

t=0

, for all π .

Such a policy is called an optimal policy. From Section 2.5, recall that the

state-action value function Q

is given by

(x, a) = E

∞

t=0



= x, A

= a

Any optimal policy π

∗

has the property that its state-action value function also

satisﬁes the Bellman optimality equation:

∗

(x, a) = E



R + γ max

∈A

∗

, a

) |X = x, A = a



Similar in spirit to temporal-diﬀerence learning, Q-learning is an incremental

algorithm that ﬁnds an optimal policy. Q-learning maintains a state-action value

function estimate, Q, which it updates according to

Q(x, a) ←(1 −α)Q(x, a) + α



r + γ max

∈A

Q(x

, a

)



. (3.21)

The use of the maximum in the Q-learning update rule results in diﬀerent

behavior than TD learning, as the selected action depends on the current value

estimate. We can think of this diﬀerence as constructing one target for each

action

and updating the value estimate toward the largest of such targets.

It can be shown that under the right conditions, Q-learning converges to the

optimal state-action value function

∗

, corresponding to the expected return

under any optimal policy. We extract an optimal policy from

∗

by acting

greedily with respect to

∗

: that is, choosing a policy

∗

that selects maximally

valued actions according to Q

∗

The simplest way to extend Q-learning to the distributional setting is to

express the maximal action in Equation 3.21 as a greedy policy. Denote by

a return-function estimate over state-action pairs, such that

(

x, a

) is the

return distribution associated with the state-action pair (

x, a

)

∈X×A

. Deﬁne

the greedy action

(x) = arg max

a∈A

Z∼η(x,a)

[Z] ,

Draft version.

70 Chapter 3

breaking ties arbitrarily. The categorical Q-learning update rule is

η(x, a) ←(1 −α)η(x, a) + α



r,γ

)



, a

)





It can be shown that, under the same conditions as Q-learning, the mean of the

return-function estimates also converges to

∗

. The behavior of the distributions

themselves, however, may be surprisingly complicated. We can also put the

learned distributions to good use and make decisions on the basis of their

full characterization, rather than from their mean alone. This forms the topic

of risk-sensitive reinforcement learning. We return to both of these points in

Chapter 7.

3.8 Further Considerations

Categorical temporal-diﬀerence learning learns to predict return distributions

from sample experience. As we will see in subsequent chapters, the choices that

we made in designing CTD are not unique, and the algorithm is best thought of

as a jumping-oﬀ point into a broad space of methods. For example, an important

question in distributional reinforcement learning asks how we should represent

probability distributions, given a ﬁnite memory budget. One issue with the

categorical representation is that it relies on a ﬁxed grid of locations to cover the

range [

, θ

], which lacks ﬂexibility and is in many situations ineﬃcient. We

will take a closer look at this issue in Chapter 5. In many practical situations,

we also need to deal with a few additional considerations, including the use of

function approximation to deal with very large state spaces (Chapters 9 and 10).

3.9 Technical Remarks

Remark 3.1

(Nonparametric distributional Monte Carlo algorithm)

Section 3.4, we saw that the (unprojected) ﬁnite-horizon categorical Monte

Carlo algorithm can in theory learn ﬁnite-horizon return-distribution functions

when there are only a small number of possible returns. It is possible to extend

these ideas to obtain a straightforward, general-purpose algorithm that can be

sometimes be used to learn an accurate approximation to the return distribution.

Like the sample-mean Monte Carlo method, the nonparametric distributional

Monte Carlo algorithm takes as input

ﬁnite-length trajectories with a com-

mon source state

. After computing the sample returns (

)

k=1

from these

trajectories, it constructs the estimate

ˆη

) =

k=1

(3.22)

of the return distribution

(

). Here, nonparametric refers to the fact that

the approximating distribution in Equation 3.22 is not described by a ﬁnite

Draft version.

Learning the Return Distribution 71

collection of parameters; in fact, the memory required to represent this object

may grow linearly with

. Although this is not an issue when

is relatively

small, this can be undesirable when working with large amounts of data and

moreover precludes the use of function approximation (see Chapters 9 and 10).

However, unlike the categorical Monte Carlo and temporal-diﬀerence learn-

ing algorithms presented in this chapter, the accuracy of this estimate is only

limited by the number of trajectories

; we describe various ways to quantify

this accuracy in Remark 4.3. As such, it provides a useful baseline for measuring

the quality of other distributional algorithms. In particular, we used this algo-

rithm to generate the ground-truth return-distribution estimates in Example 2.9

and in Figure 3.3. 4

3.10 Bibliographical Remarks

The development of a distributional algorithm in this chapter follows our own

development of the distributional perspective, beginning with our work on using

compression algorithms in reinforcement learning (Veness et al. 2015).

3.1.

The ﬁrst-visit Monte Carlo estimate is studied by Singh and Sutton (1996),

where it is used to characterize the properties of replacing eligibility traces (see

also Sutton and Barto 2018). Statistical properties of model-based estimates

(which solve for the Markov decision process’s parameters as an intermediate

step) are analyzed by Mannor et al. (2007). Grünewälder and Obermayer (2011)

argue that model-based methods must incur statistical bias, an argument that

also extends to temporal-diﬀerence algorithms. Their work also introduces a

reﬁned sample-mean Monte Carlo method that yields a minimum-variance

unbiased estimator (MVUE) of the value function. See Browne et al. (2012) for

a survey of Monte Carlo tree search methods and Liu (2001), Robert and Casella

(2004), and Owen (2013) for further background on Monte Carlo methods more

generally.

3.2.

Incremental algorithms are a staple of reinforcement learning and have

roots in stochastic approximation (Robbins and Monro 1951; Widrow and Hoﬀ

1960; Kushner and Yin 2003) and psychology (Rescorla and Wagner 1972). In

the control setting, these are also called optimistic policy iteration methods and

exhibit fairly complex behavior (Sutton 1999; Tsitsiklis 2002).

3.3.

Temporal-diﬀerence learning was introduced by Sutton (1984, 1988). The

sample transition model presented here diﬀers from the standard algorithmic

presentation but allows us to separate concerns of behavior (data collection)

from learning. A similar model is used in Bertsekas and Tsitsiklis (1996) to

prove convergence of a broad class of TD methods and by Azar et al. (2012) to

provide sample eﬃciency bounds for model-based control.

Draft version.

72 Chapter 3

3.4.

What is eﬀectively the undiscounted ﬁnite-horizon categorical Monte Carlo

algorithm was proposed by Veness et al. (2015). There, the authors demonstrate

that by means of Bayes’s rule, one can learn the return distribution by ﬁrst

learning the joint distribution over returns and states (in our notation,

(

X, G

))

by means of a compression algorithm and subsequently using Bayes’s rule

to extract

(

G | X

). The method proved surprisingly eﬀective at learning to

play Atari 2600 games. Toussaint and Storkey (2006) consider the problem of

control as probabilistic inference, where the reward and trajectory length are

viewed as random variables to be optimized over, again obviating the need to

deal with a potentially large support for the return.

3.5–3.6.

Categorical temporal-diﬀerence learning for both prediction and

control was introduced by Bellemare et al. (2017a), in part to address the

shortcomings of the undiscounted algorithm. Its original form contains both the

projection step and the categorical representation as given here. The mixture

update that we study in this chapter is due to Rowland et al. (2018).

3.7.

The Q-learning algorithm is due to Watkins (1989); see also Watkins and

Dayan (1992). The explicit construction of a greedy policy is commonly found

in more complex reinforcement learning algorithms, including modiﬁed policy

iteration (Puterman and Shin 1978),

-policy iteration (Bertsekas and Ioﬀe

1996), and nonstationary policy iteration (Scherrer and Lesner 2012; Scherrer

2014).

3.9.

The algorithm introduced in Remark 3.1 is essentially an application of the

standard Monte Carlo method to the return distribution and is a special case of

the framework set out by Chandak et al. (2021), who also analyze the statistical

properties of the approach.

3.11 Exercises

Exercise 3.1.

Suppose that we begin with the initial value function estimate

V(x) = 0 for all x ∈X.

(i)

Consider ﬁrst the setting in which we are given sample returns for a single

state

. Show that in this case, the incremental Monte Carlo algorithm

(Equation 3.6), instantiated with

k+1

, is equivalent to computing the

sample-mean Monte Carlo estimate for

. That is, after processing the

sample returns g

, . . . , g

, we have

(x) =

i=1

(ii)

Now consider the case where source states are drawn from a distribution

, with

(

)

0 for all

, and let

(

) be the number of times

has been

Draft version.

Learning the Return Distribution 73

updated up to but excluding time

. Show that the appropriate step size to

match the sample-mean Monte Carlo estimate is α

)+1

. 4

Exercise 3.2. Recall from Exercise 2.8 the n-step Bellman equation:

(x) = E

n−1

t=0

+ γ

) | X

= x



Explain what a sensible

-step temporal-diﬀerence learning update rule might

look like. 4

Exercise 3.3.

The Cart–Pole domain is a small, two-dimensional reinforcement

learning problem (Barto et al. 1983). In this problem, the learning agent must

balance a swinging pole that is a attached to a moving cart. Using an open-

source implementation of Cart–Pole, implement the undiscounted ﬁnite-horizon

categorical Monte Carlo algorithm and evaluate its behavior. Construct a ﬁnite

state space by discretising the four-dimensional state space using a uniform grid

of size 10

10. Plot the learned return distribution for a ﬁxed initial

state and other states of your choice when

(i) the policy chooses actions uniformly at random;

(ii)

the policy moves the cart in the direction that the pole is leaning toward.

You may want to pick the horizon H to be the maximum length of an episode.

Exercise 3.4.

Implement the categorical Monte Carlo (CMC), nonparametric

categorical Monte Carlo (Remark 3.1), and categorical temporal-diﬀerence

learning (CTD) algorithms. For a discount factor

= 0

99, compare the return

distributions learned by these three algorithms on the Cart–Pole domain of the

previous exercise. For CMC and CTD, vary the number of particles

and the

range of the support [

, θ

]. How do the approximations vary as a function of

these parameters? 4

Exercise 3.5.

Suppose that the rewards

take on one of

values. Consider

the undiscounted ﬁnite-horizon return

G =

H−1

t=0

Denote by N

the number of possible realizations of G.

(i) Show that N

can be as large as



+H−1



(ii) Derive a better bound when R

∈{0, 1, . . . , N

−1}.

(iii)

Explain, in words, why the bound is better in this case. Are there other sets

of rewards for which N

is smaller than the worst-case from (i)? 4

Draft version.

74 Chapter 3

Exercise 3.6.

Recall from Section 3.5 that the categorical Monte Carlo

algorithm aims to ﬁnd the approximation

ˆη

(x) = Π

(x), x ∈X,

where we use the subscript

to more explicitly indicate that the quality of this

approximation depends on the number of locations.

Consider again the problem setting of Example 3.3. For

m ≥

2, suppose that

we take θ

= 0 and θ

= 2, such that

= 2

i −1

m −1

i = 1, . . . m .

Show that in this case,

ˆη

(

) converges to the uniform distribution on the

interval [0, 2], in the sense that

lim

m→∞

ˆη

(x)



[a, b]



→

b −a

for all

a < b

; recall that

([

a, b

]) denotes the probability assigned to the interval

[a, b] by a probability distribution ν. 4

Exercise 3.7.

The purpose of this exercise is to demonstrate that the categorical

Monte Carlo algorithm is what we call mean-preserving. Consider a sequence

of state-return pairs (

, g

)

k=1

and an evenly spaced grid

{θ

, . . . , θ

}

m ≥

Suppose that rewards are bounded in [R

min

, R

max

(i)

Suppose that

min

≥θ

and

max

≤θ

. For a given

g ∈

[

min

, V

max

], show that

the distribution ν = Π

satisﬁes

Z∼ν

[Z] = g .

(ii)

Based on this, argue that if

(

) is a distribution with mean

, then after

applying the update

η(x) ←(1 −α)η(x) + αΠ

the mean of η(x) is (1 −α)V + αg.

(iii)

By comparing with the incremental Monte Carlo update rule, explain why

categorical Monte Carlo can be said to be mean-preserving.

(iv)

Now suppose that [

min

, V

max

]

[

, θ

]. How are the preceding results

aﬀected? 4

Exercise 3.8.

Following the notation of Section 3.5, consider the nearest

neighbor projection method

g =

(

−

(g) if ζ(g) ≤0.5

(g) otherwise.

Draft version.

Learning the Return Distribution 75

Show that this projection is not mean-preserving in the sense of the preceding

exercise. Implement and evaluate on the Cart–Pole domain. What do you

observe? 4

Draft version.