2
The Distribution of Returns
Training for a marathon. Growing a vegetable garden. Working towards a piano
recital. Many of life’s activities involve making decisions whose benefits are
realised only later in the future (whether to run on a particular Saturday morning;
whether to add fertiliser to the soil). In reinforcement learning, these benefits
are summarised by the return received following these decisions. The return is a
random quantity that describes the sum total of the consequences of a particular
activity measured in dollars, points, bits, kilograms, kilometres, or praise.
Distributional reinforcement learning studies the random return. It asks ques-
tions such as: how should it be described, or approximated? How can it be
predicted on the basis of past observations? The overarching aim of this book
is to establish a language with which such questions can be answered. By
virtue of its subject matter, this language is somewhere at the intersection of
probability theory, statistics, operations research, and of course reinforcement
learning itself. In this chapter we begin by studying how the random return
arises from sequential interactions and immediate rewards. From this, we estab-
lish the fundamental relationship of random returns: the distributional Bellman
equation.
2.1 Random Variables and Their Probability Distributions
A quantity that we wish to model as random can be represented via a random
variable. For example, the outcome of a coin toss can be represented with a
random variable, which may take on either the value ‘heads’ or ‘tails’. We
can reason about a random variable through its probability distribution, which
specifies the probability of its possible realisations.
Example 2.1.
Consider driving along a country road towards a railway crossing.
There are two possible states that the crossing may be in. The crossing may
be open, in which case you can drive straight through; or it may be closed,
11
12 Chapter 2
in which case you must wait for the train to pass and for the barriers to lift
before driving on. We can model the state of the crossing as a random variable
C with two outcomes, “open” and “closed”. The distribution of C is specified
by a probability mass function, which provides the probability of each possible
outcome:
P(C = “open”) = p , P(C = “closed”) = 1 p ,
for some p 2[0, 1]. 4
Example 2.2.
Suppose we arrive at the crossing described above and the barri-
ers are down. We may model the waiting time T (in minutes) until the barriers
are open again as a uniform distribution on the interval [0, 10]; informally,
any real value between 0 and 10 is equally likely. In this case, the probability
distribution can be specified through a probability density function, a function
f :
R !
[0,
1
). In the case of the uniform distribution above, this function is
given by
f (t)=
1
10
if t 2[0, 10]
0 otherwise
.
The density then provides the probability of T lying in any interval [a, b]
according to
P(T 2[a, b]) =
b
a
f (t)dt . 4
In this book, we will encounter instances of random variables such as rewards
and returns that are discrete, have densities, or in some cases fall in neither
category. To deal with this heterogeneity, one solution is to describe probability
distributions over
R
using their cumulative distribution function (CDF), which
always exists. The cumulative distribution function associated with a random
variable Z is the function F
Z
: R ![0, 1] defined by
F
Z
(z)=P(Z z).
In distributional reinforcement learning, common operations on random
variables include summation, multiplication by a scalar, and indexing into col-
lections of random variables. Later in the chapter, we will see how to describe
these operations in terms of cumulative distribution functions.
Example 2.3.
Suppose that now we consider the random variable T
0
describing
the total waiting time experienced at the railroad crossing. If we arrive at the
crossing and it is open (C =
“open”
), there is no need to wait and T
0
= 0. If,
however, the barrier is closed (C =
“closed”
), the waiting time is distributed
The Distribution of Returns 13
uniformly on [0, 10]. The cumulative distribution function of T
0
is
F
T
0
(t)=
0 t <0
pt=0
p +
(1–p)t
10
0<t < 10
1 t 10 .
Observe that there is a non-zero probability of T
0
taking the value 0 (which
occurs when the crossing is open), so the distribution cannot have a density. Nor
can it have a probability mass function, as there are a continuum of possible
waiting times from 0 to 10 minutes. 4
Another solution is to treat probability distributions atomically, as elements
of the space
P
(
R
) of probability distributions. In this book we favour this
approach over the use of cumulative distribution functions, as it lets us concisely
express operations on probability distributions. The probability distribution that
puts all of its mass on z
2R
, for example, is the Dirac delta denoted
z
; the
uniform distribution on the interval from a to b is
U
([a, b]). Both distributions
belong to
P
(
R
). The distribution of the random variable T
0
in Example 2.3 also
belongs to
P
(
R
). It is a mixture of distributions and can be written in terms of
its constituent parts:
p
0
+ (1 p)U([0, 10]) .
More formally, these objects are probability measures: functions that associate
different subsets of outcomes with their respective probabilities. If Z is a real-
valued random variable and
is its distribution, then for a subset S
R
,
3
we
write
(S)=P(Z 2S).
In particular, the probability assigned to S by a mixture distribution is the
weighted sum of probabilities assigned by its constituent parts: for
1
,
2
2
P(R) and p 2[0, 1], we have
p
1
+ (1 p)
2
(S)=p
1
(S) + (1 p)
2
(S).
With this language, the cumulative distribution function of
2P
(
R
) can be
expressed as
F
(y)=((–1, y]) .
This notation also extends to distributions over outcomes that are not real-valued.
For instance,
P
({
“open”
,
“closed”
}) is the set of probability distributions over
the state of the railroad crossing in Example 2.1. Probability distributions (as
3. Readers expecting to see the qualifier ‘measurable’ here should consult Remark 2.1.
14 Chapter 2
Agent
Environment
action
state reward
Figure 2.1
Top:
The Markov decision process model of the agent’s interactions with its environment.
Bottom:
The same interactions, unrolled to show the sequence of random variables
(X
t
, A
t
, R
t
)
t0
, beginning at time t = 0 and up to the state X
T
.
elements of
P
(
R
)) make it possible to express some operations on distributions
that would be unwieldy to describe in terms of random variables.
2.2 Markov Decision Processes
In reinforcement learning, an environment is any of a wide variety of systems
that emit observations, can be influenced, and persist in one form or another
over time. A data centre cooling system, a remote-controlled helicopter, a stock
market, and a video game console can all be thought of as environments. An
agent interacts with its environment by making choices that have consequences
in this environment. These choices may be implemented simply as an IF state-
ment in a simulator or they may require a human to perform some task in our
physical world.
We assume that interactions take place or are recorded at discrete time intervals.
These give rise to a sequential process in which at any given time t
2N
=
{0, 1, 2,
} the current situation is described by a state X
t
from a finite set
X
.
4
The initial state is a random variable X
0
with probability distribution
0
2P
(
X
).
4.
Things tend to become more complicated when one considers infinite state spaces; see
Remark 2.3.
The Distribution of Returns 15
The agent influences its future by choosing an action A
t
from a finite set of
actions
A
. In response to this choice, the agent is provided with a real-valued
reward R
t
.
5
This reward indicates to the agent the usefulness or worth of its
choice. The action also affects the state of the system; the new state is denoted
X
t+1
. An illustration of this sequential interaction is given in Figure 2.1. The
reward and next state are modelled by the transition dynamics P :
XA!
P
(
R X
) of the environment, which provides the joint probability distribution
of R
t
and X
t+1
in terms of the state X
t
and action A
t
. We say that R
t
and X
t+1
are
drawn from this distribution:
R
t
, X
t+1
P(·, · | X
t
, A
t
) . (2.1)
In particular, when R
t
is discrete Equation 2.1 can be directly interpreted as
P(R
t
= r, X
t+1
= x
0
| X
t
= x, A
t
= a)=P(r, x
0
| x, a).
Modelling the two quantities jointly is useful in problems where the reward
depends on the next state (common in board games, where the reward is asso-
ciated with reaching a certain state) or when the state depends on the reward
(common in domains where the state keeps track of past rewards). In this book,
however, unless otherwise noted we make the simplifying assumption that
the reward and next state are independent given X
t
and A
t
, and separate the
transition dynamics into a reward distribution and transition kernel:
R
t
P
R
(· | X
t
, A
t
)
X
t+1
P
X
(· | X
t
, A
t
).
A Markov decision process is a tuple (
X
,
A
,
0
, P
X
, P
R
) that contains all the
information needed to describe how the agent’s decisions influence its environ-
ment. These decisions are not themselves part of the model, but instead arise
from a policy. A policy is a mapping
:
X!P
(
A
) from states to probability
distributions over actions such that
A
t
(· | X
t
).
Such policies choose the action A
t
solely on the basis of the immediately
preceding state X
t
and possibly a random draw. Technically, these are a special
subset of decision-making rules that are both stationary (they do not depend
on the time t at which the decision is to be taken, except through the state X
t
)
and Markov (they do not depend on events prior to time t). Stationary Markov
policies will be enough for us for most of this book, but we will study more
general policies in Chapter 7.
5.
An alternative convention is to denote the reward that follows action A
t
by R
t+1
. Here we prefer
R
t
to emphasise the association between action and reward.
16 Chapter 2
Finally, a state x
?
from which no other states can be reached is called absorbing
or terminal. For all actions a 2A, its next-state distribution is
P
X
(x
?
| x
?
, a) = 1.
Terminal states correspond to situations in which further interactions are irrele-
vant: once a game of chess is won by one of the players, for example, or once a
robot has successfully accomplished a desired task.
2.3 The Pinball Model
The game of American pinball provides a useful metaphor for how the various
pieces of a Markov decision process come together to describe real systems.
A classic American pinball machine consists of a slanted, glass-enclosed play
area filled with bumpers of various shapes and sizes. The player initiates the
game by using a retractable spring to launch a metal ball into the play area, a
process that can be likened to sampling from the initial state distribution. The
metal ball progresses through the play area by bouncing off the various bumpers
(the transition function), which reward the player with a variable number of
points (the reward function). The game ends when the ball escapes through a
gap at the bottom of the play area, to which it is drawn by gravity (the terminal
state). The player can prevent this fate by controlling the ball’s course with a
pair of flippers on either side of the gap (the action space). Good players also
use the flippers to aim the ball towards the most valuable bumpers or other,
special high-scoring zones, and may even physically shake the pinball cabinet
(called nudging) to exert additional control. The game’s state space describes
possible arrangements of the machine’s different moving parts, including the
ball’s location.
Turning things around, we may think of any Markov decision process as an
abstract pinball machine. Initiated by the equivalent of inserting the traditional
quarter into the machine, we call a single play through the Markov decision
process a trajectory, beginning from the random initial state and lasting until a
terminal state is reached. This trajectory is the sequence X
0
, A
0
, R
0
, X
1
, A
1
,
...
of random interleaved states, actions, and rewards. We use the notation
(X
t
, A
t
, R
t
)
t0
to express this sequence compactly.
The various elements of the trajectory depend on each other according to the
rules set by the Markov decision process. These rules can be summarised by a
collection of generative equations, which tell us how we might write a program
for sampling a trajectory one variable at a time. For a time step t
2N
, let us
denote by X
0:t
, A
0:t
and R
0:t
the subsequences (X
0
, X
1
,
...
, X
t
), (A
0
, A
1
,
...
, A
t
)
The Distribution of Returns 17
and (R
0
, R
1
, ..., R
t
), respectively. The generative equations are
X
0
0
;
A
t
|(X
0:t
, A
0:t–1
, R
0:t–1
) ( ·| X
t
), for all t 0;
R
t
|(X
0:t
, A
0:t
, R
0:t–1
) P
R
(·| X
t
, A
t
), for all t 0;
X
t+1
|(X
0:t
, A
0:t
, R
0:t
) P
X
( · | X
t
, A
t
), for all t 0.
We use the notation Y |(Z
0
, Z
1
,
...
) to indicate the basic dependency structure
between these variables. The equation for A
t
, for example, is to be interpreted
as
P
A
t
= a | X
0
, A
0
, R
0
, ··· , X
t–1
, A
t–1
, R
t–1
, X
t
)=(a | X
t
.
For t = 0, the notation A
0:t–1
and R
0:t–1
denotes the empty sequence. Because the
policy fixes the “decision” part of the Markov decision process formalism, the
trajectory can be viewed as a Markov chain over the space
XAR
. This
model is sometimes called a Markov reward process.
By convention, terminal states yield no reward. In these situations, it is sen-
sible to end the sequence at the time T
2N
at which a terminal state is first
encountered. It is also common to notify the agent that a terminal state has
been reached. In other cases (such as Example 2.4 below) the sequence might
(theoretically) go on forever.
We use the notion of a joint distribution to formally ask questions (and give
answers) about the random trajectory. This is the joint distribution over all
random variables involved, denoted
P
(
·
). For example, the probability that
the agent begins in state x, and finds itself in that state again after t time
steps when acting according to a policy
can be mathematically expressed
as
P
(X
0
= x, X
t
= x). Similarly, the probability that a positive reward will be
received at some point in time is
P
(there exists t 0 such that R
t
> 0) .
The explicit policy subscript is a convention to emphasise that the agent’s
choices affect the distribution of outcomes. It also lets us distinguish statements
about random variables derived from the random trajectory from statements
about other, arbitrary random variables.
The joint distribution
P
gives rise to the expectation
E
over real-valued
random variables. This allows us to write statements such as
E
[2 R
0
+
{X
1
= X
0
}
R
1
].
Remark 2.1 provides additional technical details on how this expectation can be
constructed from P
.
18 Chapter 2
Example 2.4.
The martingale is a betting strategy popular in the 18
th
century
and based on the principle of doubling one’s ante until a profit is made. This
strategy is formalised as a Markov decision process where the (infinite) state
space
X
=
Z
is the gambler’s loss thus far, with negative losses denoting gains.
The action space
A
=
N
is the gambler’s bet.
6
If the game is fair, then for each
state x and each action a,
P
X
(x + a | x , a)=P
X
(x a | x , a)=
1
/2.
Placing no bet corresponds to a = 0. With X
0
= 0, the martingale policy is A
0
=1
and for t > 0,
A
t
=
X
t
+1 X
t
>0
0 otherwise.
Formally speaking, the policy maps each loss x > 0 to
x+1
, the Dirac distribution
at x + 1, and all negative states (gains) to
0
. Simple algebra shows that for t > 0,
X
t
=
2
t
1 with probability 2
t
,
–1 with probability 1 2
t
.
That is, the gambler is assured to eventually make a profit (since 2
t
!
0), but
arbitrary losses may be incurred before a positive gain is made. Calculations
show that the martingale strategy has nil expected gain (
E
[X
t
] = 0 for t
0), while the variance of the loss grows unboundedly with the number of
rounds played (
E
[X
2
t
]
!1
); as a result it is frowned upon in many gambling
establishments. 4
The notation
P
makes clear the dependence of the distribution of the random
trajectory (X
t
, A
t
, R
t
)
t0
on the agent’s policy
. Often, we find it also useful
to view the initial state distribution
0
as a parameter to be reasoned explicitly
about. In this case, it makes sense to more explicitly denote the joint distribution
by
P
0
,
. The most common situation is when we consider the same Markov
decision process, but initialised at a specific starting state x. In this case, the
distribution becomes
0
=
x
, meaning that
P
0
,
(X
0
= x) = 1. We then use the
fairly standard shorthand
P
( ·| X
0
= x)=P
x
,
( · ), E
[ ·| X
0
= x]=E
x
,
[ · ].
Technically, this use of the conditioning bar is an abuse of notation. We are
directly modifying the probability distributions of these random variables, rather
than conditioning on an event as the notation normally signifies.
7
However,
6. We are ignoring cash flow issues here.
7.
If we attempt to use the actual conditional probability distribution instead, we find that it is
ill-defined when
0
(x) = 0. See Exercise 2.2.
The Distribution of Returns 19
this notation is convenient and common throughout the reinforcement learning
literature, and we therefore adopt it as well. It is also convenient to modify the
distribution of the first action A
0
, so that rather than this random variable being
sampled from
(
·
| X
0
) it is fixed at some action a
2A
. We use similar notation
as above to signify the resulting distribution over trajectories, and corresponding
expectations:
P
( ·| X
0
= x, A
0
= a), E
[ ·| X
0
= x, A
0
= a].
2.4 The Return
Given a discount factor
2
[0, 1), the discounted return
8
(or simply the return)
is the sum of rewards received by the agent from the initial state onwards,
discounted according to their time of occurrence:
G =
1
t=0
t
R
t
. (2.2)
The return is a sum of scaled, real-valued random variables, and is therefore
itself a random variable. In reinforcement learning, the success of an agent’s
decisions is measured in terms of the return that it achieves: greater returns are
better. Because it is the measure of success, the return is the fundamental object
of reinforcement learning.
The discount factor encodes a preference to receive rewards sooner than later. In
settings that lack a terminal state, the discount factor is also used to guarantee
that the return G exists and is finite. This is easily seen when rewards are
bounded on the interval [R
MIN
, R
MAX
], in which case we have
G 2
R
MIN
1–
,
R
MAX
1–
. (2.3)
Throughout this book we will often write V
MIN
and V
MAX
for the endpoints of the
interval above, denoting the smallest and largest return obtainable when rewards
are bounded on [R
MIN
, R
MAX
]. When rewards are not bounded, the existence of
G is still guaranteed under the mild assumption that rewards have finite first
moment; we therefore adopt this assumption throughout the book.
Assumption 2.5.
For each state x
2X
and action a
2A
, the reward distribution
P
R
( · | x, a) has finite first moment. That is, if R P
R
( · | x, a) then
E
|R|
< 1. 4
8.
In parts of this book, we also consider the undiscounted return
P
1
t=0
R
t
. We sometimes write
= 1 to denote this return, indicating a change from the usual setting. One should be mindful that
we then need to make sure that the sum of rewards converges.
20 Chapter 2
Proposition 2.6.
Under Assumption 2.5, the random return G exists and
is finite with probability 1, in the sense that
P
G 2(–1, 1)
=1. 4
In the usual treatment of reinforcement learning, Assumption 2.5 is taken for
granted (there is little to be predicted otherwise). We still find Proposition
2.6 useful as it establishes that the assumption is sufficient to guarantee the
finiteness of G and hence the existence of its probability distribution. The phrase
“with probability 1” allows for the fact that in some realisations of the random
trajectory, G is infinite (for example, if the rewards are normally distributed)
but the probability of observing such a realisation is nil. Remark 2.2 gives
more details on this topic, along with a proof of Proposition 2.6; see also
Exercise 2.17.
We call the probability distribution of G the return distribution. The return
distribution determines quantities such as the expected return
E
G
= E
1
t=0
t
R
t
,
which plays a central role in reinforcement learning, the variance of the return
Var
G
= E
(G E
[G])
2
,
and tail probabilities such as
P
1
t=0
t
R
t
0
,
which arise in risk-sensitive control problems (discussed in Chapter 7). As the
following examples illustrate, the probability distributions of random returns
vary from simple to intricate, according to the complexity of interactions
between the agent and its environment.
Example 2.7
(
Blackjack
)
.
The card game of blackjack is won by drawing
cards whose total value is greater than that of a dealer, who plays according to
a known, fixed set of rules. Face cards count for 10, while aces may be counted
as either 1 or 11, to the player’s preference. The game begins with the player
receiving two cards, which they can complement by hitting, i.e. requesting
another card. The game is lost immediately if the player’s total goes over 21,
called going bust. A player satisfied with their cards may instead stick, at which
The Distribution of Returns 21
point the dealer adds cards to their own hand, until a total of 17 or more is
reached.
If the player’s total is greater than the dealer’s or the dealer goes bust, they are
paid out their ante; otherwise, the ante is lost. In the event of a tie, the player
keeps the ante. We formalise this as receiving a reward of 1, -1, or 0 when play
concludes. A blackjack occurs when the player’s initial cards are an ace and a
10-valued card, which most casinos will pay 3 to 2 (a reward of
3
/
2
) provided
the dealer’s cards sum to less than 21. After seeing their initial two cards the
player may also choose to double down and receive exactly one additional card.
In this case, wins and losses are doubled (2 or -2 reward).
Since the game terminates in a finite number of steps T and the objective is to
maximise one’s profit, let us equate the return G with the payoff from playing
one hand. The return takes on values from the set
{–2, –1, 0, 1,
3
/2, 2} .
Let us denote the cards dealt to the player by C
0
, C
1
,
...
, C
T
, the sum of these
cards by Y
p
, and the dealer’s card sum by Y
d
. With this notation we can handle
the ace’s two possible values by adding 10 to Y
p
or Y
d
when at least one ace was
drawn, and the sum is 11 or less. We define Y
p
= 0 and Y
d
= 0 when the player or
dealer goes bust, respectively. Consider a player who doubles when dealt cards
whose total is 11. The probability distribution of the return is
P
(G =
3
/2)
(a)
=2P
(C
0
= 1, C
1
= 10, Y
d
6= 21)
P
(G = –2)
(b)
= P
(C
0
+ C
1
= 11, C
0
6= 1, C
1
6= 1, C
0
+ C
1
+ C
2
< Y
d
)
P
(G = 2)
(c)
= P
(C
0
+ C
1
= 11, C
0
6= 1, C
1
6= 1, C
0
+ C
1
+ C
2
> Y
d
)
P
(G = –1) = P
(Y
p
< Y
d
)–P
(G = –2)
P
(G = 0) = P
(Y
p
= Y
d
)
P
(G = 1) = P
(Y
p
> Y
d
)–P
(G =
3
/2)–P
(G = 2) ;
(a) implements the blackjack rule (noting that either C
0
or C
1
can be an ace)
while (b) and (c) handle doubling (when there is no blackjack). For example, if
we assume that cards are drawn with replacement, then
P(Y
d
= 21) 0.12
and
P
(G =
3
/2)=2
1–P
(Y
d
= 21)
1
13
4
13
0.042 ,
22 Chapter 2
since there are 13 card types to drawn from, one of which is an ace and four of
which have value 10.
Computing the probabilities for other outcomes requires specifying the player’s
policy in full (when do they hit?). Were we to do this, we could then estimate
these probabilities from simulation, or, more tediously, calculate them by hand.
4
Example 2.8.
Consider a simple solitaire game which involves repeatedly
throwing a single six-sided die. If a 1 is rolled, the game ends immediately.
Otherwise, the player receives a point and continues to play. Consider the
undiscounted return
1
t=0
R
t
=1+1+···+1

Ttimes
,
where T is the time at which the terminating 1 is rolled. This is an integer-valued
return ranging from 0 to 1. It has the geometric distribution
P
1
t=0
R
t
= k
=
1
6
5
6
k
, k 2N
corresponding to the probability of seeing k successes before the first failure
(rolling a 1). Choosing a discount factor less than 1 (perhaps modelling the
player’s increasing boredom) changes the support of this distribution, but not
the associated probabilities. The partial sums of the geometric series correspond
to
k
t=0
t
=
1–
k+1
1–
and it follows that for k 0,
P
1
t=0
t
R
t
=
1–
k
1–
=
1
6
5
6
k
. 4
When a closed form solution is tedious or difficult to obtain, we can sometimes
still estimate the return distribution from simulations. This is illustrated in the
next example.
Example 2.9.
The Cliffs of Moher, located in County Clare, Ireland, are famous
for both their stunning views and the strong winds blowing from the Atlantic
Ocean. Inspired by the scenic walk from the nearby village of Doolin to the
Cliffs’ tourist centre, we consider an abstract cliff environment (Figure 2.2)
The Distribution of Returns 23
Safe policy
Quick policy
CliffsS G
Figure 2.2
Left.
The Cliffs environment, along with the path preferred by the quick and safe
policies.
Right.
The return distribution for these policies, estimated by sampling 100,000
trajectories from the environment. The figure reports the distribution as a probability
density function computed using kernel density estimation (with bandwidth 0.02).
based from a classic domain from the reinforcement learning literature [Sutton
and Barto, 2018].
9
The walk begins in the cell ‘S’ (Doolin village), and ends with a positive reward
(+1) when the ‘G’ cell (tourist centre) is reached. Four actions are available,
corresponding to each of the cardinal directions however, at each step there
is a probability p = 0.25 that the strong winds take the agent in an unintended
(random) direction. For simplicity, the edges of the grid act as walls. We model
falling off the cliff using a negative reward of -1, at which point the episode also
terminates. Reaching the tourist centre yields a reward of +1 and also ends the
episode. The reward is zero elsewhere. A discount factor of
= 0.99 incentivises
the agent to not dally too much along the way.
Figure 2.2 depicts the return distribution for two policies: a quick policy that
walks along the cliffs edge, and a safe policy that walks two cells away from
the edge.
10
The corresponding returns are bimodal, reflecting the two possible
classes of outcomes. The return distribution of the faster policy, in particular,
is sharply peaked around -1 and 1: The goal may be reached quickly, but the
agent is more likely to fall. 4
The next two examples show how even simple dynamics that one might rea-
sonably encounter in real scenarios can result in return distributions that are
markedly different from the reward distributions.
9. There are a few differences which we invite the reader to discover.
10.
More precisely, the safe policy always attempt to move up when it is possible to do so, unless
it is in one of the rightmost cells. The quick policy simply moves right until one of these cells is
reached, afterwards it goes down towards the ‘G’ cell.
24 Chapter 2
1.51.00.5 2.00.0
0.5
1.0
Cumulative Probability
Return
1.00.50.0
0.5
1.0
Cumulative Probability
Return
Figure 2.3
Left.
Illustration of Example 2.10, showing cumulative distribution functions (CDF)
for the truncated return G
0:T
=
P
T
t=0
1
2
t
R
t
as T increases. The number of steps in this
function doubles with each increment of T, reaching the uniform distribution in the limit
as T !1. Right. The same illustration, now for Example 2.11.
Example 2.10.
Consider a single-state, single-action Markov decision process,
so that
X
={x} and
A
={a}. The initial distribution is
0
=
x
and the transition
kernel is P
X
(x | x, a) = 1. The reward has a Bernoulli distribution, with
P
R
(0 | x, a)=P
R
(1 | x, a)=
1
/2 .
Suppose we take the discount factor to be =
1
/2. The return is
G = R
0
+
1
2
R
1
+
1
4
R
2
+ ··· . (2.4)
We will show that G has a uniform probability distribution over the interval
[0, 2]. Observe that the right-hand side of Equation 2.4 is equivalent to the
binary number
R
0
.R
1
R
2
(2.5)
Since R
t
2
{0, 1} for all t, this implies that the support of G is the set of numbers
that have a (possibly infinite) binary expansion as in Equation 2.5. The smallest
such number is clearly 0, while the largest number is 2 (the infinite sequence of
1’s). Now
P
(G 2[0, 1]) = P
(R
0
= 0) P
(G 2[1, 2]) = P
(R
0
= 1) ,
so that it is equally likely that G falls either in the interval [0, 1] or [1, 2].
11
If
we now subdivide the lower interval, we find that
P
(G 2[0.5, 1]) = P
(G 2[0, 1])P
(G 0.5 | G 2[0, 1])
= P
(R
0
= 0)P
(R
1
=1|R
0
= 0)
= P
(R
0
= 0)P
(R
1
= 1) ;
11. The probability that G = 1 is zero.
The Distribution of Returns 25
and analogously for the intervals [0, 0.5], [1, 1.5], and [1.5, 2]. We can repeat this
subdivision recursively to find that any dyadic interval [a, b] whose endpoints
belong to the set
Y =
n
j=0
1
2
j
a
j
: n 2N , a
j
2{0, 1} for 0 j n
has probability
ba
2
, i.e.
P
G 2[a, b]
=
b a
2
for a, b 2Y, a < b . (2.6)
Because the uniform distribution over the interval [0, 2] satisfies Equation
2.6, we conclude that G is distributed uniformly on [0, 2]. A formal argument
requires us to demonstrate that Equation 2.6 uniquely determines the cumulative
distribution function of G (Exercise 2.6); this argument is illustrated (informally)
in Figure 2.3. 4
Example 2.11
(*)
.
If we substitute the reward distribution of the previous
example by
P
R
(0 | x, a)=P
R
(
2
/3 | x, a)=
1
/2 .
and take
=
1
/
3
, the return distribution becomes the Cantor distribution (see
Exercise 2.7). The Cantor distribution has no atoms (values with probability
greater than zero) nor a probability density. Its cumulative distribution function
is the Cantor function (see Figure 2.3), famous for violating many of our
intuitions about mathematical analysis. 4
2.5 The Bellman Equation
The cliff-walking scenario (Example 2.9) shows how different policies can lead
to qualitatively different return distributions: one where a positive reward for
reaching the goal is likely, and one where high rewards are likely, but where
there is also a substantial chance of a low reward (due to a fall). Which should be
preferred? In reinforcement learning, the canonical way to answer this question
is to reduce return distributions to scalar values, which can be directly compared.
More precisely, we measure the quality of a policy by the expected value of its
random return,
12
or simply expected return:
E
1
t=0
t
R
t
. (2.7)
12.
Our assumption that the rewards have finite first moment (Assumption 2.5) guarantees that the
expectation in Expression 2.7 is finite.
26 Chapter 2
Being able to determine the expected return of a given policy is thus central
to most reinforcement learning algorithms. A straightforward approach is to
enumerate all possible realisations of the random trajectory (X
t
, R
t
, A
t
)
t0
up to
length T
2N
. By weighting them according to their probability, we obtain the
approximation
E
T–1
t=0
t
R
t
. (2.8)
However, even for reasonably small T this is problematic, because the num-
ber of partial trajectories of length T
2N
may grow exponentially with T.
13
Even for problems as small as cliff-walking, enumeration quickly becomes
impractical. The solution lies in the Bellman equation, which provides a concise
characterisation of the expected return under a given policy.
To begin, consider the expected return for a policy
from an initial state x. This
is called the value of x, written
V
(x)=E
1
t=0
t
R
t
X
0
= x
. (2.9)
The value function V
describes the value at all states. As the name implies, it
is formally a mapping from a state to its expected return under policy
. The
value function lets us answer counterfactual questions (“how well would the
agent do from this state?”), and also allows us to determine the expected return
in Equation 2.7, since (by the generative equations)
E
V
(X
0
)
= E
1
t=0
t
R
t
.
By linearity of expectations, the expected return can be decomposed into an
immediate reward R
0
, which depends on the initial state X
0
, and the sum of
future rewards:
E
1
t=0
t
R
t
= E
R
0
+
1
t=1
t
R
t

future rewards
.
The Bellman equation expresses this relationship in terms of the value of
different states; we give its proof in the next section.
13.
If rewards are drawn from a continuous distribution, there is in fact an infinite number of
possible sequences of length T.
The Distribution of Returns 27
Proposition 2.12
(The Bellman equation)
.
Let V
be the value function
of policy . Then for any state x 2X, it holds that
V
(x)=E
R
0
+ V
(X
1
)|X
0
= x
. (2.10)
4
The Bellman equation transforms an infinite sum (Equation 2.7) into a recursive
relationship between the value of a state x, its expected reward, and the value
of its successor states. This makes it possible to devise efficient algorithms for
determining the value function V
, as we shall see later.
An alternative to the value function is the state-action value function Q
. The
state-action value function describes the expected return when the action A
0
is
fixed to a particular choice a:
Q
(x, a)=E
1
t=0
t
R
t
| X
0
= x, A
0
= a
.
Throughout this book we will see a few situations for which this form is
preferred, either because it simplifies the exposition or makes certain algorithms
possible. Unless otherwise noted, the results we will present hold for both value
functions and state-action value functions.
2.6 Properties of the Random Trajectory
The Bellman equation is made possible by two fundamental properties of the
Markov decision process, time homogeneity and the Markov property. More
precisely, these are properties of the random trajectory (X
t
, R
t
, A
t
)
t0
, whose
probability distribution is described by the generative equations of Section 2.3.
To understand the distribution of the random trajectory (X
t
, A
t
, R
t
)
t0
, it is
helpful to depict its possible outcomes as an infinite tree (Figure 2.4). The root
of the tree is the initial state X
0
, and each level of consists of a state-action-
reward triple, drawn according to the generative equations. Each branch of the
tree then corresponds to a realisation of the trajectory. Most of the quantities
that we are interested in can be extracted by “slicing” this tree in particular
ways. For example, the random return corresponds to the sum of the rewards
along each branch, discounted according to their level.
In order to relate the probability distributions of different parts of the random tra-
jectory, let us introduce the notation
D
(Z) to denote the probability distribution
28 Chapter 2
(b)
(a)
Figure 2.4
(a)
An example Markov decision process with three states x, y, z, and a terminal state
denoted by a double circle. Action a gives a reward of 0, or 1 depending on the state,
while action b gives a reward of 0 in state y and 2 in state z. In state x, action b gives a
random reward of either 0 or 2 with equal probability. Arrows indicate the (deterministic)
transition kernel from each state.
(b)
An illustration of the Markov property (in blue):
the subsequences rooted in state X
1
= z have the same probability distribution; and the
time-homogeneity property(in pink): the subsequence rooted at X
2
= x has the same
probability distribution as the full sequence beginning in X
0
= x.
of a random variable Z. When Z is real-valued we have, for S R,
D(Z)(S)=P(Z 2S).
One advantage of the
D
(
·
) notation is that it often avoids unnecessarily introduc-
ing (and formally characterising) such a subset S, and is more easily extended
to other kinds of random variables. Importantly, we write
D
to refer to the
distribution of random variables derived from the joint distribution
P
.
14
For
example,
D
(R
0
+ R
1
) and
D
(X
2
) are the probability distribution of the sum
R
0
+ R
1
and of the third state in the random trajectory, respectively.
The Markov property states that the trajectory from a given state x is independent
of the states, actions and rewards that were encountered prior to x. Graphically,
this means that for a given level of the tree k, the identity of the state X
k
suffices
to determine the probability distribution of the subtree rooted at that state.
14.
Because the random trajectory is infinite, the technical definition of
D
requires some care and
is given in Remark 2.4.
The Distribution of Returns 29
Lemma 2.13
(Markov property)
.
The trajectory (X
t
, A
t
, R
t
)
t0
has the Markov
property. That is, for any k 2N, we have
D
(X
t
, A
t
, R
t
)
1
t=k
|(X
t
, A
t
, R
t
)
k–1
t=0
=(x
t
, a
t
, r
t
)
k–1
t=0
, X
k
= x
= D
(X
t
, A
t
, R
t
)
1
t=k
| X
k
= x
whenever the conditional distribution of the left-hand side is defined. 4
As a concrete example, Figure 2.4 shows that there are two realisations of the
trajectory for which X
1
= z. By the Markov property, the expected return from
this point on is independent of whether the immediately preceding reward (R
0
)
was 0 or 2.
One consequence of the Markov property is that the expectation of the dis-
counted return from time k is independent of the trajectory prior to time k, given
X
k
:
E
1
t=k
t
R
t
|(X
t
, A
t
, R
t
)
k–1
t=0
=(x
t
, a
t
, r
t
)
k–1
t=0
, X
k
= x
= E
1
t=k
t
R
t
| X
k
= x
.
Time homogeneity states that the distribution of the trajectory from a state x
does not depend on the time k at which this state is visited. Graphically, this
means that for a given state x, the probability distribution of the subtree rooted
in that state is the same irrespective of the level k at which it occurs.
Lemma 2.14
(Time-homogeneity)
.
The trajectory (X
t
, A
t
, R
t
)
t0
is time-
homogeneous, in the sense that for all k 2N,
D
(X
t
, A
t
, R
t
)
1
t=k
| X
k
= x
= D
x
,
(X
t
, A
t
, R
t
)
1
t=0
, (2.11)
whenever the conditional distribution on the left-hand side is defined. 4
While the left-hand side of Equation 2.11 is a proper conditional distribution,
the right-hand side is derived from
P
(
·
| X
0
= x) and by convention is not a
conditional distribution. With this is mind, this is why we write
D
x
,
rather
than the shorthand D
( · | X
0
= x).
Lemmas 2.13 and 2.14 follow as consequences of the definition of the trajectory
distribution. A formal proof requires some measure-theoretic treatment; we
provide a discussion of some of the considerations in Remark 2.4.
30 Chapter 2
Proof of Proposition 2.12 (the Bellman equation).
The result follows straight-
forwardly from Lemmas 2.13 and 2.14. We have
V
(x)=E
1
t=0
t
R
t
| X
0
= x
(a)
= E
R
0
| X
0
= x
+ E
1
t=1
t–1
R
t
| X
0
= x
(b)
= E
R
0
| X
0
= x
+ E
E
1
t=1
t–1
R
t
| X
0
= x, A
0
, X
1
| X
0
= x
(c)
= E
R
0
| X
0
= x
+ E
E
1
t=1
t–1
R
t
| X
1
| X
0
= x
(d)
= E
R
0
| X
0
= x
+ E
V
(X
1
)|X
0
= x
= E
R
0
+ V
(X
1
)|X
0
= x
,
where (a) is due to the linearity of expectations, (b) the law of total expectation,
(c) follows by the Markov property, and (d) is due to time-homogeneity and the
definition of V
.
The Bellman equation states that the expected value of a state can be expressed
in terms of the immediate action A
0
, reward R
0
, and the successor state X
1
,
omitting the rest of the trajectory. It is therefore convenient to define a generative
model that only considers these three random variables along with the initial
state X
0
. Let
2P
(
X
) be a distribution over states. The sample transition
model assigns a probability distribution to the tuple (X, A, R, X
0
) takings values
in XAR X according to
X ;
A | X ( · | X);
R |(X, A) P
R
( · | X, A);
X
0
|(X, A, R) P
X
( · | X, A) . (2.12)
We also write
P
for the joint distribution of these random variables. We
will often find it useful to consider a single source state x, such that as before
=
x
. We write (X = x, A, R, X
0
) for the resulting random tuple, with probability
distribution and expectation
P
( · | X = x) and E
[ · | X = x].
The Distribution of Returns 31
The sample transition model allows us to omit time indices in the Bellman
equation, which simplifies to
V
(x)=E
R + V
(X
0
)|X = x
.
In the sample transition model, we call
the state distribution. It is generally
different from the initial state distribution
0
, which describes a property of the
environment. In Chapters 3 and 6, we will use the state distribution to model
part of a learning algorithm’s behaviour.
2.7 The Random-Variable Bellman Equation
The Bellman equation characterises the expected value of the random return
from any state x compactly, allowing us to reduce the generative equations
(an infinite sequence) to the sample transition model. In fact, we can leverage
time-homogeneity and the Markov property to characterise all aspects of the
random return in this manner. Consider again the definition of this return as a
discounted sum of random rewards:
G =
1
t=0
t
R
t
.
As with value functions, we would like to relate the return from the initial state
to the random returns that occur downstream in the trajectory. To this end, let
us define the return function
G
(x)=
1
t=0
t
R
t
, X
0
= x , (2.13)
which describes the return obtained when acting according to
starting from a
given state x. Note that in this definition, the notation X
0
= x again modifies the
initial state distribution
0
. Equation 2.13 is thus understood as “the discounted
sum of random rewards described by the generative equations with
0
=
x
”.
Formally, G
is a collection of random variables indexed by an initial state x,
each generated by a random trajectory (X
t
, A
t
, R
t
)
t0
under the distribution
P
(
·
|
X
0
= x). Because Equation 2.13 is concerned with random variables, we will
sometimes find it convenient to be more precise and call it the return-variable
function.
15
The infinite tree of Figure 2.4 illustrates the abstract process by which one might
generate realisations from the random variable G
(x). We begin at the root,
15.
One might wonder about the joint distribution of the random returns (G
(x):x
2X
). In this
book, we will (perhaps surprisingly) not need to specify this joint distribution; however, it is valid
to conceptualise these random variables as independent for concreteness.
32 Chapter 2
whose value is fixed to X
0
= x. Each level is drawn by sampling, in succession,
the action A
t
, the reward R
t
, and finally the successor state X
t+1
, according to
the generative equations. The return G
(x) accumulates the discounted rewards
(
t
R
t
)
t0
along the way.
The nature of random variables poses a challenge to converting this generative
process into a recursive formulation like the Bellman equation. It is tempting to
try and formulate a distributional version of the Bellman equation by defining a
collection of random variables
˜
G
according to the relationship
˜
G
(x)=R +
˜
G
(X
0
), X = x (2.14)
for each x
2X
, in direct analogy with the Bellman equation for expected returns.
However, these random variables
˜
G
do not have the same distribution as the
random returns G
. The following example illustrates this point.
Example 2.15.
Consider the single-state Markov decision process of Example
2.10, for which the reward R has a Bernoulli distribution with parameter
1
/
2
.
Equation 2.14 becomes
˜
G
(x)=R +
˜
G
(x), X = x ,
which can be re-arranged to give
˜
G
(x)=
1
t=0
t
R =
1
1–
R .
Since R is either 0 or 1, we deduce that
˜
G
(x)=
0 with probability
1
/2
1
1–
with probability
1
/2 .
This is different from the uniform distribution identified in Example 2.10, in
the case =
1
/2. 4
The issue more generally with Equation 2.14 is that it effectively reuses the
reward R and successor-state X
0
across multiple visits to the initial state X = x,
which in general is not what we intend and violates the structure of the MDP in
question.
16
Put another way, Equation 2.14 fails because it does not correctly
handle the joint distribution of the sequence of rewards (R
t
)
t0
and states (X
t
)
t0
encountered along a trajectory. This phenomenon is one difference between
16.
Computer scientists may find it useful to view Equation 2.14 as simulating draws from R and X
0
with a pseudo-random number generator that is reinitialised to the same state after each use.
The Distribution of Returns 33
distributional and classical reinforcement learning; in the latter case, the issue
is avoided thanks to the linearity of expectations.
The solution is to appeal to the notion of equality in distribution. We say that
two random variables Z
1
, Z
2
are equal in distribution, denoted
Z
1
D
= Z
2
,
if their probability distributions are equal. This is effectively shorthand for
D(Z
1
)=D(Z
2
).
Equality in distribution can be thought of as breaking up the dependency of
the two random variables on their sample spaces to compare them solely on
the basis of their probability distributions. This avoids the problem posed by
directly equating random variables.
Proposition 2.16
(
The random-variable Bellman equation
)
.
Let G
be the return-variable function of policy
. For a sample transition
(X = x, A, R, X
0
) independent of G
, it holds that for any state x 2X,
G
(x)
D
= R + G
(X
0
), X = x . (2.15)
4
From a generative perspective, the random-variable Bellman equation states
that we can draw a sample return by sampling an immediate reward R and
successor state X
0
, and then recursively generating a sample return from X
0
.
Proof of Proposition 2.16.
Fix x
2X
and let
0
=
x
. Consider the (partial)
random return
G
k:1
=
1
t=k
tk
R
t
, k 2N .
In particular, G
(x)
D
=
G
0:1
under the distribution
P
(
·
| X
0
= x). Following an
analogous chain of reasoning as in the proof of Proposition 2.12, we decompose
the return G
0:1
into the immediate reward, and the rewards obtained later in
the trajectory:
G
0:1
= R
0
+ G
1:1
.
We can decompose this further based on the state occupied at time 1:
G
0:1
= R
0
+
x
0
2X
{X
1
= x
0
}G
1:1
.
34 Chapter 2
Now, by the Markov property, on the event that X
1
= x
0
, the return obtained from
that point on is independent of the reward R
0
. Further, by the time-homogeneity
property, on the same event the return G
1:1
is equal in distribution to the return
obtained when the episode begins at state x
0
. Thus, we have
R
0
+
x
0
2X
{X
1
= x
0
}G
1:1
D
= R
0
+
x
0
2X
{X
1
= x
0
}G
(x
0
)=R
0
+ G
(X
1
).
and hence
G
(x)
D
= R
0
+ G
(X
1
), X
0
= x.
The result follows by equality of distribution between (X
0
, A
0
, R
0
, X
1
) with the
sample transition model (X = x, A, R, X
0
) when X
0
= x.
Note that Proposition 2.16 implies the standard Bellman equation, in the
sense that the latter is obtained by taking expectations on both sides of the
distributional equation:
G
(x)
D
= R + G
(X
0
), X = x
=)D
G
(x)
= D
R + G
(X
0
)|X = x
=) E
[G
(x)] = E
R + G
(X
0
)|X = x
=) V
(x)=E
R + V
(X
0
)|X = x
where we made use of the linearity of expectations as well as the independence
of X
0
from the random variables (G
(x):x 2X).
2.8 From Random Variables To Probability Distributions
Random variables provide an intuitive language with which to express how
rewards are combined to form the random return. However, a proper random-
variable Bellman equation requires the notion of equality in distribution to
avoid incorrectly reusing realisations of the random reward R and next-state X
0
.
As a consequence, Equation 2.15 is somewhat incomplete: it characterises the
distribution of the random return from x, G
(x), but not the random variable
itself.
A natural alternative is to do away with the random-variable function G
and
directly relate the distribution of the random return at different states. The result
is what we may properly call the distributional Bellman equation. Working
with probability distributions requires mathematical notation that is somewhat
unintuitive compared to the simple addition and multiplication of random
The Distribution of Returns 35
Return Return Return
(a) (b) (c) (d)
Return
Mixing
Scaling
Translating
Figure 2.5
Illustration of the effects of the transformations of the random return in distributional
terms, for a given source state. The discount factor is applied to the individual realisations
of the return distribution, resulting in a
scaling (a)
of the support of this distribution.
The arrows illustrate that this scaling results in a “narrower” probability distribution.
The reward r
translates (b)
the return distribution. We write these two operations as a
pushforward distribution constructed from the bootstrap function
b
r,
. Finally, the return
distributions of successor states are combined to form a
mixture distribution (c)
whose
mixture weights are the transition probabilities to these successor states. The combined
transformations are given in
(d)
, which depicts the return distribution at the source state
(dark outline).
variables, but is free of technical snags. With precision in mind, this is why we
call Equation 2.15 the random-variable Bellman equation.
For a real-valued random variable Z with probability distribution
2P
(
R
),
recall the notation
(S)=P(Z 2S), S R .
This allows us to consider the probability assigned to S by
more directly,
without referring to Z. For each state x
2X
, let us denote the distribution of the
random variable G
(x) by
(x). Using this notation, we have
(x)(S)=P(G
(x) 2S), S R .
We call the collection of these per-state distributions the return-distribution
function. Each
(x) is a member of the space
P
(
R
) of probability distributions
over the reals. Accordingly, the space of return-distribution functions is denoted
P(R)
X
.
To understand how the random-variable Bellman equation translates to the lan-
guage of probability distributions, consider that the right-hand side of Equation
2.15 involves three operations on random variables: indexing into G
, scaling,
and addition. We use analogous operations over the space of probability dis-
tributions to construct the Bellman equation for return-distribution functions;
these distributional operations are depicted in Figure 2.5.
36 Chapter 2
Mixing.
Consider the return-variable and return-distribution functions G
and
, respectively, as well as a source state x
2X
. In the random-variable equation,
the term G
(X
0
) describes the random return received at the successor state
X
0
, when X = x and A is drawn from
(
·
| X) hence the idea of indexing the
collection G
with the random variable X
0
.
Generatively, this describes the process of first sampling a state x
0
from the
distribution of X
0
and then sampling a realised return from G
(x
0
). If we denote
the result by G
(X
0
), we see that for a subset S R we have
P
(G
(X
0
) 2S | X = x)=
x
0
2X
P
(X
0
= x
0
| X = x)P
(G
(X
0
) 2S | X
0
= x
0
, X = x)
=
x
0
2X
P
(X
0
= x
0
| X = x)P
G
(x
0
) 2S
=
x
0
2X
P
(X
0
= x
0
| X = x)
(x
0
)
(S) . (2.16)
This shows that the probability distribution of G
(X
0
) is a weighted combi-
nation, or mixture of probability distributions from
. More compactly, we
have
D
G
(X
0
)|X = x
=
x
0
2X
P
(X
0
= x
0
| X = x)
(x
0
)
= E
(X
0
)|X = x
.
Consequently, the distributional analogue of indexing into the collection G
of
random returns is the mixing of their probability distributions.
Although we prefer working directly with probability distributions, the indexing
step also has a simple expression in terms of cumulative distribution functions.
In Equation 2.16, taking the set S to be the half-open interval (–
1
, z] we obtain
P
(G
(X
0
) z | X = x)=P
(G
(X
0
) 2(–1, z]|X = x)
=
x
0
2X
P
(X
0
= x
0
| X = x)P
(G
(x
0
) z).
Thus, the mixture of next-state return distributions can be described by the
mixture of their cumulative distribution functions:
F
G
(X
0
)
(z)=
x
0
2X
P
(X
0
= x
0
| X = x)F
G
(x
0
)
(z).
Scaling and translation.
Suppose we are given the distribution of the next-state
return G
(X
0
). What is then the distribution of R +
G
(X
0
)? To answer this
The Distribution of Returns 37
question, we must express how multiplying by the discount factor and adding a
fixed reward r transforms the distribution of G
(X
0
).
This is an instance of a more general question: given a random variable Z
and a transformation f :
R !R
, how should we express the the distribution of
f (Z) in terms of f and
? Our approach is to use the notion of a pushforward
distribution. The pushforward distribution f
#
is defined as the distribution of
f (Z):
f
#
= D
f (Z)
.
One can think of it as applying the function f to the individual realisations of this
distribution “pushing” the mass of the distribution around. The pushforward
notation allows us to reason about the effects of these transformations on
distributions themselves, without having to involve random variables.
Now, for two scalars r 2R and 2[0, 1) let us define the bootstrap function
b
r,
: z 7!r + z .
The pushforward operation applied with the bootstrap function scales each
realisation by , and then adds r to it. That is,
(b
r,
)
#
= D(r + Z) . (2.17)
Expressed in terms of cumulative distribution functions, this is
F
r+Z
(z)=F
Z
z r
.
We use the pushforward operation and the bootstrap function to describe
the transformation of the next-state return distribution by the reward and the
discount factor. If x
0
is a state with return distribution
(x
0
), then
(b
r,
)
#
(x
0
)=D
r + G
(x
0
)
.
We finally combine the pushforward and mixing operations to produce a
probability distribution
b
r,
#
E
(X
0
)|X = x
= E
b
r,
#
(X
0
)|X = x
, (2.18)
by linearity of the pushforward (see Exercises 2.11–2.13).
Equation 2.18 gives the distribution of the random return for a specific reali-
sation of the random reward R. By taking the expectation over R and X
0
, we
obtain the distributional Bellman equation.
38 Chapter 2
Proposition 2.17
(
The distributional Bellman equation
)
.
Let
be the
return-distribution function of policy . For any state x 2X, we have
(x)=E
[(b
R,
)
#
(X
0
)|X = x] . (2.19)
4
Figure 2.6 illustrates the recursive relationship between return distributions
described by the distributional Bellman equation.
Example 2.18.
In Example 2.10 it was determined that the random return at
state x is uniformly distributed on the interval [0, 2]. Recall that there are two
possible rewards (0 and 1) with equal probability, x transitions back to itself,
and =
1
/2. As X
0
= x, when r = 0 we have
E
b
r,
#
(X
0
)|X = x
= (b
0,
)
#
(x)
= D
G
(x)
= U([0, 2])
= U([0, 1]) .
Similarly, when r = 1, we have
(b
1,
)
#
(x)=D
1+G
(x)
= U([1, 2]) .
Consequently,
D
R + G
(X
0
)|X = x
=
1
2
(b
0,
)
#
(x)+
1
2
(b
1,
)
#
(x)
=
1
2
U([0, 1]) +
1
2
U([1, 2])
= U([0, 2])
=
(x).
This illustrates how the pushforward operation can be used to express, in
distributional terms, the transformations at the heart of the random-variable
Bellman equation. 4
Proposition 2.17 can also be stated in terms of cumulative distribution functions.
Half-way between Equation 2.19 and the random-variable equation, what one
The Distribution of Returns 39
might call the cumulative distribution function Bellman equation is
F
G
(x)
(z)=E
F
R+G
(X
0
)
(z)|X = x
= E
F
G
(X
0
)
z R
| X = x
.
The first form highlights the relationship between the random-variable equation
and the distributional equation, whereas the second form more directly expresses
the relationship between the different cumulative distribution functions; the
odd indexing reflects the process of transferring probability mass “backwards”,
from G
(X
0
)toG
(x). Although the cumulative distribution function Bellman
equation is at first glance simpler than Equation 2.19, in later chapters we
will see that working with probability distributions is usually more natural and
direct.
Proof of Proposition 2.17.
Let S
R
and fix x
2X
. By the law of total
expectation, we may write
P
(R + G
(X
0
) 2S | X = x)
=
a2A
(a | x)
x
0
2X
P
X
(x
0
| x, a)P
R + G
(X
0
) 2S | X = x, A = a, X
0
= x
0
)
(a)
=
a2A
(a | x)
x
0
2X
P
X
(x
0
| x, a)
E
R
P
R + G
(x
0
) 2S | X = x, A = a, R
| X = x, A = a
=
a2A
(a | x)
x
0
2X
P
X
(x
0
| x, a) E
R
(b
R,
)
#
(x
0
)
(S)|X = x, A = a
= E
(b
R,
)
#
(X
0
)
(S)|X = x
= E
(b
R,
)
#
(X
0
)|X = x
(S).
The notation
E
R
in (a) denotes the expectation with respect to R
P
R
(
·
| x, a);
the independence of R and X
0
given A allows us to remove the conditional
X
0
= x
0
from the inner probability term. Now, by Proposition 2.16 we know that
G
(x)
D
= R + G
(X
0
), X = x .
By definition,
(x) is the distribution of G
(x), and hence the distribution of
R +
G
(X
0
) when X = x. But the derivation above shows that that distribution is
E
(b
R,
)
#
(X
0
)|X = x
,
which completes the proof.
40 Chapter 2
Return
Return
Return
Reward
Reward
Reward
Reward
Reward
Reward
Return
Figure 2.6
The distributional Bellman equation states that the return distribution at one state (far
left) is formed by scaling the return distributions at successor states, shifting them by the
realised rewards, and mixing over possible realisations.
Equation 2.19 directly relates the return distributions at different states and does
away with the random-variable function G
. This makes it particularly useful
when mathematical rigour is required, such as to prove the more formal results
in later chapters. The random variables X, R, and X
0
remain, however, inside
the expectation on the right-hand side. Their role is to concisely express the
three operations of mixing, scaling, and translation. It is also possible to omit
these random variables and write Equation 2.19 purely in terms of probability
distributions, by making the expectation explicit:
(x)=
a2A
(a | x)
x
0
2X
P
X
(x
0
| x, a)
R
P
R
(dr | x, a)
b
r,
)
#
(x
0
) , (2.20)
as was partially done in the proof above. We will revisit this form of the
distributional Bellman equation in Chapter 5 when we consider algorithms for
approximating the return distribution. In general, however, Equation 2.19 is
more compact and applies equally well to discrete and continuous distributions.
Observe that the random action A is implicit in the random-variable and dis-
tributional Bellman equations, but explicit in Equation 2.20. This is because
A is needed to determine the probability distributions of R and X
0
, which are
only independent conditional on the action. This is a useful reminder that we
need to treat the addition of R to
G
(X
0
) with care these are not in general
independent random variables, and their addition does not generally correspond
The Distribution of Returns 41
to a simple convolution of distributions. We will consider this issue in more
detail in Chapter 4.
2.9 Alternative Notions of the Return Distribution*
In reinforcement learning the discount factor
is typically presented as a scalar
multiplier that shrinks the magnitude of future rewards. This is the interpretation
we have used to present the return in this chapter. However, the discount factor
can be given a different interpretation as a probability of continuing. Under this
alternate interpretation, the rewards received by the agent are undiscounted, and
instead there is a probability 1
that the sequence terminates at each time
step.
Under this model, the agent’s interactions with its environment end after a
random termination time T which is geometrically distributed with parameter
, independently of all other random variables in the sequence. Specifically,
we have
P
(T = t)=
t
(1
) for all t
0. In this case, the agent’s return is now
given by
T
t=0
R
t
.
We call this the random-horizon return.
Straightforward calculation allows us to show that the expected value of the
return is unchanged (see Exercise 2.14):
E
T
t=0
R
t
= E
1
t=0
t
R
t
.
Because of this equivalence, from an expected-return perspective it does not
matter whether
is interpreted as a scaling factor or probability. By contrast,
this change in interpretation of
has a significant impact on the structure of
the return distribution. For example, the random variable
T
t=0
R
t
is unbounded
even when the rewards are bounded, unlike the discounted return.
Just as there exists a distributional Bellman equation for the discounted return,
there is also a recursive characterisation of the return distribution for this variant
of the return. By introducing an auxiliary random variable I =
{T > 0}
, we can
write down a distributional Bellman equation for this alternative interpretation
of the return. Expressing this relationship in terms of random variables, we have
G
(x)
D
= R + IG
(X
0
), X = x .
42 Chapter 2
In terms of distributions themselves, we have
(x) = (1 )D
(R | X = x)+E
(b
R,1
)
#
(X
0
)|X = x
,
where
D
(R | X = x) denotes the distribution of the reward R at x under
.
Notice that
in the equation above, which previously scaled the support of the
distribution, now mixes two probability distributions. These results are proven
in the course of Exercise 2.15 along with similar results for a more general class
of return definitions. The fact that there exists a distributional Bellman equation
for these variants of the return means that much of the theory of distributional
reinforcement learning still applies. We encourage the reader to consider how
the algorithms and analysis presented in later chapters may be modified to
accommodate alternative notions of the return.
We see that there are multiple ways in which we can define distributional
Bellman equations while preserving the relationship between expectations.
Another approach, commonly known as n-step bootstrapping in reinforcement
learning, relates the return distribution at state x with the return distribution n
steps into the future. Exercise 2.16 describes another alternative based on first
return times.
2.10 Technical Remarks
Remark 2.1.
Our account of random variables in this chapter has not required
a measure-theoretic formalism, and in general this will be the case for the
remainder of the book. For readers with a background in measure theory, the
existence of the joint distribution of the random variables (X
t
, A
t
, R
t
)
t0
in the
Markov decision process model described in this chapter can be deduced from
constructing a consistent sequence of distributions for the random variables
(X
t
, A
t
, R
t
)
H
t=0
for each horizon H
2N
using the conditional distributions given in
the definition, and then appealing to the Ionescu-Tulcea theorem [Tulcea, 1949].
See Lattimore and Szepesvári [2020] for further discussion in the context of
Markov decision processes.
We also remark that a more formal definition of probability measures restricts
the subsets of outcomes under consideration to be measurable with an underly-
ing
-algebra. Here, and elsewhere in the book, we will generally avoid repeated
mention of the qualifier ‘measurable’ in cases such as these. The issue can be
safely ignored by readers without a background in measure theory. For readers
with such a background, measurability is always implicitly with respect to the
power set
-algebra on finite sets, the Borel
-algebra on the reals, and the
corresponding product -algebra on products of such spaces. 4
The Distribution of Returns 43
Remark 2.2.
The existence of the random return
1
t=0
t
R
t
is often taken for
granted in reinforcement learning. In part, this is because assuming that rewards
are deterministic or have distributions with bounded support makes its existence
straightforward; see Exercise 2.17.
The situation is more complicated when the support of the reward distributions is
unbounded, as there might be realisations of the random trajectory for which the
sequence (G
0:T
)
T0
does not converge. If, for example, the rewards are normally
distributed then it is possible that R
0
= 1, R
1
=–
–1
, R
2
=
–2
, R
3
=–
–3
,
...
, in
which case G
0:T
oscillates between 0 and 1 indefinitely. When the truncated
returns do not converge, it is not meaningful to talk about the return obtained in
this realisation of the trajectory.
Under the assumption that all reward distributions have finite mean, we can
demonstrate convergence of the truncated return with probability 1:
P
lim
T!1
T
t=0
t
R
t
exists
= 1 , (2.21)
and hence guarantee that we can meaningfully study the distributional properties
of the random return. Intuitively, problematic sequences of rewards such as the
sequence given above require reward samples that are increasingly ‘unlikely’.
Equation 2.21 is established by appealing to the martingale convergence
theorem, and the details of this derivation are explored in Exercise 2.19. 4
Remark 2.3.
The assumption of finite state and action spaces makes several
technical aspects of the presentation in this book more straightforward. As a
simple example, consider a Markov decision process with countably infinite
state space given by the non-negative integers
N
, for which state x deterministi-
cally transitions to state x + 1. Suppose in addition that the trajectory begins in
state 0. If the reward at state x is (–
)
x
, then the truncated return up to time T
is given by
T–1
t=0
t
R
t
=
T–1
t=0
t
(–)
t
=
T–1
t=0
(–1)
t
.
In the spirit of Remark 2.2, when T is odd this sum is 1, while for T even it
is 0. Hence the limit
1
t=0
t
R
t
does not exist, despite the fact that all reward
distributions have finite mean.
Dealing with continuous state spaces requires more technical sophistication still.
Constructing the random variables describing the trajectory requires taking into
account the topology of the state space in question, and careful consideration of
44 Chapter 2
technical conditions on policies (such as measurability) is necessary too. For
details on how these issues may be addressed, see Puterman [2014] and Meyn
and Tweedie [2012]. 4
Remark 2.4
(Proof of time-homogeneity and the Markov property)
.
The gener-
ative equations defining the distribution of the random trajectory (X
t
, A
t
, R
t
)
t0
immediately give statements such as
P
(X
k+1
= x
0
| X
0:k
, A
0:k
, R
0:k
, X
k
)=P
(X
k+1
= x
0
| X
k
),
which shows that the next state in the trajectory enjoys the Markov property.
Lemmas 2.13 and 2.14 state something stronger: that the entire future trajectory
has this property. To prove these lemmas from the generative equations, the strat-
egy is to note that the distribution of (X
t
, A
t
, R
t
)
1
t=k
conditional on (X
t
, A
t
, R
t
)
k–1
t=0
and X
k
is determined by its finite-dimensional marginal distributions; that is,
conditional probabilities of the form
P
(X
t
i
= x
i
, A
t
i
= a
i
, R
t
i
2S
i
for i = 1, , n |(X
t
, A
t
, R
t
)
k–1
t=0
, X
k
)
for all n
2N
+
and sequences k
t
1
<
···
< t
n
of time steps (see e.g. Billingsley
[2012, Section 36]. Thus, if these conditional probabilities are equal to the
corresponding quantities
P
(X
t
i
= x
i
, A
t
i
= a
i
, R
t
i
2S
i
for i = 1, , n | X
k
),
then the distributions are equal. Finally, the equality of these finite-dimensional
marginals can be shown by induction from the generative equations. 4
2.11 Bibliographical Remarks
An introduction to random variables and probability distributions may be found
in any undergraduate textbook on the subject. For a more technical presentation,
see Williams [1991] and Billingsley [2012].
2.2.
Markov decision processes are generally attributed to Bellman [1957a]. A
deeper treatment than we give here can be found in most introductory textbooks,
including those by Bertsekas and Tsitsiklis [1996], Szepesvári [2010], Puterman
[2014]. Our notation is most aligned with that of Sutton and Barto [2018].
Interestingly, while it is by now standard to use reward as an indicator of
success, Bellman’s own treatment does not make it an integral part of the
formalism.
2.3.
The formulation of a trajectory as a sequence of random variables is central
to control as inference [Toussaint, 2009, Levine, 2018], which uses tools from
probabilistic reasoning to derive optimal policies. Howard [1960] used the
The Distribution of Returns 45
analogy of a frog hopping around a lily pond to convey the dynamics of a
Markov decision process; we find our own analogy more vivid. The special
consideration owed to infinite sequences is studied at length by Hutter [2005].
2.4.
The work of Veness et al. [2015] makes the return (as a random variable)
the central object of interest, and is the starting point of our own investigations
into distributional reinforcement learning. Issues regarding the existence of the
random return and a proper probabilistic formulation can be found in that paper.
An early formulation of the random-variable function can be found in Jaquette
[1973], who used it to study alternative optimality criteria. The Blackjack and
cliff-walking examples are adapted from Sutton and Barto [2018] and in the
latter case, inspired by one of the authors’ trip to Ireland. In both cases we
put a special emphasis on the probability distribution of the random return.
The uniform distribution example is taken from Bellemare et al. [2017a]; such
discounted sums of Bernoulli random variables also have a long history in
probability theory [Jessen and Wintner, 1935], see Solomyak [1995], Diaconis
and Freedman [1999], Peres et al. [2000] and references therein.
2.5–2.6.
The Bellman equation is standard to most textbooks on the topic. A
particularly thorough treatment can be found in the work of Puterman [2014]
and Bertsekas [2012]. The former also provides a good discussion on the
implications of the Markov property and time-homogeneity.
2.7-2.8.
Bellman equations relating quantities other than expected returns were
originally introduced in the context of risk-sensitive control, at varying levels
of generality. The formulation of the distributional Bellman equation in terms
of cumulative distribution functions was first given by Sobel [1982], under the
assumption of deterministic rewards and policies. Chung and Sobel [1987] later
gave a version for random, bounded rewards. See also the work of White [1988]
for a review of some of these approaches, and Morimura et al. [2010a] for a
more recent presentation of the CDF Bellman equation.
Other versions of the distributional Bellman equation have been phrased in terms
of moments [Sobel, 1982], characteristic functions [Mandl, 1971, Farahmand,
2019] and the Laplace transform [Howard and Matheson, 1972, Jaquette, 1973,
1976, Denardo and Rothblum, 1979], again at varying levels of generality, and in
some cases using the undiscounted return. Morimura et al. [2010b] also present
a version of the equation in terms of probability densities. The formulation of
the distributional Bellman equation in terms of pushforward distributions is due
to Rowland et al. [2018]; the pushforward notation is broadly used in measure-
theoretic probability, and our use of it is influenced by optimal transport theory
[Villani, 2008].
46 Chapter 2
Distributional formulations have also been used to design Bayesian methods for
reasoning about uncertainty regarding the value function [Ghavamzadeh et al.,
2015]. Dearden et al. [1998] proposes an algorithm that maintains a posterior
on the return distribution under the assumption that rewards are normally
distributed. Engel et al. [2003, 2007] use a random-variable equation to derive
an algorithm based on Gaussian processes.
Our own work in the field is rooted in the theory of stochastic fixed point
equations [Rösler, 1991, 1992, Rachev and Rüschendorf, 1995], also known as
recursive distributional equations [Aldous and Bandyopadhyay, 2005], from
which we draw the notion of equality in distribution. Recursive distributional
equations have been used, among other applications, for complexity analysis of
randomised algorithms [Rösler and Rüschendorf, 2001], as well as the study of
branching processes and objects in random geometry [Liu, 1998]. See Alsmeyer
[2012] for a book-length survey of the field. In effect, the random-variable
Bellman equation for a given state x is a recursive distributional equation in the
sense of Aldous and Bandyopadhyay [2005]; however, in distributional rein-
forcement learning we typically emphasise the collection of random variables
(the return-variable function) over individual equations.
2.9.
The idea of treating the discount factor as a probability recurs in the
literature, often as a technical device within a proof. Our earliest source is
Derman [1970, Chapter 3]. The idea is used to handle discounted, infinite
horizon problems in the reinforcement learning as inference setting [Toussaint
and Storkey, 2006] and is remarked on by Sutton et al. [2011]. The probabilistic
interpretation of the discount factor has also found application in model-based
reinforcement learning [Sutton, 1995, Janner et al., 2020], and is closely related
to the notion of termination probabilities in options [Sutton et al., 1999]. The
extension to the distributional setting was worked out with Tom Schaul but
not previously published. See also White [2017] for a broader discussion on
unifying various uses of discount factors.
2.12 Exercises
Exercise 2.1. We defined a state x as being terminal if
P
X
(x | x, a) = 1 for all a 2A
and P
R
(0 | x, a) = 1.
(i)
Explain why multiple terminal states are redundant from a return function
perspective.
The Distribution of Returns 47
(ii)
Suppose you are given an MDP with k terminal states. Describe a procedure
that creates a new MDP that behaves identically but has a single terminal
state. 4
Exercise 2.2.
Consider the generative equations of Section 2.3. Using Bayes’s
rule, explain why
P
(X
1
= x
0
| X
0
= x, A
0
= a)
needs to be introduced as a notational convention, rather than derived from the
definition of the joint distribution P
. 4
Exercise 2.3.
Suppose that
< 1 and R
t
2
[R
MIN
, R
MAX
] for all t
2N
. Show that
1
t=0
t
R
t
2
R
MIN
1–
,
R
MAX
1–
. 4
Exercise 2.4.
Find a probability distribution
that has unbounded support and
such that
P
1
t=0
t
Z
t
< 1
=1, Z
t
i.i.d.
for t 2N . 4
Exercise 2.5. Using an open-source implementation of the game of blackjack
or your own, use simulations to estimate the probability distribution of the
random return under
(i) The uniformly random policy.
(ii)
The policy that doubles when starting with 10 or 11, hits on 16 and lower,
and otherwise sticks.
(iii)
The policy that hits on 16 and lower and never doubles. Is it a better policy?
4
In all cases, aces should count as 11 whenever possible.
Exercise 2.6.
Show that any random variable Z satisfying the condition in
Equation 2.6 must have CDF given by
P
Z z
=
z
2
,0z 2,
and hence that Z has distribution U([0, 2]). 4
48 Chapter 2
Exercise 2.7
(*)
.
The Cantor distribution has a cumulative distribution function
F, defined for z 2[0, 1] by expressing z in base 3:
z =
1
n=1
z
n
3
n
, z
n
2{0, 1, 2} ,
such that
F(z)=
1
n=1
{z
n
>0,z
i
6= 1 for all 1 i < n}2
n
.
Additionally, F(z) = 0 for z
0 and F(z) = 1 for z
1. Prove that the return
distribution for the MDP in Example 2.11 is the Cantor distribution. 4
Exercise 2.8.
The n-step Bellman equation relates the value function at a state
x with the discounted sum of n
2N
+
future rewards and the value of the n
th
successor state. Prove that, for each n 2N
+
,
V
(x)=E
n–1
t=0
t
R
t
+
n
V
(X
n
)|X
0
= x
. 4
Exercise 2.9.
In Section 2.5 we argued that the number of partial trajectories
of length T can grow exponentially with T. Give examples of Markov decision
processes with N
X
states and N
A
actions where the number of possible length
T realisations (x
t
, a
t
, r
t
)
T
t=0
of the random trajectory is
(i) Bounded by a constant.
(ii) Linear in T. 4
Exercise 2.10. In Section 2.7 we argued that the equation
˜
G
(x)=R +
˜
G
(X
0
), X = x
leads to the wrong probability distribution over returns. Are there scenarios in
which this is a sensible model? 4
Exercise 2.11.
The purpose of this exercise is to familiarise yourself with the
transformations of the pushforward operation applied to the bootstrap function
b
.
Let
2P
(
R
) be a probability distribution and let Z be a random variable with
distribution
. Let r =
1
/
2
,
=
1
/
3
, and let R be a Bernoulli(
1
/
4
) random variable
independent of Z. For each of the following probability distributions:
(i) =
1
;
(ii) =
1
/2
–1
+
1
/2
1
;
The Distribution of Returns 49
(iii) = N(2, 1),
express the probability distributions produced by the following operations:
(i) (b
r,1
)
#
= D(r + Z);
(ii) (b
0,
)
#
= D(Z);
(iii) (b
r,
)
#
= D(r + Z); and
(iv) E[(b
R,
)
#
]=D(R + Z). 4
Exercise 2.12.
Repeat the preceding exercise, now with R
N
(0, 1). Conclude
on the distribution of the random return
G =
1
t=0
t
R
t
,
where R
t
N(1, 1) for all t 0. 4
Exercise 2.13.
Let
2P
(
R
)
X
be a return-distribution function, and let X
0
be a
X
-valued random variable. Show that for any r,
2R
, the pushforward
operation combined with the bootstrap function forms an affine map, in the
sense that
b
r,
#
E
(X
0
)
= E

b
r,
#
(X
0
)
. 4
Exercise 2.14.
Consider the standard discounted return
1
t=0
t
R
t
. Recall from
Section 2.9 the random-horizon return,
T
t=0
R
t
, where T has a geometric
probability distribution with parameter :
P
(T = k)=
k
(1 ).
Show that
E
1
t=0
t
R
t
= E
T
t=0
R
t
. 4
Exercise 2.15.
Consider again the random-horizon return described in
Section 2.9.
(i)
Show that if the termination time T has a geometric distribution with param-
eter
then T has the memoryless property, i.e. that
P
(T
k + l | T
k)=
P
(T l ), for all k, l 2N.
(ii)
Hence, show that the return-variable function corresponding to this
alternative definition of return satisfies the Bellman equation
G
(x)
D
= R +
{T > 0}
G
(X
0
), X = x .
50 Chapter 2
for each x 2X.
Consider now a further alternative notion of return, given by considering a
sequence of non-negative random variables (I
t
)
t0
distributed i.i.d. and indepen-
dent from all other random variables in the MDP. Suppose
E
[I
t
]=
. Define
the return to be
1
t=0
t–1
s=0
I
s
R
t
.
(iii)
Show that both the standard definition of the return and the random-horizon
return can be viewed as special cases of this more general notion of return
by particular choices of the distribution of the (I
t
)
t0
variables.
(iv) Verify that
E
1
t=0
t–1
s=0
I
s
R
t
= E
1
t=0
t
R
t
.
(v)
Show that the random-variable Bellman equation associated with this notion
of return is given by
G
(x)=R + I
0
G
(X
0
), X = x
for each x 2X. 4
Exercise 2.16. For a state x 2X, write
T
x
= min{t 1:X
t
= x}
with T
x
=
1
if x is never reached from time t = 1 onwards. When X
0
= x, this
is the random time of first return to x. Prove that the following alternative
random-variable Bellman equation holds:
G
(x)
D
=
T–1
t=0
t
R
t
+
T
G
(x), X
0
= x . 4
Exercise 2.17.
The aim of the following two exercises is to explore under what
conditions the random return G =
1
t=0
t
R
t
exists. As in Remark 2.2, denote
the truncated return by
G
0:T
=
T
t=0
t
R
t
.
Show that if all reward distributions are supported on the interval [R
MIN
, R
MAX
],
then for any values of the random variables (R
t
)
t0
, the sequence (G
0:T
)
T0
converges, and hence G is well-defined. 4
The Distribution of Returns 51
Exercise 2.18
(*)
.
The random-variable Bellman equation is effectively a sys-
tem of recursive distributional equations [Aldous and Bandyopadhyay, 2005].
This example asks you to characterise the normal distribution as the solution to
another such equation. Let Z be a normally-distributed random variable with
mean zero and unit variance (that is, Z
N
(0, 1)). Recall that the probability
density p
Z
and characteristic function
Z
of Z are
p
Z
(z)=
1
p
2
exp
z
2
2
,
Z
(s) = exp
s
2
2
.
Suppose that Z
0
is an independent copy of Z. Show that
Z
D
=2
1
/2
Z +2
1
/2
Z
0
by considering
(i) The probability density of each side of the equation above;
(ii) The characteristic function of each side of the equation above. 4
Exercise 2.19
(*)
.
Under the overarching assumption of this chapter and
the remainder of the book that there are finitely many states and all reward
distributions have finite first moment, we can show that the limit
lim
T!1
T
t=0
t
R
t
(2.22)
exists with probability 1. First, show that the sequence of truncated returns
(G
0:T
)
T0
forms a semi-martingale with respect to the filtration (
F
t
)
t0
, defined
by
F
t
= (X
0:t
, A
0:t
, R
0:t–1
);
the relevant decomposition is
G
0:T
= G
0:T–1
+
t
E
[R
T
| X
T
, A
T
]

predictable increment
+
t
(R
T
E
[R
T
| X
T
, A
T
])

martingale noise
.
Writing
C
T
=
T
t=0
t
E
[R
t
| X
t
, A
t
]
for the sum of predictable increments, show that this sequence converges for
all realisations of the random trajectory. Next, show that the sequence (
¯
C
T
)
T0
defined by
¯
C
T
= G
0:T
C
T
is a martingale, and by using the assumption of a
finite state space X, show that it satisfies
sup
T0
E[|
¯
C
T
|] < 1.
52 Chapter 2
Hence use the martingale convergence theorem to argue that (
¯
C
T
)
T0
converges
w.p. 1. From there, deduce that the limit of Equation 2.22 exists w.p. 1. 4
Exercise 2.20
(*)
.
The second Borel-Cantelli lemma [Billingsley, 2012,
Section 4] states that if (E
t
)
t0
is a sequence of independent events and
t0
P(E
t
)=1, then
P(infinitely many E
t
occur) = 1 .
Let
"
> 0. Exhibit a Markov decision process and policy for which the events
{|
t
R
t
|>"} are independent, and satisfy
P
(|
t
R
t
|>" for infinitely many t)=1.
Deduce that
P
T
t=0
t
R
t
converges as T !1
=0
in this case. Hint. Construct a sufficiently heavy-tailed reward distribution from
a suitable cumulative distribution function. 4