Notation
Notation Description
R The set of real numbers
N, N
+
The set of natural numbers including (excluding)
zero: {0, 1, 2, }
P
The probability of one or many random variables
producing the given outcomes
P(Y) The space of probability distributions over a set Y
F
, F
–1
The cumulative distribution function (CDF) and
inverse CDF respectively for distribution
Dirac delta distribution at
2R
, a probability dis-
tribution which assigns probability 1 to outcome
N(µ,
2
) Normal distribution with mean µ and variance
2
U([a, b]) Uniform distribution over [a, b], with a, b 2R
U({a, b, }) Uniform distribution over the set {a, b, }
Z
The random variable Z, with probability distribution
z, Z
Capital letters generally denote random variables
and lower case letters their realisations or expecta-
tions. Notable exceptions are: V, Q, and P
x 2X A state x in the state space X
a 2A An action a in the action space A
r 2R A reward r from the set R
Discount factor
R
t
, X
t+1
P(·, · | X
t
, A
t
)
Joint probability distribution of reward and next-
states in terms of the current state and action
P
X
Transition kernel
349
350 Notation
Notation Description
P
R
Reward distribution function
0
Initial state distribution
x
?
A terminal state
N
X
, N
A
, N
R
Size of the state, action, and reward spaces (when
finite)
A policy; usually stationary and Markov, mapping
states to distributions over actions
An optimal policy
k Iteration number or index of a sample trajectory
t Time step or time index
(X
t
, A
t
, R
t
)
t0
A trajectory of random variables for state, action,
and reward produced through interaction with a
Markov decision process
X
0:t–1
A sequence of random variables
T Length of an episode
P
The distribution over trajectories induced by a
Markov decision process and a policy
E
The expectation operator for the distribution over
trajectories induced by P
G A random-variable function or random return
Var, Var
Variance of a distribution generally and variance
under the distribution P
V
(x) The value function for policy at state x 2X
Q
(x, a)
The state-action value function for policy
at state
x 2Xand taking action a 2A
Z
D
= Z
0
Equality in distribution of two random variables
Z, Z
0
D(Z | Y)
The conditional probability distribution of a random
variable Z given Y
G
The random-variable function for policy
A return-distribution function
(x)
The return-distribution function for policy
at state
x 2X
f
#
Push-forward distribution passing distribution
through the function f
b
r,
Bootstrap function with reward r and discount
R
MIN
, R
MAX
, V
MIN
, V
MAX
Minimum and maximum possible reward and return
within an MDP
Notation 351
Notation Description
N
k
(x)
Number of visits to state x
2X
up to but excluding
iteration k
m
Number of particles or parameters of the distribu-
tion representation
{
1
, ,
m
}
Support of a categorical distribution representation,
with
i
<
j
for i < j
&
m
The gap between consecutive locations for the
support of a categorical representation with m
locations
ˆ
V
(x), ˆ
(x)
An estimate of the value function or return distribu-
tion function at state x under policy
,
k
The step size in an update expression and the step
size used for iteration k
A B
Denotes updating the variable A with the contents
of variable B
C
The categorical projection (Sections 3.5 and 5.6)
Q
The quantile projection (Section 5.6)
T
, T
The policy-evaluation Bellman operator and Bell-
man optimality operator, respectively
T
, T
The policy-evaluation distributional Bellman oper-
ator and distributional optimality operator, respec-
tively
OU
An operator
O
applied to a point U
2
M, where
(M, d) is a metric space.
k·k
1
Supremum norm on a vector space
w
p
p-Wasserstein distance
`
p
`
p
distance between probability distributions
`
2
Cramér distance
¯
d
The supremum extension of a probability metric d to
return-distribution functions, where the supremum
is taken over states
(,
0
)
The set of couplings, joint probability distributions,
of ,
0
2P(R)
P
p
(R) The set of distributions with finite p
th
moments
P
d
(R)
The set of distributions with finite d-distance to the
distribution
0
and finite first moment. Also referred
to as the finite domain of d
CVAR Conditional value at risk
352 Notation
Notation Description
F A probability distribution representation
F
E
Empirical probability distribution representation
F
N
Normal probability distribution representation,
parameterised by mean and variance
F
C,m
m-categorical probability distribution representation
F
Q,m
m-quantile probability distribution representation
F
A projection onto the probability distribution repre-
sentation F
bzc, dze
Floor and ceiling operations, mapping z
2R
to the
nearest integer that is less or equal (floor), or greater
or equal (ceiling), than z
d
A probability metric, typically used for the purposes
of contraction analysis
L
()
Quantile regression loss function for target thresh-
old 2(0, 1) and location estimate 2R
{u}
An indicator function that takes the value 1 when u
is true and 0 otherwise; also {·}
J() Objective function for a control problem
G
Greedy policy operator, which produces a policy
which is greedy with respect to a given action-value
function
T
G
, T
G
The Bellman and distributional Bellman optimality
operators derived from greedy selection rule G
J
()
A risk-sensitive control objective function, with risk
measure
A statistical functional or sketch
Steady state distribution under policy
(x)
State representation for state x, a mapping
:
X!
R
n
,
Projection onto the linear subspace generated by
with state weighting
M (R)
Space of signed probability measures over the reals
`
,2
Weighted Cramér distance over return-distribution
functions, with state weighting given by
,,`
2
Projection onto the linear subspace generated by
,
minimising the `
,2
distance
L Loss function
Notation 353
H
The Huber loss with threshold 0