Notation
Notation Description
R The set of real numbers
N, N
+
The set of natural numbers including (excluding)
zero: {0, 1, 2, }
P
The probability of one or many random variables
producing the given outcomes
P(Y) The space of probability distributions over a set Y
F
ν
, F
1
ν
The cumulative distribution function (CDF) and
inverse CDF, respectively, for distribution ν
δ
θ
Dirac delta distribution at
θ R
, a probability dis-
tribution that assigns probability 1 to outcome
θ
N(µ, σ
2
) Normal distribution with mean µ and variance σ
2
U([a, b]) Uniform distribution over [a, b], with a, b R
U({a, b, }) Uniform distribution over the set {a, b, }
Z ν
The random variable
Z
, with probability distribution
ν
z, Z
Capital letters generally denote random variables
and lowercase letters their realizations or expecta-
tions. Notable exceptions are V, Q, and P
x X A state x in the state space X
a A An action a in the action space A
r R A reward r from the set R
γ Discount factor
R
t
, X
t+1
P(·, · | X
t
, A
t
)
Joint probability distribution of reward and next
states in terms of the current state and action
P
X
Transition kernel
Draft version. 333
334 Notation
Notation Description
P
R
Reward distribution function
ξ
0
Initial state distribution
x
A terminal state
N
X
, N
A
, N
R
Size of the state, action, and reward spaces (when
finite)
π
A policy; usually stationary and Markov, mapping
states to distributions over actions
π
An optimal policy
k Iteration number or index of a sample trajectory
t Time step or time index
(X
t
, A
t
, R
t
)
t0
A trajectory of random variables for state, action,
and reward produced through interaction with a
Markov decision process
X
0:t1
A sequence of random variables
T Length of an episode
P
π
The distribution over trajectories induced by a
Markov decision process and a policy π
E
π
The expectation operator for the distribution over
trajectories induced by P
π
G A random-variable function or random return
Var, Var
π
Variance of a distribution generally and variance
under the distribution P
π
V
π
(x) The value function for policy π at state x X
Q
π
(x, a)
The state-action value function for policy
π
at state
x X and taking action a A
Z
D
= Z
0
Equality in distribution of two random variables
Z, Z
0
D(Z | Y)
The conditional probability distribution of a random
variable Z given Y
G
π
The random-variable function for policy π
η A return-distribution function
η
π
(x)
The return-distribution function for policy
π
at state
x X
f
#
ν
Pushforward distribution passing distribution
ν
through the function f
b
r
Bootstrap function with reward r and discount γ
R
min
, R
max
, V
min
, V
max
Minimum and maximum possible reward and return
within an MDP
Draft version.
Notation 335
Notation Description
N
k
(x)
Number of visits to state
x X
up to but excluding
iteration k
m
Number of particles or parameters of the distribu-
tion representation
{θ
1
, , θ
m
}
Support of a categorical distribution representation,
with θ
i
< θ
j
for i < j
ς
m
The gap between consecutive locations for the
support of a categorical representation with
m
locations
ˆ
V
π
(x), ˆη
π
(x)
An estimate of the value function or return distribu-
tion function at state x under policy π
α, α
k
The step size in an update expression and the step
size used for iteration k
A B
Denotes updating the variable
A
with the contents
of variable B
Π
c
The categorical projection (Sections 3.5 and 5.6)
Π
q
The quantile projection (Section 5.6)
T
π
, T
The policy-evaluation Bellman operator and Bell-
man optimality operator, respectively
T
π
, T
The policy-evaluation distributional Bellman oper-
ator and distributional optimality operator, respec-
tively
OU
An operator
O
applied to a point
U M
, where
(M, d) is a metric space.
k·k
Supremum norm on a vector space
w
p
p-Wasserstein distance
p
p
distance between probability distributions
2
Cramér distance
¯
d
The supremum extension of a probability metric
d
to
return-distribution functions, where the supremum
is taken over states
Γ(ν, ν
0
)
The set of couplings, joint probability distributions,
of ν, ν
0
P (R)
P
p
(R) The set of distributions with finite pth moments
P
d
(R)
The set of distributions with finite
d
-distance to the
distribution
δ
0
and finite first moment. Also referred
to as the finite domain of d
CVaR Conditional value at risk
Draft version.
336 Notation
Notation Description
F A probability distribution representation
F
E
Empirical probability distribution representation
F
N
Normal probability distribution representation,
parameterized by mean and variance
F
C,m
m
-categorical probability distribution representation
F
Q,m
m-quantile probability distribution representation
Π
F
A projection onto the probability distribution repre-
sentation F
bzc, dze
Floor and ceiling operations, mapping
z R
to the
nearest integer that is less or equal (floor), or greater
or equal (ceiling), than z
d
A probability metric, typically used for the purposes
of contraction analysis
L
τ
(θ)
Quantile regression loss function for target thresh-
old τ (0, 1) and location estimate θ R
{u}
An indicator function that takes the value 1 when
u
is true and 0 otherwise; also {·}
J(π) Objective function for a control problem
G
Greedy policy operator, produces a policy that is
greedy with respect to a given action-value function
T
G
, T
G
The Bellman and distributional Bellman optimality
operators derived from greedy selection rule G
J
ρ
(π)
A risk-sensitive control objective function, with risk
measure ρ
ψ A statistical functional or sketch
ξ
π
Steady-state distribution under policy π
φ(x)
State representation for state
x
, a mapping
φ
:
X
R
n
Π
φ,ξ
Projection onto the linear subspace generated by
φ
with state weighting ξ
M (R)
Space of signed probability measures over the reals
ξ,2
Weighted Cramér distance over return-distribution
functions, with state weighting given by ξ
Π
φ,ξ,
2
Projection onto the linear subspace generated by
φ
,
minimizing the
ξ,2
distance
L Loss function
H
κ
The Huber loss with threshold κ 0
Draft version.