Notation

Notation Description

R The set of real numbers

N, N

+

The set of natural numbers including (excluding)

zero: {0, 1, 2, …}

P

The probability of one or many random variables

producing the given outcomes

P(Y) The space of probability distributions over a set Y

F

ν

, F

−1

ν

The cumulative distribution function (CDF) and

inverse CDF, respectively, for distribution ν

δ

θ

Dirac delta distribution at

θ ∈R

, a probability dis-

tribution that assigns probability 1 to outcome

θ

N(µ, σ

2

) Normal distribution with mean µ and variance σ

2

U([a, b]) Uniform distribution over [a, b], with a, b ∈R

U({a, b, …}) Uniform distribution over the set {a, b, …}

Z ∼ν

The random variable

Z

, with probability distribution

ν

z, Z

Capital letters generally denote random variables

and lowercase letters their realizations or expecta-

tions. Notable exceptions are V, Q, and P

x ∈X A state x in the state space X

a ∈A An action a in the action space A

r ∈R A reward r from the set R

γ Discount factor

R

t

, X

t+1

∼P(·, · | X

t

, A

t

)

Joint probability distribution of reward and next

states in terms of the current state and action

P

X

Transition kernel

Draft version. 333

334 Notation

Notation Description

P

R

Reward distribution function

ξ

0

Initial state distribution

x

∅

A terminal state

N

X

, N

A

, N

R

Size of the state, action, and reward spaces (when

ﬁnite)

π

A policy; usually stationary and Markov, mapping

states to distributions over actions

π

∗

An optimal policy

k Iteration number or index of a sample trajectory

t Time step or time index

(X

t

, A

t

, R

t

)

t≥0

A trajectory of random variables for state, action,

and reward produced through interaction with a

Markov decision process

X

0:t−1

A sequence of random variables

T Length of an episode

P

π

The distribution over trajectories induced by a

Markov decision process and a policy π

E

π

The expectation operator for the distribution over

trajectories induced by P

π

G A random-variable function or random return

Var, Var

π

Variance of a distribution generally and variance

under the distribution P

π

V

π

(x) The value function for policy π at state x ∈X

Q

π

(x, a)

The state-action value function for policy

π

at state

x ∈X and taking action a ∈A

Z

D

= Z

0

Equality in distribution of two random variables

Z, Z

0

D(Z | Y)

The conditional probability distribution of a random

variable Z given Y

G

π

The random-variable function for policy π

η A return-distribution function

η

π

(x)

The return-distribution function for policy

π

at state

x ∈X

f

#

ν

Pushforward distribution passing distribution

ν

through the function f

b

r,γ

Bootstrap function with reward r and discount γ

R

min

, R

max

, V

min

, V

max

Minimum and maximum possible reward and return

within an MDP

Draft version.

Notation 335

Notation Description

N

k

(x)

Number of visits to state

x ∈X

up to but excluding

iteration k

m

Number of particles or parameters of the distribu-

tion representation

{θ

1

, …, θ

m

}

Support of a categorical distribution representation,

with θ

i

< θ

j

for i < j

ς

m

The gap between consecutive locations for the

support of a categorical representation with

m

locations

ˆ

V

π

(x), ˆη

π

(x)

An estimate of the value function or return distribu-

tion function at state x under policy π

α, α

k

The step size in an update expression and the step

size used for iteration k

A ← B

Denotes updating the variable

A

with the contents

of variable B

Π

c

The categorical projection (Sections 3.5 and 5.6)

Π

q

The quantile projection (Section 5.6)

T

π

, T

The policy-evaluation Bellman operator and Bell-

man optimality operator, respectively

T

π

, T

The policy-evaluation distributional Bellman oper-

ator and distributional optimality operator, respec-

tively

OU

An operator

O

applied to a point

U ∈M

, where

(M, d) is a metric space.

k·k

∞

Supremum norm on a vector space

w

p

p-Wasserstein distance

p

p

distance between probability distributions

2

Cramér distance

¯

d

The supremum extension of a probability metric

d

to

return-distribution functions, where the supremum

is taken over states

Γ(ν, ν

0

)

The set of couplings, joint probability distributions,

of ν, ν

0

∈P (R)

P

p

(R) The set of distributions with ﬁnite pth moments

P

d

(R)

The set of distributions with ﬁnite

d

-distance to the

distribution

δ

0

and ﬁnite ﬁrst moment. Also referred

to as the ﬁnite domain of d

CVaR Conditional value at risk

Draft version.

336 Notation

Notation Description

F A probability distribution representation

F

E

Empirical probability distribution representation

F

N

Normal probability distribution representation,

parameterized by mean and variance

F

C,m

m

-categorical probability distribution representation

F

Q,m

m-quantile probability distribution representation

Π

F

A projection onto the probability distribution repre-

sentation F

bzc, dze

Floor and ceiling operations, mapping

z ∈R

to the

nearest integer that is less or equal (ﬂoor), or greater

or equal (ceiling), than z

d

A probability metric, typically used for the purposes

of contraction analysis

L

τ

(θ)

Quantile regression loss function for target thresh-

old τ ∈(0, 1) and location estimate θ ∈R

{u}

An indicator function that takes the value 1 when

u

is true and 0 otherwise; also {·}

J(π) Objective function for a control problem

G

Greedy policy operator, produces a policy that is

greedy with respect to a given action-value function

T

G

, T

G

The Bellman and distributional Bellman optimality

operators derived from greedy selection rule G

J

ρ

(π)

A risk-sensitive control objective function, with risk

measure ρ

ψ A statistical functional or sketch

ξ

π

Steady-state distribution under policy π

φ(x)

State representation for state

x

, a mapping

φ

:

X→

R

n

Π

φ,ξ

Projection onto the linear subspace generated by

φ

with state weighting ξ

M (R)

Space of signed probability measures over the reals

ξ,2

Weighted Cramér distance over return-distribution

functions, with state weighting given by ξ

Π

φ,ξ,

2

Projection onto the linear subspace generated by

φ

,

minimizing the

ξ,2

distance

L Loss function

H

κ

The Huber loss with threshold κ ≥0

Draft version.