Introduction 7
practice, one must also take care in choosing a loss function that is compatible
with a stochastic gradient descent scheme.
For an agent that only knows about expected returns, it is natural (almost
necessary) to define optimal behaviour in terms of maximising this quantity.
The Q-learning algorithm, which performs credit assignment by maximising
over state-action values, learns a policy with exactly this objective in mind.
Knowledge of the return function, however, allows us to define behaviours
that depend on the full distributions of returns – what is called risk-sensitive
reinforcement learning. For example, it may be desirable to act so as to avoid
states that carry a high probability of failure, or penalise decisions that have
high variance. In many circumstances, distributional reinforcement learning
enables behaviour that is more robust to variations and, perhaps, better suited to
real-world applications.
1.4 Intended Audience and Organisation
This book is intended for advanced undergraduates, graduate students, and
researchers who have some exposure to reinforcement learning and are inter-
ested in understanding its distributional counterpart. We present core ideas from
classical reinforcement learning as they are needed to contextualise distribu-
tional topics, but often omit longer discussions and a presentation of specialised
methods in order to keep the exposition concise. The reader wishing a more
in-depth review of classical reinforcement learning is invited to consult one
of the literature’s many excellent books on the topic, including Bertsekas and
Tsitsiklis [1996], Szepesvári [2010], Bertsekas [2012], Puterman [2014], Sutton
and Barto [2018], Meyn [2022].
Already, an exhaustive treatment of distributional reinforcement learning would
require a substantially larger book. Instead, here we emphasise key concepts and
challenges of working with return distributions, in a mathematical language that
aims to be both technically correct but also easily applied. Our choice of topics
is driven by practical considerations (such as scalability in terms of available
computational resources), a topic’s relative maturity, and our own domains of
expertise. In particular, this book contains only one chapter about what is com-
monly called the control problem, and focuses on dynamic programming and
temporal-difference algorithms over Monte Carlo methods. Where appropriate,
in the bibliographical remarks we provide references on these omitted topics.
In general, we chose to include proofs when they pertain to major results in the
chapter, or are instructive in their own right. We defer the proof of a number of
smaller results to exercises.