The history of this book begins one evening in November 2016, when after an
especially unfruitful day of research Will and Marc decided to try a different
approach to reinforcement learning. The idea took inspiration from the earlier
“Compress and Control” algorithm [Veness et al., 2015] and recent successes
in using classification algorithms to perform regression [Van den Oord et al.,
2016], yet was unfamiliar, confusing, exhilarating. Working from one of the
many whiteboards in DeepMind offices at King’s Cross, there were many false
starts and much reinventing the wheel. But eventually C51, a distributional
reinforcement learning algorithm, came to be. The analysis of the distributional
Bellman operator proceeded in parallel with algorithmic development, and
by the ICML 2017 deadline there was a theorem regarding the contraction
of this operator in the Wasserstein distance, and state-of-the-art performance
at playing Atari 2600 video games. These results were swiftly followed by a
second paper that aimed to explain the fairly large gap between the contraction
result and the actual C51 algorithm. The trio was completed when Mark joined
for a summer internship, and at that point the first real theoretical results
came regarding distributional reinforcement learning algorithms. The QR-DQN,
Implicit Quantile Networks (IQN), and expectile temporal-difference learning
algorithms then followed. In parallel, we also began studying how one could
theoretically explained just why distributional reinforcement learning led to
better performance in large-scale settings; the first results suggested said that it
should not, only deepening a mystery that we continue to work to solve today.
One of the great pleasures of working on a book together has been to be able
to take the time to produce a more complete picture of the scientific ancestry
of distributional reinforcement learning. Bellman [1957a] himself expressed
in passing that quantities other than the expected return should be of interest;
Howard and Matheson [1972] considered the question explicitly. Earlier studies
focused on a single characteristic of the return distribution, often a criterion to be
x Preface
optimised, for example the variance of the return [Sobel, 1982]. Similarly, many
results in risk-sensitive reinforcement learning have focused on optimising a
specific measure of risk, such as variance-penalised expectation [Mannor and
Tsitsiklis, 2011] or conditional-value-at-risk [Chow and Ghavamzadeh, 2014].
Our contribution to this vast body of work is perhaps to treat these criteria
and characteristics in a more unified manner, focusing squarely on the return
distribution as the main object of interest, from which everything can be derived.
We see signs of this unified treatment paying off in answering related questions
[Chandak et al., 2021]. Of course, we have only been able to get there because
of relatively recent advances in the study of probability metrics [Székely, 2002,
Rachev et al., 2013], better tools with which to study recursive distributional
relationships [Rösler, 1992, Rachev and Rüschendorf, 1995], and key results
from stochastic approximation theory.
Our hope is that, by providing a more comprehensive treatment of distributional
reinforcement, we may pave the way for further developments in sequential
decision-making and reinforcement learning. The most immediate effects should
be seen in deep reinforcement learning, which has since that first ICML paper
used distributional predictions to improve performance across a wide variety of
problems, real and simulated. In particular, we are quite excited to see how risk-
sensitive reinforcement learning may improve the reliability and effectiveness
of reinforcement learning for robotics [Vecerik et al., 2019, Bodnar et al.,
2020, Cabi et al., 2020]. Research in computational neuroscience has already
demonstrated the value of taking a distributional perspective, even to explain
biological phenomena [Dabney et al., 2020b]. Eventually, we hope that our
work can generally help further our understanding of what it means for an agent
to interact with its environment.
In developing the material for this book, we have been immensely lucky to work
with a few esteemed mentors, collaborators, and students who were willing to
indulge us in the first steps of this journey. Rémi Munos was instrumental in
shaping this first project and helping us articulate its value to DeepMind and the
scientific community. Yee Whye Teh provided invaluable advice, pointers to
the statistics literature, lodging, and eventually brought the three of us together.
Pablo Samuel Castro and Georg Ostrovski built, distilled, removed technical
hurdles, and served as the voice of reason. Clare Lyle, Philip Amortila, Robert
Dadashi, Saurabh Kumar, Nicolas Le Roux, John Martin, and Rosie Zhao
helped answer a fresh set of questions that we had until then lacked the formal
language to describe, eventually creating more problems than answers such is
the way of science. Yunhao Tang and Harley Wiltzer gracefully accepted to be
Preface xi
the first consumers of this book, and their feedback on all parts of the notation,
ideas, and manuscript has been invaluable.
We are very grateful for the excellent feedback narrative and technical
provided to us by Adam White and our anonymous reviewers, which allowed us
to make substantial improvements on the original draft. We thank Rich Sutton,
Andy Barto, Csaba Szepesvári, Kevin Murphy, Aaron Courville, Doina Precup,
Prakash Panangaden, David Silver, Joelle Pineau, and Dale Schuurmans, for
discussions on book-writing and serving as role-models on taking an effort
larger than anything else we had previously done. We appreciate the technical
and conceptual input of many of our colleagues at Google, DeepMind, Mila,
and beyond: Pierre-Luc Bacon, Hado van Hasselt, Thomas Degris, Tom Schaul,
Adam Oberman, Derek Nowrouzezahrai, Danny Tarlow, Bernardo Avila Pires,
Bilal Piot, Audrunas Gruslys, Volodymyr Mnih, Shie Mannor, Yoshua Bengio,
Sal Candido, Olivier Pietquin, Michael Bowling, and Jason Baldridge. We
further thank the many people who reviewed parts of this book, and helped fill in
some of the gaps in our knowledge: Blake Richards, Chris Finlay, Yinlam Chow,
Erick Delage, Elliot Ludvig, Amir massoud Farahmand, Jesse Farebrother,
Pierluca D’Oro, Simone Totaro, Tadashi Kozuno, Andrea Michi, Daniel Slater,
Tyler Kastner, Rylan Schaeffer, Karolis Ramanauskas, Jun Tian, Doug Eck, and
Hugo Larochelle. Finally, we thank Francis Bach, Elizabeth Swayze, and the
team at MIT Press for championing this work and making it a possibility.
Marc gives further thanks to Judy Loewen, Frédéric Lavoie, Jacqueline Smith,
Madeleine Fugère, Samantha Work, Damon MacLeod, and Andreas Fidjeland,
for support along the scientific journey; and to Lauren Busheikin, for being an
incredibly supportive partner for over a decade. Further thanks go to CIFAR
and the Mila academic community for providing the fertile scientific ground
from which the writing of this book began, and DeepMind and Google Brain
for providing support and inspiration to take on ever larger challenges.
Will wishes to additionally thank Zeb Kurth-Nelson and Matt Botvinick for
their patience and scientific rigor as we explored distributional RL in neuro-
science; Koray Kavukcuoglu and Demis Hassabis for their enthusiasm and
encouragement surrounding the project; Rémi Munos for supporting our pursuit
of random, risky research ideas; and Blair Lyonev for being a supportive partner,
providing both encouragement and advice surrounding the challenges of writing
a book.
Mark would like to thank Maciej Dunajski, Andrew Thomason, Adrian Weller,
Krzysztof Choromanski, Rich Turner, and John Aston for their supervision and
mentorship, and his family and Kristin Goffe for all their support.