The history of this book begins one evening in November 2016, when after an
especially unfruitful day of research, Will and Marc decided to try a dierent
approach to reinforcement learning. The idea took inspiration from the earlier
“Compress and Control” algorithm (Veness et al. 2015) and recent successes in
using classification algorithms to perform regression (van den Oord et al. 2016),
yet was unfamiliar, confusing, exhilarating. Working from one of the many
whiteboards in DeepMind oces at King’s Cross, there were many false starts
and much reinventing the wheel. But eventually C51, a distributional reinforce-
ment learning algorithm, came to be. The analysis of the distributional Bellman
operator proceeded in parallel with algorithmic development, and by the ICML
2017 deadline, there was a theorem regarding the contraction of this operator in
the Wasserstein distance and state-of-the-art performance at playing Atari 2600
video games. These results were swiftly followed by a second paper that aimed
to explain the fairly large gap between the contraction result and the actual C51
algorithm. The trio was completed when Mark joined for a summer internship,
and at that point the first real theoretical results came regarding distributional
reinforcement learning algorithms. The QR-DQN, Implicit Quantile Networks
(IQN), and expectile temporal-dierence learning algorithms then followed. In
parallel, we also began studying how one could theoretically explained just why
distributional reinforcement learning led to better performance in large-scale
settings; the first results suggested said that it should not, only deepening a
mystery that we continue to work to solve today.
One of the great pleasures of working on a book together has been to be able
to take the time to produce a more complete picture of the scientific ancestry
of distributional reinforcement learning. Bellman (1957b) himself expressed
in passing that quantities other than the expected return should be of interest;
Howard and Matheson (1972) considered the question explicitly. Earlier studies
focused on a single characteristic of the return distribution, often a criterion to be
optimized: for example, the variance of the return (Sobel 1982). Similarly, many
Draft version. ix
x Preface
results in risk-sensitive reinforcement learning have focused on optimizing a
specific measure of risk, such as variance-penalized expectation (Mannor and
Tsitsiklis 2011) or conditional-value-at-risk (Chow and Ghavamzadeh 2014).
Our contribution to this vast body of work is perhaps to treat these criteria
and characteristics in a more unified manner, focusing squarely on the return
distribution as the main object of interest, from which everything can be derived.
We see signs of this unified treatment paying oin answering related questions
(Chandak et al. 2021). Of course, we have only been able to get there because
of relatively recent advances in the study of probability metrics (Székely 2002;
Rachev et al. 2013), better tools with which to study recursive distributional
relationships (Rösler 1992; Rachev and Rüschendorf 1995), and key results
from stochastic approximation theory.
Our hope is that, by providing a more comprehensive treatment of distri-
butional reinforcement, we may pave the way for further developments in
sequential decision-making and reinforcement learning. The most immediate
eects should be seen in deep reinforcement learning, which has since that first
ICML paper used distributional predictions to improve performance across a
wide variety of problems, real and simulated. In particular, we are quite excited
to see how risk-sensitive reinforcement learning may improve the reliability
and eectiveness of reinforcement learning for robotics (Vecerik et al. 2019;
Bodnar et al. 2020; Cabi et al. 2020). Research in computational neuroscience
has already demonstrated the value of taking a distributional perspective, even
to explain biological phenomena (Dabney et al. 2020b). Eventually, we hope
that our work can generally help further our understanding of what it means for
an agent to interact with its environment.
In developing the material for this book, we have been immensely lucky to
work with a few esteemed mentors, collaborators, and students who were willing
to indulge us in the first steps of this journey. Rémi Munos was instrumental in
shaping this first project and helping us articulate its value to DeepMind and
the scientific community. Yee Whye Teh provided invaluable advice, pointers
to the statistics literature, and lodging and eventually brought the three of us
together. Pablo Samuel Castro and Georg Ostrovski built, distilled, removed
technical hurdles, and served as the voice of reason. Clare Lyle, Philip Amortila,
Robert Dadashi, Saurabh Kumar, Nicolas Le Roux, John Martin, and Rosie
Zhao helped answer a fresh set of questions that we had until then lacked the
formal language to describe, eventually creating more problems than answers –
such is the way of science. Yunhao Tang and Harley Wiltzer gracefully accepted
to be the first consumers of this book, and their feedback on all parts of the
notation, ideas, and manuscript has been invaluable.
Draft version.
Preface xi
We are very grateful for the excellent feedback – narrative and technical –
provided to us by Adam White and our anonymous reviewers, which allowed
us to make substantial improvements on the original draft. We thank Rich
Sutton, Andy Barto, Csaba Szepesvári, Kevin Murphy, Aaron Courville, Doina
Precup, Prakash Panangaden, David Silver, Joelle Pineau, and Dale Schuur-
mans, for discussions on book writing and serving as role models on taking
an eort larger than anything else we had previously done. We appreciate the
technical and conceptual input of many of our colleagues at Google, DeepMind,
Mila, and beyond: Bernardo Avila Pires, Jason Baldridge, Pierre-Luc Bacon,
Yoshua Bengio, Michael Bowling, Sal Candido, Peter Dayan, Thomas Degris,
Audrunas Gruslys, Hado van Hasselt, Shie Mannor, Volodymyr Mnih, Derek
Nowrouzezahrai, Adam Oberman, Bilal Piot, Tom Schaul, Danny Tarlow, and
Olivier Pietquin. We further thank the many people who reviewed parts of
this book and helped fill in some of the gaps in our knowledge: Yinlam Chow,
Erick Delage, Pierluca D’Oro, Doug Eck, Amir-massoud Farahmand, Jesse
Farebrother, Chris Finlay, Tadashi Kozuno, Hugo Larochelle, Elliot Ludvig,
Andrea Michi, Blake Richards, Daniel Slater, and Simone Totaro. We further
thank Vektor Dewanto, Tyler Kastner, Karolis Ramanauskas, Rylan Schaeer,
Eugene Tarassov, and Jun Tian for their feedback on the online draft and the
COMP-579 students at McGill University for beta-testing our presentation of
the material. We were lucky to perform this research within DeepMind and
Google Brain, which provided support both moral and material and inspiration
to take on ever larger challenges. Finally, we thank Francis Bach, Elizabeth
Swayze, Matt Valades, and the team at MIT Press for championing this work
and making it a possibility.
Marc gives further thanks to Judy Loewen, Frédéric Lavoie, Jacqueline Smith,
Madeleine Fugère, Samantha Work, Damon MacLeod, and Andreas Fidjeland,
for support along the scientific journey, and to Lauren Busheikin, for being
an incredibly supportive partner. Further thanks go to CIFAR and the Mila
academic community for providing the fertile scientific ground from which the
writing of this book began.
Will wishes to additionally thank Zeb Kurth-Nelson and Matt Botvinick for
their patience and scientific rigor as we explored distributional reinforcement
learning in neuroscience; Koray Kavukcuoglu and Demis Hassabis for their
enthusiasm and encouragement surrounding the project; Rémi Munos for sup-
porting our pursuit of random, risky research ideas; and Blair Lyonev for being
a supportive partner, providing both encouragement and advice surrounding the
challenges of writing a book.
Mark would like to thank Maciej Dunajski, Andrew Thomason, Adrian
Weller, Krzysztof Choromanski, Rich Turner, and John Aston for their
Draft version.
xii Preface
supervision and mentorship, and his family and Kristin Goe for all their
Draft version.