Preface

The history of this book begins one evening in November 2016, when after an

especially unfruitful day of research, Will and Marc decided to try a diﬀerent

approach to reinforcement learning. The idea took inspiration from the earlier

“Compress and Control” algorithm (Veness et al. 2015) and recent successes in

using classiﬁcation algorithms to perform regression (van den Oord et al. 2016),

yet was unfamiliar, confusing, exhilarating. Working from one of the many

whiteboards in DeepMind oﬃces at King’s Cross, there were many false starts

and much reinventing the wheel. But eventually C51, a distributional reinforce-

ment learning algorithm, came to be. The analysis of the distributional Bellman

operator proceeded in parallel with algorithmic development, and by the ICML

2017 deadline, there was a theorem regarding the contraction of this operator in

the Wasserstein distance and state-of-the-art performance at playing Atari 2600

video games. These results were swiftly followed by a second paper that aimed

to explain the fairly large gap between the contraction result and the actual C51

algorithm. The trio was completed when Mark joined for a summer internship,

and at that point the ﬁrst real theoretical results came regarding distributional

reinforcement learning algorithms. The QR-DQN, Implicit Quantile Networks

(IQN), and expectile temporal-diﬀerence learning algorithms then followed. In

parallel, we also began studying how one could theoretically explained just why

distributional reinforcement learning led to better performance in large-scale

settings; the ﬁrst results suggested said that it should not, only deepening a

mystery that we continue to work to solve today.

One of the great pleasures of working on a book together has been to be able

to take the time to produce a more complete picture of the scientiﬁc ancestry

of distributional reinforcement learning. Bellman (1957b) himself expressed

in passing that quantities other than the expected return should be of interest;

Howard and Matheson (1972) considered the question explicitly. Earlier studies

focused on a single characteristic of the return distribution, often a criterion to be

optimized: for example, the variance of the return (Sobel 1982). Similarly, many

Draft version. ix

x Preface

results in risk-sensitive reinforcement learning have focused on optimizing a

speciﬁc measure of risk, such as variance-penalized expectation (Mannor and

Tsitsiklis 2011) or conditional-value-at-risk (Chow and Ghavamzadeh 2014).

Our contribution to this vast body of work is perhaps to treat these criteria

and characteristics in a more uniﬁed manner, focusing squarely on the return

distribution as the main object of interest, from which everything can be derived.

We see signs of this uniﬁed treatment paying oﬀ in answering related questions

(Chandak et al. 2021). Of course, we have only been able to get there because

of relatively recent advances in the study of probability metrics (Székely 2002;

Rachev et al. 2013), better tools with which to study recursive distributional

relationships (Rösler 1992; Rachev and Rüschendorf 1995), and key results

from stochastic approximation theory.

Our hope is that, by providing a more comprehensive treatment of distri-

butional reinforcement, we may pave the way for further developments in

sequential decision-making and reinforcement learning. The most immediate

eﬀects should be seen in deep reinforcement learning, which has since that ﬁrst

ICML paper used distributional predictions to improve performance across a

wide variety of problems, real and simulated. In particular, we are quite excited

to see how risk-sensitive reinforcement learning may improve the reliability

and eﬀectiveness of reinforcement learning for robotics (Vecerik et al. 2019;

Bodnar et al. 2020; Cabi et al. 2020). Research in computational neuroscience

has already demonstrated the value of taking a distributional perspective, even

to explain biological phenomena (Dabney et al. 2020b). Eventually, we hope

that our work can generally help further our understanding of what it means for

an agent to interact with its environment.

In developing the material for this book, we have been immensely lucky to

work with a few esteemed mentors, collaborators, and students who were willing

to indulge us in the ﬁrst steps of this journey. Rémi Munos was instrumental in

shaping this ﬁrst project and helping us articulate its value to DeepMind and

the scientiﬁc community. Yee Whye Teh provided invaluable advice, pointers

to the statistics literature, and lodging and eventually brought the three of us

together. Pablo Samuel Castro and Georg Ostrovski built, distilled, removed

technical hurdles, and served as the voice of reason. Clare Lyle, Philip Amortila,

Robert Dadashi, Saurabh Kumar, Nicolas Le Roux, John Martin, and Rosie

Zhao helped answer a fresh set of questions that we had until then lacked the

formal language to describe, eventually creating more problems than answers –

such is the way of science. Yunhao Tang and Harley Wiltzer gracefully accepted

to be the ﬁrst consumers of this book, and their feedback on all parts of the

notation, ideas, and manuscript has been invaluable.

Draft version.

Preface xi

We are very grateful for the excellent feedback – narrative and technical –

provided to us by Adam White and our anonymous reviewers, which allowed

us to make substantial improvements on the original draft. We thank Rich

Sutton, Andy Barto, Csaba Szepesvári, Kevin Murphy, Aaron Courville, Doina

Precup, Prakash Panangaden, David Silver, Joelle Pineau, and Dale Schuur-

mans, for discussions on book writing and serving as role models on taking

an eﬀort larger than anything else we had previously done. We appreciate the

technical and conceptual input of many of our colleagues at Google, DeepMind,

Mila, and beyond: Bernardo Avila Pires, Jason Baldridge, Pierre-Luc Bacon,

Yoshua Bengio, Michael Bowling, Sal Candido, Peter Dayan, Thomas Degris,

Audrunas Gruslys, Hado van Hasselt, Shie Mannor, Volodymyr Mnih, Derek

Nowrouzezahrai, Adam Oberman, Bilal Piot, Tom Schaul, Danny Tarlow, and

Olivier Pietquin. We further thank the many people who reviewed parts of

this book and helped ﬁll in some of the gaps in our knowledge: Yinlam Chow,

Erick Delage, Pierluca D’Oro, Doug Eck, Amir-massoud Farahmand, Jesse

Farebrother, Chris Finlay, Tadashi Kozuno, Hugo Larochelle, Elliot Ludvig,

Andrea Michi, Blake Richards, Daniel Slater, and Simone Totaro. We further

thank Vektor Dewanto, Tyler Kastner, Karolis Ramanauskas, Rylan Schaeﬀer,

Eugene Tarassov, and Jun Tian for their feedback on the online draft and the

COMP-579 students at McGill University for beta-testing our presentation of

the material. We were lucky to perform this research within DeepMind and

Google Brain, which provided support both moral and material and inspiration

to take on ever larger challenges. Finally, we thank Francis Bach, Elizabeth

Swayze, Matt Valades, and the team at MIT Press for championing this work

and making it a possibility.

Marc gives further thanks to Judy Loewen, Frédéric Lavoie, Jacqueline Smith,

Madeleine Fugère, Samantha Work, Damon MacLeod, and Andreas Fidjeland,

for support along the scientiﬁc journey, and to Lauren Busheikin, for being

an incredibly supportive partner. Further thanks go to CIFAR and the Mila

academic community for providing the fertile scientiﬁc ground from which the

writing of this book began.

Will wishes to additionally thank Zeb Kurth-Nelson and Matt Botvinick for

their patience and scientiﬁc rigor as we explored distributional reinforcement

learning in neuroscience; Koray Kavukcuoglu and Demis Hassabis for their

enthusiasm and encouragement surrounding the project; Rémi Munos for sup-

porting our pursuit of random, risky research ideas; and Blair Lyonev for being

a supportive partner, providing both encouragement and advice surrounding the

challenges of writing a book.

Mark would like to thank Maciej Dunajski, Andrew Thomason, Adrian

Weller, Krzysztof Choromanski, Rich Turner, and John Aston for their

Draft version.

xii Preface

supervision and mentorship, and his family and Kristin Goﬀe for all their

support.

Draft version.