344 Chapter 11
11.3 Conclusion
The act of learning is fundamentally an anticipatory activity. It allows us to
deduce that eating certain kinds of foods might be hazardous to our health, and
consequently avoid them. It helps the footballer decide how to kick the ball into
the opposite team’s net, and the goalkeeper to prepare for the save before the
kick is even made. It informs us that studying leads to better grades; experience
teaches us to avoid the highway at rush hour. In a rich, complex world, many
phenomena carry a part of unpredictability which in reinforcement learning
we model as randomness. In that respect, learning to predict the full range of
possible outcomes – the return distribution – is only natural: it improves our
understanding of the environment “for free”, in the sense that it can be done in
parallel with the usual learning of expected returns.
For the authors of this book, the roots of the distributional perspective lie
in deep reinforcement learning, as a technique for obtaining more accurate
representations of the world. By now, it is clear that this is but one potential
application. Distributional reinforcement learning has proven useful in settings
far beyond what was expected, including to model the behaviours of co-evolving
agents and the dynamics of dopaminergic neurons. We expect this trend to
continue, and look forward to seeing its greater application in mathematical
finance, engineering, and life sciences. We hope this book will provide a sturdy
foundation on which these ideas can be built.
11.4 Bibliographical Remarks
11.1.
Game theory and the study of multi-agent interactions is a research disci-
pline that dates back almost a century [von Neumann, 1928, Morgenstern and
von Neumann, 1944]. Shoham and Leyton-Brown [2009] provides a modern
summary of a wide range of topics relating to multi-agent interactions, and
Oliehoek and Amato [2016] provide a recent overview from a reinforcement
learning perspective. The MMDP model described here was introduced by
Boutilier [1996], and forms a special case of the general class of Markov games
[Shapley, 1953, van der Wal, 1981, Littman, 1994]. A commonly-encountered
generalisation of the MMDP is the Dec-POMDP [Bernstein et al., 2002], which
also allows for partial observations of the state. Lauer and Riedmiller [2000]
propose an optimistic algorithm with convergence guarantees in deterministic
MMDPs, and many other (non-distributional) approaches to decentralised con-
trol in MMDPs have since been considered in the literature (see e.g. Bowling
and Veloso [2002], Panait et al. [2003, 2006], Matignon et al. [2007, 2012], Wei
and Luke [2016]), including in combination with deep reinforcement learning
[Tampuu et al., 2017, Omidshafiei et al., 2017, Palmer et al., 2018, 2019]. There