Deep Reinforcement Learning 327
as a challenge domain for artificial intelligence. Early results on the Arcade
Learning Environment included both reinforcement learning [Bellemare et al.,
2012a,b] and planning [Bellemare et al., 2013b, Lipovetzky et al., 2015] solu-
tions. The DQN algorithm demonstrated the ability of deep neural networks
to effectively tackle this domain [Mnih et al., 2015]. Since then, deep rein-
forcement learning has been applied to produce high-performing policies for a
variety of video games and image-based control problems [e.g. Beattie et al.,
2016, Levine et al., 2016, Kempka et al., 2016, Bhonker et al., 2017, Cobbe
et al., 2020]. Machado et al. [2018] studies the relative performance of linear
and deep methods in the context of the Arcade Learning Environment. See
François-Lavet et al. [2018] and Arulkumaran et al. [2017] for reviews of deep
reinforcement learning and Graesser and Keng [2019] for a practical overview.
Montfort and Bogost [2009] give an excellent history of the Atari 2600 video
game console itself.
10.2–10.3.
The C51, QR-DQN, and IQN agent architectures and algorithms
were respectively introduced by Bellemare et al. [2017b], Dabney et al. [2018b],
and Dabney et al. [2018a]. Open-source implementations of these three algo-
rithms are available in the Dopamine framework [Castro et al., 2018] and the
DQN Zoo [Quan and Ostrovski, 2020]. The idea of implicitly parametrising
other arguments of the prediction function has been used extensively to deal
with continuous actions, see e.g. Lillicrap et al. [2016b], Barth-Maron et al.
[2018].
There is by now a wide variety of deep distributional reinforcement learn-
ing algorithms, many of which outperform IQN. FQF [Yang et al., 2019]
approximates the return distribution with a weighted combination of Diracs by
combining the IQN architecture with a method for selecting which values of
⌧ 2
(0, 1) to feed into the network. MM-DQN [Nguyen et al., 2021] uses an
architecture based on QR-DQN in combination with an MMD-based loss as
described in Chapter 4; typically the Gaussian kernel has been found to provide
the best empirical performance, in spite of a lack of theoretical guarantees.
Both Freirich et al. [2019] and Doan et al. [2018] propose the use of generative
adversarial networks [Goodfellow et al., 2014] to model the reward distribution.
Freirich et al. also extend this approach to the case of multivariate rewards.
There are also several recent modifications to the QR-DQN architecture that
seek to address the quantile-crossing problem; namely, that the outputs of the
QR-DQN network need not satisfy the natural monotonicity constraints of
distribution quantiles. Yue et al. [2020] propose to use deep generative models
combined with a post-processing sorting step to obtain monotonic quantile
estimates. Zhou et al. [2021] parametrise the difference between successive