References
Mastane Achab. Ranking and risk-aware reinforcement learning. PhD thesis, Institut
Polytechnique de Paris, 2020.
Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspec-
tive on offline reinforcement learning. In Proceedings of the International Conference
on Machine Learning, 2020.
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G.
Bellemare. Deep reinforcement learning at the edge of the statistical precipice. In
Advances in Neural Information Processing Systems, 2021.
D. J. Aigner, Takeshi Amemiya, and Dale J. Poirier. On the estimation of production
frontiers: Maximum likelihood estimation of the parameters of a discontinuous density
function. International Economic Review, 17(2):377–96, 1976.
David J. Aldous and Antar Bandyopadhyay. A survey of max-type recursive
distributional equations. The Annals of Applied Probability, 15(2):1047–1110, 2005.
Gerold Alsmeyer. Random recursive equations and their distributional fixed points.
Lecture notes, 2012.
Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: In metric spaces
and in the space of probability measures. Springer Science & Business Media, 2005.
Philip Amortila, Marc G. Bellemare, Prakash Panangaden, and Doina Precup. Tempo-
rally extended metrics for Markov decision processes. In SafeAI: AAAI Workshop on
Artificial Intelligence Safety, 2019.
Philip Amortila, Doina Precup, Prakash Panangaden, and Marc G. Bellemare. A
distributional analysis of sampling-based reinforcement learning algorithms. In
Proceedings of the International Conference on Artificial Intelligence and Statistics,
2020.
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. In
Proceedings of the International Conference on Machine Learning, 2017.
Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent
measures of risk. Mathematical finance, 9(3):203–228, 1999.
355
356 References
Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, David Heath, and Hyejin Ku.
Coherent multiperiod risk adjusted values and Bellman’s principle. Annals of
Operations Research, 152(1):5–22, 2007.
Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath.
A brief survey of deep reinforcement learning. IEEE Signal Processing Magazine,
Special Issue on Deep Learning for Image Understanding, 2017.
Peter Auer, Mark Herbster, and Manfred K. Warmuth. Exponentially many local
minima for single neurons. In Advances in Neural Information Processing Systems,
1995.
Mohammad Gheshlaghi Azar, Rémi Munos, Mohammad Gavamzadeh, and Hilbert J.
Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems,
2011.
Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. On the sample
complexity of reinforcement learning with a generative model. In Proceedings of the
International Conference on Machine Learning, 2012.
Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. Minimax PAC
bounds on the sample complexity of reinforcement learning with a generative model.
Machine learning, 91(3):325–349, 2013.
Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds
for reinforcement learning. In Proceedings of the International Conference on
Machine Learning, 2017.
Frederico A.C. Azevedo, Ludmila R.B. Carvalho, Lea T. Grinberg, José Marcelo
Farfel, Renata E.L. Ferretti, Renata E.P. Leite, Wilson Jacob Filho, Roberto Lent, and
Suzana Herculano-Houzel. Equal numbers of neuronal and nonneuronal cells make
the human brain an isometrically scaled-up primate brain. Journal of Comparative
Neurology, 513(5):532–541, 2009.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine transla-
tion by jointly learning to align and translate. In Proceedings of the International
Conference on Learning Representations, 2015.
Leemon C. Baird. Residual algorithms: Reinforcement learning with function approx-
imation. In Proceedings of the International Conference on Machine Learning,
1995.
Leemon C. Baird. Reinforcement learning through gradient descent. PhD thesis,
Carnegie Mellon University, 1999.
Stefan Banach. Sur les opérations dans les ensembles abstraits et leur application aux
équations intégrales. Fund. math, 3(1):133–181, 1922.
Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap,
Piotr Mirowski, Alexander Pritzel, Martin J. Chadwick, Thomas Degris, Joseph
Modayil, Greg Wayne, Hubert Soyer, Fabio Viola, Brian Zhang, Ross Goroshin, Neil
Rabinowitz, Razvan Pascanu, Charlie Beattie, Stig Petersen, Amir Sadik, Stephen
Gaffney, Helen King, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, and Dhar-
shan Kumaran. Vector-based navigation using grid-like representations in artificial
agents. Nature, 557(7705):429–433, 2018.
References 357
André Barbeau. Drugs affecting movement disorders. Annual Review of Pharmacol-
ogy, 14(1):91–113, 1974.
Nolan Bard, Jakob N. Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H. Francis
Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, Iain
Dunning, Shibl Mourad, Hugo Larochelle, Marc G. Bellemare, and Michael Bowling.
The Hanabi challenge: A new frontier for AI research. Artificial Intelligence, 280:
103216, 2020.
Etienne Barnard. Temporal-difference methods and Markov models. IEEE Transac-
tions on Systems, Man, and Cybernetics, 1993.
André Barreto, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, Hado P. van
Hasselt, and David Silver. Successor features for transfer in reinforcement learning.
In Advances in Neural Information Processing Systems, 2017.
Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan,
Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed
distributional deterministic policy gradients. In Proceedings of the International
Conference on Learning Representations, 2018.
Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. Neuronlike adaptive
elements that can solve difficult learning control problems. IEEE Transactions on
Systems, Man, and Cybernetics, SMC-13(5):834–846, 1983.
Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using
real-time dynamic programming. Artificial intelligence, 72(1-2):81–138, 1995.
Nicole Bäuerle and Jonathan Ott. Markov decision processes with average-value-at-risk
criteria. Mathematical Methods of Operations Research, 74(3):361–379, 2011.
Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright,
Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian
Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton,
Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen.
DeepMind Lab. arXiv preprint arXiv:1612.03801, 2016.
Marc G. Bellemare, Joel Veness, and Michael Bowling. Investigating contingency
awareness using Atari 2600 games. In Proceedings of the Twenty-Sixth AAAI
Conference on Artificial Intelligence, 2012a.
Marc G. Bellemare, Joel Veness, and Michael Bowling. Sketch-based linear value
function approximation. In Advances in Neural Information Processing Systems,
2012b.
Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade
Learning Environment: An evaluation platform for general agents. Journal of Artificial
Intelligence Research, 47:253–279, June 2013a.
Marc G. Bellemare, Joel Veness, and Michael Bowling. Bayesian learning of recur-
sively factored environments. In Proceedings of the International Conference on
Machine Learning, 2013b.
358 References
Marc .G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade
Learning Environment: An evaluation platform for general agents, extended abstract.
In European Workshop on Reinforcement Learning, 2015.
Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, and Rémi Munos.
Increasing the action gap: New operators for reinforcement learning. In Proceedings
of the AAAI Conference on Artificial Intelligence, 2016.
Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on
reinforcement learning. In Proceedings of the International Conference on Machine
Learning, 2017a.
Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshmi-
narayanan, Stephan Hoyer, and Rémi Munos. The Cramer distance as a solution to
biased Wasserstein gradients. arXiv, 2017b.
Marc G. Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel
Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, and Clare Lyle. A
geometric perspective on optimal representations for reinforcement learning. In
Advances in Neural Information Processing Systems, 2019a.
Marc G. Bellemare, Nicolas Le Roux, Pablo Samuel Castro, and Subhodeep Moitra.
Distributional reinforcement learning with linear function approximation. In Pro-
ceedings of the International Conference on Artificial Intelligence and Statistics,
2019b.
Marc G. Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C.
Machado, Subhodeep Moitra, Sameera S. Ponda, and Ziyu Wang. Autonomous
navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836):
77–82, 2020.
Fabio Bellini and Elena Di Bernardino. Risk management with expectiles. The
European Journal of Finance, 23(6):487–506, 2017.
Fabio Bellini, Bernhard Klar, Alfred Müller, and Emanuela Rosazza Gianin. General-
ized quantiles as risk measures. Insurance: Mathematics and Economics, 54:41–48,
2014.
Richard Bellman. Dynamic Programming. Dover Publications, 1957a.
Richard E. Bellman. A Markovian decision process. Journal of Mathematics and
Mechanics, 6(5), 1957b.
Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and
stochastic approximations, volume 22. Springer Science & Business Media, 2012.
Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The
complexity of decentralized control of Markov decision processes. Mathematics of
operations research, 27(4):819–840, 2002.
Dimitri P. Bertsekas. Generic rank-one corrections for value iteration in Markovian
decision problems. Technical report, Massachusetts Institute of Technology, 1994.
Dimitri P. Bertsekas. A counterexample to temporal differences learning. Neural
computation, 7(2):270–279, 1995.
References 359
Dimitri P. Bertsekas. Approximate policy iteration: A survey and some new methods.
Journal of Control Theory and Applications, 9(3):310–335, 2011.
Dimitri P. Bertsekas. Dynamic programming and optimal control, volume 2. Athena
Scientific, 4th edition, 2012.
Dimitri P. Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration
and applications in neuro-dynamic programming. Technical report, Massachusetts
Institute of Technology, 1996.
Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-dynamic programming. Athena
Scientific, 1996.
Jalaj Bhandari and Daniel Russo. On the linear convergence of policy gradient meth-
ods for finite MDPs. In Proceedings of the International Conference on Artificial
Intelligence and Statistics, 2021.
Nadav Bhonker, Shai Rozenberg, and Itay Hubara. Playing SNES in the retro learn-
ing environment. In Proceedings of the International Conference on Learning
Representations, 2017.
Peter J. Bickel and David A. Freedman. Some asymptotic theory for the bootstrap.
The Annals of Statistics, pages 1196–1217, 1981.
Patrick Billingsley. Probability and measure. John Wiley & Sons, 4th edition, 2012.
Christopher M. Bishop. Pattern recognition and machine learning. Springer, 2006.
Sergey Bobkov and Michel Ledoux. One-dimensional empirical measures, order
statistics, and Kantorovich transport distances. Memoirs of the AMS, 261(1259),
2019.
Cristian Bodnar, Adrian Li, Karol Hausman, Peter Pastor, and Mrinal Kalakrishnan.
Quantile QT-OPT for risk-aware vision-based robotic grasping. In Robotics: Science
and Systems, 2020.
Vivek S. Borkar. Stochastic approximation with two time scales. Systems & Control
Letters, 29(5):291–294, 1997.
Vivek S. Borkar. Stochastic approximation: A dynamical systems viewpoint. Cam-
bridge University Press, 2008.
Vivek S. Borkar and Sean P. Meyn. The ODE method for convergence of stochas-
tic approximation and reinforcement learning. SIAM Journal on Control and
Optimization, 38(2):447–469, 2000.
Léon Bottou. Online learning and stochastic approximations. On-line learning in
neural networks, 17(9):142, 1998.
Craig Boutilier. Planning, learning and coordination in multiagent decision processes.
In TARK, volume 96, pages 195–210, 1996.
Michael Bowling and Manuela Veloso. Multiagent learning using a variable learning
rate. Artificial Intelligence, 136(2):215–250, 2002.
Justin Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely
approximating the value function. In Advances in Neural Information Processing
Systems, 1995.
360 References
Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University
Press, 2004.
Steven J. Bradtke and Andrew G. Barto. Linear least-squares algorithms for temporal
difference learning. Machine learning, 22(1):33–57, 1996.
Todd S. Braver, Deanna M. Barch, and Jonathan D. Cohen. Cognition and control in
schizophrenia: a computational model of dopamine and prefrontal function. Biological
psychiatry, 46(3):312–328, 1999.
Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov
Chain Monte Carlo. CRC Press, 2011.
Daniel Brown, Scott Niekum, and Marek Petrik. Bayesian robust optimization for
imitation learning. In Advances in Neural Information Processing Systems, 2020.
Cameron B. Browne, Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I.
Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis,
and Simon Colton. A survey of Monte Carlo tree search methods. IEEE Transactions
on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova,
Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg
Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu
Wang. Scaling data-driven robotics with reward sketching and batch reinforcement
learning. In Proceedings of Robotics: Science and Systems, 2020.
Barbara Cagniard, Peter D. Balsam, Daniela Brunner, and Xiaoxi Zhuang. Mice with
chronically elevated dopamine exhibit enhanced motivation, but not learning, for a
food reward. Neuropsychopharmacology, 31(7):1362–1370, 2006.
Stefano Carpin, Yinlam Chow, and Marco Pavone. Risk aversion in finite Markov
decision processes using total cost criteria and average value at risk. In Proceedings
of the IEEE International Conference on Robotics and Automation, 2016.
Pablo S. Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G.
Bellemare. Dopamine: A research framework for deep reinforcement learning. arXiv,
2018.
Johan Samir Obando Ceron and Pablo Samuel Castro. Revisiting Rainbow: Promoting
more insightful and inclusive deep reinforcement learning research. In Proceedings
of the International Conference on Machine Learning, 2021.
Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma
Brunskill, and Philip S. Thomas. Universal off-policy evaluation. In Advances in
Neural Information Processing Systems, 2021.
Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforce-
ment learning. In Proceedings of the International Conference on Machine Learning,
2019.
Nicolas Chopin and Omiros Papaspiliopoulos. An introduction to sequential Monte
Carlo. Springer, 2020.
Yinlam Chow. Risk-sensitive and data-driven sequential decision making. PhD thesis,
Stanford University, 2017.
References 361
Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in
MDPs. In Advances in Neural Information Processing Systems, 2014.
Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust
decision-making: a CVaR optimization approach. In Advances in Neural Information
Processing Systems, 2015.
Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-
constrained reinforcement learning with percentile risk criteria. Journal of Machine
Learning Research, 2018.
Kun-Jen Chung and Matthew J. Sobel. Discounted MDPs: Distribution functions and
exponential utility maximization. SIAM Journal on Control and Optimization, 25(1):
49–62, 1987.
Wesley Chung, Somjit Nath, Ajin Joseph, and Martha White. Two-timescale networks
for nonlinear value function approximation. In Proceedings of the International
Conference on Learning Representations, 2018.
Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in coop-
erative multiagent systems. In Proceedings of the AAAI Conference on Artificial
Intelligence, 1998.
William R. Clements, Benoit-Marie Robaglia, Bastien Van Delft, Reda Bahi Slaoui,
and Sebastien Toth. Estimating risk and uncertainty in deep reinforcement learning.
In Workshop on Uncertainty and Robustness in Deep Learning at the International
Conference on Machine Learning, 2020.
Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural
generation to benchmark reinforcement learning. In Proceedings of the International
Conference on Machine Learning, 2020.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introduction to algorithms. MIT Press, 2001.
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min
sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In
Advances in Neural Information Processing Systems, 2013.
Felipe Leno Da Silva, Anna Helena Reali Costa, and Peter Stone. Distributional
reinforcement learning applied to robot soccer simulation. In Adaptive and Learn-
ing Agents Workshop at the International Conference on Autonomous Agents and
Multiagent Systems, 2019.
Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile
networks for distributional reinforcement learning. In Proceedings of the International
Conference on Machine Learning, 2018a.
Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional
reinforcement learning with quantile regression. In AAAI Conference on Artificial
Intelligence, 2018b.
362 References
Will Dabney, André Barreto, Mark Rowland, Robert Dadashi, John Quan, Marc G.
Bellemare, and David Silver. The value-improvement path: Towards better repre-
sentations for reinforcement learning. In Proceedings of the AAAI Conference on
Artificial Intelligence, 2020a.
Will Dabney, Zeb Kurth-Nelson, Naoshige Uchida, Clara Kwon Starkweather, Demis
Hassabis, Rémi Munos, and Matthew Botvinick. A distributional code for value in
dopamine-based reinforcement learning. Nature, 577(7792):671–675, 2020b.
Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and
Le Song. SBEED: Convergent reinforcement learning with nonlinear function approx-
imation. In Proceedings of the International Conference on Machine Learning,
2018.
Nathaniel D. Daw. Reinforcement learning models of the dopamine system and their
behavioral implications. Carnegie Mellon University, 2003.
Nathaniel D. Daw and Philippe N. Tobler. Value learning through reinforcement: the
basics of dopamine and reinforcement learning. In Paul W. Glimcher and Ernst Fehr,
editors, Neuroeconomics, pages 283–298. Academic Press, 2014.
Peter Dayan. The convergence of TD(
) for general
. Machine learning, 8(3-4):
341–362, 1992.
Peter Dayan. Improving generalization for temporal difference learning: The successor
representation. Neural Computation, 5(4):613–624, 1993.
Peter Dayan and Terrence J. Sejnowski. TD(
) converges with probability 1. Machine
Learning, 14(3):295–301, 1994.
Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian Q-learning. In Proceed-
ings of the Fifteenth National Conference on Artificial Intelligence, pages 761–768,
1998.
Erick Delage and Shie Mannor. Percentile optimization for Markov decision processes
with parameter uncertainty. Operations Research, 2010.
Eric V. Denardo and Uriel G. Rothblum. Optimal stopping, exponential utility, and
linear programming. Mathematical Programming, 16(1):228–244, 1979.
Cyrus Derman. Finite state Markovian decision processes. Academic Press, 1970.
Persi Diaconis and David Freedman. Iterated random functions. SIAM review, 41(1):
45–76, 1999.
Thang Doan, Bogdan Mazoure, and Clare Lyle. Gan Q-learning. arXiv preprint
arXiv:1805.04874, 2018.
J. L. Doob. Measure Theory. Springer, 1994.
Arnaud Doucet and Adam M. Johansen. A tutorial on particle filtering and smoothing:
Fifteen years later. Handbook of nonlinear filtering, 2011.
Arnaud. Doucet, Nando. De Freitas, and Neil Gordon. Sequential Monte Carlo methods
in practice. Springer, 2001.
Jingliang Duan, Yang Guan, Shengbo Eben Li, Yangang Ren, Qi Sun, and Bo Cheng.
Distributional soft actor-critic: Off-policy reinforcement learning for addressing value
References 363
estimation errors. IEEE Transactions on Neural Networks and Learning Systems,
2021.
Aryeh Dvoretzky. On stochastic approximation. In Proceedings of the Berkeley
Symposium on Mathematical Statistics and Probability, pages 39–55, 1956.
Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character
of the sample distribution function and of the classical multinomial estimator. The
Annals of Mathematical Statistics, pages 642–669, 1956.
Yaakov Engel, Shie Mannor, and Ron Meir. Bayes meets Bellman: The Gaussian
process approach to temporal difference learning. In Proceedings of the International
Conference on Machine Learning, 2003.
Yaakov Engel, Shie Mannor, and Ron Meir. Bayesian reinforcement learning with
Gaussian process temporal difference methods. Unpublished, 2007.
Martin Engert. Finite dimensional translation invariant subspaces. Pacific Journal of
Mathematics, 32(2):333–343, 1970.
Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforce-
ment learning. Journal of Machine Learning Research, 6:503–556, 2005.
Neir Eshel, Michael Bukwich, Vinod Rao, Vivian Hemmelder, Ju Tian, and Naoshige
Uchida. Arithmetic and local circuitry underlying dopamine prediction errors. Nature,
525(7568):243–246, 2015.
Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning. Journal of
Machine Learning Research, 5(1), 2003.
Amir-massoud Farahmand. Action-gap phenomenon in reinforcement learning. In
Advances in Neural Information Processing Systems, 2011.
Amir-massoud Farahmand. Value function in frequency domain and the characteristic
value iteration algorithm. In Advances in Neural Information Processing Systems,
2019.
William Fedus, Carles Gelada, Yoshua Bengio, Marc G. Bellemare, and Hugo
Larochelle. Hyperbolic discounting and learning over multiple horizons. In
Multi-Disciplinary Conference on Reinforcement Learning and Decision-Making,
2019.
Eugene A. Feinberg. Constrained discounted Markov decision processes and
Hamiltonian cycles. Mathematics of Operations Research, 25(1):130–140, 2000.
Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite Markov
decision processes. In Proceedings of the Conference on Uncertainty in Artificial
Intelligence, 2004.
Norman Ferns and Doina Precup. Bisimulation metrics are optimal value functions.
In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2014.
Jerzy A. Filar, Dmitry Krass, and Keith W. Ross. Percentile performance criteria
for limiting average Markov decision processes. IEEE Transactions on Automatic
Control, 40(1):2–10, 1995.
364 References
Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband,
Alex Graves, Vlad Mnih, Rémi Munos, Demis Hassabis, Olivier Pietquin, Charles
Blundell, and Shane Legg. Noisy networks for exploration. In Proceedings of the
International Conference on Learning Representations, 2018.
Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and
Joelle Pineau. An introduction to deep reinforcement learning. Foundations and
Trends
R
in Machine Learning, 11(3-4):219–354, 2018.
Dror Freirich, Tzahi Shimkin, Ron Meir, and Aviv Tamar. Distributional multivariate
policy evaluation and exploration with the Bellman GAN. In Proceedings of the
International Conference on Machine Learning, 2019.
Matthew P.H. Gardner, Geoffrey Schoenbaum, and Samuel J. Gershman. Rethinking
dopamine as generalized prediction error. Proceedings of the Royal Society B, 285
(1891):20181645, 2018.
Dwight C. German, Kebreten Manaye, Wade K. Smith, Donald J. Woodward, and
Clifford B. Saper. Midbrain dopaminergic cell loss in parkinson’s disease: computer
visualization. Annals of Neurology: Official Journal of the American Neurological
Association and the Child Neurology Society, 26(4):507–514, 1989.
Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Bayesian
reinforcement learning: A survey. Foundations and Trends
R
in Machine Learning,8
(5-6):359–483, 2015.
Dibya Ghosh and Marc G. Bellemare. Representations for stable off-policy reinforce-
ment learning. In Proceedings of the International Conference on Machine Learning,
2020.
Dibya Ghosh, Marlos C. Machado, and Nicolas Le Roux. An operator view of policy
gradient methods. In Advances in Neural Information Processing Systems, 2020.
Hugo Gilbert, Paul Weng, and Yan Xu. Optimizing quantiles in preference-based
Markov decision processes. In Proceedings of the AAAI Conference on Artificial
Intelligence, 2017.
Paul W. Glimcher. Understanding dopamine and reinforcement learning: the dopamine
reward prediction error hypothesis. Proceedings of the National Academy of Sciences,
108(Supplement 3):15647–15654, 2011.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances in Neural Information Processing Systems, 2014.
Ian Goodfellow, Aaron Courville, and Yoshua Bengio. Deep learning. MIT Press,
2016.
Geoffrey Gordon. Stable function approximation in dynamic programming. In
Proceedings of the International Conference on Machine Learning, 1995.
Neil J. Gordon, David J. Salmond, and Adrian F.M. Smith. Novel approach to
nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F (Radar and
Signal Processing), 140(2):107–113, 1993.
References 365
Laura Graesser and Wah Loon Keng. Foundations of deep reinforcement learning:
Theory and practice in Python. Addison-Wesley Professional, 2019.
Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and
Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research,
13:723–773, 2012.
Steffen Grünewälder and Klaus Obermayer. The optimal unbiased value estimator and
its relation to LSTD, TD and MC. Machine Learning, 83(3), 2011.
Audrunas Gruslys, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc Belle-
mare, and Rémi Munos. The Reactor: A fast and sample-efficient actor-critic agent for
reinforcement learning. In Proceedings of the International Conference on Learning
Representations, 2018.
Zhaohan Daniel Guo, Bernardo Avila Pires, Bilal Piot, Jean-Bastien Grill, Flo-
rent Altché, Rémi Munos, and Mohammad Gheshlaghi Azar. Bootstrap latent-
predictive representations for multitask reinforcement learning. In Proceedings of the
International Conference on Machine Learning, 2020.
Leonid Gurvits, Long-Ji Lin, and Stephen José Hanson. Incremental learning of evalu-
ation functions for absorbing Markov chains: New methods and theorems. Technical
report, Siemens Corporate Research, 1994.
Mance E. Harmon and Leemon C. Baird. A response to Bertsekas’ a counterexample
to temporal-differences learning. Technical report, Wright Laboratory, 1996.
William B. Haskell and Rahul Jain. A convex analytic approach to risk-aware Markov
decision processes. SIAM Journal on Control and Optimization, 53(3):1569–1598,
2015.
Shane V. Hegarty, Aideen M. Sullivan, and Gerard W. O’Keeffe. Midbrain dopamin-
ergic neurons: a review of the molecular circuitry that regulates their development.
Developmental biology, 379(2):123–138, 2013.
Matthias Heger. Consideration of risk in reinforcement learning. In Machine Learning
Proceedings 1994, pages 105–111. Elsevier, 1994.
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and
David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI
Conference on Artificial Intelligence, 2018.
Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski,
Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow:
Combining improvements in deep reinforcement learning. In Proceedings of the
AAAI Conference on Artificial Intelligence, 2018.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural
computation, 9(8):1735–1780, 1997.
Oleh Hornykiewicz. Dopamine (3-hydroxytyramine) and brain function. Pharmaco-
logical reviews, 18(2):925–964, 1966.
R. Howard. Dynamic Programming and Markov Processes. MIT Press, 1960.
366 References
Ronald A. Howard and James E. Matheson. Risk-sensitive Markov decision processes.
Management science, 18(7):356–369, 1972.
Oliver D. Howes and Shitij Kapur. The dopamine hypothesis of schizophrenia: version
iii—the final common pathway. Schizophrenia bulletin, 35(3):549–562, 2009.
Marcus Hutter. Universal artificial intelligence: Sequential decisions based on
algorithmic probability. Springer, 2005.
Ehsan Imani and Martha White. Improving regression performance with distributional
losses. In Proceedings of the International Conference on Machine Learning, 2018.
Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of
stochastic iterative dynamic programming algorithms. Neural Computation, 6(6),
1994.
Max Jaderberg, Volodymyr Mnih, Wojciech M. Czarnecki, Tom Schaul, Joel Z.
Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsuper-
vised auxiliary tasks. In Proceedings of the International Conference on Learning
Representations, 2017.
Michael Janner, Igor Mordatch, and Sergey Levine. Generative temporal difference
learning for infinite-horizon prediction. In Advances in Neural Information Processing
Systems, 2020.
Stratton C. Jaquette. Markov decision processes with a new optimality criterion:
Discrete time. The Annals of Statistics, pages 496–505, 1973.
Stratton C. Jaquette. A utility criterion for Markov decision processes. Management
Science, 23(1):43–49, 1976.
Børge Jessen and Aurel Wintner. Distribution functions and the riemann zeta function.
Transactions of the American Mathematical Society, 38(1):48–88, 1935.
Daniel R. Jiang and Warren B. Powell. Risk-averse approximate dynamic programming
with quantile-based risk measures. Mathematics of Operations Research, 43(2):
554–579, 2018.
Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the
Fokker–Planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and
acting in partially observable stochastic domains. Artificial Intelligence, 101:99–134,
1998.
Leon J. Kamin. "Attention like" processes in classical conditioning. In Miami
symposium on the prediction of behavior: Aversive stimulation, pages 9–31, 1968.
Leonid V. Kantorovich. On the translocation of masses. In Dokl. Akad. Nauk. USSR
(NS), volume 37, pages 199–201, 1942.
Spiros Kapetanakis and Daniel Kudenko. Reinforcement learning of coordination in
cooperative multi-agent systems. In Proceedings of the AAAI Conference on Artificial
Intelligence, 2002.
References 367
Bilal Kartal, Pablo Hernandez-Leal, and Matthew E. Taylor. Terminal prediction
as an auxiliary task for deep reinforcement learning. In Proceedings of the AAAI
Conference on Artificial Intelligence and Interactive Digital Entertainment, 2019.
Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech
Ja
´
skowski. Vizdoom: A Doom-based AI research platform for visual reinforcement
learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG),
pages 1–8, 2016.
Ramtin Keramati, Christoph Dann, Alex Tamkin, and Emma Brunskill. Being opti-
mistic to be conservative: Quickly learning a CVaR policy. In Proceedings of the
AAAI Conference on Artificial Intelligence, 2020.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
Proceedings of the International Conference on Learning Representations, 2015.
Roger Koenker. Quantile Regression. Cambridge University Press, 2005.
Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: journal
of the Econometric Society, pages 33–50, 1978.
J. Zico Kolter. The fixed points of off-policy TD. In Advances in Neural Information
Processing Systems, 2011.
George D. Konidaris, Sarah Osentoski, and Philip S. Thomas. Value function approxi-
mation in reinforcement learning using the fourier basis. In Proceedings of the AAAI
Conference on Artificial Intelligence, 2011.
Chung-Ming Kuan, Jin-Huei Yeh, and Yu-Chin Hsu. Assessing value at risk with care,
the conditional autoregressive expectile models. Journal of Econometrics, 150(2):
261–270, 2009.
Harold W. Kuhn. A simplified two-person poker. Contributions to the Theory of
Games, 1:97–103, 1950.
Zeb Kurth-Nelson and A. David Redish. Temporal-difference reinforcement learning
with distributed representations. PLoS One, 4(10):e7362, 2009.
Harold Kusher and Dean Clark. Stochastic Approximation Methods for Constrained
and Unconstrained Systems. Springer, 1978.
Harold Kushner and G. George Yin. Stochastic approximation and recursive algorithms
and applications. Springer Science & Business Media, 2003.
Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. Control-
ling overestimation bias with truncated mixture of continuous distributional quantile
critics. In Proceedings of the International Conference on Machine Learning, 2020.
M.G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine
Learning Research, 4:1107–1149, 2003.
Guillaume Lample and Devendra Singh Chaplot. Playing FPS games with deep rein-
forcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence,
2017.
368 References
Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive unsuper-
vised representations for reinforcement learning. In Proceedings of the International
Conference on Machine Learning, 2020.
Tor Lattimore and Marcus Hutter. PAC bounds for discounted MDPs. In Proceedings
of the International Conference on Algorithmic Learning Theory, 2012.
Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press,
2020.
Martin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement
learning in cooperative multi-agent systems. In Proceedings of the International
Conference on Machine Learning, 2000.
Charline Le Lan, Stephen Tu, Adam Oberman, Rishabh Agarwal, and Marc G.
Bellemare. On the generalization of representations in reinforcement learning. In
Proceedings of the International Conference on Artificial Intelligence and Statistics,
2022.
Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time
series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
Daewoo Lee, Boris Defourny, and Warren B. Powell. Bias-corrected Q-learning
to control max-operator bias in Q-learning. In Symposium on Adaptive Dynamic
Programming And Reinforcement Learning, 2013.
Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial
and review. arXiv preprint arXiv:1805.00909, 2018.
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training
of deep visuomotor policies. Journal of Machine Learning Research, 2016.
Xiaocheng Li, Huaiyang Zhong, and Margaret L. Brandeau. Quantile Markov decision
processes. Operations Research, 2021.
Timothy P. Lillicrap, Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman.
Random synaptic feedback weights support error backpropagation for deep learning.
Nature Communications, 7(1):1–10, 2016a.
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,
Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep rein-
forcement learning. In Proceedings of the International Conference on Learning
Representations, 2016b.
Gwo Dong Lin. Recent developments on the moment problem. Journal of Statistical
Distributions and Applications, 4(1):1–17, 2017.
L.J. Lin. Self-improving reactive agents based on reinforcement learning, planning
and teaching. Machine learning, 8(3):293–321, 1992.
Zichuan Lin, Li Zhao, Derek Yang, Tao Qin, Tie-Yan Liu, and Guangwen Yang.
Distributional reward decomposition for reinforcement learning. In Advances in
Neural Information Processing Systems, 2019.
References 369
Nir Lipovetzky, Miquel Ramirez, and Hector Geffner. Classical planning with sim-
ulators: Results on the atari video games. In Proceedings of International Joint
Conference on Artificial Intelligence, 2015.
Michael L. Littman. Markov games as a framework for multi-agent reinforcement
learning. In Proceedings of the International Conference on Machine Learning, 1994.
Michael L. Littman and Csaba Szepesvári. A generalized reinforcement-learning
model: Convergence and applications. In Proceedings of the International Conference
on Machine Learning, 1996.
Jun S. Liu. Monte Carlo strategies in scientific computing, volume 10. Springer, 2001.
Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose
Bayesian inference algorithm. In Advances in Neural Information Processing Systems,
2016.
Quansheng Liu. Fixed points of a generalized smoothing transformation and applica-
tions to the branching random walk. Advances in Applied Probability, 30(1):85–112,
1998.
Lennart Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions on
Automatic Control, 22(4):551–575, 1977.
Tomas Ljungberg, Paul Apicella, and Wolfram Schultz. Responses of monkey
dopamine neurons during learning of behavioral reactions. Journal of neurophysiol-
ogy, 67(1):145–163, 1992.
Adam S. Lowet, Qiao Zheng, Sara Matias, Jan Drugowitsch, and Naoshige Uchida.
Distributional reinforcement learning in the brain. Trends in Neurosciences, 2020.
Elliot A. Ludvig, Marc G. Bellemare, and Keir G. Pearson. A primer on reinforcement
learning in the brain: Psychological, computational, and neural perspectives. Com-
putational neuroscience for advancing artificial intelligence: Models, methods and
applications, pages 111–144, 2011.
Yudong Luo, Guiliang Liu, Haonan Duan, Oliver Schulte, and Pascal Poupart. Dis-
tributional reinforcement learning with monotonic splines. In Proceedings of the
International Conference on Learning Representations, 2021.
Clare Lyle, Pablo Samuel Castro, and Marc G. Bellemare. A comparative analysis
of expected and distributional reinforcement learning. In Proceedings of the AAAI
Conference on Artificial Intelligence, 2019.
Clare Lyle, Mark Rowland, Georg Ostrovski, and Will Dabney. On the effect of
auxiliary tasks on representation dynamics. In Proceedings of the International
Conference on Artificial Intelligence and Statistics, 2021.
Xueguang Lyu and Christopher Amato. Likelihood quantile networks for coordinating
multi-agent reinforcement learning. In Proceedings of the International Conference
on Autonomous Agents and Multiagent Systems, 2020.
Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew
Hausknecht, and Michael Bowling. Revisiting the Arcade Learning Environment:
Evaluation protocols and open problems for general agents. Journal of Artificial
Intelligence Research, 2018.
370 References
David J.C. MacKay. Information theory, inference and learning algorithms. Cam-
bridge University Press, 2003.
Chris J. Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Arnaud Doucet,
Andriy Mnih, and Yee Whye Teh. Particle value functions. In Proceedings of the
International Conference on Learning Representations (Workshop Track), 2017.
João Guilherme Madeira Auraújo, Johan Samir Obando Ceron, and Pablo Samuel
Castro. Lifting the veil on hyper-parameters for value-based deep reinforcement
learning. In NeurIPS 2021 Workshop: LatinX in AI, 2021.
Hamid Reza Maei. Gradient Temporal-Difference Learning Algorithms. PhD thesis,
University of Alberta, 2011.
Petr Mandl. On the variance in controlled Markov chains. Kybernetika, 7(1):1–12,
1971.
Shie Mannor and John Tsitsiklis. Mean-variance optimization in Markov decision
processes. In Proceedings of the International Conference on Machine Learning,
2011.
Shie Mannor, Duncan Simester, Peng Sun, and John N. Tsitsiklis. Bias and variance
approximation in value function estimates. Management Science, 53, 2007.
Harry M. Markowitz. Portfolio selection. Journal of Finance, 7:77–91, 1952.
John Martin, Michal Lyskawinski, Xiaohu Li, and Brendan Englot. Stochastically
dominant distributional reinforcement learning. In Proceedings of the International
Conference on Machine Learning, 2020.
Pascal Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The
Annals of Probability, pages 1269–1283, 1990.
Laëtitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. Hysteretic Q-
learning: an algorithm for decentralized reinforcement learning in cooperative multi-
agent teams. In IEEE International Conference on Intelligent Robots and Systems,
2007.
Laetitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. Independent
reinforcement learners in cooperative Markov games: A survey regarding coordination
problems. The Knowledge Engineering Review, 27(1):1–31, 2012.
Borislav Mavrin, Hengshuai Yao, Linglong Kong, Kaiwen Wu, and Yaoliang Yu.
Distributional reinforcement learning for efficient exploration. In Proceedings of the
International Conference on Machine Learning, 2019.
Andrew K. McCallum. Reinforcement learning with selective perception and hidden
state. PhD thesis, University of Rochester, 1995.
Sean Meyn. Control Systems and Reinforcement Learning. Cambridge University
Press, 2022.
Sean P. Meyn and Richard L. Tweedie. Markov chains and stochastic stability.
Cambridge University Press, 2012.
Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learning. Machine
learning, 49(2):267–290, 2002.
References 371
Ralph R. Miller, Robert C. Barnet, and Nicholas J. Grahame. Assessment of the
Rescorla-Wagner model. Psychological bulletin, 117(3):363, 1995.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,
Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg
Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen
King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-
level control through deep reinforcement learning. Nature, 518(7540):529–533,
2015.
Gordon J. Mogenson, Douglas L. Jones, and Chi Yiu Yim. From motivation to action:
functional interface between the limbic system and the motor system. Progress in
neurobiology, 14(2-3):69–97, 1980.
Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de
l’Académie Royale des Sciences de Paris, 1781.
P. Read Montague, Peter Dayan, and Terrence J. Sejnowski. A framework for mes-
encephalic dopamine systems based on predictive Hebbian learning. Journal of
neuroscience, 16(5):1936–1947, 1996.
Nick Montfort and Ian Bogost. Racing the beam: The Atari Video Computer System.
MIT Press, 2009.
Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement
learning with less data and less time. Machine Learning, 1993.
Oskar Morgenstern and John von Neumann. Theory of games and economic behavior.
Princeton University Press, 1944.
Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and
Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforce-
ment learning. In Proceedings of the International Conference on Machine Learning,
2010a.
Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and
Toshiyuki Tanaka. Parametric return density estimation for reinforcement learning.
In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2010b.
Thomas E. Morton. On the asymptotic convergence rate of cost differences for
Markovian decision processes. Operations Research, 19(1):244–248, 1971.
Bradford W. Mott, Stephen Anthony, and the Stella team. Stella: A multi-platform
Atari 2600 VCS emulator. http://stella.sourceforge.net, 1995–2021.
Alfred Müller. Integral probability metrics and their generating classes of functions.
Advances in Applied Probability, 29(2):429–443, 1997.
Timothy H. Muller, James L. Butler, Sebastijan Veselic, Bruno Miranda, Timothy E.J.
Behrens, Zeb Kurth-Nelson, and Steven W. Kennerley. Distributional reinforcement
learning in prefrontal cortex. bioRxiv, 2021.
Rémi Munos. Error bounds for approximate policy iteration. In Proceedings of the
International Conference on Machine Learning, 2003.
372 References
Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G. Bellemare. Safe
and efficient off-policy reinforcement learning. In Advances in Neural Information
Processing Systems, 2016.
Kevin P. Murphy. Machine learning: A probabilistic perspective. MIT Press, 2012.
Christian A. Naesseth, Fredrik Lindsten, and Thomas B. Schön. Elements of sequential
Monte Carlo. Foundations and Trends
R
in Machine Learning, 12(3):307–392, 2019.
Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted Boltzmann
machines. In Proceedings of the International Conference on Machine Learning,
2010.
Daniel W Nam, Younghoon Kim, and Chan Y. Park. GMAC: A distributional perspec-
tive on actor-critic framework. In Proceedings of the International Conference on
Machine Learning, 2021.
Ralph Neininger. Limit Laws for Random Recursive Structures and Algorithms. PhD
thesis, University of Freiburg, 1999.
Ralph Neininger. On a multivariate contraction method for random recursive structures
with applications to Quicksort. Random Structures & Algorithms, 19(3-4):498–524,
2001.
Ralph Neininger and Ludger Rüschendorf. A general limit theorem for recursive
algorithms and combinatorial structures. The Annals of Applied Probability, 14(1):
378–418, 2004.
Whitney K. Newey and James L. Powell. Asymmetric least squares estimation and
testing. Econometrica: Journal of the Econometric Society, pages 819–847, 1987.
Thanh Tang Nguyen, Sunil Gupta, and Svetha Venkatesh. Distributional reinforcement
learning via moment matching. In Proceedings of the AAAI Conference on Artificial
Intelligence, 2021.
André Nieoullon. Dopamine and the regulation of cognition and attention. Progress
in neurobiology, 67(1):53–83, 2002.
Nikolay Nikolov, Johannes Kirschner, Felix Berkenkamp, and Andreas Krause.
Information-directed exploration for deep reinforcement learning. In Proceedings of
the International Conference on Learning Representations, 2019.
Yael Niv. Reinforcement learning in the brain. Journal of Mathematical Psychology,
53(3):139–154, 2009.
Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Kather-
ine Ye, and Alexander Mordvintsev. The building blocks of interpretability. Distill,3
(3), 2018.
Frans A. Oliehoek and Christopher Amato. A concise introduction to decentralized
POMDPs. Springer, 2016.
Frans A. Oliehoek, Matthijs T.J. Spaan, and Nikos Vlassis. Optimal and approximate
Q-value functions for decentralized POMDPs. Journal of Artificial Intelligence
Research, 32:289–353, 2008.
References 373
Ditte Olsen, Niels Wellner, Mathias Kaas, Inge E.M. de Jong, Florence Sotty, Michael
Didriksen, Simon Glerup, and Anders Nykjaer. Altered dopaminergic firing pat-
tern and novelty response underlie adhd-like behavior of sorcs2-deficient mice.
Translational Psychiatry, 11(1):1–14, 2021.
Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P. How, and John
Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial
observability. In Proceedings of the International Conference on Machine Learning,
2017.
Art B. Owen. Monte Carlo theory, methods and examples. 2013.
Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani. Lenient multi-
agent deep reinforcement learning. In Proceedings of the International Conference
on Autonomous Agents and Multiagent Systems, 2018.
Gregory Palmer, Rahul Savani, and Karl Tuyls. Negative update intervals in deep
multi-agent reinforcement learning. In Proceedings of the International Conference
on Autonomous Agents and Multiagent Systems, 2019.
Liviu Panait, R. Paul Wiegand, and Sean Luke. Improving coevolutionary search for
optimal multiagent behaviors. In Proceedings of the International Joint Conference
on Artificial Intelligence, 2003.
Liviu Panait, Keith Sullivan, and Sean Luke. Lenient learners in cooperative multiagent
systems. In Proceedings of the International Conference on Autonomous Agents and
Multiagent Systems, 2006.
Victor M. Panaretos and Yoav Zemel. An invitation to statistics in Wasserstein space.
Springer Nature, 2020.
Ronald Parr, Christopher Painter-Wakefield, Lihong Li, and Michael Littman. Ana-
lyzing feature generation for value-function approximation. In Proceedings of the
International Conference on Machine Learning, 2007.
Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, and Michael L.
Littman. An analysis of linear models, linear value-function approximation, and
feature selection for reinforcement learning. In Proceedings of the International
Conference on Machine Learning, 2008.
Ivan P. Pavlov. Conditioned reflexes: An investigation of the physiological activity of
the cerebral cortex., 1927.
Yuval Peres, Wilhelm Schlag, and Boris Solomyak. Sixty years of Bernoulli
convolutions. In Fractal geometry and stochastics II, pages 39–65. Springer, 2000.
Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications
to data science. Foundations and Trends
R
in Machine Learning, 11(5-6):355–607,
2019.
L.A. Prashanth and Michael Fu. Risk-sensitive reinforcement learning. arXiv preprint
arXiv:1810.09126, 2021.
L.A. Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-
sensitive MDPs. In Advances in Neural Information Processing Systems, 2013.
374 References
Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-
policy policy evaluation. In Proceedings of the International Conference on Machine
Learning, 2000.
Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech Badia, Oriol
Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic
control. In Proceedings of the International Conference on Machine Learning, 2017.
Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic
programming. John Wiley & Sons, 2014.
Martin L. Puterman and Moon Chirl Shin. Modified policy iteration algorithms for
discounted Markov decision problems. Management Science, 24(11):1127–1137,
1978.
Wei Qiu, Xinrun Wang, Runsheng Yu, Xu He, Rundong Wang, Bo An, Svetlana
Obraztsova, and Zinovi Rabinovich. RMIX: Learning risk-sensitive policies for coop-
erative reinforcement learning agents. In Advances in Neural Information Processing
Systems, 2021.
Chao Qu, Shie Mannor, and Huan Xu. Nonlinear distributional gradient temporal-
difference learning. In Proceedings of the International Conference on Machine
Learning, 2019.
John Quan and Georg Ostrovski. DQN Zoo: Reference implementations of DQN-based
agents, 2020. URL http://github.com/deepmind/dqn_zoo.
S. T. Rachev and L. Rüschendorf. Probability metrics and recursive algorithms.
Advances in Applied Probability, 27(3):770–799, 1995.
Svetlozar T. Rachev, Lev Klebanov, Stoyan V. Stoyanov, and Frank Fabozzi. The
methods of distances in the theory of probability and statistics. Springer Science &
Business Media, 2013.
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob
Foerster, and Shimon Whiteson. QMIX: Monotonic value function factorisation
for deep multi-agent reinforcement learning. In Proceedings of the International
Conference on Machine Learning, 2018.
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar,
Jakob N. Foerster, and Shimon Whiteson. Monotonic value function factorisation for
deep multi-agent reinforcement learning. Journal of Machine Learning Research, 21:
178–1, 2020.
Robert A. Rescorla and Allan R. Wagner. A theory of Pavlovian conditioning: Vari-
ations in the effectiveness of reinforcement and nonreinforcement. In Classical
conditioning II, chapter 3, pages 64–99. Appleton-Century-Crofts, 1972.
M. Riedmiller. Neural fitted Q iteration–first experiences with a data efficient neural
reinforcement learning method. In Proceedings of the European Conference on
Machine Learning, 2005.
Martin Riedmiller, Thomas Gabel, Roland Hafner, and Sascha Lange. Reinforcement
learning for robot soccer. Autonomous Robots, 27(1):55–73, 2009.
References 375
Maria L. Rizzo and Gábor J. Székely. Energy distance. Wiley Interdisciplinary Reviews:
Computational Statistics, 8(1):27–38, 2016.
Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals
of mathematical statistics, pages 400–407, 1951.
Herbert Robbins and David Siegmund. A convergence theorem for non negative almost
supermartingales and some applications. In Optimizing methods in statistics, pages
233–257. Elsevier, 1971.
Christian Robert and George Casella. Monte Carlo statistical methods. Springer
Science & Business Media, 2004.
R. Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk.
Journal of risk, 2:21–42, 2000.
R. Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general
loss distributions. Journal of banking & finance, 26(7):1443–1471, 2002.
Uwe Rösler. A limit theorem for “quicksort”. RAIRO-Theoretical Informatics and
Applications, 25(1):85–100, 1991.
Uwe Rösler. A fixed point theorem for distributions. Stochastic Processes and their
Applications, 42(2):195–214, 1992.
Uwe Rösler. On the analysis of stochastic divide and conquer algorithms. Algorithmica,
29(1):238–261, 2001.
Uwe Rösler and Ludger Rüschendorf. The contraction method for recursive algorithms.
Algorithmica, 29(1-2):3–33, 2001.
Mark Rowland, Marc G. Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh.
An analysis of categorical distributional reinforcement learning. In Proceedings of
the International Conference on Artificial Intelligence and Statistics, 2018.
Mark Rowland, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G. Bellemare,
and Will Dabney. Statistics and samples in distributional reinforcement learning. In
Proceedings of the International Conference on Machine Learning, 2019.
Mark Rowland, Shayegan Omidshafiei, Daniel Hennes, Will Dabney, Andrew Jaegle,
Paul Muller, Julien Pérolat, and Karl Tuyls. Temporal difference and return optimism
in cooperative multi-agent reinforcement learning. In Adaptive and Learning Agents
Workshop at the International Conference on Autonomous Agents and Multiagent
Systems, 2021.
Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. A metric for distributions with
applications to image databases. In Sixth International Conference on Computer
Vision, pages 59–66. IEEE, 1998.
Walter Rudin. Principles of mathematical analysis, volume 3. McGraw-Hill New
York, 1976.
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning
representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
Gavin A. Rummery and Mahesan Niranjan. On-line Q-learning using connectionist
systems. Technical report, Cambridge University Engineering Department, 1994.
376 References
Ludger Rüschendorf. On stochastic recursive equations of sum and max type. Journal
of applied probability, 43(3):687–703, 2006.
Ludger Rüschendorf and Ralph Neininger. A survey of multivariate aspects of the
contraction method. Discrete Mathematics & Theoretical Computer Science, 8, 2006.
Andrzej Ruszczy
´
nski. Risk-averse dynamic programming for Markov decision
processes. Mathematical programming, 125(2):235–261, 2010.
Arthur L. Samuel. Some studies in machine learning using the game of checkers. IBM
Journal of Research and Development, 1959.
Filippo Santambrogio. Optimal Transport for Applied Mathematicians: Calculus of
Variations, PDEs and Modeling. Birkhäuser, 2015.
Simo Särkkä. Bayesian filtering and smoothing. Cambridge University Press, 2013.
Makoto Sato, Hajime Kimura, and Shibenobu Kobayashi. TD algorithm for the
variance of return and mean-variance reinforcement learning. Transactions of the
Japanese Society for Artificial Intelligence, 16(3):353–362, 2001.
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience
replay. In Proceedings of the International Conference on Learning Representations,
2016.
Bruno Scherrer. Should one compute the temporal difference fix point or minimize
the Bellman residual? the unified oblique projection view. In Proceedings of the
International Conference on Machine Learning, 2010.
Bruno Scherrer. Approximate policy iteration schemes: A comparison. In Proceedings
of the International Conference on Machine Learning, 2014.
Bruno Scherrer and Boris Lesner. On the use of non-stationary policies for stationary
infinite-horizon Markov decision processes. In Advances in Neural Information
Processing Systems, 2012.
Matthew Schlegel, Andrew Jacobsen, Zaheer Abbas, Andrew Patterson, Adam White,
and Martha White. General value function networks. Journal of Artificial Intelligence
Research (JAIR), 2021.
Wolfram Schultz. Responses of midbrain dopamine neurons to behavioral trigger
stimuli in the monkey. Journal of neurophysiology, 56(5):1439–1461, 1986.
Wolfram Schultz. Getting formal with dopamine and reward. Neuron, 36(2):241–263,
2002.
Wolfram Schultz. Dopamine reward prediction-error signalling: a two-component
response. Nature reviews neuroscience, 17(3):183–195, 2016.
Wolfram Schultz and Ranulfo Romo. Dopamine neurons of the monkey midbrain: Con-
tingencies of responses to stimuli eliciting immediate behavioral reactions. Journal
of neurophysiology, 63(3):607–624, 1990.
Wolfram Schultz, Paul Apicella, and Tomas Ljungberg. Responses of monkey
dopamine neurons to reward and conditioned stimuli during successive steps of
learning a delayed response task. Journal of neuroscience, 13(3):900–913, 1993.
References 377
Wolfram Schultz, Peter Dayan, and P. Read Montague. A neural substrate of prediction
and reward. Science, 275(5306):1593–1599, 1997.
Ashvin Shah. Psychological and neuroscientific connections with reinforcement
learning. In Reinforcement Learning, pages 507–537. Springer, 2012.
Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on
stochastic programming: Modeling and theory. SIAM, 2009.
Lloyd S. Shapley. Stochastic games. Proceedings of the National Academy of Sciences,
39(10):1095–1100, 1953.
Yun Shen, Wilhelm Stannat, and Klaus Obermayer. Risk-sensitive Markov control
processes. SIAM Journal on Control and Optimization, 51(5):3652–3672, 2013.
Yoav Shoham and Kevin Leyton-Brown. Multiagent systems. Cambridge University
Press, 2009.
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George
van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya
Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel,
and Demis Hassabis. Mastering the game of Go with deep neural networks and tree
search. Nature, 529(7587):484–489, 2016.
Satinder P. Singh and Richard S. Sutton. Reinforcement learning with replacing
eligibility traces. Machine Learning, 22:123–158, 1996.
Matthew J. Sobel. The variance of discounted Markov decision processes. Journal of
Applied Probability, 19(4):794–802, 1982.
Boris Solomyak. On the random series
±
n
(an Erd
˝
os problem). Annals of
Mathematics, 142(3):611–625, 1995.
Thomas A. Stalnaker, James D. Howard, Yuji K. Takahashi, Samuel J. Gershman,
Thorsten Kahnt, and Geoffrey Schoenbaum. Dopamine neuron ensembles signal the
content of sensory prediction errors. eLife, 8:e49315, 2019.
Marc C. Steinbach. Markowitz revisited: Mean-variance models in financial portfolio
analysis. SIAM review, 43(1):31–85, 2001.
Gilbert Strang. Introduction to linear algebra. Wellesley-Cambridge Press, 1993.
Felipe Petroski Such, Vashisht Madhavan, Rosanne Liu, Rui Wang, Pablo Samuel
Castro, Yulun Li, Ludwig Schubert, Marc G. Bellemare, Jeff Clune, and Joel Lehman.
An Atari model zoo for analyzing, visualizing, and comparing deep reinforcement
learning agents. In Proceedings of the International Joint Conference on Artificial
Intelligence, 2019.
Wei-Fang Sun, Cheng-Kuang Lee, and Chun-Yi Lee. DFAC framework: Factorizing
the value function via quantile mixture for multi-agent distributional Q-learning. In
Proceedings of the International Conference on Machine Learning, 2021.
Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius
Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl
378 References
Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent
learning. arXiv preprint arXiv:1706.05296, 2017.
Richard S. Sutton. Temporal Credit Assignment in Reinforcement Learning . PhD
thesis, University of Massachusetts, Amherst, 1984.
Richard S. Sutton. Learning to predict by the methods of temporal differences.
Machine learning, 3(1):9–44, 1988.
Richard S. Sutton. TD models: Modeling the world at a mixture of time scales. In
Proceedings of the International Conference on Machine Learning, 1995.
Richard S. Sutton. Generalization in reinforcement learning: Successful examples
using sparse coarse coding. In Advances in Neural Information Processing Systems,
1996.
Richard S. Sutton. Open theoretical questions in reinforcement learning. In European
Conference on Computational Learning Theory, 1999.
Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction.
MIT Press, 2018.
Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-
MDPs: A framework for temporal abstraction in reinforcement learning. Artificial
intelligence, 112(1-2):181–211, 1999.
Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour.
Policy gradient methods for reinforcement learning with function approximation. In
Advances in Neural Information Processing Systems, 2000.
Richard S. Sutton, Csaba Szepesvári, and Hamid Reza Maei. A convergent
O(n) temporal-difference algorithm for off-policy learning with linear function
approximation. In Advances in Neural Information Processing Systems, 2008a.
Richard S. Sutton, Csaba Szespesvári, Alborz Geramifard, and Michael Bowling.
Dyna-style planning with linear function approximation and prioritized sweeping. In
Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2008b.
Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David
Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for
temporal-difference learning with linear function approximation. In Proceedings of
the International Conference on Machine Learning, 2009.
Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski,
Adam White, and Doina Precup. Horde: A scalable real-time architecture for learn-
ing knowledge from unsupervised sensorimotor interaction. In Proceedings of the
International Conference on Autonomous Agents and Multiagents Systems, 2011.
Gabor J. Székely. E-statistics: The energy of statistical samples. Technical Report
02-16, Bowling Green State University, Department of Mathematics and Statistics,
2002.
Gábor J. Székely and Maria L. Rizzo. Energy statistics: A class of statistics based on
distances. Journal of Statistical Planning and Inference, 143(8):1249–1272, 2013.
References 379
Csaba Szepesvári. The asymptotic convergence-rate of Q-learning. In Advances in
Neural Information Processing Systems, 1998.
Csaba Szepesvári. Algorithms for reinforcement learning. Morgan & Claypool
Publishers, 2010.
Csaba Szepesvári. Constrained MDPs and the reward hypothesis. https://readingsml
.blogspot.com/2020/03/constrained-mdps-and-reward-hypothesis.html, 2020.
Accessed June 25, 2021.
Yuji K. Takahashi, Hannah M. Batchelor, Bing Liu, Akash Khanna, Marisela Morales,
and Geoffrey Schoenbaum. Dopamine neurons respond to errors in the prediction of
sensory features of expected rewards. Neuron, 95(6):1395–1405, 2017.
Aviv Tamar, Dotan Di Castro, and Shie Mannor. Policy gradients with variance related
risk criteria. In Proceedings of the International Conference on Machine Learning,
2012.
Aviv Tamar, Dotan Di Castro, and Shie Mannor. Temporal difference methods for
the variance of the reward to go. In Proceedings of the International Conference on
Machine Learning, 2013.
Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the CVaR via sampling.
In Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
Aviv Tamar, Dotan Di Castro, and Shie Mannor. Learning the variance of the reward-
to-go. Journal of Machine Learning Research, 17(1):361–396, 2016.
Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan
Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep
reinforcement learning. PloS one, 12(4):e0172395, 2017.
Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents.
In Proceedings of the International Conference on Machine Learning, 1993.
Pablo Tano, Peter Dayan, and Alexandre Pouget. A local temporal difference code for
distributional reinforcement learning. In Advances in Neural Information Processing
Systems, 2020.
James W. Taylor. Estimating value at risk and expected shortfall using expectiles.
Journal of Financial Econometrics, 6(2):231–252, 2008.
Gerald Tesauro. Temporal difference learning and TD-Gammon. Communications of
the ACM, 38(3), 1995.
Chen Tessler, Guy Tennenholtz, and Shie Mannor. Distributional policy optimization:
An alternative approach for continuous control. In Advances in Neural Information
Processing Systems, 2019.
T. Tieleman and G. Hinton. RmsProp: Divide the gradient by a running average of its
recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
Marc Toussaint. Robot trajectory optimization using approximate inference. In
Proceedings of the International Conference on Machine Learning, 2009.
380 References
Marc Toussaint and Amos Storkey. Probabilistic inference for solving discrete and
continuous state Markov decision processes. In Proceedings of the International
Conference on Machine Learning, 2006.
John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine
learning, 16(3):185–202, 1994.
John N. Tsitsiklis. On the convergence of optimistic policy iteration. Journal of
Machine Learning Research, 3:59–72, 2002.
John N. Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning
with function approximation. IEEE Transactions on Automatic Control, 42(5):674–
690, 1997.
Cassius T. Ionescu Tulcea. Mesures dans les espaces produits. Atti Accademia
Nazionale Lincei Rend, 8(7), 1949.
Aaron Van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neu-
ral networks. In Proceedings of the International Conference on Machine Learning,
2016.
Aad W. Van der Vaart. Asymptotic statistics. Cambridge University Press, 2000.
J. van der Wal. Stochastic dynamic programming: Successive approximations and
nearly optimal strategies for Markov decision processes and Markov games. Stichting
Mathematisch Centrum, 1981.
Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with
double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence,
2016.
Hado P. van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, and David Silver.
Learning values across many orders of magnitude. In Advances in Neural Information
Processing Systems, 2016.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances
in Neural Information Processing Systems, 2017.
Mel Vecerik, Oleg Sushkov, David Barker, Thomas Rothörl, Todd Hester, and Jon
Scholz. A practical approach to insertion with variable socket position using deep rein-
forcement learning. In IEEE International Conference on Robotics and Automation,
2019.
Joel Veness, Kee Siong Ng, Marcus Hutter, William T. B. Uther, and David Silver. A
Monte-Carlo AIXI Approximation. Journal of Artificial Intelligence Resesearch, 40:
95–142, 2011.
Joel Veness, Marc G. Bellemare, Marcus Hutter, Alvin Chua, and Guillaume Des-
jardins. Compress and control. In Proceedings of the AAAI Conference on Artificial
Intelligence, 2015.
A.M. Vershik. Long history of the Monge-Kantorovich transportation problem. The
Mathematical Intelligencer, 35(4):1–9, 2013.
References 381
Nino Vieillard, Olivier Pietquin, and Matthieu Geist. Munchausen reinforcement
learning. In Advances in Neural Information Processing Systems, 2020.
Cédric Villani. Topics in optimal transportation. Graduate Studies in Mathematics.
American Mathematical Society, 2003.
Cédric Villani. Optimal transport: old and new. Springer Science & Business Media,
2008.
J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100
(1):295–320, 1928.
Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families,
and variational inference. Foundations and Trends
R
in Machine Learning, 1(1–2):
1–305, 2008.
Neil Walton. Lecture Notes on Stochastic Control. Unpublished, 2021.
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Fre-
itas. Dueling network architectures for deep reinforcement learning. In Proceedings
of the International Conference on Machine Learning, 2016.
Christopher J.C.H. Watkins. Learning from delayed rewards. PhD thesis, King’s
College, Cambridge, 1989.
Christopher J.C.H. Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):
279–292, 1992.
Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of conver-
gence of empirical measures in Wasserstein distance. Bernoulli, 25(4A):2620–2648,
2019.
Ermo Wei and Sean Luke. Lenient learning in independent-learner stochastic
cooperative games. Journal of Machine Learning Research, 17(1):2914–2955, 2016.
Paul J. Werbos. Applications of advances in nonlinear sensitivity analysis. In System
modeling and optimization, pages 762–770. Springer, 1982.
D. J. White. Mean, variance, and probabilistic criteria in finite Markov decision
processes: a review. Journal of Optimization Theory and Applications, 56(1):1–29,
1988.
Martha White. Unifying task specification in reinforcement learning. In Proceedings
of the International Conference on Machine Learning, 2017.
Martha White and Adam White. A greedy approach to adapting the trace parameter
for temporal difference learning. In Proceedings of the International Conference on
Autonomous Agents and Multiagent Systems, 2016.
Norman M. White and Marc Viaud. Localized intracaudate dopamine d2 receptor
activation during the post-training period improves memory for visual or olfactory
conditioned emotional responses in rats. Behavioral and neural biology, 55(3):
255–269, 1991.
Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. In WESCON
Convention Record Part IV, 1960.
David Williams. Probability with martingales. Cambridge University Press, 1991.
382 References
Roy A. Wise. Dopamine, learning and motivation. Nature reviews neuroscience, 5(6):
483–494, 2004.
Peter R. Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik
Subramanian, Thomas J. Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert,
Florian Fuchs, Leilani Gilpin, Piyush Khandelwal, Varun Kompella, HaoChih Lin,
Patrick MacAlpine, Declan Oller, Takuma Seno, Craig Sherstan, Michael D. Thomure,
Houmehr Aghabozorgi, Leon Barrett, Rory Douglas, Dion Whitehead, Peter Dürr,
Peter Stone, Michael Spranger, and Hiroaki Kitano. Outracing champion Gran
Turismo drivers with deep reinforcement learning. Nature, 602(7896):223–228, 2022.
Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully
parameterized quantile function for distributional reinforcement learning. In Advances
in Neural Information Processing Systems, 2019.
Kenny Young and Tian Tian. MinAtar: An Atari-inspired testbed for thorough and
reproducible reinforcement learning experiments. arXiv, 2019.
Yuguang Yue, Zhendong Wang, and Mingyuan Zhou. Implicit distributional rein-
forcement learning. In Advances in Neural Information Processing Systems,
2020.
Shangtong Zhang and Hengshuai Yao. QUOTA: The quantile option architecture
for reinforcement learning. In Proceedings of the AAAI Conference on Artificial
Intelligence, 2019.
Fan Zhou, Zhoufan Zhu, Qi Kuang, and Liwen Zhang. Non-decreasing quantile
function network with efficient exploration for distributional reinforcement learning.
In Proceedings of the International Joint Conference on Artificial Intelligence, 2021.
Johanna F. Ziegel. Coherence and elicitability. Mathematical Finance, 26(4):901–918,
2016.
Vladimir M. Zolotarev. Metric distances in spaces of random variables and their
distributions. Sbornik: Mathematics, 30(3):373–401, 1976.