References

Mastane Achab. Ranking and risk-aware reinforcement learning. PhD thesis, Institut

Polytechnique de Paris, 2020.

Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspec-

tive on ofﬂine reinforcement learning. In Proceedings of the International Conference

on Machine Learning, 2020.

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G.

Bellemare. Deep reinforcement learning at the edge of the statistical precipice. In

Advances in Neural Information Processing Systems, 2021.

D. J. Aigner, Takeshi Amemiya, and Dale J. Poirier. On the estimation of production

frontiers: Maximum likelihood estimation of the parameters of a discontinuous density

function. International Economic Review, 17(2):377–96, 1976.

David J. Aldous and Antar Bandyopadhyay. A survey of max-type recursive

distributional equations. The Annals of Applied Probability, 15(2):1047–1110, 2005.

Gerold Alsmeyer. Random recursive equations and their distributional ﬁxed points.

Lecture notes, 2012.

Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient ﬂows: In metric spaces

and in the space of probability measures. Springer Science & Business Media, 2005.

Philip Amortila, Marc G. Bellemare, Prakash Panangaden, and Doina Precup. Tempo-

rally extended metrics for Markov decision processes. In SafeAI: AAAI Workshop on

Artiﬁcial Intelligence Safety, 2019.

Philip Amortila, Doina Precup, Prakash Panangaden, and Marc G. Bellemare. A

distributional analysis of sampling-based reinforcement learning algorithms. In

Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics,

2020.

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. In

Proceedings of the International Conference on Machine Learning, 2017.

Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. Coherent

measures of risk. Mathematical ﬁnance, 9(3):203–228, 1999.

355

356 References

Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, David Heath, and Hyejin Ku.

Coherent multiperiod risk adjusted values and Bellman’s principle. Annals of

Operations Research, 152(1):5–22, 2007.

Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath.

A brief survey of deep reinforcement learning. IEEE Signal Processing Magazine,

Special Issue on Deep Learning for Image Understanding, 2017.

Peter Auer, Mark Herbster, and Manfred K. Warmuth. Exponentially many local

minima for single neurons. In Advances in Neural Information Processing Systems,

1995.

Mohammad Gheshlaghi Azar, Rémi Munos, Mohammad Gavamzadeh, and Hilbert J.

Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems,

2011.

Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. On the sample

complexity of reinforcement learning with a generative model. In Proceedings of the

International Conference on Machine Learning, 2012.

Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. Minimax PAC

bounds on the sample complexity of reinforcement learning with a generative model.

Machine learning, 91(3):325–349, 2013.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds

for reinforcement learning. In Proceedings of the International Conference on

Machine Learning, 2017.

Frederico A.C. Azevedo, Ludmila R.B. Carvalho, Lea T. Grinberg, José Marcelo

Farfel, Renata E.L. Ferretti, Renata E.P. Leite, Wilson Jacob Filho, Roberto Lent, and

Suzana Herculano-Houzel. Equal numbers of neuronal and nonneuronal cells make

the human brain an isometrically scaled-up primate brain. Journal of Comparative

Neurology, 513(5):532–541, 2009.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine transla-

tion by jointly learning to align and translate. In Proceedings of the International

Conference on Learning Representations, 2015.

Leemon C. Baird. Residual algorithms: Reinforcement learning with function approx-

imation. In Proceedings of the International Conference on Machine Learning,

1995.

Leemon C. Baird. Reinforcement learning through gradient descent. PhD thesis,

Carnegie Mellon University, 1999.

Stefan Banach. Sur les opérations dans les ensembles abstraits et leur application aux

équations intégrales. Fund. math, 3(1):133–181, 1922.

Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap,

Piotr Mirowski, Alexander Pritzel, Martin J. Chadwick, Thomas Degris, Joseph

Modayil, Greg Wayne, Hubert Soyer, Fabio Viola, Brian Zhang, Ross Goroshin, Neil

Rabinowitz, Razvan Pascanu, Charlie Beattie, Stig Petersen, Amir Sadik, Stephen

Gaffney, Helen King, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, and Dhar-

shan Kumaran. Vector-based navigation using grid-like representations in artiﬁcial

agents. Nature, 557(7705):429–433, 2018.

References 357

André Barbeau. Drugs affecting movement disorders. Annual Review of Pharmacol-

ogy, 14(1):91–113, 1974.

Nolan Bard, Jakob N. Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H. Francis

Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, Iain

Dunning, Shibl Mourad, Hugo Larochelle, Marc G. Bellemare, and Michael Bowling.

The Hanabi challenge: A new frontier for AI research. Artiﬁcial Intelligence, 280:

103216, 2020.

Etienne Barnard. Temporal-difference methods and Markov models. IEEE Transac-

tions on Systems, Man, and Cybernetics, 1993.

André Barreto, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, Hado P. van

Hasselt, and David Silver. Successor features for transfer in reinforcement learning.

In Advances in Neural Information Processing Systems, 2017.

Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan,

Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed

distributional deterministic policy gradients. In Proceedings of the International

Conference on Learning Representations, 2018.

Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. Neuronlike adaptive

elements that can solve difﬁcult learning control problems. IEEE Transactions on

Systems, Man, and Cybernetics, SMC-13(5):834–846, 1983.

Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using

real-time dynamic programming. Artiﬁcial intelligence, 72(1-2):81–138, 1995.

Nicole Bäuerle and Jonathan Ott. Markov decision processes with average-value-at-risk

criteria. Mathematical Methods of Operations Research, 74(3):361–379, 2011.

Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright,

Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian

Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton,

Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen.

DeepMind Lab. arXiv preprint arXiv:1612.03801, 2016.

Marc G. Bellemare, Joel Veness, and Michael Bowling. Investigating contingency

awareness using Atari 2600 games. In Proceedings of the Twenty-Sixth AAAI

Conference on Artiﬁcial Intelligence, 2012a.

Marc G. Bellemare, Joel Veness, and Michael Bowling. Sketch-based linear value

function approximation. In Advances in Neural Information Processing Systems,

2012b.

Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade

Learning Environment: An evaluation platform for general agents. Journal of Artiﬁcial

Intelligence Research, 47:253–279, June 2013a.

Marc G. Bellemare, Joel Veness, and Michael Bowling. Bayesian learning of recur-

sively factored environments. In Proceedings of the International Conference on

Machine Learning, 2013b.

358 References

Marc .G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade

Learning Environment: An evaluation platform for general agents, extended abstract.

In European Workshop on Reinforcement Learning, 2015.

Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, and Rémi Munos.

Increasing the action gap: New operators for reinforcement learning. In Proceedings

of the AAAI Conference on Artiﬁcial Intelligence, 2016.

Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on

reinforcement learning. In Proceedings of the International Conference on Machine

Learning, 2017a.

Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshmi-

narayanan, Stephan Hoyer, and Rémi Munos. The Cramer distance as a solution to

biased Wasserstein gradients. arXiv, 2017b.

Marc G. Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel

Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, and Clare Lyle. A

geometric perspective on optimal representations for reinforcement learning. In

Advances in Neural Information Processing Systems, 2019a.

Marc G. Bellemare, Nicolas Le Roux, Pablo Samuel Castro, and Subhodeep Moitra.

Distributional reinforcement learning with linear function approximation. In Pro-

ceedings of the International Conference on Artiﬁcial Intelligence and Statistics,

2019b.

Marc G. Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C.

Machado, Subhodeep Moitra, Sameera S. Ponda, and Ziyu Wang. Autonomous

navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836):

77–82, 2020.

Fabio Bellini and Elena Di Bernardino. Risk management with expectiles. The

European Journal of Finance, 23(6):487–506, 2017.

Fabio Bellini, Bernhard Klar, Alfred Müller, and Emanuela Rosazza Gianin. General-

ized quantiles as risk measures. Insurance: Mathematics and Economics, 54:41–48,

2014.

Richard Bellman. Dynamic Programming. Dover Publications, 1957a.

Richard E. Bellman. A Markovian decision process. Journal of Mathematics and

Mechanics, 6(5), 1957b.

Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and

stochastic approximations, volume 22. Springer Science & Business Media, 2012.

Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The

complexity of decentralized control of Markov decision processes. Mathematics of

operations research, 27(4):819–840, 2002.

Dimitri P. Bertsekas. Generic rank-one corrections for value iteration in Markovian

decision problems. Technical report, Massachusetts Institute of Technology, 1994.

Dimitri P. Bertsekas. A counterexample to temporal differences learning. Neural

computation, 7(2):270–279, 1995.

References 359

Dimitri P. Bertsekas. Approximate policy iteration: A survey and some new methods.

Journal of Control Theory and Applications, 9(3):310–335, 2011.

Dimitri P. Bertsekas. Dynamic programming and optimal control, volume 2. Athena

Scientiﬁc, 4th edition, 2012.

Dimitri P. Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration

and applications in neuro-dynamic programming. Technical report, Massachusetts

Institute of Technology, 1996.

Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-dynamic programming. Athena

Scientiﬁc, 1996.

Jalaj Bhandari and Daniel Russo. On the linear convergence of policy gradient meth-

ods for ﬁnite MDPs. In Proceedings of the International Conference on Artiﬁcial

Intelligence and Statistics, 2021.

Nadav Bhonker, Shai Rozenberg, and Itay Hubara. Playing SNES in the retro learn-

ing environment. In Proceedings of the International Conference on Learning

Representations, 2017.

Peter J. Bickel and David A. Freedman. Some asymptotic theory for the bootstrap.

The Annals of Statistics, pages 1196–1217, 1981.

Patrick Billingsley. Probability and measure. John Wiley & Sons, 4th edition, 2012.

Christopher M. Bishop. Pattern recognition and machine learning. Springer, 2006.

Sergey Bobkov and Michel Ledoux. One-dimensional empirical measures, order

statistics, and Kantorovich transport distances. Memoirs of the AMS, 261(1259),

2019.

Cristian Bodnar, Adrian Li, Karol Hausman, Peter Pastor, and Mrinal Kalakrishnan.

Quantile QT-OPT for risk-aware vision-based robotic grasping. In Robotics: Science

and Systems, 2020.

Vivek S. Borkar. Stochastic approximation with two time scales. Systems & Control

Letters, 29(5):291–294, 1997.

Vivek S. Borkar. Stochastic approximation: A dynamical systems viewpoint. Cam-

bridge University Press, 2008.

Vivek S. Borkar and Sean P. Meyn. The ODE method for convergence of stochas-

tic approximation and reinforcement learning. SIAM Journal on Control and

Optimization, 38(2):447–469, 2000.

Léon Bottou. Online learning and stochastic approximations. On-line learning in

neural networks, 17(9):142, 1998.

Craig Boutilier. Planning, learning and coordination in multiagent decision processes.

In TARK, volume 96, pages 195–210, 1996.

Michael Bowling and Manuela Veloso. Multiagent learning using a variable learning

rate. Artiﬁcial Intelligence, 136(2):215–250, 2002.

Justin Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely

approximating the value function. In Advances in Neural Information Processing

Systems, 1995.

360 References

Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University

Press, 2004.

Steven J. Bradtke and Andrew G. Barto. Linear least-squares algorithms for temporal

difference learning. Machine learning, 22(1):33–57, 1996.

Todd S. Braver, Deanna M. Barch, and Jonathan D. Cohen. Cognition and control in

schizophrenia: a computational model of dopamine and prefrontal function. Biological

psychiatry, 46(3):312–328, 1999.

Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of Markov

Chain Monte Carlo. CRC Press, 2011.

Daniel Brown, Scott Niekum, and Marek Petrik. Bayesian robust optimization for

imitation learning. In Advances in Neural Information Processing Systems, 2020.

Cameron B. Browne, Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I.

Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis,

and Simon Colton. A survey of Monte Carlo tree search methods. IEEE Transactions

on Computational Intelligence and AI in Games, 4(1):1–43, 2012.

Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova,

Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg

Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu

Wang. Scaling data-driven robotics with reward sketching and batch reinforcement

learning. In Proceedings of Robotics: Science and Systems, 2020.

Barbara Cagniard, Peter D. Balsam, Daniela Brunner, and Xiaoxi Zhuang. Mice with

chronically elevated dopamine exhibit enhanced motivation, but not learning, for a

food reward. Neuropsychopharmacology, 31(7):1362–1370, 2006.

Stefano Carpin, Yinlam Chow, and Marco Pavone. Risk aversion in ﬁnite Markov

decision processes using total cost criteria and average value at risk. In Proceedings

of the IEEE International Conference on Robotics and Automation, 2016.

Pablo S. Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G.

Bellemare. Dopamine: A research framework for deep reinforcement learning. arXiv,

2018.

Johan Samir Obando Ceron and Pablo Samuel Castro. Revisiting Rainbow: Promoting

more insightful and inclusive deep reinforcement learning research. In Proceedings

of the International Conference on Machine Learning, 2021.

Yash Chandak, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma

Brunskill, and Philip S. Thomas. Universal off-policy evaluation. In Advances in

Neural Information Processing Systems, 2021.

Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforce-

ment learning. In Proceedings of the International Conference on Machine Learning,

2019.

Nicolas Chopin and Omiros Papaspiliopoulos. An introduction to sequential Monte

Carlo. Springer, 2020.

Yinlam Chow. Risk-sensitive and data-driven sequential decision making. PhD thesis,

Stanford University, 2017.

References 361

Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in

MDPs. In Advances in Neural Information Processing Systems, 2014.

Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust

decision-making: a CVaR optimization approach. In Advances in Neural Information

Processing Systems, 2015.

Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-

constrained reinforcement learning with percentile risk criteria. Journal of Machine

Learning Research, 2018.

Kun-Jen Chung and Matthew J. Sobel. Discounted MDPs: Distribution functions and

exponential utility maximization. SIAM Journal on Control and Optimization, 25(1):

49–62, 1987.

Wesley Chung, Somjit Nath, Ajin Joseph, and Martha White. Two-timescale networks

for nonlinear value function approximation. In Proceedings of the International

Conference on Learning Representations, 2018.

Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in coop-

erative multiagent systems. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, 1998.

William R. Clements, Benoit-Marie Robaglia, Bastien Van Delft, Reda Bahi Slaoui,

and Sebastien Toth. Estimating risk and uncertainty in deep reinforcement learning.

In Workshop on Uncertainty and Robustness in Deep Learning at the International

Conference on Machine Learning, 2020.

Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural

generation to benchmark reinforcement learning. In Proceedings of the International

Conference on Machine Learning, 2020.

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.

Introduction to algorithms. MIT Press, 2001.

G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min

sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In

Advances in Neural Information Processing Systems, 2013.

Felipe Leno Da Silva, Anna Helena Reali Costa, and Peter Stone. Distributional

reinforcement learning applied to robot soccer simulation. In Adaptive and Learn-

ing Agents Workshop at the International Conference on Autonomous Agents and

Multiagent Systems, 2019.

Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile

networks for distributional reinforcement learning. In Proceedings of the International

Conference on Machine Learning, 2018a.

Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional

reinforcement learning with quantile regression. In AAAI Conference on Artiﬁcial

Intelligence, 2018b.

362 References

Will Dabney, André Barreto, Mark Rowland, Robert Dadashi, John Quan, Marc G.

Bellemare, and David Silver. The value-improvement path: Towards better repre-

sentations for reinforcement learning. In Proceedings of the AAAI Conference on

Artiﬁcial Intelligence, 2020a.

Will Dabney, Zeb Kurth-Nelson, Naoshige Uchida, Clara Kwon Starkweather, Demis

Hassabis, Rémi Munos, and Matthew Botvinick. A distributional code for value in

dopamine-based reinforcement learning. Nature, 577(7792):671–675, 2020b.

Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and

Le Song. SBEED: Convergent reinforcement learning with nonlinear function approx-

imation. In Proceedings of the International Conference on Machine Learning,

2018.

Nathaniel D. Daw. Reinforcement learning models of the dopamine system and their

behavioral implications. Carnegie Mellon University, 2003.

Nathaniel D. Daw and Philippe N. Tobler. Value learning through reinforcement: the

basics of dopamine and reinforcement learning. In Paul W. Glimcher and Ernst Fehr,

editors, Neuroeconomics, pages 283–298. Academic Press, 2014.

Peter Dayan. The convergence of TD(

) for general

. Machine learning, 8(3-4):

341–362, 1992.

Peter Dayan. Improving generalization for temporal difference learning: The successor

representation. Neural Computation, 5(4):613–624, 1993.

Peter Dayan and Terrence J. Sejnowski. TD(

) converges with probability 1. Machine

Learning, 14(3):295–301, 1994.

Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian Q-learning. In Proceed-

ings of the Fifteenth National Conference on Artiﬁcial Intelligence, pages 761–768,

1998.

Erick Delage and Shie Mannor. Percentile optimization for Markov decision processes

with parameter uncertainty. Operations Research, 2010.

Eric V. Denardo and Uriel G. Rothblum. Optimal stopping, exponential utility, and

linear programming. Mathematical Programming, 16(1):228–244, 1979.

Cyrus Derman. Finite state Markovian decision processes. Academic Press, 1970.

Persi Diaconis and David Freedman. Iterated random functions. SIAM review, 41(1):

45–76, 1999.

Thang Doan, Bogdan Mazoure, and Clare Lyle. Gan Q-learning. arXiv preprint

arXiv:1805.04874, 2018.

J. L. Doob. Measure Theory. Springer, 1994.

Arnaud Doucet and Adam M. Johansen. A tutorial on particle ﬁltering and smoothing:

Fifteen years later. Handbook of nonlinear ﬁltering, 2011.

Arnaud. Doucet, Nando. De Freitas, and Neil Gordon. Sequential Monte Carlo methods

in practice. Springer, 2001.

Jingliang Duan, Yang Guan, Shengbo Eben Li, Yangang Ren, Qi Sun, and Bo Cheng.

Distributional soft actor-critic: Off-policy reinforcement learning for addressing value

References 363

estimation errors. IEEE Transactions on Neural Networks and Learning Systems,

2021.

Aryeh Dvoretzky. On stochastic approximation. In Proceedings of the Berkeley

Symposium on Mathematical Statistics and Probability, pages 39–55, 1956.

Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character

of the sample distribution function and of the classical multinomial estimator. The

Annals of Mathematical Statistics, pages 642–669, 1956.

Yaakov Engel, Shie Mannor, and Ron Meir. Bayes meets Bellman: The Gaussian

process approach to temporal difference learning. In Proceedings of the International

Conference on Machine Learning, 2003.

Yaakov Engel, Shie Mannor, and Ron Meir. Bayesian reinforcement learning with

Gaussian process temporal difference methods. Unpublished, 2007.

Martin Engert. Finite dimensional translation invariant subspaces. Paciﬁc Journal of

Mathematics, 32(2):333–343, 1970.

Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforce-

ment learning. Journal of Machine Learning Research, 6:503–556, 2005.

Neir Eshel, Michael Bukwich, Vinod Rao, Vivian Hemmelder, Ju Tian, and Naoshige

Uchida. Arithmetic and local circuitry underlying dopamine prediction errors. Nature,

525(7568):243–246, 2015.

Eyal Even-Dar and Yishay Mansour. Learning rates for Q-learning. Journal of

Machine Learning Research, 5(1), 2003.

Amir-massoud Farahmand. Action-gap phenomenon in reinforcement learning. In

Advances in Neural Information Processing Systems, 2011.

Amir-massoud Farahmand. Value function in frequency domain and the characteristic

value iteration algorithm. In Advances in Neural Information Processing Systems,

2019.

William Fedus, Carles Gelada, Yoshua Bengio, Marc G. Bellemare, and Hugo

Larochelle. Hyperbolic discounting and learning over multiple horizons. In

Multi-Disciplinary Conference on Reinforcement Learning and Decision-Making,

2019.

Eugene A. Feinberg. Constrained discounted Markov decision processes and

Hamiltonian cycles. Mathematics of Operations Research, 25(1):130–140, 2000.

Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for ﬁnite Markov

decision processes. In Proceedings of the Conference on Uncertainty in Artiﬁcial

Intelligence, 2004.

Norman Ferns and Doina Precup. Bisimulation metrics are optimal value functions.

In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2014.

Jerzy A. Filar, Dmitry Krass, and Keith W. Ross. Percentile performance criteria

for limiting average Markov decision processes. IEEE Transactions on Automatic

Control, 40(1):2–10, 1995.

364 References

Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband,

Alex Graves, Vlad Mnih, Rémi Munos, Demis Hassabis, Olivier Pietquin, Charles

Blundell, and Shane Legg. Noisy networks for exploration. In Proceedings of the

International Conference on Learning Representations, 2018.

Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and

Joelle Pineau. An introduction to deep reinforcement learning. Foundations and

Trends

R

in Machine Learning, 11(3-4):219–354, 2018.

Dror Freirich, Tzahi Shimkin, Ron Meir, and Aviv Tamar. Distributional multivariate

policy evaluation and exploration with the Bellman GAN. In Proceedings of the

International Conference on Machine Learning, 2019.

Matthew P.H. Gardner, Geoffrey Schoenbaum, and Samuel J. Gershman. Rethinking

dopamine as generalized prediction error. Proceedings of the Royal Society B, 285

(1891):20181645, 2018.

Dwight C. German, Kebreten Manaye, Wade K. Smith, Donald J. Woodward, and

Clifford B. Saper. Midbrain dopaminergic cell loss in parkinson’s disease: computer

visualization. Annals of Neurology: Ofﬁcial Journal of the American Neurological

Association and the Child Neurology Society, 26(4):507–514, 1989.

Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, and Aviv Tamar. Bayesian

reinforcement learning: A survey. Foundations and Trends

R

in Machine Learning,8

(5-6):359–483, 2015.

Dibya Ghosh and Marc G. Bellemare. Representations for stable off-policy reinforce-

ment learning. In Proceedings of the International Conference on Machine Learning,

2020.

Dibya Ghosh, Marlos C. Machado, and Nicolas Le Roux. An operator view of policy

gradient methods. In Advances in Neural Information Processing Systems, 2020.

Hugo Gilbert, Paul Weng, and Yan Xu. Optimizing quantiles in preference-based

Markov decision processes. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, 2017.

Paul W. Glimcher. Understanding dopamine and reinforcement learning: the dopamine

reward prediction error hypothesis. Proceedings of the National Academy of Sciences,

108(Supplement 3):15647–15654, 2011.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,

Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

Advances in Neural Information Processing Systems, 2014.

Ian Goodfellow, Aaron Courville, and Yoshua Bengio. Deep learning. MIT Press,

2016.

Geoffrey Gordon. Stable function approximation in dynamic programming. In

Proceedings of the International Conference on Machine Learning, 1995.

Neil J. Gordon, David J. Salmond, and Adrian F.M. Smith. Novel approach to

nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F (Radar and

Signal Processing), 140(2):107–113, 1993.

References 365

Laura Graesser and Wah Loon Keng. Foundations of deep reinforcement learning:

Theory and practice in Python. Addison-Wesley Professional, 2019.

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and

Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research,

13:723–773, 2012.

Steffen Grünewälder and Klaus Obermayer. The optimal unbiased value estimator and

its relation to LSTD, TD and MC. Machine Learning, 83(3), 2011.

Audrunas Gruslys, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc Belle-

mare, and Rémi Munos. The Reactor: A fast and sample-efﬁcient actor-critic agent for

reinforcement learning. In Proceedings of the International Conference on Learning

Representations, 2018.

Zhaohan Daniel Guo, Bernardo Avila Pires, Bilal Piot, Jean-Bastien Grill, Flo-

rent Altché, Rémi Munos, and Mohammad Gheshlaghi Azar. Bootstrap latent-

predictive representations for multitask reinforcement learning. In Proceedings of the

International Conference on Machine Learning, 2020.

Leonid Gurvits, Long-Ji Lin, and Stephen José Hanson. Incremental learning of evalu-

ation functions for absorbing Markov chains: New methods and theorems. Technical

report, Siemens Corporate Research, 1994.

Mance E. Harmon and Leemon C. Baird. A response to Bertsekas’ a counterexample

to temporal-differences learning. Technical report, Wright Laboratory, 1996.

William B. Haskell and Rahul Jain. A convex analytic approach to risk-aware Markov

decision processes. SIAM Journal on Control and Optimization, 53(3):1569–1598,

2015.

Shane V. Hegarty, Aideen M. Sullivan, and Gerard W. O’Keeffe. Midbrain dopamin-

ergic neurons: a review of the molecular circuitry that regulates their development.

Developmental biology, 379(2):123–138, 2013.

Matthias Heger. Consideration of risk in reinforcement learning. In Machine Learning

Proceedings 1994, pages 105–111. Elsevier, 1994.

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and

David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI

Conference on Artiﬁcial Intelligence, 2018.

Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski,

Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow:

Combining improvements in deep reinforcement learning. In Proceedings of the

AAAI Conference on Artiﬁcial Intelligence, 2018.

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural

computation, 9(8):1735–1780, 1997.

Oleh Hornykiewicz. Dopamine (3-hydroxytyramine) and brain function. Pharmaco-

logical reviews, 18(2):925–964, 1966.

R. Howard. Dynamic Programming and Markov Processes. MIT Press, 1960.

366 References

Ronald A. Howard and James E. Matheson. Risk-sensitive Markov decision processes.

Management science, 18(7):356–369, 1972.

Oliver D. Howes and Shitij Kapur. The dopamine hypothesis of schizophrenia: version

iii—the ﬁnal common pathway. Schizophrenia bulletin, 35(3):549–562, 2009.

Marcus Hutter. Universal artiﬁcial intelligence: Sequential decisions based on

algorithmic probability. Springer, 2005.

Ehsan Imani and Martha White. Improving regression performance with distributional

losses. In Proceedings of the International Conference on Machine Learning, 2018.

Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of

stochastic iterative dynamic programming algorithms. Neural Computation, 6(6),

1994.

Max Jaderberg, Volodymyr Mnih, Wojciech M. Czarnecki, Tom Schaul, Joel Z.

Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsuper-

vised auxiliary tasks. In Proceedings of the International Conference on Learning

Representations, 2017.

Michael Janner, Igor Mordatch, and Sergey Levine. Generative temporal difference

learning for inﬁnite-horizon prediction. In Advances in Neural Information Processing

Systems, 2020.

Stratton C. Jaquette. Markov decision processes with a new optimality criterion:

Discrete time. The Annals of Statistics, pages 496–505, 1973.

Stratton C. Jaquette. A utility criterion for Markov decision processes. Management

Science, 23(1):43–49, 1976.

Børge Jessen and Aurel Wintner. Distribution functions and the riemann zeta function.

Transactions of the American Mathematical Society, 38(1):48–88, 1935.

Daniel R. Jiang and Warren B. Powell. Risk-averse approximate dynamic programming

with quantile-based risk measures. Mathematics of Operations Research, 43(2):

554–579, 2018.

Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the

Fokker–Planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and

acting in partially observable stochastic domains. Artiﬁcial Intelligence, 101:99–134,

1998.

Leon J. Kamin. "Attention like" processes in classical conditioning. In Miami

symposium on the prediction of behavior: Aversive stimulation, pages 9–31, 1968.

Leonid V. Kantorovich. On the translocation of masses. In Dokl. Akad. Nauk. USSR

(NS), volume 37, pages 199–201, 1942.

Spiros Kapetanakis and Daniel Kudenko. Reinforcement learning of coordination in

cooperative multi-agent systems. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, 2002.

References 367

Bilal Kartal, Pablo Hernandez-Leal, and Matthew E. Taylor. Terminal prediction

as an auxiliary task for deep reinforcement learning. In Proceedings of the AAAI

Conference on Artiﬁcial Intelligence and Interactive Digital Entertainment, 2019.

Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech

Ja

´

skowski. Vizdoom: A Doom-based AI research platform for visual reinforcement

learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG),

pages 1–8, 2016.

Ramtin Keramati, Christoph Dann, Alex Tamkin, and Emma Brunskill. Being opti-

mistic to be conservative: Quickly learning a CVaR policy. In Proceedings of the

AAAI Conference on Artiﬁcial Intelligence, 2020.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

Proceedings of the International Conference on Learning Representations, 2015.

Roger Koenker. Quantile Regression. Cambridge University Press, 2005.

Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: journal

of the Econometric Society, pages 33–50, 1978.

J. Zico Kolter. The ﬁxed points of off-policy TD. In Advances in Neural Information

Processing Systems, 2011.

George D. Konidaris, Sarah Osentoski, and Philip S. Thomas. Value function approxi-

mation in reinforcement learning using the fourier basis. In Proceedings of the AAAI

Conference on Artiﬁcial Intelligence, 2011.

Chung-Ming Kuan, Jin-Huei Yeh, and Yu-Chin Hsu. Assessing value at risk with care,

the conditional autoregressive expectile models. Journal of Econometrics, 150(2):

261–270, 2009.

Harold W. Kuhn. A simpliﬁed two-person poker. Contributions to the Theory of

Games, 1:97–103, 1950.

Zeb Kurth-Nelson and A. David Redish. Temporal-difference reinforcement learning

with distributed representations. PLoS One, 4(10):e7362, 2009.

Harold Kusher and Dean Clark. Stochastic Approximation Methods for Constrained

and Unconstrained Systems. Springer, 1978.

Harold Kushner and G. George Yin. Stochastic approximation and recursive algorithms

and applications. Springer Science & Business Media, 2003.

Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. Control-

ling overestimation bias with truncated mixture of continuous distributional quantile

critics. In Proceedings of the International Conference on Machine Learning, 2020.

M.G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine

Learning Research, 4:1107–1149, 2003.

Guillaume Lample and Devendra Singh Chaplot. Playing FPS games with deep rein-

forcement learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence,

2017.

368 References

Michael Laskin, Aravind Srinivas, and Pieter Abbeel. CURL: Contrastive unsuper-

vised representations for reinforcement learning. In Proceedings of the International

Conference on Machine Learning, 2020.

Tor Lattimore and Marcus Hutter. PAC bounds for discounted MDPs. In Proceedings

of the International Conference on Algorithmic Learning Theory, 2012.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press,

2020.

Martin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement

learning in cooperative multi-agent systems. In Proceedings of the International

Conference on Machine Learning, 2000.

Charline Le Lan, Stephen Tu, Adam Oberman, Rishabh Agarwal, and Marc G.

Bellemare. On the generalization of representations in reinforcement learning. In

Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics,

2022.

Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time

series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.

Daewoo Lee, Boris Defourny, and Warren B. Powell. Bias-corrected Q-learning

to control max-operator bias in Q-learning. In Symposium on Adaptive Dynamic

Programming And Reinforcement Learning, 2013.

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial

and review. arXiv preprint arXiv:1805.00909, 2018.

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training

of deep visuomotor policies. Journal of Machine Learning Research, 2016.

Xiaocheng Li, Huaiyang Zhong, and Margaret L. Brandeau. Quantile Markov decision

processes. Operations Research, 2021.

Timothy P. Lillicrap, Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman.

Random synaptic feedback weights support error backpropagation for deep learning.

Nature Communications, 7(1):1–10, 2016a.

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,

Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep rein-

forcement learning. In Proceedings of the International Conference on Learning

Representations, 2016b.

Gwo Dong Lin. Recent developments on the moment problem. Journal of Statistical

Distributions and Applications, 4(1):1–17, 2017.

L.J. Lin. Self-improving reactive agents based on reinforcement learning, planning

and teaching. Machine learning, 8(3):293–321, 1992.

Zichuan Lin, Li Zhao, Derek Yang, Tao Qin, Tie-Yan Liu, and Guangwen Yang.

Distributional reward decomposition for reinforcement learning. In Advances in

Neural Information Processing Systems, 2019.

References 369

Nir Lipovetzky, Miquel Ramirez, and Hector Geffner. Classical planning with sim-

ulators: Results on the atari video games. In Proceedings of International Joint

Conference on Artiﬁcial Intelligence, 2015.

Michael L. Littman. Markov games as a framework for multi-agent reinforcement

learning. In Proceedings of the International Conference on Machine Learning, 1994.

Michael L. Littman and Csaba Szepesvári. A generalized reinforcement-learning

model: Convergence and applications. In Proceedings of the International Conference

on Machine Learning, 1996.

Jun S. Liu. Monte Carlo strategies in scientiﬁc computing, volume 10. Springer, 2001.

Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose

Bayesian inference algorithm. In Advances in Neural Information Processing Systems,

2016.

Quansheng Liu. Fixed points of a generalized smoothing transformation and applica-

tions to the branching random walk. Advances in Applied Probability, 30(1):85–112,

1998.

Lennart Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions on

Automatic Control, 22(4):551–575, 1977.

Tomas Ljungberg, Paul Apicella, and Wolfram Schultz. Responses of monkey

dopamine neurons during learning of behavioral reactions. Journal of neurophysiol-

ogy, 67(1):145–163, 1992.

Adam S. Lowet, Qiao Zheng, Sara Matias, Jan Drugowitsch, and Naoshige Uchida.

Distributional reinforcement learning in the brain. Trends in Neurosciences, 2020.

Elliot A. Ludvig, Marc G. Bellemare, and Keir G. Pearson. A primer on reinforcement

learning in the brain: Psychological, computational, and neural perspectives. Com-

putational neuroscience for advancing artiﬁcial intelligence: Models, methods and

applications, pages 111–144, 2011.

Yudong Luo, Guiliang Liu, Haonan Duan, Oliver Schulte, and Pascal Poupart. Dis-

tributional reinforcement learning with monotonic splines. In Proceedings of the

International Conference on Learning Representations, 2021.

Clare Lyle, Pablo Samuel Castro, and Marc G. Bellemare. A comparative analysis

of expected and distributional reinforcement learning. In Proceedings of the AAAI

Conference on Artiﬁcial Intelligence, 2019.

Clare Lyle, Mark Rowland, Georg Ostrovski, and Will Dabney. On the effect of

auxiliary tasks on representation dynamics. In Proceedings of the International

Conference on Artiﬁcial Intelligence and Statistics, 2021.

Xueguang Lyu and Christopher Amato. Likelihood quantile networks for coordinating

multi-agent reinforcement learning. In Proceedings of the International Conference

on Autonomous Agents and Multiagent Systems, 2020.

Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew

Hausknecht, and Michael Bowling. Revisiting the Arcade Learning Environment:

Evaluation protocols and open problems for general agents. Journal of Artiﬁcial

Intelligence Research, 2018.

370 References

David J.C. MacKay. Information theory, inference and learning algorithms. Cam-

bridge University Press, 2003.

Chris J. Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Arnaud Doucet,

Andriy Mnih, and Yee Whye Teh. Particle value functions. In Proceedings of the

International Conference on Learning Representations (Workshop Track), 2017.

João Guilherme Madeira Auraújo, Johan Samir Obando Ceron, and Pablo Samuel

Castro. Lifting the veil on hyper-parameters for value-based deep reinforcement

learning. In NeurIPS 2021 Workshop: LatinX in AI, 2021.

Hamid Reza Maei. Gradient Temporal-Difference Learning Algorithms. PhD thesis,

University of Alberta, 2011.

Petr Mandl. On the variance in controlled Markov chains. Kybernetika, 7(1):1–12,

1971.

Shie Mannor and John Tsitsiklis. Mean-variance optimization in Markov decision

processes. In Proceedings of the International Conference on Machine Learning,

2011.

Shie Mannor, Duncan Simester, Peng Sun, and John N. Tsitsiklis. Bias and variance

approximation in value function estimates. Management Science, 53, 2007.

Harry M. Markowitz. Portfolio selection. Journal of Finance, 7:77–91, 1952.

John Martin, Michal Lyskawinski, Xiaohu Li, and Brendan Englot. Stochastically

dominant distributional reinforcement learning. In Proceedings of the International

Conference on Machine Learning, 2020.

Pascal Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The

Annals of Probability, pages 1269–1283, 1990.

Laëtitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. Hysteretic Q-

learning: an algorithm for decentralized reinforcement learning in cooperative multi-

agent teams. In IEEE International Conference on Intelligent Robots and Systems,

2007.

Laetitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. Independent

reinforcement learners in cooperative Markov games: A survey regarding coordination

problems. The Knowledge Engineering Review, 27(1):1–31, 2012.

Borislav Mavrin, Hengshuai Yao, Linglong Kong, Kaiwen Wu, and Yaoliang Yu.

Distributional reinforcement learning for efﬁcient exploration. In Proceedings of the

International Conference on Machine Learning, 2019.

Andrew K. McCallum. Reinforcement learning with selective perception and hidden

state. PhD thesis, University of Rochester, 1995.

Sean Meyn. Control Systems and Reinforcement Learning. Cambridge University

Press, 2022.

Sean P. Meyn and Richard L. Tweedie. Markov chains and stochastic stability.

Cambridge University Press, 2012.

Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learning. Machine

learning, 49(2):267–290, 2002.

References 371

Ralph R. Miller, Robert C. Barnet, and Nicholas J. Grahame. Assessment of the

Rescorla-Wagner model. Psychological bulletin, 117(3):363, 1995.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,

Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg

Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen

King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-

level control through deep reinforcement learning. Nature, 518(7540):529–533,

2015.

Gordon J. Mogenson, Douglas L. Jones, and Chi Yiu Yim. From motivation to action:

functional interface between the limbic system and the motor system. Progress in

neurobiology, 14(2-3):69–97, 1980.

Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de

l’Académie Royale des Sciences de Paris, 1781.

P. Read Montague, Peter Dayan, and Terrence J. Sejnowski. A framework for mes-

encephalic dopamine systems based on predictive Hebbian learning. Journal of

neuroscience, 16(5):1936–1947, 1996.

Nick Montfort and Ian Bogost. Racing the beam: The Atari Video Computer System.

MIT Press, 2009.

Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement

learning with less data and less time. Machine Learning, 1993.

Oskar Morgenstern and John von Neumann. Theory of games and economic behavior.

Princeton University Press, 1944.

Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and

Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforce-

ment learning. In Proceedings of the International Conference on Machine Learning,

2010a.

Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and

Toshiyuki Tanaka. Parametric return density estimation for reinforcement learning.

In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2010b.

Thomas E. Morton. On the asymptotic convergence rate of cost differences for

Markovian decision processes. Operations Research, 19(1):244–248, 1971.

Bradford W. Mott, Stephen Anthony, and the Stella team. Stella: A multi-platform

Atari 2600 VCS emulator. http://stella.sourceforge.net, 1995–2021.

Alfred Müller. Integral probability metrics and their generating classes of functions.

Advances in Applied Probability, 29(2):429–443, 1997.

Timothy H. Muller, James L. Butler, Sebastijan Veselic, Bruno Miranda, Timothy E.J.

Behrens, Zeb Kurth-Nelson, and Steven W. Kennerley. Distributional reinforcement

learning in prefrontal cortex. bioRxiv, 2021.

Rémi Munos. Error bounds for approximate policy iteration. In Proceedings of the

International Conference on Machine Learning, 2003.

372 References

Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G. Bellemare. Safe

and efﬁcient off-policy reinforcement learning. In Advances in Neural Information

Processing Systems, 2016.

Kevin P. Murphy. Machine learning: A probabilistic perspective. MIT Press, 2012.

Christian A. Naesseth, Fredrik Lindsten, and Thomas B. Schön. Elements of sequential

Monte Carlo. Foundations and Trends

R

in Machine Learning, 12(3):307–392, 2019.

Vinod Nair and Geoffrey E. Hinton. Rectiﬁed linear units improve restricted Boltzmann

machines. In Proceedings of the International Conference on Machine Learning,

2010.

Daniel W Nam, Younghoon Kim, and Chan Y. Park. GMAC: A distributional perspec-

tive on actor-critic framework. In Proceedings of the International Conference on

Machine Learning, 2021.

Ralph Neininger. Limit Laws for Random Recursive Structures and Algorithms. PhD

thesis, University of Freiburg, 1999.

Ralph Neininger. On a multivariate contraction method for random recursive structures

with applications to Quicksort. Random Structures & Algorithms, 19(3-4):498–524,

2001.

Ralph Neininger and Ludger Rüschendorf. A general limit theorem for recursive

algorithms and combinatorial structures. The Annals of Applied Probability, 14(1):

378–418, 2004.

Whitney K. Newey and James L. Powell. Asymmetric least squares estimation and

testing. Econometrica: Journal of the Econometric Society, pages 819–847, 1987.

Thanh Tang Nguyen, Sunil Gupta, and Svetha Venkatesh. Distributional reinforcement

learning via moment matching. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, 2021.

André Nieoullon. Dopamine and the regulation of cognition and attention. Progress

in neurobiology, 67(1):53–83, 2002.

Nikolay Nikolov, Johannes Kirschner, Felix Berkenkamp, and Andreas Krause.

Information-directed exploration for deep reinforcement learning. In Proceedings of

the International Conference on Learning Representations, 2019.

Yael Niv. Reinforcement learning in the brain. Journal of Mathematical Psychology,

53(3):139–154, 2009.

Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Kather-

ine Ye, and Alexander Mordvintsev. The building blocks of interpretability. Distill,3

(3), 2018.

Frans A. Oliehoek and Christopher Amato. A concise introduction to decentralized

POMDPs. Springer, 2016.

Frans A. Oliehoek, Matthijs T.J. Spaan, and Nikos Vlassis. Optimal and approximate

Q-value functions for decentralized POMDPs. Journal of Artiﬁcial Intelligence

Research, 32:289–353, 2008.

References 373

Ditte Olsen, Niels Wellner, Mathias Kaas, Inge E.M. de Jong, Florence Sotty, Michael

Didriksen, Simon Glerup, and Anders Nykjaer. Altered dopaminergic ﬁring pat-

tern and novelty response underlie adhd-like behavior of sorcs2-deﬁcient mice.

Translational Psychiatry, 11(1):1–14, 2021.

Shayegan Omidshaﬁei, Jason Pazis, Christopher Amato, Jonathan P. How, and John

Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial

observability. In Proceedings of the International Conference on Machine Learning,

2017.

Art B. Owen. Monte Carlo theory, methods and examples. 2013.

Gregory Palmer, Karl Tuyls, Daan Bloembergen, and Rahul Savani. Lenient multi-

agent deep reinforcement learning. In Proceedings of the International Conference

on Autonomous Agents and Multiagent Systems, 2018.

Gregory Palmer, Rahul Savani, and Karl Tuyls. Negative update intervals in deep

multi-agent reinforcement learning. In Proceedings of the International Conference

on Autonomous Agents and Multiagent Systems, 2019.

Liviu Panait, R. Paul Wiegand, and Sean Luke. Improving coevolutionary search for

optimal multiagent behaviors. In Proceedings of the International Joint Conference

on Artiﬁcial Intelligence, 2003.

Liviu Panait, Keith Sullivan, and Sean Luke. Lenient learners in cooperative multiagent

systems. In Proceedings of the International Conference on Autonomous Agents and

Multiagent Systems, 2006.

Victor M. Panaretos and Yoav Zemel. An invitation to statistics in Wasserstein space.

Springer Nature, 2020.

Ronald Parr, Christopher Painter-Wakeﬁeld, Lihong Li, and Michael Littman. Ana-

lyzing feature generation for value-function approximation. In Proceedings of the

International Conference on Machine Learning, 2007.

Ronald Parr, Lihong Li, Gavin Taylor, Christopher Painter-Wakeﬁeld, and Michael L.

Littman. An analysis of linear models, linear value-function approximation, and

feature selection for reinforcement learning. In Proceedings of the International

Conference on Machine Learning, 2008.

Ivan P. Pavlov. Conditioned reﬂexes: An investigation of the physiological activity of

the cerebral cortex., 1927.

Yuval Peres, Wilhelm Schlag, and Boris Solomyak. Sixty years of Bernoulli

convolutions. In Fractal geometry and stochastics II, pages 39–65. Springer, 2000.

Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications

to data science. Foundations and Trends

R

in Machine Learning, 11(5-6):355–607,

2019.

L.A. Prashanth and Michael Fu. Risk-sensitive reinforcement learning. arXiv preprint

arXiv:1810.09126, 2021.

L.A. Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-

sensitive MDPs. In Advances in Neural Information Processing Systems, 2013.

374 References

Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-

policy policy evaluation. In Proceedings of the International Conference on Machine

Learning, 2000.

Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech Badia, Oriol

Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic

control. In Proceedings of the International Conference on Machine Learning, 2017.

Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic

programming. John Wiley & Sons, 2014.

Martin L. Puterman and Moon Chirl Shin. Modiﬁed policy iteration algorithms for

discounted Markov decision problems. Management Science, 24(11):1127–1137,

1978.

Wei Qiu, Xinrun Wang, Runsheng Yu, Xu He, Rundong Wang, Bo An, Svetlana

Obraztsova, and Zinovi Rabinovich. RMIX: Learning risk-sensitive policies for coop-

erative reinforcement learning agents. In Advances in Neural Information Processing

Systems, 2021.

Chao Qu, Shie Mannor, and Huan Xu. Nonlinear distributional gradient temporal-

difference learning. In Proceedings of the International Conference on Machine

Learning, 2019.

John Quan and Georg Ostrovski. DQN Zoo: Reference implementations of DQN-based

agents, 2020. URL http://github.com/deepmind/dqn_zoo.

S. T. Rachev and L. Rüschendorf. Probability metrics and recursive algorithms.

Advances in Applied Probability, 27(3):770–799, 1995.

Svetlozar T. Rachev, Lev Klebanov, Stoyan V. Stoyanov, and Frank Fabozzi. The

methods of distances in the theory of probability and statistics. Springer Science &

Business Media, 2013.

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob

Foerster, and Shimon Whiteson. QMIX: Monotonic value function factorisation

for deep multi-agent reinforcement learning. In Proceedings of the International

Conference on Machine Learning, 2018.

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar,

Jakob N. Foerster, and Shimon Whiteson. Monotonic value function factorisation for

deep multi-agent reinforcement learning. Journal of Machine Learning Research, 21:

178–1, 2020.

Robert A. Rescorla and Allan R. Wagner. A theory of Pavlovian conditioning: Vari-

ations in the effectiveness of reinforcement and nonreinforcement. In Classical

conditioning II, chapter 3, pages 64–99. Appleton-Century-Crofts, 1972.

M. Riedmiller. Neural ﬁtted Q iteration–ﬁrst experiences with a data efﬁcient neural

reinforcement learning method. In Proceedings of the European Conference on

Machine Learning, 2005.

Martin Riedmiller, Thomas Gabel, Roland Hafner, and Sascha Lange. Reinforcement

learning for robot soccer. Autonomous Robots, 27(1):55–73, 2009.

References 375

Maria L. Rizzo and Gábor J. Székely. Energy distance. Wiley Interdisciplinary Reviews:

Computational Statistics, 8(1):27–38, 2016.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals

of mathematical statistics, pages 400–407, 1951.

Herbert Robbins and David Siegmund. A convergence theorem for non negative almost

supermartingales and some applications. In Optimizing methods in statistics, pages

233–257. Elsevier, 1971.

Christian Robert and George Casella. Monte Carlo statistical methods. Springer

Science & Business Media, 2004.

R. Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk.

Journal of risk, 2:21–42, 2000.

R. Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general

loss distributions. Journal of banking & ﬁnance, 26(7):1443–1471, 2002.

Uwe Rösler. A limit theorem for “quicksort”. RAIRO-Theoretical Informatics and

Applications, 25(1):85–100, 1991.

Uwe Rösler. A ﬁxed point theorem for distributions. Stochastic Processes and their

Applications, 42(2):195–214, 1992.

Uwe Rösler. On the analysis of stochastic divide and conquer algorithms. Algorithmica,

29(1):238–261, 2001.

Uwe Rösler and Ludger Rüschendorf. The contraction method for recursive algorithms.

Algorithmica, 29(1-2):3–33, 2001.

Mark Rowland, Marc G. Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh.

An analysis of categorical distributional reinforcement learning. In Proceedings of

the International Conference on Artiﬁcial Intelligence and Statistics, 2018.

Mark Rowland, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G. Bellemare,

and Will Dabney. Statistics and samples in distributional reinforcement learning. In

Proceedings of the International Conference on Machine Learning, 2019.

Mark Rowland, Shayegan Omidshaﬁei, Daniel Hennes, Will Dabney, Andrew Jaegle,

Paul Muller, Julien Pérolat, and Karl Tuyls. Temporal difference and return optimism

in cooperative multi-agent reinforcement learning. In Adaptive and Learning Agents

Workshop at the International Conference on Autonomous Agents and Multiagent

Systems, 2021.

Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. A metric for distributions with

applications to image databases. In Sixth International Conference on Computer

Vision, pages 59–66. IEEE, 1998.

Walter Rudin. Principles of mathematical analysis, volume 3. McGraw-Hill New

York, 1976.

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning

representations by back-propagating errors. Nature, 323(6088):533–536, 1986.

Gavin A. Rummery and Mahesan Niranjan. On-line Q-learning using connectionist

systems. Technical report, Cambridge University Engineering Department, 1994.

376 References

Ludger Rüschendorf. On stochastic recursive equations of sum and max type. Journal

of applied probability, 43(3):687–703, 2006.

Ludger Rüschendorf and Ralph Neininger. A survey of multivariate aspects of the

contraction method. Discrete Mathematics & Theoretical Computer Science, 8, 2006.

Andrzej Ruszczy

´

nski. Risk-averse dynamic programming for Markov decision

processes. Mathematical programming, 125(2):235–261, 2010.

Arthur L. Samuel. Some studies in machine learning using the game of checkers. IBM

Journal of Research and Development, 1959.

Filippo Santambrogio. Optimal Transport for Applied Mathematicians: Calculus of

Variations, PDEs and Modeling. Birkhäuser, 2015.

Simo Särkkä. Bayesian ﬁltering and smoothing. Cambridge University Press, 2013.

Makoto Sato, Hajime Kimura, and Shibenobu Kobayashi. TD algorithm for the

variance of return and mean-variance reinforcement learning. Transactions of the

Japanese Society for Artiﬁcial Intelligence, 16(3):353–362, 2001.

Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience

replay. In Proceedings of the International Conference on Learning Representations,

2016.

Bruno Scherrer. Should one compute the temporal difference ﬁx point or minimize

the Bellman residual? the uniﬁed oblique projection view. In Proceedings of the

International Conference on Machine Learning, 2010.

Bruno Scherrer. Approximate policy iteration schemes: A comparison. In Proceedings

of the International Conference on Machine Learning, 2014.

Bruno Scherrer and Boris Lesner. On the use of non-stationary policies for stationary

inﬁnite-horizon Markov decision processes. In Advances in Neural Information

Processing Systems, 2012.

Matthew Schlegel, Andrew Jacobsen, Zaheer Abbas, Andrew Patterson, Adam White,

and Martha White. General value function networks. Journal of Artiﬁcial Intelligence

Research (JAIR), 2021.

Wolfram Schultz. Responses of midbrain dopamine neurons to behavioral trigger

stimuli in the monkey. Journal of neurophysiology, 56(5):1439–1461, 1986.

Wolfram Schultz. Getting formal with dopamine and reward. Neuron, 36(2):241–263,

2002.

Wolfram Schultz. Dopamine reward prediction-error signalling: a two-component

response. Nature reviews neuroscience, 17(3):183–195, 2016.

Wolfram Schultz and Ranulfo Romo. Dopamine neurons of the monkey midbrain: Con-

tingencies of responses to stimuli eliciting immediate behavioral reactions. Journal

of neurophysiology, 63(3):607–624, 1990.

Wolfram Schultz, Paul Apicella, and Tomas Ljungberg. Responses of monkey

dopamine neurons to reward and conditioned stimuli during successive steps of

learning a delayed response task. Journal of neuroscience, 13(3):900–913, 1993.

References 377

Wolfram Schultz, Peter Dayan, and P. Read Montague. A neural substrate of prediction

and reward. Science, 275(5306):1593–1599, 1997.

Ashvin Shah. Psychological and neuroscientiﬁc connections with reinforcement

learning. In Reinforcement Learning, pages 507–537. Springer, 2012.

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on

stochastic programming: Modeling and theory. SIAM, 2009.

Lloyd S. Shapley. Stochastic games. Proceedings of the National Academy of Sciences,

39(10):1095–1100, 1953.

Yun Shen, Wilhelm Stannat, and Klaus Obermayer. Risk-sensitive Markov control

processes. SIAM Journal on Control and Optimization, 51(5):3652–3672, 2013.

Yoav Shoham and Kevin Leyton-Brown. Multiagent systems. Cambridge University

Press, 2009.

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George

van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,

Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya

Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel,

and Demis Hassabis. Mastering the game of Go with deep neural networks and tree

search. Nature, 529(7587):484–489, 2016.

Satinder P. Singh and Richard S. Sutton. Reinforcement learning with replacing

eligibility traces. Machine Learning, 22:123–158, 1996.

Matthew J. Sobel. The variance of discounted Markov decision processes. Journal of

Applied Probability, 19(4):794–802, 1982.

Boris Solomyak. On the random series

±

n

(an Erd

˝

os problem). Annals of

Mathematics, 142(3):611–625, 1995.

Thomas A. Stalnaker, James D. Howard, Yuji K. Takahashi, Samuel J. Gershman,

Thorsten Kahnt, and Geoffrey Schoenbaum. Dopamine neuron ensembles signal the

content of sensory prediction errors. eLife, 8:e49315, 2019.

Marc C. Steinbach. Markowitz revisited: Mean-variance models in ﬁnancial portfolio

analysis. SIAM review, 43(1):31–85, 2001.

Gilbert Strang. Introduction to linear algebra. Wellesley-Cambridge Press, 1993.

Felipe Petroski Such, Vashisht Madhavan, Rosanne Liu, Rui Wang, Pablo Samuel

Castro, Yulun Li, Ludwig Schubert, Marc G. Bellemare, Jeff Clune, and Joel Lehman.

An Atari model zoo for analyzing, visualizing, and comparing deep reinforcement

learning agents. In Proceedings of the International Joint Conference on Artiﬁcial

Intelligence, 2019.

Wei-Fang Sun, Cheng-Kuang Lee, and Chun-Yi Lee. DFAC framework: Factorizing

the value function via quantile mixture for multi-agent distributional Q-learning. In

Proceedings of the International Conference on Machine Learning, 2021.

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius

Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl

378 References

Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent

learning. arXiv preprint arXiv:1706.05296, 2017.

Richard S. Sutton. Temporal Credit Assignment in Reinforcement Learning . PhD

thesis, University of Massachusetts, Amherst, 1984.

Richard S. Sutton. Learning to predict by the methods of temporal differences.

Machine learning, 3(1):9–44, 1988.

Richard S. Sutton. TD models: Modeling the world at a mixture of time scales. In

Proceedings of the International Conference on Machine Learning, 1995.

Richard S. Sutton. Generalization in reinforcement learning: Successful examples

using sparse coarse coding. In Advances in Neural Information Processing Systems,

1996.

Richard S. Sutton. Open theoretical questions in reinforcement learning. In European

Conference on Computational Learning Theory, 1999.

Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction.

MIT Press, 2018.

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-

MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial

intelligence, 112(1-2):181–211, 1999.

Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour.

Policy gradient methods for reinforcement learning with function approximation. In

Advances in Neural Information Processing Systems, 2000.

Richard S. Sutton, Csaba Szepesvári, and Hamid Reza Maei. A convergent

O(n) temporal-difference algorithm for off-policy learning with linear function

approximation. In Advances in Neural Information Processing Systems, 2008a.

Richard S. Sutton, Csaba Szespesvári, Alborz Geramifard, and Michael Bowling.

Dyna-style planning with linear function approximation and prioritized sweeping. In

Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2008b.

Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David

Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for

temporal-difference learning with linear function approximation. In Proceedings of

the International Conference on Machine Learning, 2009.

Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski,

Adam White, and Doina Precup. Horde: A scalable real-time architecture for learn-

ing knowledge from unsupervised sensorimotor interaction. In Proceedings of the

International Conference on Autonomous Agents and Multiagents Systems, 2011.

Gabor J. Székely. E-statistics: The energy of statistical samples. Technical Report

02-16, Bowling Green State University, Department of Mathematics and Statistics,

2002.

Gábor J. Székely and Maria L. Rizzo. Energy statistics: A class of statistics based on

distances. Journal of Statistical Planning and Inference, 143(8):1249–1272, 2013.

References 379

Csaba Szepesvári. The asymptotic convergence-rate of Q-learning. In Advances in

Neural Information Processing Systems, 1998.

Csaba Szepesvári. Algorithms for reinforcement learning. Morgan & Claypool

Publishers, 2010.

Csaba Szepesvári. Constrained MDPs and the reward hypothesis. https://readingsml

.blogspot.com/2020/03/constrained-mdps-and-reward-hypothesis.html, 2020.

Accessed June 25, 2021.

Yuji K. Takahashi, Hannah M. Batchelor, Bing Liu, Akash Khanna, Marisela Morales,

and Geoffrey Schoenbaum. Dopamine neurons respond to errors in the prediction of

sensory features of expected rewards. Neuron, 95(6):1395–1405, 2017.

Aviv Tamar, Dotan Di Castro, and Shie Mannor. Policy gradients with variance related

risk criteria. In Proceedings of the International Conference on Machine Learning,

2012.

Aviv Tamar, Dotan Di Castro, and Shie Mannor. Temporal difference methods for

the variance of the reward to go. In Proceedings of the International Conference on

Machine Learning, 2013.

Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the CVaR via sampling.

In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2015.

Aviv Tamar, Dotan Di Castro, and Shie Mannor. Learning the variance of the reward-

to-go. Journal of Machine Learning Research, 17(1):361–396, 2016.

Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan

Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep

reinforcement learning. PloS one, 12(4):e0172395, 2017.

Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents.

In Proceedings of the International Conference on Machine Learning, 1993.

Pablo Tano, Peter Dayan, and Alexandre Pouget. A local temporal difference code for

distributional reinforcement learning. In Advances in Neural Information Processing

Systems, 2020.

James W. Taylor. Estimating value at risk and expected shortfall using expectiles.

Journal of Financial Econometrics, 6(2):231–252, 2008.

Gerald Tesauro. Temporal difference learning and TD-Gammon. Communications of

the ACM, 38(3), 1995.

Chen Tessler, Guy Tennenholtz, and Shie Mannor. Distributional policy optimization:

An alternative approach for continuous control. In Advances in Neural Information

Processing Systems, 2019.

T. Tieleman and G. Hinton. RmsProp: Divide the gradient by a running average of its

recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.

Marc Toussaint. Robot trajectory optimization using approximate inference. In

Proceedings of the International Conference on Machine Learning, 2009.

380 References

Marc Toussaint and Amos Storkey. Probabilistic inference for solving discrete and

continuous state Markov decision processes. In Proceedings of the International

Conference on Machine Learning, 2006.

John N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine

learning, 16(3):185–202, 1994.

John N. Tsitsiklis. On the convergence of optimistic policy iteration. Journal of

Machine Learning Research, 3:59–72, 2002.

John N. Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning

with function approximation. IEEE Transactions on Automatic Control, 42(5):674–

690, 1997.

Cassius T. Ionescu Tulcea. Mesures dans les espaces produits. Atti Accademia

Nazionale Lincei Rend, 8(7), 1949.

Aaron Van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neu-

ral networks. In Proceedings of the International Conference on Machine Learning,

2016.

Aad W. Van der Vaart. Asymptotic statistics. Cambridge University Press, 2000.

J. van der Wal. Stochastic dynamic programming: Successive approximations and

nearly optimal strategies for Markov decision processes and Markov games. Stichting

Mathematisch Centrum, 1981.

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with

double Q-learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence,

2016.

Hado P. van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, and David Silver.

Learning values across many orders of magnitude. In Advances in Neural Information

Processing Systems, 2016.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.

Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances

in Neural Information Processing Systems, 2017.

Mel Vecerik, Oleg Sushkov, David Barker, Thomas Rothörl, Todd Hester, and Jon

Scholz. A practical approach to insertion with variable socket position using deep rein-

forcement learning. In IEEE International Conference on Robotics and Automation,

2019.

Joel Veness, Kee Siong Ng, Marcus Hutter, William T. B. Uther, and David Silver. A

Monte-Carlo AIXI Approximation. Journal of Artiﬁcial Intelligence Resesearch, 40:

95–142, 2011.

Joel Veness, Marc G. Bellemare, Marcus Hutter, Alvin Chua, and Guillaume Des-

jardins. Compress and control. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, 2015.

A.M. Vershik. Long history of the Monge-Kantorovich transportation problem. The

Mathematical Intelligencer, 35(4):1–9, 2013.

References 381

Nino Vieillard, Olivier Pietquin, and Matthieu Geist. Munchausen reinforcement

learning. In Advances in Neural Information Processing Systems, 2020.

Cédric Villani. Topics in optimal transportation. Graduate Studies in Mathematics.

American Mathematical Society, 2003.

Cédric Villani. Optimal transport: old and new. Springer Science & Business Media,

2008.

J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100

(1):295–320, 1928.

Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families,

and variational inference. Foundations and Trends

R

in Machine Learning, 1(1–2):

1–305, 2008.

Neil Walton. Lecture Notes on Stochastic Control. Unpublished, 2021.

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Fre-

itas. Dueling network architectures for deep reinforcement learning. In Proceedings

of the International Conference on Machine Learning, 2016.

Christopher J.C.H. Watkins. Learning from delayed rewards. PhD thesis, King’s

College, Cambridge, 1989.

Christopher J.C.H. Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):

279–292, 1992.

Jonathan Weed and Francis Bach. Sharp asymptotic and ﬁnite-sample rates of conver-

gence of empirical measures in Wasserstein distance. Bernoulli, 25(4A):2620–2648,

2019.

Ermo Wei and Sean Luke. Lenient learning in independent-learner stochastic

cooperative games. Journal of Machine Learning Research, 17(1):2914–2955, 2016.

Paul J. Werbos. Applications of advances in nonlinear sensitivity analysis. In System

modeling and optimization, pages 762–770. Springer, 1982.

D. J. White. Mean, variance, and probabilistic criteria in ﬁnite Markov decision

processes: a review. Journal of Optimization Theory and Applications, 56(1):1–29,

1988.

Martha White. Unifying task speciﬁcation in reinforcement learning. In Proceedings

of the International Conference on Machine Learning, 2017.

Martha White and Adam White. A greedy approach to adapting the trace parameter

for temporal difference learning. In Proceedings of the International Conference on

Autonomous Agents and Multiagent Systems, 2016.

Norman M. White and Marc Viaud. Localized intracaudate dopamine d2 receptor

activation during the post-training period improves memory for visual or olfactory

conditioned emotional responses in rats. Behavioral and neural biology, 55(3):

255–269, 1991.

Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. In WESCON

Convention Record Part IV, 1960.

David Williams. Probability with martingales. Cambridge University Press, 1991.

382 References

Roy A. Wise. Dopamine, learning and motivation. Nature reviews neuroscience, 5(6):

483–494, 2004.

Peter R. Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik

Subramanian, Thomas J. Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert,

Florian Fuchs, Leilani Gilpin, Piyush Khandelwal, Varun Kompella, HaoChih Lin,

Patrick MacAlpine, Declan Oller, Takuma Seno, Craig Sherstan, Michael D. Thomure,

Houmehr Aghabozorgi, Leon Barrett, Rory Douglas, Dion Whitehead, Peter Dürr,

Peter Stone, Michael Spranger, and Hiroaki Kitano. Outracing champion Gran

Turismo drivers with deep reinforcement learning. Nature, 602(7896):223–228, 2022.

Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully

parameterized quantile function for distributional reinforcement learning. In Advances

in Neural Information Processing Systems, 2019.

Kenny Young and Tian Tian. MinAtar: An Atari-inspired testbed for thorough and

reproducible reinforcement learning experiments. arXiv, 2019.

Yuguang Yue, Zhendong Wang, and Mingyuan Zhou. Implicit distributional rein-

forcement learning. In Advances in Neural Information Processing Systems,

2020.

Shangtong Zhang and Hengshuai Yao. QUOTA: The quantile option architecture

for reinforcement learning. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence, 2019.

Fan Zhou, Zhoufan Zhu, Qi Kuang, and Liwen Zhang. Non-decreasing quantile

function network with efﬁcient exploration for distributional reinforcement learning.

In Proceedings of the International Joint Conference on Artiﬁcial Intelligence, 2021.

Johanna F. Ziegel. Coherence and elicitability. Mathematical Finance, 26(4):901–918,

2016.

Vladimir M. Zolotarev. Metric distances in spaces of random variables and their

distributions. Sbornik: Mathematics, 30(3):373–401, 1976.