References

Achab, Mastane. 2020. Ranking and risk-aware reinforcement learning. PhD diss.,

Institut Polytechnique de Paris.

Agarwal, Rishabh, Dale Schuurmans, and Mohammad Norouzi. 2020. An optimistic

perspective on oﬄine reinforcement learning. In Proceedings of the International

Conference on Machine Learning.

Agarwal, Rishabh, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G.

Bellemare. 2021. Deep reinforcement learning at the edge of the statistical precipice. In

Advances in Neural Information Processing Systems.

Aigner, D. J., Takeshi Amemiya, and Dale J. Poirier. 1976. On the estimation of pro-

duction frontiers: Maximum likelihood estimation of the parameters of a discontinuous

density function. International Economic Review 17 (2): 377–396.

Aldous, David J., and Antar Bandyopadhyay. 2005. A survey of max-type recursive

distributional equations. The Annals of Applied Probability 15 (2): 1047–1110.

Alsmeyer, Gerold. 2012. Random recursive equations and their distributional ﬁxed

points. Unpublished manuscript.

Altman, Eitan. 1999. Constrained Markov decision processes. Vol. 7. CRC Press.

Ambrosio, Luigi, Nicola Gigli, and Giuseppe Savaré. 2005. Gradient ﬂows: In metric

spaces and in the space of probability measures. Springer Science & Business Media.

Amortila, Philip, Marc G. Bellemare, Prakash Panangaden, and Doina Precup. 2019.

Temporally extended metrics for Markov decision processes. In SafeAI: AAAI Workshop

on Artiﬁcial Intelligence Safety.

Amortila, Philip, Doina Precup, Prakash Panangaden, and Marc G. Bellemare. 2020.

A distributional analysis of sampling-based reinforcement learning algorithms. In

Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics.

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. 2017. Wasserstein GAN. In

Proceedings of the International Conference on Machine Learning.

Artzner, Philippe, Freddy Delbaen, Jean-Marc Eber, and David Heath. 1999. Coherent

measures of risk. Mathematical Finance 9 (3): 203–228.

Draft version. 337

338 References

Artzner, Philippe, Freddy Delbaen, Jean-Marc Eber, David Heath, and Hyejin Ku. 2007.

Coherent multiperiod risk adjusted values and Bellman’s principle. Annals of Operations

Research 152 (1): 5–22.

Arulkumaran, Kai, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath.

2017. A brief survey of deep reinforcement learning. IEEE Signal Processing Magazine,

Special Issue on Deep Learning for Image Understanding.

Auer, Peter, Mark Herbster, and Manfred K. Warmuth. 1995. Exponentially many local

minima for single neurons. In Advances in Neural Information Processing Systems.

Azar, Mohammad Gheshlaghi, Rémi Munos, Mohammad Ghavamzadeh, and Hilbert

J. Kappen. 2011. Speedy Q-learning. In Advances in Neural Information Processing

Systems.

Azar, Mohammad Gheshlaghi, Rémi Munos, and Hilbert J. Kappen. 2012. On the sample

complexity of reinforcement learning with a generative model. In Proceedings of the

International Conference on Machine Learning.

Azar, Mohammad Gheshlaghi, Rémi Munos, and Hilbert J. Kappen. 2013. Minimax

PAC bounds on the sample complexity of reinforcement learning with a generative

model. Machine Learning 91 (3): 325–349.

Azar, Mohammad Gheshlaghi, Ian Osband, and Rémi Munos. 2017. Minimax regret

bounds for reinforcement learning. In Proceedings of the International Conference on

Machine Learning.

Azevedo, Frederico A. C., Ludmila R. B. Carvalho, Lea T. Grinberg, José Marcelo

Farfel, Renata E. L. Ferretti, Renata E. P. Leite, Wilson Jacob Filho, Roberto Lent, and

Suzana Herculano-Houzel. 2009. Equal numbers of neuronal and nonneuronal cells

make the human brain an isometrically scaled-up primate brain. Journal of Comparative

Neurology 513 (5): 532–541.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine trans-

lation by jointly learning to align and translate. In Proceedings of the International

Conference on Learning Representations.

Baird, Leemon C. 1995. Residual algorithms: Reinforcement learning with function

approximation. In Proceedings of the International Conference on Machine Learning.

Baird, Leemon C. 1999. Reinforcement learning through gradient descent. PhD diss.,

Carnegie Mellon University.

Banach, Stefan. 1922. Sur les opérations dans les ensembles abstraits et leur application

aux équations intégrales. Fundamenta Mathematicae 3 (1): 133–181.

Banino, Andrea, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap,

Piotr Mirowski, Alexander Pritzel, Martin J. Chadwick, Thomas Degris, Joseph Modayil,

Greg Wayne, Hubert Soyer, Fabio Viola, Brian Zhang, Ross Goroshin, Neil Rabinowitz,

Razvan Pascanu, Charlie Beattie, Stig Petersen, Amir Sadik, Stephen Gaﬀney, Helen

King, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, and Dharshan Kumaran. 2018.

Vector-based navigation using grid-like representations in artiﬁcial agents. Nature 557

(7705): 429–433.

Draft version.

References 339

Barbeau, André. 1974. Drugs aﬀecting movement disorders. Annual Review of

Pharmacology 14 (1): 91–113.

Bard, Nolan, Jakob N. Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H. Francis

Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, Iain

Dunning, Shibl Mourad, Hugo Larochelle, Marc G. Bellemare, and Michael Bowling.

2020. The Hanabi challenge: A new frontier for AI research. Artiﬁcial Intelligence

280:103216.

Barnard, Etienne. 1993. Temporal-diﬀerence methods and Markov models. IEEE

Transactions on Systems, Man, and Cybernetics 23 (2): 357–365.

Barreto, André, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, Hado van

Hasselt, and David Silver. 2017. Successor features for transfer in reinforcement learning.

In Advances in Neural Information Processing Systems.

Barth-Maron, Gabriel, Matthew W. Hoﬀman, David Budden, Will Dabney, Dan Horgan,

Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. 2018. Distributed dis-

tributional deterministic policy gradients. In Proceedings of the International Conference

on Learning Representations.

Barto, Andrew G., Steven J. Bradtke, and Satinder P. Singh. 1995. Learning to act using

real-time dynamic programming. Artiﬁcial Intelligence 72 (1): 81–138.

Barto, Andrew G., Richard S. Sutton, and Charles W. Anderson. 1983. Neuronlike

adaptive elements that can solve diﬃcult learning control problems. IEEE Transactions

on Systems, Man, and Cybernetics 13 (5): 834–846.

Bäuerle, Nicole, and Jonathan Ott. 2011. Markov decision processes with average-value-

at-risk criteria. Mathematical Methods of Operations Research 74 (3): 361–379.

Beattie, Charles, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright,

Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian

Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton,

Stephen Gaﬀney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. 2016.

DeepMind Lab. arXiv preprint arXiv:1612.03801.

Bellemare, Marc G., Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C.

Machado, Subhodeep Moitra, Sameera S. Ponda, and Ziyu Wang. 2020. Autonomous

navigation of stratospheric balloons using reinforcement learning. Nature 588 (7836):

77–82.

Bellemare, Marc G., Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel

Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, and Clare Lyle. 2019a. A

geometric perspective on optimal representations for reinforcement learning. In Advances

in Neural Information Processing Systems.

Bellemare, Marc G., Will Dabney, and Rémi Munos. 2017a. A distributional perspective

on reinforcement learning. In Proceedings of the International Conference on Machine

Learning.

Bellemare, Marc G., Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshmi-

narayanan, Stephan Hoyer, and Rémi Munos. 2017b. The Cramer distance as a solution

to biased Wasserstein gradients. arXiv preprint arXiv:1705.10743.

Draft version.

340 References

Bellemare, Marc G., Yavar Naddaf, Joel Veness, and Michael Bowling. 2013a. The

Arcade Learning Environment: An evaluation platform for general agents. Journal of

Artiﬁcial Intelligence Research 47 (June): 253–279.

Bellemare, Marc G., Yavar Naddaf, Joel Veness, and Michael Bowling. 2015. The

Arcade Learning Environment: An evaluation platform for general agents, extended

abstract. In European Workshop on Reinforcement Learning.

Bellemare, Marc G., Georg Ostrovski, Arthur Guez, Philip S. Thomas, and Rémi

Munos. 2016. Increasing the action gap: New operators for reinforcement learning.

In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.

Bellemare, Marc G., Nicolas Le Roux, Pablo Samuel Castro, and Subhodeep Moitra.

2019b. Distributional reinforcement learning with linear function approximation. In

Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics.

Bellemare, Marc G., Joel Veness, and Michael Bowling. 2012a. Investigating contingency

awareness using Atari 2600 games. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence.

Bellemare, Marc G., Joel Veness, and Michael Bowling. 2012b. Sketch-based linear

value function approximation. In Advances in Neural Information Processing Systems.

Bellemare, Marc G., Joel Veness, and Michael Bowling. 2013b. Bayesian learning of

recursively factored environments. In Proceedings of the International Conference on

Machine Learning.

Bellini, Fabio, and Elena Di Bernardino. 2017. Risk management with expectiles. The

European Journal of Finance 23 (6): 487–506.

Bellini, Fabio, Bernhard Klar, Alfred Müller, and Emanuela Rosazza Gianin. 2014.

Generalized quantiles as risk measures. Insurance: Mathematics and Economics 54:41–

48.

Bellman, Richard E. 1957a. A Markovian decision process. Journal of Mathematics and

Mechanics 6 (5): 679–684.

Bellman, Richard E. 1957b. Dynamic programming. Dover Publications.

Benveniste, Albert, Michel Métivier, and Pierre Priouret. 2012. Adaptive algorithms and

stochastic approximations. Springer Science & Business Media.

Bernstein, Daniel S., Robert Givan, Neil Immerman, and Shlomo Zilberstein. 2002.

The complexity of decentralized control of Markov decision processes. Mathematics of

Operations Research 27 (4): 819–840.

Bertsekas, Dimitri P. 1994. Generic rank-one corrections for value iteration in

Markovian decision problems. Technical report. Massachusetts Institute of Technology.

Bertsekas, Dimitri P. 1995. A counterexample to temporal diﬀerences learning. Neural

Computation 7 (2): 270–279.

Bertsekas, Dimitri P. 2011. Approximate policy iteration: A survey and some new

methods. Journal of Control Theory and Applications 9 (3): 310–335.

Bertsekas, Dimitri P. 2012. Dynamic programming and optimal control. 4th ed. Vol. 2.

Athena Scientiﬁc.

Draft version.

References 341

Bertsekas, Dimitri P., and Sergey Ioﬀe. 1996. Temporal diﬀerences-based policy itera-

tion and applications in neuro-dynamic programming. Technical report. Massachusetts

Institute of Technology.

Bertsekas, Dimitri P., and John N. Tsitsiklis. 1996. Neuro-dynamic programming. Athena

Scientiﬁc.

Bhandari, Jalaj, and Daniel Russo. 2021. On the linear convergence of policy gradient

methods for ﬁnite MDPs. In Proceedings of the International Conference on Artiﬁcial

Intelligence and Statistics.

Bhonker, Nadav, Shai Rozenberg, and Itay Hubara. 2017. Playing SNES in the Retro

Learning Environment. In Proceedings of the International Conference on Learning

Representations.

Bickel, Peter J., and David A. Freedman. 1981. Some asymptotic theory for the bootstrap.

The Annals of Statistics 9 (6): 1196–1217.

Billingsley, Patrick. 2012. Probability and measure. 4th ed. John Wiley & Sons.

Bishop, Christopher M. 2006. Pattern recognition and machine learning. Springer.

Bobkov, Sergey, and Michel Ledoux. 2019. One-dimensional empirical measures, order

statistics, and Kantorovich transport distances. American Mathematical Society.

Bodnar, Cristian, Adrian Li, Karol Hausman, Peter Pastor, and Mrinal Kalakrishnan.

2020. Quantile QT-OPT for risk-aware vision-based robotic grasping. In Proceedings of

Robotics: Science and Systems.

Borkar, Vivek S. 1997. Stochastic approximation with two time scales. Systems &

Control Letters 29 (5): 291–294.

Borkar, Vivek S. 2008. Stochastic approximation: A dynamical systems viewpoint.

Cambridge University Press.

Borkar, Vivek S., and Sean P. Meyn. 2000. The ODE method for convergence of

stochastic approximation and reinforcement learning. SIAM Journal on Control and

Optimization 38 (2): 447–469.

Bottou, Léon. 1998. Online learning and stochastic approximations. On-line Learning in

Neural Networks 17 (9): 142.

Boutilier, Craig. 1996. Planning, learning and coordination in multiagent decision

processes. In Proceedings of the Conference on Theoretical Aspects of Rationality and

Knowledge.

Bowling, Michael, and Manuela Veloso. 2002. Multiagent learning using a variable

learning rate. Artiﬁcial Intelligence 136 (2): 215–250.

Boyan, Justin, and Andrew W. Moore. 1995. Generalization in reinforcement learning:

Safely approximating the value function. In Advances in Neural Information Processing

Systems.

Boyd, Stephen, and Lieven Vandenberghe. 2004. Convex optimization. Cambridge

University Press.

Bradtke, Steven J., and Andrew G. Barto. 1996. Linear least-squares algorithms for

temporal diﬀerence learning. Machine Learning 22 (1): 33–57.

Draft version.

342 References

Braver, Todd S., Deanna M. Barch, and Jonathan D. Cohen. 1999. Cognition and

control in schizophrenia: A computational model of dopamine and prefrontal function.

Biological Psychiatry 46 (3): 312–328.

Brooks, Steve, Andrew Gelman, Galin Jones, and Xiao-Li Meng. 2011. Handbook of

Markov chain Monte Carlo. CRC Press.

Brown, Daniel, Scott Niekum, and Marek Petrik. 2020. Bayesian robust optimization

for imitation learning. In Advances in Neural Information Processing Systems.

Browne, Cameron B., Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter

I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samoth-

rakis, and Simon Colton. 2012. A survey of Monte Carlo tree search methods. IEEE

Transactions on Computational Intelligence and AI in Games 4 (1): 1–43.

Cabi, Serkan, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova,

Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg

Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu

Wang. 2020. Scaling data-driven robotics with reward sketching and batch reinforcement

learning. In Proceedings of Robotics: Science and Systems.

Cagniard, Barbara, Peter D. Balsam, Daniela Brunner, and Xiaoxi Zhuang. 2006. Mice

with chronically elevated dopamine exhibit enhanced motivation, but not learning, for a

food reward. Neuropsychopharmacology 31 (7): 1362–1370.

Carpin, Stefano, Yinlam Chow, and Marco Pavone. 2016. Risk aversion in ﬁnite Markov

decision processes using total cost criteria and average value at risk. In Proceedings of

the IEEE International Conference on Robotics and Automation.

Castro, Pablo S., Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G.

Bellemare. 2018. Dopamine: A research framework for deep reinforcement learning.

arXiv preprint arXiv:1812.06110.

Ceron, Johan Samir Obando, and Pablo Samuel Castro. 2021. Revisiting Rainbow:

Promoting more insightful and inclusive deep reinforcement learning research. In

Proceedings of the International Conference on Machine Learning.

Chandak, Yash, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma

Brunskill, and Philip S. Thomas. 2021. Universal oﬀ-policy evaluation. In Advances in

Neural Information Processing Systems.

Chapman, David, and Leslie Pack Kaelbling. 1991. Input generalization in delayed

reinforcement learning: An algorithm and performance comparisons. In Proceedings of

the International Joint Conference on Artiﬁcial Intelligence.

Chen, Jinglin, and Nan Jiang. 2019. Information-theoretic considerations in batch

reinforcement learning. In Proceedings of the International Conference on Machine

Learning.

Chopin, Nicolas, and Omiros Papaspiliopoulos. 2020. An introduction to sequential

Monte Carlo. Springer.

Chow, Yinlam. 2017. Risk-sensitive and data-driven sequential decision making. PhD

diss., Stanford University.

Draft version.

References 343

Chow, Yinlam, and Mohammad Ghavamzadeh. 2014. Algorithms for CVaR optimization

in MDPs. In Advances in Neural Information Processing Systems.

Chow, Yinlam, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. 2018.

Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine

Learning Research 18 (1): 6070–6120.

Chow, Yinlam, Aviv Tamar, Shie Mannor, and Marco Pavone. 2015. Risk-sensitive

and robust decision-making: A CVaR optimization approach. In Advances in Neural

Information Processing Systems.

Chung, Kun-Jen, and Matthew J. Sobel. 1987. Discounted MDPs: Distribution functions

and exponential utility maximization. SIAM Journal on Control and Optimization 25

(1): 49–62.

Chung, Wesley, Somjit Nath, Ajin Joseph, and Martha White. 2018. Two-timescale

networks for nonlinear value function approximation. In Proceedings of the International

Conference on Learning Representations.

Claus, Caroline, and Craig Boutilier. 1998. The dynamics of reinforcement learning in

cooperative multiagent systems. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence.

Clements, William R., Benoit-Marie Robaglia, Bastien Van Delft, Reda Bahi Slaoui, and

Sebastien Toth. 2020. Estimating risk and uncertainty in deep reinforcement learning.

In Workshop on Uncertainty and Robustness in Deep Learning at the International

Conference on Machine Learning.

Cobbe, Karl, Chris Hesse, Jacob Hilton, and John Schulman. 2020. Leveraging procedu-

ral generation to benchmark reinforcement learning. In Proceedings of the International

Conference on Machine Learning.

Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Cliﬀord Stein. 2001.

Introduction to algorithms. MIT Press.

Cormode, G., and S. Muthukrishnan. 2005. An improved data stream summary: The

count-min sketch and its applications. Journal of Algorithms 55 (1): 58–75.

Cuturi, Marco. 2013. Sinkhorn distances: Lightspeed computation of optimal transport.

In Advances in Neural Information Processing Systems.

Da Silva, Felipe Leno, Anna Helena Reali Costa, and Peter Stone. 2019. Distributional

reinforcement learning applied to robot soccer simulation. In Adaptive and Learning

Agents Workshop at the International Conference on Autonomous Agents and Multiagent

Systems.

Dabney, Will, André Barreto, Mark Rowland, Robert Dadashi, John Quan, Marc G.

Bellemare, and David Silver. 2020a. The value-improvement path: Towards better

representations for reinforcement learning. In Proceedings of the AAAI Conference on

Artiﬁcial Intelligence.

Dabney, Will, Zeb Kurth-Nelson, Naoshige Uchida, Clara Kwon Starkweather, Demis

Hassabis, Rémi Munos, and Matthew Botvinick. 2020b. A distributional code for value

in dopamine-based reinforcement learning. Nature 577 (7792): 671–675.

Draft version.

344 References

Dabney, Will, Georg Ostrovski, David Silver, and Rémi Munos. 2018a. Implicit quantile

networks for distributional reinforcement learning. In Proceedings of the International

Conference on Machine Learning.

Dabney, Will, Mark Rowland, Marc G. Bellemare, and Rémi Munos. 2018b. Distribu-

tional reinforcement learning with quantile regression. In AAAI Conference on Artiﬁcial

Intelligence.

Dai, Bo, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and

Le Song. 2018. SBEED: Convergent reinforcement learning with nonlinear function

approximation. In Proceedings of the International Conference on Machine Learning.

Daw, Nathaniel D. 2003. Reinforcement learning models of the dopamine system and

their behavioral implications. Carnegie Mellon University.

Daw, Nathaniel D., and Philippe N. Tobler. 2014. Value learning through reinforcement:

The basics of dopamine and reinforcement learning. In Neuroeconomics, edited by

Paul W. Glimcher and Ernst Fehr, 283–298. Academic Press.

Dayan, Peter. 1992. The convergence of TD(

) for general

. Machine Learning 8 (3–4):

341–362.

Dayan, Peter. 1993. Improving generalization for temporal diﬀerence learning: The

successor representation. Neural Computation 5 (4): 613–624.

Dayan, Peter, and Terrence J. Sejnowski. 1994. TD(

) converges with probability 1.

Machine Learning 14 (3): 295–301.

Dearden, Richard, Nir Friedman, and Stuart Russell. 1998. Bayesian Q-learning. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence.

Degrave, Jonas, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey,

Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de

las Casas, Craig Donner, Leslie Fritz, Cristian Galperti, Andrea Huber, James Keeling,

Maria Tsimpoukelli, Jackie Kay, Antoine Merle, Jean-Marc Moret, Seb Noury, Federico

Pesamosca, David Pfau, Olivier Sauter, Cristian Sommariva, Stefano Coda, Basil Duval,

Ambrogio Fasoli, Pushmeet Kohli, Koray Kavukcuoglu, Demis Hassabis, and Martin

Riedmiller. 2022. Magnetic control of tokamak plasmas through deep reinforcement

learning. Nature 602:414–419.

Delage, Erick, and Shie Mannor. 2010. Percentile optimization for Markov decision

processes with parameter uncertainty. Operations Research 58 (1): 203–213.

Denardo, Eric V., and Uriel G. Rothblum. 1979. Optimal stopping, exponential utility,

and linear programming. Mathematical Programming 16 (1): 228–244.

Derman, Cyrus. 1970. Finite state Markovian decision processes. Academic Press.

Diaconis, Persi, and David Freedman. 1999. Iterated random functions. SIAM Review 41

(1): 45–76.

Doan, Thang, Bogdan Mazoure, and Clare Lyle. 2018. GAN Q-learning. arXiv preprint

arXiv:1805.04874.

Doob, J. L. 1994. Measure theory. Springer.

Draft version.

References 345

Doucet, Arnaud, Nando De Freitas, and Neil Gordon. 2001. Sequential Monte Carlo

methods in practice. Springer.

Doucet, Arnaud, and Adam M. Johansen. 2011. A tutorial on particle ﬁltering and

smoothing: Fifteen years later. In The Oxford handbook of nonlinear ﬁltering, edited by

Dan Crisan and Boris Rozovskii. Oxford University Press.

Duan, Jingliang, Yang Guan, Shengbo Eben Li, Yangang Ren, Qi Sun, and Bo Cheng.

2021. Distributional soft actor-critic: Oﬀ-policy reinforcement learning for addressing

value estimation errors. IEEE Transactions on Neural Networks and Learning Systems.

Dvoretzky, Aryeh. 1956. On stochastic approximation. In Proceedings of the Berkeley

Symposium on Mathematical Statistics and Probability, 39–55.

Dvoretzky, Aryeh, Jack Kiefer, and Jacob Wolfowitz. 1956. Asymptotic minimax char-

acter of the sample distribution function and of the classical multinomial estimator. The

Annals of Mathematical Statistics 27 (3): 642–669.

Engel, Yaakov, Shie Mannor, and Ron Meir. 2003. Bayes meets Bellman: The Gaussian

process approach to temporal diﬀerence learning. In Proceedings of the International

Conference on Machine Learning.

Engel, Yaakov, Shie Mannor, and Ron Meir. 2007. Bayesian reinforcement learning

with Gaussian process temporal diﬀerence methods. Unpublished manuscript.

Engert, Martin. 1970. Finite dimensional translation invariant subspaces. Paciﬁc Journal

of Mathematics 32 (2): 333–343.

Ernst, Damien, Pierre Geurts, and Louis Wehenkel. 2005. Tree-based batch mode

reinforcement learning. Journal of Machine Learning Research 6:503–556.

Eshel, Neir, Michael Bukwich, Vinod Rao, Vivian Hemmelder, Ju Tian, and Naoshige

Uchida. 2015. Arithmetic and local circuitry underlying dopamine prediction errors.

Nature 525 (7568): 243–246.

Even-Dar, Eyal, and Yishay Mansour. 2003. Learning rates for Q-learning. Journal of

Machine Learning Research 5 (1): 1–25.

Farahmand, Amir-massoud. 2011. Action-gap phenomenon in reinforcement learning.

In Advances in Neural Information Processing Systems.

Farahmand, Amir-massoud. 2019. Value function in frequency domain and the char-

acteristic value iteration algorithm. In Advances in Neural Information Processing

Systems.

Fedus, William, Carles Gelada, Yoshua Bengio, Marc G. Bellemare, and Hugo

Larochelle. 2019. Hyperbolic discounting and learning over multiple horizons. In

Multi-Disciplinary Conference on Reinforcement Learning and Decision-Making.

Feinberg, Eugene A. 2000. Constrained discounted Markov decision processes and

Hamiltonian cycles. Mathematics of Operations Research 25 (1): 130–140.

Ferns, Norm, Prakash Panangaden, and Doina Precup. 2004. Metrics for ﬁnite Markov

decision processes. In Proceedings of the Conference on Uncertainty in Artiﬁcial

Intelligence.

Draft version.

346 References

Ferns, Norman, and Doina Precup. 2014. Bisimulation metrics are optimal value

functions. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence.

Filar, Jerzy A., Dmitry Krass, and Keith W. Ross. 1995. Percentile performance crite-

ria for limiting average Markov decision processes. IEEE Transactions on Automatic

Control 40 (1): 2–10.

Fortunato, Meire, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband,

Alex Graves, Vlad Mnih, Rémi Munos, Demis Hassabis, Olivier Pietquin, Charles

Blundell, and Shane Legg. 2018. Noisy networks for exploration. In Proceedings of the

International Conference on Learning Representations.

François-Lavet, Vincent, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle

Pineau. 2018. An introduction to deep reinforcement learning. Foundations and Trends



in Machine Learning 11 (3–4): 219–354.

Freirich, Dror, Tzahi Shimkin, Ron Meir, and Aviv Tamar. 2019. Distributional multi-

variate policy evaluation and exploration with the Bellman GAN. In Proceedings of the

International Conference on Machine Learning.

Gardner, Matthew P. H., Geoﬀrey Schoenbaum, and Samuel J. Gershman. 2018. Rethink-

ing dopamine as generalized prediction error. Proceedings of the Royal Society B 285

(1891): 20181645.

German, Dwight C., Kebreten Manaye, Wade K. Smith, Donald J. Woodward, and Clif-

ford B. Saper. 1989. Midbrain dopaminergic cell loss in Parkinson’s disease: Computer

visualization. Annals of Neurology 26 (4): 507–514.

Ghavamzadeh, Mohammad, Shie Mannor, Joelle Pineau, and Aviv Tamar. 2015.

Bayesian reinforcement learning: A survey. Foundations and Trends



in Machine

Learning 8 (5–6): 359–483.

Ghosh, Dibya, and Marc G. Bellemare. 2020. Representations for stable oﬀ-policy

reinforcement learning. In Proceedings of the International Conference on Machine

Learning.

Ghosh, Dibya, Marlos C. Machado, and Nicolas Le Roux. 2020. An operator view of

policy gradient methods. In Advances in Neural Information Processing Systems.

Gilbert, Hugo, Paul Weng, and Yan Xu. 2017. Optimizing quantiles in preference-

based Markov decision processes. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence.

Glimcher, Paul W. 2011. Understanding dopamine and reinforcement learning: The

dopamine reward prediction error hypothesis. Proceedings of the National Academy of

Sciences 108 (Suppl. 3): 15647–15654.

Goodfellow, Ian, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. MIT Press.

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,

Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets.

In Advances in Neural Information Processing Systems.

Gordon, Geoﬀrey. 1995. Stable function approximation in dynamic programming. In

Proceedings of the International Conference on Machine Learning.

Draft version.

References 347

Gordon, Neil J., David J. Salmond, and Adrian F. M. Smith. 1993. Novel approach

to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F (Radar and

Signal Processing) 140 (2): 107–113.

Graesser, Laura, and Wah Loon Keng. 2019. Foundations of deep reinforcement learning:

Theory and practice in Python. Addison-Wesley Professional.

Gretton, Arthur, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexan-

der Smola. 2012. A kernel two-sample test. Journal of Machine Learning Research 13

(1): 723–773.

Grünewälder, Steﬀen, and Klaus Obermayer. 2011. The optimal unbiased value estimator

and its relation to LSTD, TD and MC. Machine Learning 83 (3): 289–330.

Gruslys, Audrunas, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc Belle-

mare, and Rémi Munos. 2018. The Reactor: A fast and sample-eﬃcient actor-critic agent

for reinforcement learning. In Proceedings of the International Conference on Learning

Representations.

Guo, Zhaohan Daniel, Bernardo Avila Pires, Bilal Piot, Jean-Bastien Grill, Florent Altché,

Rémi Munos, and Mohammad Gheshlaghi Azar. 2020. Bootstrap latent-predictive

representations for multitask reinforcement learning. In Proceedings of the International

Conference on Machine Learning.

Gurvits, Leonid, Long-Ji Lin, and Stephen José Hanson. 1994. Incremental learning

of evaluation functions for absorbing Markov chains: New methods and theorems.

Technical report. Siemens Corporate Research.

Harmon, Mance E., and Leemon C. Baird. 1996. A response to Bertsekas’ “A

counterexample to temporal-diﬀerences learning”. Technical report. Wright Laboratory.

Haskell, William B., and Rahul Jain. 2015. A convex analytic approach to risk-aware

Markov decision processes. SIAM Journal on Control and Optimization 53 (3): 1569–

1598.

Hegarty, Shane V., Aideen M. Sullivan, and Gerard W. O’Keeﬀe. 2013. Midbrain

dopaminergic neurons: A review of the molecular circuitry that regulates their

development. Developmental Biology 379 (2): 123–138.

Heger, Matthias. 1994. Consideration of risk in reinforcement learning. In Proceedings

of the International Conference on Machine Learning.

Henderson, Peter, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and

David Meger. 2018. Deep reinforcement learning that matters. In Proceedings of the

AAAI Conference on Artiﬁcial Intelligence.

Hessel, Matteo, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will

Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. 2018. Rainbow:

Combining improvements in deep reinforcement learning. In Proceedings of the AAAI

Conference on Artiﬁcial Intelligence.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. Long short-term memory. Neural

Computation 9 (8): 1735–1780.

Hornykiewicz, Oleh. 1966. Dopamine (3-hydroxytyramine) and brain function. Pharma-

cological Reviews 18 (2): 925–964.

Draft version.

348 References

Howard, R. 1960. Dynamic programming and Markov processes. MIT Press.

Howard, Ronald A., and James E. Matheson. 1972. Risk-sensitive Markov decision

processes. Management Science 18 (7): 356–369.

Howes, Oliver D., and Shitij Kapur. 2009. The dopamine hypothesis of schizophrenia:

Version III—the ﬁnal common pathway. Schizophrenia Bulletin 35 (3): 549–562.

Hutter, Marcus. 2005. Universal artiﬁcial intelligence: Sequential decisions based on

algorithmic probability. Springer.

Imani, Ehsan, and Martha White. 2018. Improving regression performance with

distributional losses. In Proceedings of the International Conference on Machine

Learning.

Jaakkola, Tommi, Michael I. Jordan, and Satinder P. Singh. 1994. On the convergence

of stochastic iterative dynamic programming algorithms. Neural Computation 6 (6):

1185–1201.

Jaderberg, Max, Volodymyr Mnih, Wojciech M. Czarnecki, Tom Schaul, Joel Z. Leibo,

David Silver, and Koray Kavukcuoglu. 2017. Reinforcement learning with unsuper-

vised auxiliary tasks. In Proceedings of the International Conference on Learning

Representations.

Janner, Michael, Igor Mordatch, and Sergey Levine. 2020. Generative temporal dif-

ference learning for inﬁnite-horizon prediction. In Advances in Neural Information

Processing Systems.

Jaquette, Stratton C. 1973. Markov decision processes with a new optimality criterion:

Discrete time. The Annals of Statistics 1 (3): 496–505.

Jaquette, Stratton C. 1976. A utility criterion for Markov decision processes. Manage-

ment Science 23 (1): 43–49.

Jessen, Børge, and Aurel Wintner. 1935. Distribution functions and the Riemann zeta

function. Transactions of the American Mathematical Society 38 (1): 48–88.

Jiang, Daniel R., and Warren B. Powell. 2018. Risk-averse approximate dynamic pro-

gramming with quantile-based risk measures. Mathematics of Operations Research 43

(2): 554–579.

Jordan, Richard, David Kinderlehrer, and Felix Otto. 1998. The variational formulation

of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis 29 (1): 1–17.

Kaelbling, Leslie Pack, Michael L. Littman, and Anthony R. Cassandra. 1998. Planning

and acting in partially observable stochastic domains. Artiﬁcial Intelligence 101:99–134.

Kamin, Leon J. 1968. "Attention like" processes in classical conditioning. In Miami

Symposium on the Prediction of Behavior: Aversive Stimulation, 9–31.

Kantorovich, Leonid V. 1942. On the translocation of masses. Proceedings of the USSR

Academy of Sciences 37 (7–8): 227–229.

Kapetanakis, Spiros, and Daniel Kudenko. 2002. Reinforcement learning of coordination

in cooperative multi-agent systems. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence.

Draft version.

References 349

Kartal, Bilal, Pablo Hernandez-Leal, and Matthew E. Taylor. 2019. Terminal predic-

tion as an auxiliary task for deep reinforcement learning. In Proceedings of the AAAI

Conference on Artiﬁcial Intelligence and Interactive Digital Entertainment.

Kempka, Michał, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech

skowski. 2016. Vizdoom: A Doom-based AI research platform for visual reinforce-

ment learning. In 2016 IEEE Conference on Computational Intelligence and Games,

1–8.

Keramati, Ramtin, Christoph Dann, Alex Tamkin, and Emma Brunskill. 2020. Being

optimistic to be conservative: Quickly learning a CVaR policy. In Proceedings of the

AAAI Conference on Artiﬁcial Intelligence.

Kingma, Diederik, and Jimmy Ba. 2015. Adam: A method for stochastic optimization.

In Proceedings of the International Conference on Learning Representations.

Koenker, Roger. 2005. Quantile regression. Cambridge University Press.

Koenker, Roger, and Gilbert Bassett Jr. 1978. Regression quantiles. Econometrica 46

(1): 33–50.

Kolter, J. Zico. 2011. The ﬁxed points of oﬀ-policy TD. In Advances in Neural

Information Processing Systems.

Konidaris, George D., Sarah Osentoski, and Philip S. Thomas. 2011. Value function

approximation in reinforcement learning using the Fourier basis. In Proceedings of the

AAAI Conference on Artiﬁcial Intelligence.

Kuan, Chung-Ming, Jin-Huei Yeh, and Yu-Chin Hsu. 2009. Assessing value at risk with

care, the conditional autoregressive expectile models. Journal of Econometrics 150 (2):

261–270.

Kuhn, Harold W. 1950. A simpliﬁed two-person poker. Contributions to the Theory of

Games 1:97–103.

Kurth-Nelson, Zeb, and A. David Redish. 2009. Temporal-diﬀerence reinforcement

learning with distributed representations. PLoS One 4 (10): e7362.

Kusher, Harold, and Dean Clark. 1978. Stochastic approximation methods for con-

strained and unconstrained systems. Springer.

Kushner, Harold, and G. George Yin. 2003. Stochastic approximation and recursive

algorithms and applications. Springer Science & Business Media.

Kuznetsov, Arsenii, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. 2020.

Controlling overestimation bias with truncated mixture of continuous distributional

quantile critics. In Proceedings of the International Conference on Machine Learning.

Lagoudakis, Michail G., and Ronald Parr. 2003. Least-squares policy iteration. Journal

of Machine Learning Research 4:1107–1149.

Lample, Guillaume, and Devendra Singh Chaplot. 2017. Playing FPS games with deep

reinforcement learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.

Laskin, Michael, Aravind Srinivas, and Pieter Abbeel. 2020. CURL: Contrastive unsu-

pervised representations for reinforcement learning. In Proceedings of the International

Conference on Machine Learning.

Draft version.

350 References

Lattimore, Tor, and Marcus Hutter. 2012. PAC bounds for discounted MDPs. In

Proceedings of the International Conference on Algorithmic Learning Theory.

Lattimore, Tor, and Csaba Szepesvári. 2020. Bandit algorithms. Cambridge University

Press.

Lauer, Martin, and Martin Riedmiller. 2000. An algorithm for distributed reinforce-

ment learning in cooperative multi-agent systems. In Proceedings of the International

Conference on Machine Learning.

Le Lan, Charline, Stephen Tu, Adam Oberman, Rishabh Agarwal, and Marc G. Belle-

mare. 2022. On the generalization of representations in reinforcement learning. In

Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics.

LeCun, Yann, and Yoshua Bengio. 1995. Convolutional networks for images, speech, and

time series. In The handbook of brain theory and neural networks, edited by Michael A.

Arbib. MIT Press.

Lee, Daewoo, Boris Defourny, and Warren B. Powell. 2013. Bias-corrected Q-learning

to control max-operator bias in Q-learning. In Symposium on Adaptive Dynamic

Programming And Reinforcement Learning.

Levine, Sergey. 2018. Reinforcement learning and control as probabilistic inference:

Tutorial and review. arXiv preprint arXiv:1805.00909.

Levine, Sergey, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-end

training of deep visuomotor policies. Journal of Machine Learning Research 17 (1):

1334–1373.

Li, Xiaocheng, Huaiyang Zhong, and Margaret L. Brandeau. 2022. Quantile Markov

decision processes. Operations Research 70 (3): 1428–1447.

Lillicrap, Timothy P., Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman. 2016a.

Random synaptic feedback weights support error backpropagation for deep learning.

Nature Communications 7 (1): 1–10.

Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,

Yuval Tassa, David Silver, and Daan Wierstra. 2016b. Continuous control with deep

reinforcement learning. In Proceedings of the International Conference on Learning

Representations.

Lin, Gwo Dong. 2017. Recent developments on the moment problem. Journal of

Statistical Distributions and Applications 4 (1): 1–17.

Lin, L. J. 1992. Self-improving reactive agents based on reinforcement learning, planning

and teaching. Machine Learning 8 (3): 293–321.

Lin, Zichuan, Li Zhao, Derek Yang, Tao Qin, Tie-Yan Liu, and Guangwen Yang. 2019.

Distributional reward decomposition for reinforcement learning. In Advances in Neural

Information Processing Systems.

Lipovetzky, Nir, Miquel Ramirez, and Hector Geﬀner. 2015. Classical planning with

simulators: Results on the Atari video games. In Proceedings of International Joint

Conference on Artiﬁcial Intelligence.

Draft version.

References 351

Littman, Michael L. 1994. Markov games as a framework for multi-agent reinforcement

learning. In Proceedings of the International Conference on Machine Learning.

Littman, Michael L., and Csaba Szepesvári. 1996. A generalized reinforcement-learning

model: Convergence and applications. In Proceedings of the International Conference

on Machine Learning.

Liu, Jun S. 2001. Monte Carlo strategies in scientiﬁc computing. Springer.

Liu, Qiang, and Dilin Wang. 2016. Stein variational gradient descent: A general purpose

Bayesian inference algorithm. In Advances in Neural Information Processing Systems.

Liu, Quansheng. 1998. Fixed points of a generalized smoothing transformation and

applications to the branching random walk. Advances in Applied Probability 30 (1):

85–112.

Ljung, Lennart. 1977. Analysis of recursive stochastic algorithms. IEEE Transactions

on Automatic Control 22 (4): 551–575.

Ljungberg, Tomas, Paul Apicella, and Wolfram Schultz. 1992. Responses of monkey

dopamine neurons during learning of behavioral reactions. Journal of Neurophysiology

67 (1): 145–163.

Lowet, Adam S., Qiao Zheng, Sara Matias, Jan Drugowitsch, and Naoshige Uchida.

2020. Distributional reinforcement learning in the brain. Trends in Neurosciences 43

(12): 980–997.

Ludvig, Elliot A., Marc G. Bellemare, and Keir G. Pearson. 2011. A primer on reinforce-

ment learning in the brain: Psychological, computational, and neural perspectives. In

Computational neuroscience for advancing artiﬁcial intelligence: Models, methods and

applications, edited by Eduardo Alonso and Esther Mondragón. IGI Global.

Luo, Yudong, Guiliang Liu, Haonan Duan, Oliver Schulte, and Pascal Poupart. 2021.

Distributional reinforcement learning with monotonic splines. In Proceedings of the

International Conference on Learning Representations.

Lyle, Clare, Pablo Samuel Castro, and Marc G. Bellemare. 2019. A comparative analysis

of expected and distributional reinforcement learning. In Proceedings of the AAAI

Conference on Artiﬁcial Intelligence.

Lyle, Clare, Mark Rowland, Georg Ostrovski, and Will Dabney. 2021. On the eﬀect

of auxiliary tasks on representation dynamics. In Proceedings of the International

Conference on Artiﬁcial Intelligence and Statistics.

Lyu, Xueguang, and Christopher Amato. 2020. Likelihood quantile networks for

coordinating multi-agent reinforcement learning. In Proceedings of the International

Conference on Autonomous Agents and Multiagent Systems.

Machado, Marlos C., Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew

Hausknecht, and Michael Bowling. 2018. Revisiting the Arcade Learning Environ-

ment: Evaluation protocols and open problems for general agents. Journal of Artiﬁcial

Intelligence Research 61:523–562.

MacKay, David J. C. 2003. Information theory, inference and learning algorithms.

Cambridge University Press.

Draft version.

352 References

Maddison, Chris J., Dieterich Lawson, George Tucker, Nicolas Heess, Arnaud Doucet,

Andriy Mnih, and Yee Whye Teh. 2017. Particle value functions. In Proceedings of the

International Conference on Learning Representations (Workshop Track).

Madeira Auraújo, João Guilherme, Johan Samir Obando Ceron, and Pablo Samuel

Castro. 2021. Lifting the veil on hyper-parameters for value-based deep reinforcement

learning. In NeurIPS 2021 Workshop: LatinX in AI.

Maei, Hamid Reza. 2011. Gradient temporal-diﬀerence learning algorithms. PhD diss.,

University of Alberta.

Mandl, Petr. 1971. On the variance in controlled Markov chains. Kybernetika 7 (1):

1–12.

Mannor, Shie, Duncan Simester, Peng Sun, and John N. Tsitsiklis. 2007. Bias and

variance approximation in value function estimates. Management Science 53 (2): 308–

322.

Mannor, Shie, and John Tsitsiklis. 2011. Mean-variance optimization in Markov decision

processes. In Proceedings of the International Conference on Machine Learning.

Markowitz, Harry M. 1952. Portfolio selection. Journal of Finance 7:77–91.

Martin, John, Michal Lyskawinski, Xiaohu Li, and Brendan Englot. 2020. Stochastically

dominant distributional reinforcement learning. In Proceedings of the International

Conference on Machine Learning.

Massart, Pascal. 1990. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality.

The Annals of Probability 18 (3): 1269–1283.

Matignon, Laëtitia, Guillaume J. Laurent, and Nadine Le Fort-Piat. 2007. Hysteretic

Q-learning: An algorithm for decentralized reinforcement learning in cooperative multi-

agent teams. In IEEE International Conference on Intelligent Robots and Systems.

Matignon, Laëtitia, Guillaume J. Laurent, and Nadine Le Fort-Piat. 2012. Independent

reinforcement learners in cooperative Markov games: A survey regarding coordination

problems. The Knowledge Engineering Review 27 (1): 1–31.

Mavrin, Borislav, Hengshuai Yao, Linglong Kong, Kaiwen Wu, and Yaoliang Yu. 2019.

Distributional reinforcement learning for eﬃcient exploration. In Proceedings of the

International Conference on Machine Learning.

McCallum, Andrew K. 1995. Reinforcement learning with selective perception and

hidden state. PhD diss., University of Rochester.

Meyn, Sean. 2022. Control systems and reinforcement learning. Cambridge University

Press.

Meyn, Sean P., and Richard L. Tweedie. 2012. Markov chains and stochastic stability.

Cambridge University Press.

Mihatsch, Oliver, and Ralph Neuneier. 2002. Risk-sensitive reinforcement learning.

Machine Learning 49 (2): 267–290.

Miller, Ralph R., Robert C. Barnet, and Nicholas J. Grahame. 1995. Assessment of the

Rescorla-Wagner model. Psychological Bulletin 117 (3): 363.

Draft version.

References 353

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc

G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski,

Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan

Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control

through deep reinforcement learning. Nature 518 (7540): 529–533.

Mogenson, Gordon J., Douglas L. Jones, and Chi Yiu Yim. 1980. From motivation to

action: Functional interface between the limbic system and the motor system. Progress

in Neurobiology 14 (2–3): 69–97.

Monge, Gaspard. 1781. Mémoire sur la théorie des déblais et des remblais. Histoire de

l’Académie Royale des Sciences de Paris: 666–704.

Montague, P. Read, Peter Dayan, and Terrence J. Sejnowski. 1996. A framework for

mesencephalic dopamine systems based on predictive Hebbian learning. Journal of

Neuroscience 16 (5): 1936–1947.

Montfort, Nick, and Ian Bogost. 2009. Racing the beam: The Atari video computer

system. MIT Press.

Moore, Andrew W., and Christopher G. Atkeson. 1993. Prioritized sweeping: Rein-

forcement learning with less data and less time. Machine Learning 13 (1): 103–

130.

Morgenstern, Oskar, and John von Neumann. 1944. Theory of games and economic

behavior. Princeton University Press.

Morimura, Tetsuro, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and

Toshiyuki Tanaka. 2010a. Nonparametric return distribution approximation for rein-

forcement learning. In Proceedings of the International Conference on Machine

Learning.

Morimura, Tetsuro, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and

Toshiyuki Tanaka. 2010b. Parametric return density estimation for reinforcement

learning. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence.

Morton, Thomas E. 1971. On the asymptotic convergence rate of cost diﬀerences for

Markovian decision processes. Operations Research 19 (1): 244–248.

Mott, Bradford W., Stephen Anthony, and the Stella team. 1995–2023. Stella: A multi-

platform Atari 2600 VCS Emulator. http://stella.sourceforge.net.

Müller, Alfred. 1997. Integral probability metrics and their generating classes of

functions. Advances in Applied Probability 29 (2): 429–443.

Muller, Timothy H., James L. Butler, Sebastijan Veselic, Bruno Miranda, Timothy

E. J. Behrens, Zeb Kurth-Nelson, and Steven W. Kennerley. 2021. Distributional

reinforcement learning in prefrontal cortex. bioRxiv 2021.06.14.448422.

Munos, Rémi. 2003. Error bounds for approximate policy iteration. In Proceedings of

the International Conference on Machine Learning.

Munos, Rémi, Tom Stepleton, Anna Harutyunyan, and Marc G. Bellemare. 2016. Safe

and eﬃcient oﬀ-policy reinforcement learning. In Advances in Neural Information

Processing Systems.

Draft version.

354 References

Murphy, Kevin P. 2012. Machine learning: A probabilistic perspective. MIT Press.

Naddaf, Yavar. 2010. Game-independent AI agents for playing Atari 2600 console

games. Master’s thesis, University of Alberta.

Naesseth, Christian A., Fredrik Lindsten, and Thomas B. Schön. 2019. Elements of

sequential Monte Carlo. Foundations and Trends



in Machine Learning 12 (3): 307–

392.

Nair, Vinod, and Geoﬀrey E. Hinton. 2010. Rectiﬁed linear units improve restricted

Boltzmann machines. In Proceedings of the International Conference on Machine

Learning.

Nam, Daniel W., Younghoon Kim, and Chan Y. Park. 2021. GMAC: A distributional

perspective on actor-critic framework. In Proceedings of the International Conference

on Machine Learning.

Neininger, Ralph. 1999. Limit laws for random recursive structures and algorithms.

PhD diss., University of Freiburg.

Neininger, Ralph. 2001. On a multivariate contraction method for random recursive

structures with applications to Quicksort. Random Structures & Algorithms 19 (3–4):

498–524.

Neininger, Ralph, and Ludger Rüschendorf. 2004. A general limit theorem for recursive

algorithms and combinatorial structures. The Annals of Applied Probability 14 (1): 378–

418.

Newey, Whitney K., and James L. Powell. 1987. Asymmetric least squares estimation

and testing. Econometrica 55 (4): 819–847.

Nguyen, Thanh Tang, Sunil Gupta, and Svetha Venkatesh. 2021. Distributional rein-

forcement learning via moment matching. In Proceedings of the AAAI Conference on

Artiﬁcial Intelligence.

Nieoullon, André. 2002. Dopamine and the regulation of cognition and attention.

Progress in Neurobiology 67 (1): 53–83.

Nikolov, Nikolay, Johannes Kirschner, Felix Berkenkamp, and Andreas Krause. 2019.

Information-directed exploration for deep reinforcement learning. In Proceedings of the

International Conference on Learning Representations.

Niv, Yael. 2009. Reinforcement learning in the brain. Journal of Mathematical

Psychology 53 (3): 139–154.

Olah, Chris, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Kather-

ine Ye, and Alexander Mordvintsev. 2018. The building blocks of interpretability.

Distill.

Oliehoek, Frans A., and Christopher Amato. 2016. A concise introduction to decentral-

ized POMDPs. Springer.

Oliehoek, Frans A., Matthijs T. J. Spaan, and Nikos Vlassis. 2008. Optimal and approxi-

mate Q-value functions for decentralized POMDPs. Journal of Artiﬁcial Intelligence

Research 32 (1): 289–353.

Draft version.

References 355

Olsen, Ditte, Niels Wellner, Mathias Kaas, Inge E. M. de Jong, Florence Sotty, Michael

Didriksen, Simon Glerup, and Anders Nykjaer. 2021. Altered dopaminergic ﬁring

pattern and novelty response underlie ADHD-like behavior of SorCS2-deﬁcient mice.

Translational Psychiatry 11 (1): 1–14.

Omidshaﬁei, Shayegan, Jason Pazis, Christopher Amato, Jonathan P. How, and John Vian.

2017. Deep decentralized multi-task multi-agent reinforcement learning under partial

observability. In Proceedings of the International Conference on Machine Learning.

Owen, Art B. 2013. Monte Carlo theory, methods and examples.

Palmer, Gregory, Rahul Savani, and Karl Tuyls. 2019. Negative update intervals in deep

multi-agent reinforcement learning. In Proceedings of the International Conference on

Autonomous Agents and Multiagent Systems.

Palmer, Gregory, Karl Tuyls, Daan Bloembergen, and Rahul Savani. 2018. Lenient

multi-agent deep reinforcement learning. In Proceedings of the International Conference

on Autonomous Agents and Multiagent Systems.

Panait, Liviu, Keith Sullivan, and Sean Luke. 2006. Lenient learners in cooperative

multiagent systems. In Proceedings of the International Conference on Autonomous

Agents and Multiagent Systems.

Panait, Liviu, R. Paul Wiegand, and Sean Luke. 2003. Improving coevolutionary search

for optimal multiagent behaviors. In Proceedings of the International Joint Conference

on Artiﬁcial Intelligence.

Panaretos, Victor M., and Yoav Zemel. 2020. An invitation to statistics in Wasserstein

space. Springer Nature.

Parr, Ronald, Lihong Li, Gavin Taylor, Christopher Painter-Wakeﬁeld, and Michael

L. Littman. 2008. An analysis of linear models, linear value-function approximation,

and feature selection for reinforcement learning. In Proceedings of the International

Conference on Machine Learning.

Parr, Ronald, Christopher Painter-Wakeﬁeld, Lihong Li, and Michael Littman. 2007.

Analyzing feature generation for value-function approximation. In Proceedings of the

International Conference on Machine Learning.

Pavlov, Ivan P. 1927. Conditioned reﬂexes: An investigation of the physiological activity

of the cerebral cortex. Oxford University Press.

Peres, Yuval, Wilhelm Schlag, and Boris Solomyak. 2000. Sixty years of Bernoulli con-

volutions. In Fractal geometry and stochastics II, edited by Christoph Bandt, Siegfried

Graf, and Martina Zähle. Springer.

Peyré, Gabriel, and Marco Cuturi. 2019. Computational optimal transport: With applica-

tions to data science. Foundations and Trends



in Machine Learning 11 (5–6): 355–

607.

Prashanth, L. A., and Michael Fu. 2021. Risk-sensitive reinforcement learning. arXiv

preprint arXiv:1810.09126.

Prashanth, L. A., and Mohammad Ghavamzadeh. 2013. Actor-critic algorithms for

risk-sensitive MDPs. In Advances in Neural Information Processing Systems.

Draft version.

356 References

Precup, Doina, Richard S. Sutton, and Satinder P. Singh. 2000. Eligibility traces for

oﬀ-policy policy evaluation. In Proceedings of the International Conference on Machine

Learning.

Pritzel, Alexander, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech Badia, Oriol

Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. 2017. Neural episodic

control. In Proceedings of the International Conference on Machine Learning.

Puterman, Martin L. 2014. Markov decision processes: Discrete stochastic dynamic

programming. John Wiley & Sons.

Puterman, Martin L., and Moon Chirl Shin. 1978. Modiﬁed policy iteration algorithms

for discounted Markov decision problems. Management Science 24 (11): 1127–1137.

Qiu, Wei, Xinrun Wang, Runsheng Yu, Xu He, Rundong Wang, Bo An, Svetlana

Obraztsova, and Zinovi Rabinovich. 2021. RMIX: Learning risk-sensitive policies

for cooperative reinforcement learning agents. In Advances in Neural Information

Processing Systems.

Qu, Chao, Shie Mannor, and Huan Xu. 2019. Nonlinear distributional gradient temporal-

diﬀerence learning. In Proceedings of the International Conference on Machine

Learning.

Quan, John, and Georg Ostrovski. 2020. DQN Zoo: Reference implementations of

DQN-based agents. Version 1.0.0. http://github.com/deepmind/dqn_zoo.

Rachev, Svetlozar T., Lev Klebanov, Stoyan V. Stoyanov, and Frank Fabozzi. 2013.

The methods of distances in the theory of probability and statistics. Springer Science &

Business Media.

Rachev, Svetlozar T., and Ludger Rüschendorf. 1995. Probability metrics and recursive

algorithms. Advances in Applied Probability 27 (3): 770–799.

Rashid, Tabish, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar,

Jakob N. Foerster, and Shimon Whiteson. 2020. Monotonic value function factorisation

for deep multi-agent reinforcement learning. Journal of Machine Learning Research 21

(1): 7234–7284.

Rashid, Tabish, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar,

Jakob Foerster, and Shimon Whiteson. 2018. QMIX: Monotonic value function factori-

sation for deep multi-agent reinforcement learning. In Proceedings of the International

Conference on Machine Learning.

Rescorla, Robert A., and Allan R. Wagner. 1972. A theory of Pavlovian conditioning:

Variations in the eﬀectiveness of reinforcement and nonreinforcement. In Classical

conditioning II: Current Research and Theory, edited by Abraham J. Black and William

F. Prosaky, 64–99. Appleton-Century-Crofts.

Riedmiller, M. 2005. Neural ﬁtted Q iteration – ﬁrst experiences with a data eﬃcient

neural reinforcement learning method. In Proceedings of the European Conference on

Machine Learning.

Riedmiller, Martin, Thomas Gabel, Roland Hafner, and Sascha Lange. 2009. Reinforce-

ment learning for robot soccer. Autonomous Robots 27 (1): 55–73.

Draft version.

References 357

Rizzo, Maria L., and Gábor J. Székely. 2016. Energy distance. Wiley Interdisciplinary

Reviews: Computational Statistics 8 (1): 27–38.

Robbins, Herbert, and Sutton Monro. 1951. A stochastic approximation method. The

Annals of Mathematical Statistics 22 (3): 400–407.

Robbins, Herbert, and David Siegmund. 1971. A convergence theorem for non negative

almost supermartingales and some applications. In Optimizing methods in statistics,

edited by Jagdish S. Rustagi, 233–257. Academic Press.

Robert, Christian, and George Casella. 2004. Monte Carlo statistical methods. Springer

Science & Business Media.

Rockafellar, R. Tyrrell, and Stanislav Uryasev. 2000. Optimization of conditional value-

at-risk. Journal of Risk 2:21–42.

Rockafellar, R. Tyrrell, and Stanislav Uryasev. 2002. Conditional value-at-risk for

general loss distributions. Journal of Banking & Finance 26 (7): 1443–1471.

Rösler, Uwe. 1991. A limit theorem for “Quicksort.” RAIRO-Theoretical Informatics

and Applications 25 (1): 85–100.

Rösler, Uwe. 1992. A ﬁxed point theorem for distributions. Stochastic Processes and

Their Applications 42 (2): 195–214.

Rösler, Uwe. 2001. On the analysis of stochastic divide and conquer algorithms.

Algorithmica 29 (1): 238–261.

Rösler, Uwe, and Ludger Rüschendorf. 2001. The contraction method for recursive

algorithms. Algorithmica 29 (1–2): 3–33.

Rowland, Mark, Marc G. Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh.

2018. An analysis of categorical distributional reinforcement learning. In Proceedings of

the International Conference on Artiﬁcial Intelligence and Statistics.

Rowland, Mark, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G. Bellemare,

and Will Dabney. 2019. Statistics and samples in distributional reinforcement learning.

In Proceedings of the International Conference on Machine Learning.

Rowland, Mark, Shayegan Omidshaﬁei, Daniel Hennes, Will Dabney, Andrew Jaegle,

Paul Muller, Julien Pérolat, and Karl Tuyls. 2021. Temporal diﬀerence and return

optimism in cooperative multi-agent reinforcement learning. In Adaptive and Learning

Agents Workshop at the International Conference on Autonomous Agents and Multiagent

Systems.

Rubner, Yossi, Carlo Tomasi, and Leonidas J. Guibas. 1998. A metric for distributions

with applications to image databases. In Sixth International Conference on Computer

Vision.

Rudin, Walter. 1976. Principles of mathematical analysis. McGraw-Hill.

Rumelhart, David E., Geoﬀrey E. Hinton, and Ronald J. Williams. 1986. Learning

representations by back-propagating errors. Nature 323 (6088): 533–536.

Rummery, Gavin A., and Mahesan Niranjan. 1994. On-line Q-learning using connec-

tionist systems. Technical report. Cambridge University Engineering Department.

Draft version.

358 References

Rüschendorf, Ludger. 2006. On stochastic recursive equations of sum and max type.

Journal of Applied Probability 43 (3): 687–703.

Rüschendorf, Ludger, and Ralph Neininger. 2006. A survey of multivariate aspects of

the contraction method. Discrete Mathematics & Theoretical Computer Science 8:31–56.

Ruszczy

nski, Andrzej. 2010. Risk-averse dynamic programming for Markov decision

processes. Mathematical Programming 125 (2): 235–261.

Samuel, Arthur L. 1959. Some studies in machine learning using the game of checkers.

IBM Journal of Research and Development 11 (6): 601–617.

Santambrogio, Filippo. 2015. Optimal transport for applied mathematicians: Calculus

of variations, PDEs and modeling. Birkhäuser.

Särkkä, Simo. 2013. Bayesian ﬁltering and smoothing. Cambridge University Press.

Sato, Makoto, Hajime Kimura, and Shibenobu Kobayashi. 2001. TD algorithm for

the variance of return and mean-variance reinforcement learning. Transactions of the

Japanese Society for Artiﬁcial Intelligence 16 (3): 353–362.

Schaul, Tom, John Quan, Ioannis Antonoglou, and David Silver. 2016. Prioritized

experience replay. In Proceedings of the International Conference on Learning

Representations.

Scherrer, Bruno. 2010. Should one compute the temporal diﬀerence ﬁx point or mini-

mize the Bellman residual? The uniﬁed oblique projection view. In Proceedings of the

International Conference on Machine Learning.

Scherrer, Bruno. 2014. Approximate policy iteration schemes: A comparison. In

Proceedings of the International Conference on Machine Learning.

Scherrer, Bruno, and Boris Lesner. 2012. On the use of non-stationary policies for sta-

tionary inﬁnite-horizon Markov decision processes. In Advances in Neural Information

Processing Systems.

Schlegel, Matthew, Andrew Jacobsen, Zaheer Abbas, Andrew Patterson, Adam White,

and Martha White. 2021. General value function networks. Journal of Artiﬁcial

Intelligence Research (JAIR) 70:497–543.

Schultz, Wolfram. 1986. Responses of midbrain dopamine neurons to behavioral trigger

stimuli in the monkey. Journal of Neurophysiology 56 (5): 1439–1461.

Schultz, Wolfram. 2002. Getting formal with dopamine and reward. Neuron 36 (2):

241–263.

Schultz, Wolfram. 2016. Dopamine reward prediction-error signalling: A two-component

response. Nature Reviews Neuroscience 17 (3): 183–195.

Schultz, Wolfram, Paul Apicella, and Tomas Ljungberg. 1993. Responses of monkey

dopamine neurons to reward and conditioned stimuli during successive steps of learning

a delayed response task. Journal of Neuroscience 13 (3): 900–913.

Schultz, Wolfram, Peter Dayan, and P. Read Montague. 1997. A neural substrate of

prediction and reward. Science 275 (5306): 1593–1599.

Draft version.

References 359

Schultz, Wolfram, and Ranulfo Romo. 1990. Dopamine neurons of the monkey midbrain:

Contingencies of responses to stimuli eliciting immediate behavioral reactions. Journal

of Neurophysiology 63 (3): 607–624.

Shah, Ashvin. 2012. Psychological and neuroscientiﬁc connections with reinforcement

learning. In Reinforcement learning, edited by Marco Wiering and Martijn Otterlo,

507–537. Springer.

Shapiro, Alexander, Darinka Dentcheva, and Andrzej Ruszczynski. 2009. Lectures on

stochastic programming: Modeling and theory. SIAM.

Shapley, Lloyd S. 1953. Stochastic games. Proceedings of the National Academy of

Sciences 39 (10): 1095–1100.

Shen, Yun, Wilhelm Stannat, and Klaus Obermayer. 2013. Risk-sensitive Markov control

processes. SIAM Journal on Control and Optimization 51 (5): 3652–3672.

Shoham, Yoav, and Kevin Leyton-Brown. 2009. Multiagent systems. Cambridge

University Press.

Silver, David, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George

van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,

Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya

Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel,

and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and

tree search. Nature 529 (7587): 484–489.

Singh, Satinder P., and Richard S. Sutton. 1996. Reinforcement learning with replacing

eligibility traces. Machine Learning 22:123–158.

Sobel, Matthew J. 1982. The variance of discounted Markov decision processes. Journal

of Applied Probability 19 (4): 794–802.

Solomyak, Boris. 1995. On the random series Σ

±λ

(an Erd

os problem). Annals of

Mathematics 142 (3): 611–625.

Stalnaker, Thomas A., James D. Howard, Yuji K. Takahashi, Samuel J. Gershman,

Thorsten Kahnt, and Geoﬀrey Schoenbaum. 2019. Dopamine neuron ensembles signal

the content of sensory prediction errors. eLife 8:e49315.

Steinbach, Marc C. 2001. Markowitz revisited: Mean-variance models in ﬁnancial

portfolio analysis. SIAM Review 43 (1): 31–85.

Strang, Gilbert. 1993. Introduction to linear algebra. Wellesley-Cambridge Press.

Such, Felipe Petroski, Vashisht Madhavan, Rosanne Liu, Rui Wang, Pablo Samuel Castro,

Yulun Li, Ludwig Schubert, Marc G. Bellemare, Jeﬀ Clune, and Joel Lehman. 2019. An

Atari model zoo for analyzing, visualizing, and comparing deep reinforcement learning

agents. In Proceedings of the International Joint Conference on Artiﬁcial Intelligence.

Sun, Wei-Fang, Cheng-Kuang Lee, and Chun-Yi Lee. 2021. DFAC framework: Factoriz-

ing the value function via quantile mixture for multi-agent distributional Q-learning. In

Proceedings of the International Conference on Machine Learning.

Sunehag, Peter, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius

Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls,

Draft version.

360 References

and Thore Graepel. 2017. Value-decomposition networks for cooperative multi-agent

learning. arXiv preprint arXiv:1706.05296.

Sutton, Richard S. 1984. Temporal credit assignment in reinforcement learning. PhD

diss., University of Massachusetts, Amherst.

Sutton, Richard S. 1988. Learning to predict by the methods of temporal diﬀerences.

Machine Learning 3 (1): 9–44.

Sutton, Richard S. 1995. TD models: Modeling the world at a mixture of time scales. In

Proceedings of the International Conference on Machine Learning.

Sutton, Richard S. 1996. Generalization in reinforcement learning: Successful examples

using sparse coarse coding. In Advances in Neural Information Processing Systems.

Sutton, Richard S. 1999. Open theoretical questions in reinforcement learning. In

European Conference on Computational Learning Theory.

Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement learning: An introduction.

MIT Press.

Sutton, Richard S., Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Sil-

ver, Csaba Szepesvári, and Eric Wiewiora. 2009. Fast gradient-descent methods for

temporal-diﬀerence learning with linear function approximation. In Proceedings of the

International Conference on Machine Learning.

Sutton, Richard S., David A. McAllester, Satinder P. Singh, and Yishay Mansour. 2000.

Policy gradient methods for reinforcement learning with function approximation. In

Advances in Neural Information Processing Systems.

Sutton, Richard S., Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski,

Adam White, and Doina Precup. 2011. Horde: A scalable real-time architecture for

learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the

International Conference on Autonomous Agents and Multiagents Systems.

Sutton, Richard S., Doina Precup, and Satinder Singh. 1999. Between MDPs and semi-

MDPs: A framework for temporal abstraction in reinforcement learning. Artiﬁcial

Intelligence 112 (1–2): 181–211.

Sutton, Richard S., Csaba Szepesvári, and Hamid Reza Maei. 2008a. A convergent

(

)

temporal-diﬀerence algorithm for oﬀ-policy learning with linear function approximation.

In Advances in Neural Information Processing Systems.

Sutton, Richard S., Csaba Szespesvári, Alborz Geramifard, and Michael Bowling. 2008b.

Dyna-style planning with linear function approximation and prioritized sweeping. In

Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence.

Székely, Gabor J. 2002. E-statistics: The energy of statistical samples. Technical

report 02-16. Bowling Green State University, Department of Mathematics and Statistics.

Székely, Gábor J., and Maria L. Rizzo. 2013. Energy statistics: A class of statistics based

on distances. Journal of Statistical Planning and Inference 143 (8): 1249–1272.

Szepesvári, Csaba. 1998. The asymptotic convergence-rate of Q-learning. In Advances

in Neural Information Processing Systems.

Draft version.

References 361

Szepesvári, Csaba. 2010. Algorithms for reinforcement learning. Morgan & Claypool

Publishers.

Szepesvári, Csaba. 2020. Constrained MDPs and the reward hypothesis. https://readin

gsml.blogspot.com/2020/03/constrained-mdps-and-reward-hypothesis.html. Accessed

June 25, 2021.

Takahashi, Yuji K., Hannah M. Batchelor, Bing Liu, Akash Khanna, Marisela Morales,

and Geoﬀrey Schoenbaum. 2017. Dopamine neurons respond to errors in the prediction

of sensory features of expected rewards. Neuron 95 (6): 1395–1405.

Tamar, Aviv, Dotan Di Castro, and Shie Mannor. 2012. Policy gradients with vari-

ance related risk criteria. In Proceedings of the International Conference on Machine

Learning.

Tamar, Aviv, Dotan Di Castro, and Shie Mannor. 2013. Temporal diﬀerence methods

for the variance of the reward to go. In Proceedings of the International Conference on

Machine Learning.

Tamar, Aviv, Dotan Di Castro, and Shie Mannor. 2016. Learning the variance of the

reward-to-go. Journal of Machine Learning Research 17 (1): 361–396.

Tamar, Aviv, Yonatan Glassner, and Shie Mannor. 2015. Optimizing the CVaR via

sampling. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.

Tampuu, Ardi, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan

Aru, Jaan Aru, and Raul Vicente. 2017. Multiagent cooperation and competition with

deep reinforcement learning. PloS One 12 (4): e0172395.

Tan, Ming. 1993. Multi-agent reinforcement learning: Independent vs. cooperative

agents. In Proceedings of the International Conference on Machine Learning.

Tano, Pablo, Peter Dayan, and Alexandre Pouget. 2020. A local temporal diﬀerence code

for distributional reinforcement learning. In Advances in Neural Information Processing

Systems.

Taylor, James W. 2008. Estimating value at risk and expected shortfall using expectiles.

Journal of Financial Econometrics 6 (2): 231–252.

Tesauro, Gerald. 1995. Temporal diﬀerence learning and TD-Gammon. Communications

of the ACM 38 (3): 58–68.

Tessler, Chen, Guy Tennenholtz, and Shie Mannor. 2019. Distributional policy optimiza-

tion: An alternative approach for continuous control. In Advances in Neural Information

Processing Systems.

Tieleman, T., and G. Hinton. 2012. rmsprop: Divide the gradient by a running average

of its recent magnitude. COURSERA: Neural Networks for Machine Learning.

Toussaint, Marc. 2009. Robot trajectory optimization using approximate inference. In

Proceedings of the International Conference on Machine Learning.

Toussaint, Marc, and Amos Storkey. 2006. Probabilistic inference for solving discrete

and continuous state Markov decision processes. In Proceedings of the International

Conference on Machine Learning.

Draft version.

362 References

Tsitsiklis, John N. 1994. Asynchronous stochastic approximation and Q-learning.

Machine Learning 16 (3): 185–202.

Tsitsiklis, John N. 2002. On the convergence of optimistic policy iteration. Journal of

Machine Learning Research 3:59–72.

Tsitsiklis, John N., and Benjamin Van Roy. 1997. An analysis of temporal-diﬀerence

learning with function approximation. IEEE Transactions on Automatic Control 42 (5):

674–690.

Tulcea, Cassius T. Ionescu. 1949. Mesures dans les espaces produits. Atti Accademia

Nazionale Lincei Rend 8 (7): 208–211.

van den Oord, Aäron, Nal Kalchbrenner, and Koray Kavukcuoglu. 2016. Pixel recurrent

neural networks. In Proceedings of the International Conference on Machine Learning.

van der Vaart, Aad W. 2000. Asymptotic statistics. Cambridge University Press.

van der Wal, Johannes. 1981. Stochastic dynamic programming: Successive approxima-

tions and nearly optimal strategies for Markov decision processes and Markov games.

Stichting Mathematisch Centrum.

van Hasselt, Hado, Arthur Guez, Matteo Hessel, Volodymyr Mnih, and David Sil-

ver. 2016a. Learning values across many orders of magnitude. In Advances in Neural

Information Processing Systems.

van Hasselt, Hado, Arthur Guez, and David Silver. 2016b. Deep reinforcement learning

with double Q-learning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan

N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In

Advances in Neural Information Processing Systems.

Vecerik, Mel, Oleg Sushkov, David Barker, Thomas Rothörl, Todd Hester, and Jon

Scholz. 2019. A practical approach to insertion with variable socket position using deep

reinforcement learning. In IEEE International Conference on Robotics and Automation.

Veness, Joel, Marc G. Bellemare, Marcus Hutter, Alvin Chua, and Guillaume Desjardins.

2015. Compress and control. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence.

Veness, Joel, Kee Siong Ng, Marcus Hutter, William T. B. Uther, and David Silver.

2011. A Monte-Carlo AIXI approximation. Journal of Artiﬁcial Intelligence Resesearch

40:95–142.

Vershik, A. M. 2013. Long history of the Monge-Kantorovich transportation problem.

The Mathematical Intelligencer 35 (4): 1–9.

Vieillard, Nino, Olivier Pietquin, and Matthieu Geist. 2020. Munchausen reinforcement

learning. In Advances in Neural Information Processing Systems.

Villani, Cédric. 2003. Topics in optimal transportation. Graduate Studies in Mathematics.

American Mathematical Society.

Villani, Cédric. 2008. Optimal transport: Old and new. Springer Science & Business

Media.

Draft version.

References 363

von Neumann, John. 1928. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen

100 (1): 295–320.

Wainwright, Martin J., and Michael I. Jordan. 2008. Graphical models, exponential

families, and variational inference. Foundations and Trends



in Machine Learning 1

(1–2): 1–305.

Walton, Neil. 2021. Lecture notes on stochastic control. Unpublished manuscript.

Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando

Freitas. 2016. Dueling network architectures for deep reinforcement learning. In

Proceedings of the International Conference on Machine Learning.

Watkins, Christopher J. C. H. 1989. Learning from delayed rewards. PhD diss., King’s

College, Cambridge.

Watkins, Christopher J. C. H., and Peter Dayan. 1992. Q-learning. Machine Learning 8

(3–4): 279–292.

Weed, Jonathan, and Francis Bach. 2019. Sharp asymptotic and ﬁnite-sample rates of

convergence of empirical measures in Wasserstein distance. Bernoulli 25 (4A): 2620–

2648.

Wei, Ermo, and Sean Luke. 2016. Lenient learning in independent-learner stochastic

cooperative games. Journal of Machine Learning Research 17 (1): 2914–2955.

Werbos, Paul J. 1982. Applications of advances in nonlinear sensitivity analysis. In

System modeling and optimization, edited by Rudolph F. Drenick and Frank Kozin,

762–770. Springer.

White, D. J. 1988. Mean, variance, and probabilistic criteria in ﬁnite Markov decision

processes: A review. Journal of Optimization Theory and Applications 56 (1): 1–29.

White, Martha. 2017. Unifying task speciﬁcation in reinforcement learning. In

Proceedings of the International Conference on Machine Learning.

White, Martha, and Adam White. 2016. A greedy approach to adapting the trace param-

eter for temporal diﬀerence learning. In Proceedings of the International Conference on

Autonomous Agents and Multiagent Systems.

White, Norman M., and Marc Viaud. 1991. Localized intracaudate dopamine D2 receptor

activation during the post-training period improves memory for visual or olfactory

conditioned emotional responses in rats. Behavioral and Neural Biology 55 (3): 255–

269.

Widrow, Bernard, and Marcian E. Hoﬀ. 1960. Adaptive switching circuits. In WESCON

Convention Record Part IV.

Williams, David. 1991. Probability with martingales. Cambridge University Press.

Wise, Roy A. 2004. Dopamine, learning and motivation. Nature Reviews Neuroscience

5 (6): 483–494.

Wurman, Peter R., Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik

Subramanian, Thomas J. Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert,

Florian Fuchs, Leilani Gilpin, Piyush Khandelwal, Varun Kompella, HaoChih Lin,

Patrick MacAlpine, Declan Oller, Takuma Seno, Craig Sherstan, Michael D. Thomure,

Draft version.

364 References

Houmehr Aghabozorgi, Leon Barrett, Rory Douglas, Dion Whitehead, Peter Dürr, Peter

Stone, Michael Spranger, and Hiroaki Kitano. 2022. Outracing champion Gran Turismo

drivers with deep reinforcement learning. Nature 602 (7896): 223–228.

Yang, Derek, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. 2019. Fully

parameterized quantile function for distributional reinforcement learning. In Advances

in Neural Information Processing Systems.

Young, Kenny, and Tian Tian. 2019. MinAtar: An Atari-inspired testbed for thorough

and reproducible reinforcement learning experiments. arXiv preprint arXiv:1903.03176.

Yue, Yuguang, Zhendong Wang, and Mingyuan Zhou. 2020. Implicit distributional

reinforcement learning. In Advances in Neural Information Processing Systems.

Zhang, Shangtong, and Hengshuai Yao. 2019. QUOTA: The quantile option architec-

ture for reinforcement learning. In Proceedings of the AAAI Conference on Artiﬁcial

Intelligence.

Zhou, Fan, Zhoufan Zhu, Qi Kuang, and Liwen Zhang. 2021. Non-decreasing quantile

function network with eﬃcient exploration for distributional reinforcement learning. In

Proceedings of the International Joint Conference on Artiﬁcial Intelligence.

Ziegel, Johanna F. 2016. Coherence and elicitability. Mathematical Finance 26 (4):

901–918.

Zolotarev, Vladimir M. 1976. Metric distances in spaces of random variables and their

distributions. Sbornik: Mathematics 30 (3): 373–401.

Draft version.