References
Achab, Mastane. 2020. Ranking and risk-aware reinforcement learning. PhD diss.,
Institut Polytechnique de Paris.
Agarwal, Rishabh, Dale Schuurmans, and Mohammad Norouzi. 2020. An optimistic
perspective on oine reinforcement learning. In Proceedings of the International
Conference on Machine Learning.
Agarwal, Rishabh, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G.
Bellemare. 2021. Deep reinforcement learning at the edge of the statistical precipice. In
Advances in Neural Information Processing Systems.
Aigner, D. J., Takeshi Amemiya, and Dale J. Poirier. 1976. On the estimation of pro-
duction frontiers: Maximum likelihood estimation of the parameters of a discontinuous
density function. International Economic Review 17 (2): 377–396.
Aldous, David J., and Antar Bandyopadhyay. 2005. A survey of max-type recursive
distributional equations. The Annals of Applied Probability 15 (2): 1047–1110.
Alsmeyer, Gerold. 2012. Random recursive equations and their distributional fixed
points. Unpublished manuscript.
Altman, Eitan. 1999. Constrained Markov decision processes. Vol. 7. CRC Press.
Ambrosio, Luigi, Nicola Gigli, and Giuseppe Savaré. 2005. Gradient flows: In metric
spaces and in the space of probability measures. Springer Science & Business Media.
Amortila, Philip, Marc G. Bellemare, Prakash Panangaden, and Doina Precup. 2019.
Temporally extended metrics for Markov decision processes. In SafeAI: AAAI Workshop
on Artificial Intelligence Safety.
Amortila, Philip, Doina Precup, Prakash Panangaden, and Marc G. Bellemare. 2020.
A distributional analysis of sampling-based reinforcement learning algorithms. In
Proceedings of the International Conference on Artificial Intelligence and Statistics.
Arjovsky, Martin, Soumith Chintala, and Léon Bottou. 2017. Wasserstein GAN. In
Proceedings of the International Conference on Machine Learning.
Artzner, Philippe, Freddy Delbaen, Jean-Marc Eber, and David Heath. 1999. Coherent
measures of risk. Mathematical Finance 9 (3): 203–228.
Draft version. 337
338 References
Artzner, Philippe, Freddy Delbaen, Jean-Marc Eber, David Heath, and Hyejin Ku. 2007.
Coherent multiperiod risk adjusted values and Bellman’s principle. Annals of Operations
Research 152 (1): 5–22.
Arulkumaran, Kai, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath.
2017. A brief survey of deep reinforcement learning. IEEE Signal Processing Magazine,
Special Issue on Deep Learning for Image Understanding.
Auer, Peter, Mark Herbster, and Manfred K. Warmuth. 1995. Exponentially many local
minima for single neurons. In Advances in Neural Information Processing Systems.
Azar, Mohammad Gheshlaghi, Rémi Munos, Mohammad Ghavamzadeh, and Hilbert
J. Kappen. 2011. Speedy Q-learning. In Advances in Neural Information Processing
Systems.
Azar, Mohammad Gheshlaghi, Rémi Munos, and Hilbert J. Kappen. 2012. On the sample
complexity of reinforcement learning with a generative model. In Proceedings of the
International Conference on Machine Learning.
Azar, Mohammad Gheshlaghi, Rémi Munos, and Hilbert J. Kappen. 2013. Minimax
PAC bounds on the sample complexity of reinforcement learning with a generative
model. Machine Learning 91 (3): 325–349.
Azar, Mohammad Gheshlaghi, Ian Osband, and Rémi Munos. 2017. Minimax regret
bounds for reinforcement learning. In Proceedings of the International Conference on
Machine Learning.
Azevedo, Frederico A. C., Ludmila R. B. Carvalho, Lea T. Grinberg, José Marcelo
Farfel, Renata E. L. Ferretti, Renata E. P. Leite, Wilson Jacob Filho, Roberto Lent, and
Suzana Herculano-Houzel. 2009. Equal numbers of neuronal and nonneuronal cells
make the human brain an isometrically scaled-up primate brain. Journal of Comparative
Neurology 513 (5): 532–541.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine trans-
lation by jointly learning to align and translate. In Proceedings of the International
Conference on Learning Representations.
Baird, Leemon C. 1995. Residual algorithms: Reinforcement learning with function
approximation. In Proceedings of the International Conference on Machine Learning.
Baird, Leemon C. 1999. Reinforcement learning through gradient descent. PhD diss.,
Carnegie Mellon University.
Banach, Stefan. 1922. Sur les opérations dans les ensembles abstraits et leur application
aux équations intégrales. Fundamenta Mathematicae 3 (1): 133–181.
Banino, Andrea, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap,
Piotr Mirowski, Alexander Pritzel, Martin J. Chadwick, Thomas Degris, Joseph Modayil,
Greg Wayne, Hubert Soyer, Fabio Viola, Brian Zhang, Ross Goroshin, Neil Rabinowitz,
Razvan Pascanu, Charlie Beattie, Stig Petersen, Amir Sadik, Stephen Ganey, Helen
King, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, and Dharshan Kumaran. 2018.
Vector-based navigation using grid-like representations in artificial agents. Nature 557
(7705): 429–433.
Draft version.
References 339
Barbeau, André. 1974. Drugs aecting movement disorders. Annual Review of
Pharmacology 14 (1): 91–113.
Bard, Nolan, Jakob N. Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H. Francis
Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, Iain
Dunning, Shibl Mourad, Hugo Larochelle, Marc G. Bellemare, and Michael Bowling.
2020. The Hanabi challenge: A new frontier for AI research. Artificial Intelligence
280:103216.
Barnard, Etienne. 1993. Temporal-dierence methods and Markov models. IEEE
Transactions on Systems, Man, and Cybernetics 23 (2): 357–365.
Barreto, André, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, Hado van
Hasselt, and David Silver. 2017. Successor features for transfer in reinforcement learning.
In Advances in Neural Information Processing Systems.
Barth-Maron, Gabriel, Matthew W. Homan, David Budden, Will Dabney, Dan Horgan,
Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. 2018. Distributed dis-
tributional deterministic policy gradients. In Proceedings of the International Conference
on Learning Representations.
Barto, Andrew G., Steven J. Bradtke, and Satinder P. Singh. 1995. Learning to act using
real-time dynamic programming. Artificial Intelligence 72 (1): 81–138.
Barto, Andrew G., Richard S. Sutton, and Charles W. Anderson. 1983. Neuronlike
adaptive elements that can solve dicult learning control problems. IEEE Transactions
on Systems, Man, and Cybernetics 13 (5): 834–846.
Bäuerle, Nicole, and Jonathan Ott. 2011. Markov decision processes with average-value-
at-risk criteria. Mathematical Methods of Operations Research 74 (3): 361–379.
Beattie, Charles, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright,
Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian
Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton,
Stephen Ganey, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. 2016.
DeepMind Lab. arXiv preprint arXiv:1612.03801.
Bellemare, Marc G., Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C.
Machado, Subhodeep Moitra, Sameera S. Ponda, and Ziyu Wang. 2020. Autonomous
navigation of stratospheric balloons using reinforcement learning. Nature 588 (7836):
77–82.
Bellemare, Marc G., Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel
Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, and Clare Lyle. 2019a. A
geometric perspective on optimal representations for reinforcement learning. In Advances
in Neural Information Processing Systems.
Bellemare, Marc G., Will Dabney, and Rémi Munos. 2017a. A distributional perspective
on reinforcement learning. In Proceedings of the International Conference on Machine
Learning.
Bellemare, Marc G., Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshmi-
narayanan, Stephan Hoyer, and Rémi Munos. 2017b. The Cramer distance as a solution
to biased Wasserstein gradients. arXiv preprint arXiv:1705.10743.
Draft version.
340 References
Bellemare, Marc G., Yavar Naddaf, Joel Veness, and Michael Bowling. 2013a. The
Arcade Learning Environment: An evaluation platform for general agents. Journal of
Artificial Intelligence Research 47 (June): 253–279.
Bellemare, Marc G., Yavar Naddaf, Joel Veness, and Michael Bowling. 2015. The
Arcade Learning Environment: An evaluation platform for general agents, extended
abstract. In European Workshop on Reinforcement Learning.
Bellemare, Marc G., Georg Ostrovski, Arthur Guez, Philip S. Thomas, and Rémi
Munos. 2016. Increasing the action gap: New operators for reinforcement learning.
In Proceedings of the AAAI Conference on Artificial Intelligence.
Bellemare, Marc G., Nicolas Le Roux, Pablo Samuel Castro, and Subhodeep Moitra.
2019b. Distributional reinforcement learning with linear function approximation. In
Proceedings of the International Conference on Artificial Intelligence and Statistics.
Bellemare, Marc G., Joel Veness, and Michael Bowling. 2012a. Investigating contingency
awareness using Atari 2600 games. In Proceedings of the AAAI Conference on Artificial
Intelligence.
Bellemare, Marc G., Joel Veness, and Michael Bowling. 2012b. Sketch-based linear
value function approximation. In Advances in Neural Information Processing Systems.
Bellemare, Marc G., Joel Veness, and Michael Bowling. 2013b. Bayesian learning of
recursively factored environments. In Proceedings of the International Conference on
Machine Learning.
Bellini, Fabio, and Elena Di Bernardino. 2017. Risk management with expectiles. The
European Journal of Finance 23 (6): 487–506.
Bellini, Fabio, Bernhard Klar, Alfred Müller, and Emanuela Rosazza Gianin. 2014.
Generalized quantiles as risk measures. Insurance: Mathematics and Economics 54:41–
48.
Bellman, Richard E. 1957a. A Markovian decision process. Journal of Mathematics and
Mechanics 6 (5): 679–684.
Bellman, Richard E. 1957b. Dynamic programming. Dover Publications.
Benveniste, Albert, Michel Métivier, and Pierre Priouret. 2012. Adaptive algorithms and
stochastic approximations. Springer Science & Business Media.
Bernstein, Daniel S., Robert Givan, Neil Immerman, and Shlomo Zilberstein. 2002.
The complexity of decentralized control of Markov decision processes. Mathematics of
Operations Research 27 (4): 819–840.
Bertsekas, Dimitri P. 1994. Generic rank-one corrections for value iteration in
Markovian decision problems. Technical report. Massachusetts Institute of Technology.
Bertsekas, Dimitri P. 1995. A counterexample to temporal dierences learning. Neural
Computation 7 (2): 270–279.
Bertsekas, Dimitri P. 2011. Approximate policy iteration: A survey and some new
methods. Journal of Control Theory and Applications 9 (3): 310–335.
Bertsekas, Dimitri P. 2012. Dynamic programming and optimal control. 4th ed. Vol. 2.
Athena Scientific.
Draft version.
References 341
Bertsekas, Dimitri P., and Sergey Ioe. 1996. Temporal dierences-based policy itera-
tion and applications in neuro-dynamic programming. Technical report. Massachusetts
Institute of Technology.
Bertsekas, Dimitri P., and John N. Tsitsiklis. 1996. Neuro-dynamic programming. Athena
Scientific.
Bhandari, Jalaj, and Daniel Russo. 2021. On the linear convergence of policy gradient
methods for finite MDPs. In Proceedings of the International Conference on Artificial
Intelligence and Statistics.
Bhonker, Nadav, Shai Rozenberg, and Itay Hubara. 2017. Playing SNES in the Retro
Learning Environment. In Proceedings of the International Conference on Learning
Representations.
Bickel, Peter J., and David A. Freedman. 1981. Some asymptotic theory for the bootstrap.
The Annals of Statistics 9 (6): 1196–1217.
Billingsley, Patrick. 2012. Probability and measure. 4th ed. John Wiley & Sons.
Bishop, Christopher M. 2006. Pattern recognition and machine learning. Springer.
Bobkov, Sergey, and Michel Ledoux. 2019. One-dimensional empirical measures, order
statistics, and Kantorovich transport distances. American Mathematical Society.
Bodnar, Cristian, Adrian Li, Karol Hausman, Peter Pastor, and Mrinal Kalakrishnan.
2020. Quantile QT-OPT for risk-aware vision-based robotic grasping. In Proceedings of
Robotics: Science and Systems.
Borkar, Vivek S. 1997. Stochastic approximation with two time scales. Systems &
Control Letters 29 (5): 291–294.
Borkar, Vivek S. 2008. Stochastic approximation: A dynamical systems viewpoint.
Cambridge University Press.
Borkar, Vivek S., and Sean P. Meyn. 2000. The ODE method for convergence of
stochastic approximation and reinforcement learning. SIAM Journal on Control and
Optimization 38 (2): 447–469.
Bottou, Léon. 1998. Online learning and stochastic approximations. On-line Learning in
Neural Networks 17 (9): 142.
Boutilier, Craig. 1996. Planning, learning and coordination in multiagent decision
processes. In Proceedings of the Conference on Theoretical Aspects of Rationality and
Knowledge.
Bowling, Michael, and Manuela Veloso. 2002. Multiagent learning using a variable
learning rate. Artificial Intelligence 136 (2): 215–250.
Boyan, Justin, and Andrew W. Moore. 1995. Generalization in reinforcement learning:
Safely approximating the value function. In Advances in Neural Information Processing
Systems.
Boyd, Stephen, and Lieven Vandenberghe. 2004. Convex optimization. Cambridge
University Press.
Bradtke, Steven J., and Andrew G. Barto. 1996. Linear least-squares algorithms for
temporal dierence learning. Machine Learning 22 (1): 33–57.
Draft version.
342 References
Braver, Todd S., Deanna M. Barch, and Jonathan D. Cohen. 1999. Cognition and
control in schizophrenia: A computational model of dopamine and prefrontal function.
Biological Psychiatry 46 (3): 312–328.
Brooks, Steve, Andrew Gelman, Galin Jones, and Xiao-Li Meng. 2011. Handbook of
Markov chain Monte Carlo. CRC Press.
Brown, Daniel, Scott Niekum, and Marek Petrik. 2020. Bayesian robust optimization
for imitation learning. In Advances in Neural Information Processing Systems.
Browne, Cameron B., Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter
I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samoth-
rakis, and Simon Colton. 2012. A survey of Monte Carlo tree search methods. IEEE
Transactions on Computational Intelligence and AI in Games 4 (1): 1–43.
Cabi, Serkan, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova,
Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, Oleg
Sushkov, David Barker, Jonathan Scholz, Misha Denil, Nando de Freitas, and Ziyu
Wang. 2020. Scaling data-driven robotics with reward sketching and batch reinforcement
learning. In Proceedings of Robotics: Science and Systems.
Cagniard, Barbara, Peter D. Balsam, Daniela Brunner, and Xiaoxi Zhuang. 2006. Mice
with chronically elevated dopamine exhibit enhanced motivation, but not learning, for a
food reward. Neuropsychopharmacology 31 (7): 1362–1370.
Carpin, Stefano, Yinlam Chow, and Marco Pavone. 2016. Risk aversion in finite Markov
decision processes using total cost criteria and average value at risk. In Proceedings of
the IEEE International Conference on Robotics and Automation.
Castro, Pablo S., Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G.
Bellemare. 2018. Dopamine: A research framework for deep reinforcement learning.
arXiv preprint arXiv:1812.06110.
Ceron, Johan Samir Obando, and Pablo Samuel Castro. 2021. Revisiting Rainbow:
Promoting more insightful and inclusive deep reinforcement learning research. In
Proceedings of the International Conference on Machine Learning.
Chandak, Yash, Scott Niekum, Bruno Castro da Silva, Erik Learned-Miller, Emma
Brunskill, and Philip S. Thomas. 2021. Universal o-policy evaluation. In Advances in
Neural Information Processing Systems.
Chapman, David, and Leslie Pack Kaelbling. 1991. Input generalization in delayed
reinforcement learning: An algorithm and performance comparisons. In Proceedings of
the International Joint Conference on Artificial Intelligence.
Chen, Jinglin, and Nan Jiang. 2019. Information-theoretic considerations in batch
reinforcement learning. In Proceedings of the International Conference on Machine
Learning.
Chopin, Nicolas, and Omiros Papaspiliopoulos. 2020. An introduction to sequential
Monte Carlo. Springer.
Chow, Yinlam. 2017. Risk-sensitive and data-driven sequential decision making. PhD
diss., Stanford University.
Draft version.
References 343
Chow, Yinlam, and Mohammad Ghavamzadeh. 2014. Algorithms for CVaR optimization
in MDPs. In Advances in Neural Information Processing Systems.
Chow, Yinlam, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. 2018.
Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine
Learning Research 18 (1): 6070–6120.
Chow, Yinlam, Aviv Tamar, Shie Mannor, and Marco Pavone. 2015. Risk-sensitive
and robust decision-making: A CVaR optimization approach. In Advances in Neural
Information Processing Systems.
Chung, Kun-Jen, and Matthew J. Sobel. 1987. Discounted MDPs: Distribution functions
and exponential utility maximization. SIAM Journal on Control and Optimization 25
(1): 49–62.
Chung, Wesley, Somjit Nath, Ajin Joseph, and Martha White. 2018. Two-timescale
networks for nonlinear value function approximation. In Proceedings of the International
Conference on Learning Representations.
Claus, Caroline, and Craig Boutilier. 1998. The dynamics of reinforcement learning in
cooperative multiagent systems. In Proceedings of the AAAI Conference on Artificial
Intelligence.
Clements, William R., Benoit-Marie Robaglia, Bastien Van Delft, Reda Bahi Slaoui, and
Sebastien Toth. 2020. Estimating risk and uncertainty in deep reinforcement learning.
In Workshop on Uncertainty and Robustness in Deep Learning at the International
Conference on Machine Learning.
Cobbe, Karl, Chris Hesse, Jacob Hilton, and John Schulman. 2020. Leveraging procedu-
ral generation to benchmark reinforcement learning. In Proceedings of the International
Conference on Machine Learning.
Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Cliord Stein. 2001.
Introduction to algorithms. MIT Press.
Cormode, G., and S. Muthukrishnan. 2005. An improved data stream summary: The
count-min sketch and its applications. Journal of Algorithms 55 (1): 58–75.
Cuturi, Marco. 2013. Sinkhorn distances: Lightspeed computation of optimal transport.
In Advances in Neural Information Processing Systems.
Da Silva, Felipe Leno, Anna Helena Reali Costa, and Peter Stone. 2019. Distributional
reinforcement learning applied to robot soccer simulation. In Adaptive and Learning
Agents Workshop at the International Conference on Autonomous Agents and Multiagent
Systems.
Dabney, Will, André Barreto, Mark Rowland, Robert Dadashi, John Quan, Marc G.
Bellemare, and David Silver. 2020a. The value-improvement path: Towards better
representations for reinforcement learning. In Proceedings of the AAAI Conference on
Artificial Intelligence.
Dabney, Will, Zeb Kurth-Nelson, Naoshige Uchida, Clara Kwon Starkweather, Demis
Hassabis, Rémi Munos, and Matthew Botvinick. 2020b. A distributional code for value
in dopamine-based reinforcement learning. Nature 577 (7792): 671–675.
Draft version.
344 References
Dabney, Will, Georg Ostrovski, David Silver, and Rémi Munos. 2018a. Implicit quantile
networks for distributional reinforcement learning. In Proceedings of the International
Conference on Machine Learning.
Dabney, Will, Mark Rowland, Marc G. Bellemare, and Rémi Munos. 2018b. Distribu-
tional reinforcement learning with quantile regression. In AAAI Conference on Artificial
Intelligence.
Dai, Bo, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and
Le Song. 2018. SBEED: Convergent reinforcement learning with nonlinear function
approximation. In Proceedings of the International Conference on Machine Learning.
Daw, Nathaniel D. 2003. Reinforcement learning models of the dopamine system and
their behavioral implications. Carnegie Mellon University.
Daw, Nathaniel D., and Philippe N. Tobler. 2014. Value learning through reinforcement:
The basics of dopamine and reinforcement learning. In Neuroeconomics, edited by
Paul W. Glimcher and Ernst Fehr, 283–298. Academic Press.
Dayan, Peter. 1992. The convergence of TD(
λ
) for general
λ
. Machine Learning 8 (3–4):
341–362.
Dayan, Peter. 1993. Improving generalization for temporal dierence learning: The
successor representation. Neural Computation 5 (4): 613–624.
Dayan, Peter, and Terrence J. Sejnowski. 1994. TD(
λ
) converges with probability 1.
Machine Learning 14 (3): 295–301.
Dearden, Richard, Nir Friedman, and Stuart Russell. 1998. Bayesian Q-learning. In
Proceedings of the AAAI Conference on Artificial Intelligence.
Degrave, Jonas, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey,
Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de
las Casas, Craig Donner, Leslie Fritz, Cristian Galperti, Andrea Huber, James Keeling,
Maria Tsimpoukelli, Jackie Kay, Antoine Merle, Jean-Marc Moret, Seb Noury, Federico
Pesamosca, David Pfau, Olivier Sauter, Cristian Sommariva, Stefano Coda, Basil Duval,
Ambrogio Fasoli, Pushmeet Kohli, Koray Kavukcuoglu, Demis Hassabis, and Martin
Riedmiller. 2022. Magnetic control of tokamak plasmas through deep reinforcement
learning. Nature 602:414–419.
Delage, Erick, and Shie Mannor. 2010. Percentile optimization for Markov decision
processes with parameter uncertainty. Operations Research 58 (1): 203–213.
Denardo, Eric V., and Uriel G. Rothblum. 1979. Optimal stopping, exponential utility,
and linear programming. Mathematical Programming 16 (1): 228–244.
Derman, Cyrus. 1970. Finite state Markovian decision processes. Academic Press.
Diaconis, Persi, and David Freedman. 1999. Iterated random functions. SIAM Review 41
(1): 45–76.
Doan, Thang, Bogdan Mazoure, and Clare Lyle. 2018. GAN Q-learning. arXiv preprint
arXiv:1805.04874.
Doob, J. L. 1994. Measure theory. Springer.
Draft version.
References 345
Doucet, Arnaud, Nando De Freitas, and Neil Gordon. 2001. Sequential Monte Carlo
methods in practice. Springer.
Doucet, Arnaud, and Adam M. Johansen. 2011. A tutorial on particle filtering and
smoothing: Fifteen years later. In The Oxford handbook of nonlinear filtering, edited by
Dan Crisan and Boris Rozovskii. Oxford University Press.
Duan, Jingliang, Yang Guan, Shengbo Eben Li, Yangang Ren, Qi Sun, and Bo Cheng.
2021. Distributional soft actor-critic: O-policy reinforcement learning for addressing
value estimation errors. IEEE Transactions on Neural Networks and Learning Systems.
Dvoretzky, Aryeh. 1956. On stochastic approximation. In Proceedings of the Berkeley
Symposium on Mathematical Statistics and Probability, 39–55.
Dvoretzky, Aryeh, Jack Kiefer, and Jacob Wolfowitz. 1956. Asymptotic minimax char-
acter of the sample distribution function and of the classical multinomial estimator. The
Annals of Mathematical Statistics 27 (3): 642–669.
Engel, Yaakov, Shie Mannor, and Ron Meir. 2003. Bayes meets Bellman: The Gaussian
process approach to temporal dierence learning. In Proceedings of the International
Conference on Machine Learning.
Engel, Yaakov, Shie Mannor, and Ron Meir. 2007. Bayesian reinforcement learning
with Gaussian process temporal dierence methods. Unpublished manuscript.
Engert, Martin. 1970. Finite dimensional translation invariant subspaces. Pacific Journal
of Mathematics 32 (2): 333–343.
Ernst, Damien, Pierre Geurts, and Louis Wehenkel. 2005. Tree-based batch mode
reinforcement learning. Journal of Machine Learning Research 6:503–556.
Eshel, Neir, Michael Bukwich, Vinod Rao, Vivian Hemmelder, Ju Tian, and Naoshige
Uchida. 2015. Arithmetic and local circuitry underlying dopamine prediction errors.
Nature 525 (7568): 243–246.
Even-Dar, Eyal, and Yishay Mansour. 2003. Learning rates for Q-learning. Journal of
Machine Learning Research 5 (1): 1–25.
Farahmand, Amir-massoud. 2011. Action-gap phenomenon in reinforcement learning.
In Advances in Neural Information Processing Systems.
Farahmand, Amir-massoud. 2019. Value function in frequency domain and the char-
acteristic value iteration algorithm. In Advances in Neural Information Processing
Systems.
Fedus, William, Carles Gelada, Yoshua Bengio, Marc G. Bellemare, and Hugo
Larochelle. 2019. Hyperbolic discounting and learning over multiple horizons. In
Multi-Disciplinary Conference on Reinforcement Learning and Decision-Making.
Feinberg, Eugene A. 2000. Constrained discounted Markov decision processes and
Hamiltonian cycles. Mathematics of Operations Research 25 (1): 130–140.
Ferns, Norm, Prakash Panangaden, and Doina Precup. 2004. Metrics for finite Markov
decision processes. In Proceedings of the Conference on Uncertainty in Artificial
Intelligence.
Draft version.
346 References
Ferns, Norman, and Doina Precup. 2014. Bisimulation metrics are optimal value
functions. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Filar, Jerzy A., Dmitry Krass, and Keith W. Ross. 1995. Percentile performance crite-
ria for limiting average Markov decision processes. IEEE Transactions on Automatic
Control 40 (1): 2–10.
Fortunato, Meire, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband,
Alex Graves, Vlad Mnih, Rémi Munos, Demis Hassabis, Olivier Pietquin, Charles
Blundell, and Shane Legg. 2018. Noisy networks for exploration. In Proceedings of the
International Conference on Learning Representations.
François-Lavet, Vincent, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle
Pineau. 2018. An introduction to deep reinforcement learning. Foundations and Trends
R
in Machine Learning 11 (3–4): 219–354.
Freirich, Dror, Tzahi Shimkin, Ron Meir, and Aviv Tamar. 2019. Distributional multi-
variate policy evaluation and exploration with the Bellman GAN. In Proceedings of the
International Conference on Machine Learning.
Gardner, Matthew P. H., Georey Schoenbaum, and Samuel J. Gershman. 2018. Rethink-
ing dopamine as generalized prediction error. Proceedings of the Royal Society B 285
(1891): 20181645.
German, Dwight C., Kebreten Manaye, Wade K. Smith, Donald J. Woodward, and Clif-
ford B. Saper. 1989. Midbrain dopaminergic cell loss in Parkinson’s disease: Computer
visualization. Annals of Neurology 26 (4): 507–514.
Ghavamzadeh, Mohammad, Shie Mannor, Joelle Pineau, and Aviv Tamar. 2015.
Bayesian reinforcement learning: A survey. Foundations and Trends
R
in Machine
Learning 8 (5–6): 359–483.
Ghosh, Dibya, and Marc G. Bellemare. 2020. Representations for stable o-policy
reinforcement learning. In Proceedings of the International Conference on Machine
Learning.
Ghosh, Dibya, Marlos C. Machado, and Nicolas Le Roux. 2020. An operator view of
policy gradient methods. In Advances in Neural Information Processing Systems.
Gilbert, Hugo, Paul Weng, and Yan Xu. 2017. Optimizing quantiles in preference-
based Markov decision processes. In Proceedings of the AAAI Conference on Artificial
Intelligence.
Glimcher, Paul W. 2011. Understanding dopamine and reinforcement learning: The
dopamine reward prediction error hypothesis. Proceedings of the National Academy of
Sciences 108 (Suppl. 3): 15647–15654.
Goodfellow, Ian, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. MIT Press.
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets.
In Advances in Neural Information Processing Systems.
Gordon, Georey. 1995. Stable function approximation in dynamic programming. In
Proceedings of the International Conference on Machine Learning.
Draft version.
References 347
Gordon, Neil J., David J. Salmond, and Adrian F. M. Smith. 1993. Novel approach
to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F (Radar and
Signal Processing) 140 (2): 107–113.
Graesser, Laura, and Wah Loon Keng. 2019. Foundations of deep reinforcement learning:
Theory and practice in Python. Addison-Wesley Professional.
Gretton, Arthur, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexan-
der Smola. 2012. A kernel two-sample test. Journal of Machine Learning Research 13
(1): 723–773.
Grünewälder, Steen, and Klaus Obermayer. 2011. The optimal unbiased value estimator
and its relation to LSTD, TD and MC. Machine Learning 83 (3): 289–330.
Gruslys, Audrunas, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc Belle-
mare, and Rémi Munos. 2018. The Reactor: A fast and sample-ecient actor-critic agent
for reinforcement learning. In Proceedings of the International Conference on Learning
Representations.
Guo, Zhaohan Daniel, Bernardo Avila Pires, Bilal Piot, Jean-Bastien Grill, Florent Altché,
Rémi Munos, and Mohammad Gheshlaghi Azar. 2020. Bootstrap latent-predictive
representations for multitask reinforcement learning. In Proceedings of the International
Conference on Machine Learning.
Gurvits, Leonid, Long-Ji Lin, and Stephen José Hanson. 1994. Incremental learning
of evaluation functions for absorbing Markov chains: New methods and theorems.
Technical report. Siemens Corporate Research.
Harmon, Mance E., and Leemon C. Baird. 1996. A response to Bertsekas’ “A
counterexample to temporal-dierences learning”. Technical report. Wright Laboratory.
Haskell, William B., and Rahul Jain. 2015. A convex analytic approach to risk-aware
Markov decision processes. SIAM Journal on Control and Optimization 53 (3): 1569–
1598.
Hegarty, Shane V., Aideen M. Sullivan, and Gerard W. O’Keee. 2013. Midbrain
dopaminergic neurons: A review of the molecular circuitry that regulates their
development. Developmental Biology 379 (2): 123–138.
Heger, Matthias. 1994. Consideration of risk in reinforcement learning. In Proceedings
of the International Conference on Machine Learning.
Henderson, Peter, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and
David Meger. 2018. Deep reinforcement learning that matters. In Proceedings of the
AAAI Conference on Artificial Intelligence.
Hessel, Matteo, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will
Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. 2018. Rainbow:
Combining improvements in deep reinforcement learning. In Proceedings of the AAAI
Conference on Artificial Intelligence.
Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
Computation 9 (8): 1735–1780.
Hornykiewicz, Oleh. 1966. Dopamine (3-hydroxytyramine) and brain function. Pharma-
cological Reviews 18 (2): 925–964.
Draft version.
348 References
Howard, R. 1960. Dynamic programming and Markov processes. MIT Press.
Howard, Ronald A., and James E. Matheson. 1972. Risk-sensitive Markov decision
processes. Management Science 18 (7): 356–369.
Howes, Oliver D., and Shitij Kapur. 2009. The dopamine hypothesis of schizophrenia:
Version III—the final common pathway. Schizophrenia Bulletin 35 (3): 549–562.
Hutter, Marcus. 2005. Universal artificial intelligence: Sequential decisions based on
algorithmic probability. Springer.
Imani, Ehsan, and Martha White. 2018. Improving regression performance with
distributional losses. In Proceedings of the International Conference on Machine
Learning.
Jaakkola, Tommi, Michael I. Jordan, and Satinder P. Singh. 1994. On the convergence
of stochastic iterative dynamic programming algorithms. Neural Computation 6 (6):
1185–1201.
Jaderberg, Max, Volodymyr Mnih, Wojciech M. Czarnecki, Tom Schaul, Joel Z. Leibo,
David Silver, and Koray Kavukcuoglu. 2017. Reinforcement learning with unsuper-
vised auxiliary tasks. In Proceedings of the International Conference on Learning
Representations.
Janner, Michael, Igor Mordatch, and Sergey Levine. 2020. Generative temporal dif-
ference learning for infinite-horizon prediction. In Advances in Neural Information
Processing Systems.
Jaquette, Stratton C. 1973. Markov decision processes with a new optimality criterion:
Discrete time. The Annals of Statistics 1 (3): 496–505.
Jaquette, Stratton C. 1976. A utility criterion for Markov decision processes. Manage-
ment Science 23 (1): 43–49.
Jessen, Børge, and Aurel Wintner. 1935. Distribution functions and the Riemann zeta
function. Transactions of the American Mathematical Society 38 (1): 48–88.
Jiang, Daniel R., and Warren B. Powell. 2018. Risk-averse approximate dynamic pro-
gramming with quantile-based risk measures. Mathematics of Operations Research 43
(2): 554–579.
Jordan, Richard, David Kinderlehrer, and Felix Otto. 1998. The variational formulation
of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis 29 (1): 1–17.
Kaelbling, Leslie Pack, Michael L. Littman, and Anthony R. Cassandra. 1998. Planning
and acting in partially observable stochastic domains. Artificial Intelligence 101:99–134.
Kamin, Leon J. 1968. "Attention like" processes in classical conditioning. In Miami
Symposium on the Prediction of Behavior: Aversive Stimulation, 9–31.
Kantorovich, Leonid V. 1942. On the translocation of masses. Proceedings of the USSR
Academy of Sciences 37 (7–8): 227–229.
Kapetanakis, Spiros, and Daniel Kudenko. 2002. Reinforcement learning of coordination
in cooperative multi-agent systems. In Proceedings of the AAAI Conference on Artificial
Intelligence.
Draft version.
References 349
Kartal, Bilal, Pablo Hernandez-Leal, and Matthew E. Taylor. 2019. Terminal predic-
tion as an auxiliary task for deep reinforcement learning. In Proceedings of the AAAI
Conference on Artificial Intelligence and Interactive Digital Entertainment.
Kempka, Michał, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech
Ja
´
skowski. 2016. Vizdoom: A Doom-based AI research platform for visual reinforce-
ment learning. In 2016 IEEE Conference on Computational Intelligence and Games,
1–8.
Keramati, Ramtin, Christoph Dann, Alex Tamkin, and Emma Brunskill. 2020. Being
optimistic to be conservative: Quickly learning a CVaR policy. In Proceedings of the
AAAI Conference on Artificial Intelligence.
Kingma, Diederik, and Jimmy Ba. 2015. Adam: A method for stochastic optimization.
In Proceedings of the International Conference on Learning Representations.
Koenker, Roger. 2005. Quantile regression. Cambridge University Press.
Koenker, Roger, and Gilbert Bassett Jr. 1978. Regression quantiles. Econometrica 46
(1): 33–50.
Kolter, J. Zico. 2011. The fixed points of o-policy TD. In Advances in Neural
Information Processing Systems.
Konidaris, George D., Sarah Osentoski, and Philip S. Thomas. 2011. Value function
approximation in reinforcement learning using the Fourier basis. In Proceedings of the
AAAI Conference on Artificial Intelligence.
Kuan, Chung-Ming, Jin-Huei Yeh, and Yu-Chin Hsu. 2009. Assessing value at risk with
care, the conditional autoregressive expectile models. Journal of Econometrics 150 (2):
261–270.
Kuhn, Harold W. 1950. A simplified two-person poker. Contributions to the Theory of
Games 1:97–103.
Kurth-Nelson, Zeb, and A. David Redish. 2009. Temporal-dierence reinforcement
learning with distributed representations. PLoS One 4 (10): e7362.
Kusher, Harold, and Dean Clark. 1978. Stochastic approximation methods for con-
strained and unconstrained systems. Springer.
Kushner, Harold, and G. George Yin. 2003. Stochastic approximation and recursive
algorithms and applications. Springer Science & Business Media.
Kuznetsov, Arsenii, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. 2020.
Controlling overestimation bias with truncated mixture of continuous distributional
quantile critics. In Proceedings of the International Conference on Machine Learning.
Lagoudakis, Michail G., and Ronald Parr. 2003. Least-squares policy iteration. Journal
of Machine Learning Research 4:1107–1149.
Lample, Guillaume, and Devendra Singh Chaplot. 2017. Playing FPS games with deep
reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
Laskin, Michael, Aravind Srinivas, and Pieter Abbeel. 2020. CURL: Contrastive unsu-
pervised representations for reinforcement learning. In Proceedings of the International
Conference on Machine Learning.
Draft version.
350 References
Lattimore, Tor, and Marcus Hutter. 2012. PAC bounds for discounted MDPs. In
Proceedings of the International Conference on Algorithmic Learning Theory.
Lattimore, Tor, and Csaba Szepesvári. 2020. Bandit algorithms. Cambridge University
Press.
Lauer, Martin, and Martin Riedmiller. 2000. An algorithm for distributed reinforce-
ment learning in cooperative multi-agent systems. In Proceedings of the International
Conference on Machine Learning.
Le Lan, Charline, Stephen Tu, Adam Oberman, Rishabh Agarwal, and Marc G. Belle-
mare. 2022. On the generalization of representations in reinforcement learning. In
Proceedings of the International Conference on Artificial Intelligence and Statistics.
LeCun, Yann, and Yoshua Bengio. 1995. Convolutional networks for images, speech, and
time series. In The handbook of brain theory and neural networks, edited by Michael A.
Arbib. MIT Press.
Lee, Daewoo, Boris Defourny, and Warren B. Powell. 2013. Bias-corrected Q-learning
to control max-operator bias in Q-learning. In Symposium on Adaptive Dynamic
Programming And Reinforcement Learning.
Levine, Sergey. 2018. Reinforcement learning and control as probabilistic inference:
Tutorial and review. arXiv preprint arXiv:1805.00909.
Levine, Sergey, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-end
training of deep visuomotor policies. Journal of Machine Learning Research 17 (1):
1334–1373.
Li, Xiaocheng, Huaiyang Zhong, and Margaret L. Brandeau. 2022. Quantile Markov
decision processes. Operations Research 70 (3): 1428–1447.
Lillicrap, Timothy P., Daniel Cownden, Douglas B. Tweed, and Colin J. Akerman. 2016a.
Random synaptic feedback weights support error backpropagation for deep learning.
Nature Communications 7 (1): 1–10.
Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,
Yuval Tassa, David Silver, and Daan Wierstra. 2016b. Continuous control with deep
reinforcement learning. In Proceedings of the International Conference on Learning
Representations.
Lin, Gwo Dong. 2017. Recent developments on the moment problem. Journal of
Statistical Distributions and Applications 4 (1): 1–17.
Lin, L. J. 1992. Self-improving reactive agents based on reinforcement learning, planning
and teaching. Machine Learning 8 (3): 293–321.
Lin, Zichuan, Li Zhao, Derek Yang, Tao Qin, Tie-Yan Liu, and Guangwen Yang. 2019.
Distributional reward decomposition for reinforcement learning. In Advances in Neural
Information Processing Systems.
Lipovetzky, Nir, Miquel Ramirez, and Hector Gener. 2015. Classical planning with
simulators: Results on the Atari video games. In Proceedings of International Joint
Conference on Artificial Intelligence.
Draft version.
References 351
Littman, Michael L. 1994. Markov games as a framework for multi-agent reinforcement
learning. In Proceedings of the International Conference on Machine Learning.
Littman, Michael L., and Csaba Szepesvári. 1996. A generalized reinforcement-learning
model: Convergence and applications. In Proceedings of the International Conference
on Machine Learning.
Liu, Jun S. 2001. Monte Carlo strategies in scientific computing. Springer.
Liu, Qiang, and Dilin Wang. 2016. Stein variational gradient descent: A general purpose
Bayesian inference algorithm. In Advances in Neural Information Processing Systems.
Liu, Quansheng. 1998. Fixed points of a generalized smoothing transformation and
applications to the branching random walk. Advances in Applied Probability 30 (1):
85–112.
Ljung, Lennart. 1977. Analysis of recursive stochastic algorithms. IEEE Transactions
on Automatic Control 22 (4): 551–575.
Ljungberg, Tomas, Paul Apicella, and Wolfram Schultz. 1992. Responses of monkey
dopamine neurons during learning of behavioral reactions. Journal of Neurophysiology
67 (1): 145–163.
Lowet, Adam S., Qiao Zheng, Sara Matias, Jan Drugowitsch, and Naoshige Uchida.
2020. Distributional reinforcement learning in the brain. Trends in Neurosciences 43
(12): 980–997.
Ludvig, Elliot A., Marc G. Bellemare, and Keir G. Pearson. 2011. A primer on reinforce-
ment learning in the brain: Psychological, computational, and neural perspectives. In
Computational neuroscience for advancing artificial intelligence: Models, methods and
applications, edited by Eduardo Alonso and Esther Mondragón. IGI Global.
Luo, Yudong, Guiliang Liu, Haonan Duan, Oliver Schulte, and Pascal Poupart. 2021.
Distributional reinforcement learning with monotonic splines. In Proceedings of the
International Conference on Learning Representations.
Lyle, Clare, Pablo Samuel Castro, and Marc G. Bellemare. 2019. A comparative analysis
of expected and distributional reinforcement learning. In Proceedings of the AAAI
Conference on Artificial Intelligence.
Lyle, Clare, Mark Rowland, Georg Ostrovski, and Will Dabney. 2021. On the eect
of auxiliary tasks on representation dynamics. In Proceedings of the International
Conference on Artificial Intelligence and Statistics.
Lyu, Xueguang, and Christopher Amato. 2020. Likelihood quantile networks for
coordinating multi-agent reinforcement learning. In Proceedings of the International
Conference on Autonomous Agents and Multiagent Systems.
Machado, Marlos C., Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew
Hausknecht, and Michael Bowling. 2018. Revisiting the Arcade Learning Environ-
ment: Evaluation protocols and open problems for general agents. Journal of Artificial
Intelligence Research 61:523–562.
MacKay, David J. C. 2003. Information theory, inference and learning algorithms.
Cambridge University Press.
Draft version.
352 References
Maddison, Chris J., Dieterich Lawson, George Tucker, Nicolas Heess, Arnaud Doucet,
Andriy Mnih, and Yee Whye Teh. 2017. Particle value functions. In Proceedings of the
International Conference on Learning Representations (Workshop Track).
Madeira Auraújo, João Guilherme, Johan Samir Obando Ceron, and Pablo Samuel
Castro. 2021. Lifting the veil on hyper-parameters for value-based deep reinforcement
learning. In NeurIPS 2021 Workshop: LatinX in AI.
Maei, Hamid Reza. 2011. Gradient temporal-dierence learning algorithms. PhD diss.,
University of Alberta.
Mandl, Petr. 1971. On the variance in controlled Markov chains. Kybernetika 7 (1):
1–12.
Mannor, Shie, Duncan Simester, Peng Sun, and John N. Tsitsiklis. 2007. Bias and
variance approximation in value function estimates. Management Science 53 (2): 308–
322.
Mannor, Shie, and John Tsitsiklis. 2011. Mean-variance optimization in Markov decision
processes. In Proceedings of the International Conference on Machine Learning.
Markowitz, Harry M. 1952. Portfolio selection. Journal of Finance 7:77–91.
Martin, John, Michal Lyskawinski, Xiaohu Li, and Brendan Englot. 2020. Stochastically
dominant distributional reinforcement learning. In Proceedings of the International
Conference on Machine Learning.
Massart, Pascal. 1990. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality.
The Annals of Probability 18 (3): 1269–1283.
Matignon, Laëtitia, Guillaume J. Laurent, and Nadine Le Fort-Piat. 2007. Hysteretic
Q-learning: An algorithm for decentralized reinforcement learning in cooperative multi-
agent teams. In IEEE International Conference on Intelligent Robots and Systems.
Matignon, Laëtitia, Guillaume J. Laurent, and Nadine Le Fort-Piat. 2012. Independent
reinforcement learners in cooperative Markov games: A survey regarding coordination
problems. The Knowledge Engineering Review 27 (1): 1–31.
Mavrin, Borislav, Hengshuai Yao, Linglong Kong, Kaiwen Wu, and Yaoliang Yu. 2019.
Distributional reinforcement learning for ecient exploration. In Proceedings of the
International Conference on Machine Learning.
McCallum, Andrew K. 1995. Reinforcement learning with selective perception and
hidden state. PhD diss., University of Rochester.
Meyn, Sean. 2022. Control systems and reinforcement learning. Cambridge University
Press.
Meyn, Sean P., and Richard L. Tweedie. 2012. Markov chains and stochastic stability.
Cambridge University Press.
Mihatsch, Oliver, and Ralph Neuneier. 2002. Risk-sensitive reinforcement learning.
Machine Learning 49 (2): 267–290.
Miller, Ralph R., Robert C. Barnet, and Nicholas J. Grahame. 1995. Assessment of the
Rescorla-Wagner model. Psychological Bulletin 117 (3): 363.
Draft version.
References 353
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc
G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski,
Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan
Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control
through deep reinforcement learning. Nature 518 (7540): 529–533.
Mogenson, Gordon J., Douglas L. Jones, and Chi Yiu Yim. 1980. From motivation to
action: Functional interface between the limbic system and the motor system. Progress
in Neurobiology 14 (2–3): 69–97.
Monge, Gaspard. 1781. Mémoire sur la théorie des déblais et des remblais. Histoire de
l’Académie Royale des Sciences de Paris: 666–704.
Montague, P. Read, Peter Dayan, and Terrence J. Sejnowski. 1996. A framework for
mesencephalic dopamine systems based on predictive Hebbian learning. Journal of
Neuroscience 16 (5): 1936–1947.
Montfort, Nick, and Ian Bogost. 2009. Racing the beam: The Atari video computer
system. MIT Press.
Moore, Andrew W., and Christopher G. Atkeson. 1993. Prioritized sweeping: Rein-
forcement learning with less data and less time. Machine Learning 13 (1): 103–
130.
Morgenstern, Oskar, and John von Neumann. 1944. Theory of games and economic
behavior. Princeton University Press.
Morimura, Tetsuro, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and
Toshiyuki Tanaka. 2010a. Nonparametric return distribution approximation for rein-
forcement learning. In Proceedings of the International Conference on Machine
Learning.
Morimura, Tetsuro, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and
Toshiyuki Tanaka. 2010b. Parametric return density estimation for reinforcement
learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Morton, Thomas E. 1971. On the asymptotic convergence rate of cost dierences for
Markovian decision processes. Operations Research 19 (1): 244–248.
Mott, Bradford W., Stephen Anthony, and the Stella team. 1995–2023. Stella: A multi-
platform Atari 2600 VCS Emulator. http://stella.sourceforge.net.
Müller, Alfred. 1997. Integral probability metrics and their generating classes of
functions. Advances in Applied Probability 29 (2): 429–443.
Muller, Timothy H., James L. Butler, Sebastijan Veselic, Bruno Miranda, Timothy
E. J. Behrens, Zeb Kurth-Nelson, and Steven W. Kennerley. 2021. Distributional
reinforcement learning in prefrontal cortex. bioRxiv 2021.06.14.448422.
Munos, Rémi. 2003. Error bounds for approximate policy iteration. In Proceedings of
the International Conference on Machine Learning.
Munos, Rémi, Tom Stepleton, Anna Harutyunyan, and Marc G. Bellemare. 2016. Safe
and ecient o-policy reinforcement learning. In Advances in Neural Information
Processing Systems.
Draft version.
354 References
Murphy, Kevin P. 2012. Machine learning: A probabilistic perspective. MIT Press.
Naddaf, Yavar. 2010. Game-independent AI agents for playing Atari 2600 console
games. Master’s thesis, University of Alberta.
Naesseth, Christian A., Fredrik Lindsten, and Thomas B. Schön. 2019. Elements of
sequential Monte Carlo. Foundations and Trends
R
in Machine Learning 12 (3): 307–
392.
Nair, Vinod, and Georey E. Hinton. 2010. Rectified linear units improve restricted
Boltzmann machines. In Proceedings of the International Conference on Machine
Learning.
Nam, Daniel W., Younghoon Kim, and Chan Y. Park. 2021. GMAC: A distributional
perspective on actor-critic framework. In Proceedings of the International Conference
on Machine Learning.
Neininger, Ralph. 1999. Limit laws for random recursive structures and algorithms.
PhD diss., University of Freiburg.
Neininger, Ralph. 2001. On a multivariate contraction method for random recursive
structures with applications to Quicksort. Random Structures & Algorithms 19 (3–4):
498–524.
Neininger, Ralph, and Ludger Rüschendorf. 2004. A general limit theorem for recursive
algorithms and combinatorial structures. The Annals of Applied Probability 14 (1): 378–
418.
Newey, Whitney K., and James L. Powell. 1987. Asymmetric least squares estimation
and testing. Econometrica 55 (4): 819–847.
Nguyen, Thanh Tang, Sunil Gupta, and Svetha Venkatesh. 2021. Distributional rein-
forcement learning via moment matching. In Proceedings of the AAAI Conference on
Artificial Intelligence.
Nieoullon, André. 2002. Dopamine and the regulation of cognition and attention.
Progress in Neurobiology 67 (1): 53–83.
Nikolov, Nikolay, Johannes Kirschner, Felix Berkenkamp, and Andreas Krause. 2019.
Information-directed exploration for deep reinforcement learning. In Proceedings of the
International Conference on Learning Representations.
Niv, Yael. 2009. Reinforcement learning in the brain. Journal of Mathematical
Psychology 53 (3): 139–154.
Olah, Chris, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Kather-
ine Ye, and Alexander Mordvintsev. 2018. The building blocks of interpretability.
Distill.
Oliehoek, Frans A., and Christopher Amato. 2016. A concise introduction to decentral-
ized POMDPs. Springer.
Oliehoek, Frans A., Matthijs T. J. Spaan, and Nikos Vlassis. 2008. Optimal and approxi-
mate Q-value functions for decentralized POMDPs. Journal of Artificial Intelligence
Research 32 (1): 289–353.
Draft version.
References 355
Olsen, Ditte, Niels Wellner, Mathias Kaas, Inge E. M. de Jong, Florence Sotty, Michael
Didriksen, Simon Glerup, and Anders Nykjaer. 2021. Altered dopaminergic firing
pattern and novelty response underlie ADHD-like behavior of SorCS2-deficient mice.
Translational Psychiatry 11 (1): 1–14.
Omidshafiei, Shayegan, Jason Pazis, Christopher Amato, Jonathan P. How, and John Vian.
2017. Deep decentralized multi-task multi-agent reinforcement learning under partial
observability. In Proceedings of the International Conference on Machine Learning.
Owen, Art B. 2013. Monte Carlo theory, methods and examples.
Palmer, Gregory, Rahul Savani, and Karl Tuyls. 2019. Negative update intervals in deep
multi-agent reinforcement learning. In Proceedings of the International Conference on
Autonomous Agents and Multiagent Systems.
Palmer, Gregory, Karl Tuyls, Daan Bloembergen, and Rahul Savani. 2018. Lenient
multi-agent deep reinforcement learning. In Proceedings of the International Conference
on Autonomous Agents and Multiagent Systems.
Panait, Liviu, Keith Sullivan, and Sean Luke. 2006. Lenient learners in cooperative
multiagent systems. In Proceedings of the International Conference on Autonomous
Agents and Multiagent Systems.
Panait, Liviu, R. Paul Wiegand, and Sean Luke. 2003. Improving coevolutionary search
for optimal multiagent behaviors. In Proceedings of the International Joint Conference
on Artificial Intelligence.
Panaretos, Victor M., and Yoav Zemel. 2020. An invitation to statistics in Wasserstein
space. Springer Nature.
Parr, Ronald, Lihong Li, Gavin Taylor, Christopher Painter-Wakefield, and Michael
L. Littman. 2008. An analysis of linear models, linear value-function approximation,
and feature selection for reinforcement learning. In Proceedings of the International
Conference on Machine Learning.
Parr, Ronald, Christopher Painter-Wakefield, Lihong Li, and Michael Littman. 2007.
Analyzing feature generation for value-function approximation. In Proceedings of the
International Conference on Machine Learning.
Pavlov, Ivan P. 1927. Conditioned reflexes: An investigation of the physiological activity
of the cerebral cortex. Oxford University Press.
Peres, Yuval, Wilhelm Schlag, and Boris Solomyak. 2000. Sixty years of Bernoulli con-
volutions. In Fractal geometry and stochastics II, edited by Christoph Bandt, Siegfried
Graf, and Martina Zähle. Springer.
Peyré, Gabriel, and Marco Cuturi. 2019. Computational optimal transport: With applica-
tions to data science. Foundations and Trends
R
in Machine Learning 11 (5–6): 355–
607.
Prashanth, L. A., and Michael Fu. 2021. Risk-sensitive reinforcement learning. arXiv
preprint arXiv:1810.09126.
Prashanth, L. A., and Mohammad Ghavamzadeh. 2013. Actor-critic algorithms for
risk-sensitive MDPs. In Advances in Neural Information Processing Systems.
Draft version.
356 References
Precup, Doina, Richard S. Sutton, and Satinder P. Singh. 2000. Eligibility traces for
o-policy policy evaluation. In Proceedings of the International Conference on Machine
Learning.
Pritzel, Alexander, Benigno Uria, Sriram Srinivasan, Adria Puigdomenech Badia, Oriol
Vinyals, Demis Hassabis, Daan Wierstra, and Charles Blundell. 2017. Neural episodic
control. In Proceedings of the International Conference on Machine Learning.
Puterman, Martin L. 2014. Markov decision processes: Discrete stochastic dynamic
programming. John Wiley & Sons.
Puterman, Martin L., and Moon Chirl Shin. 1978. Modified policy iteration algorithms
for discounted Markov decision problems. Management Science 24 (11): 1127–1137.
Qiu, Wei, Xinrun Wang, Runsheng Yu, Xu He, Rundong Wang, Bo An, Svetlana
Obraztsova, and Zinovi Rabinovich. 2021. RMIX: Learning risk-sensitive policies
for cooperative reinforcement learning agents. In Advances in Neural Information
Processing Systems.
Qu, Chao, Shie Mannor, and Huan Xu. 2019. Nonlinear distributional gradient temporal-
dierence learning. In Proceedings of the International Conference on Machine
Learning.
Quan, John, and Georg Ostrovski. 2020. DQN Zoo: Reference implementations of
DQN-based agents. Version 1.0.0. http://github.com/deepmind/dqn_zoo.
Rachev, Svetlozar T., Lev Klebanov, Stoyan V. Stoyanov, and Frank Fabozzi. 2013.
The methods of distances in the theory of probability and statistics. Springer Science &
Business Media.
Rachev, Svetlozar T., and Ludger Rüschendorf. 1995. Probability metrics and recursive
algorithms. Advances in Applied Probability 27 (3): 770–799.
Rashid, Tabish, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar,
Jakob N. Foerster, and Shimon Whiteson. 2020. Monotonic value function factorisation
for deep multi-agent reinforcement learning. Journal of Machine Learning Research 21
(1): 7234–7284.
Rashid, Tabish, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar,
Jakob Foerster, and Shimon Whiteson. 2018. QMIX: Monotonic value function factori-
sation for deep multi-agent reinforcement learning. In Proceedings of the International
Conference on Machine Learning.
Rescorla, Robert A., and Allan R. Wagner. 1972. A theory of Pavlovian conditioning:
Variations in the eectiveness of reinforcement and nonreinforcement. In Classical
conditioning II: Current Research and Theory, edited by Abraham J. Black and William
F. Prosaky, 64–99. Appleton-Century-Crofts.
Riedmiller, M. 2005. Neural fitted Q iteration – first experiences with a data ecient
neural reinforcement learning method. In Proceedings of the European Conference on
Machine Learning.
Riedmiller, Martin, Thomas Gabel, Roland Hafner, and Sascha Lange. 2009. Reinforce-
ment learning for robot soccer. Autonomous Robots 27 (1): 55–73.
Draft version.
References 357
Rizzo, Maria L., and Gábor J. Székely. 2016. Energy distance. Wiley Interdisciplinary
Reviews: Computational Statistics 8 (1): 27–38.
Robbins, Herbert, and Sutton Monro. 1951. A stochastic approximation method. The
Annals of Mathematical Statistics 22 (3): 400–407.
Robbins, Herbert, and David Siegmund. 1971. A convergence theorem for non negative
almost supermartingales and some applications. In Optimizing methods in statistics,
edited by Jagdish S. Rustagi, 233–257. Academic Press.
Robert, Christian, and George Casella. 2004. Monte Carlo statistical methods. Springer
Science & Business Media.
Rockafellar, R. Tyrrell, and Stanislav Uryasev. 2000. Optimization of conditional value-
at-risk. Journal of Risk 2:21–42.
Rockafellar, R. Tyrrell, and Stanislav Uryasev. 2002. Conditional value-at-risk for
general loss distributions. Journal of Banking & Finance 26 (7): 1443–1471.
Rösler, Uwe. 1991. A limit theorem for “Quicksort.RAIRO-Theoretical Informatics
and Applications 25 (1): 85–100.
Rösler, Uwe. 1992. A fixed point theorem for distributions. Stochastic Processes and
Their Applications 42 (2): 195–214.
Rösler, Uwe. 2001. On the analysis of stochastic divide and conquer algorithms.
Algorithmica 29 (1): 238–261.
Rösler, Uwe, and Ludger Rüschendorf. 2001. The contraction method for recursive
algorithms. Algorithmica 29 (1–2): 3–33.
Rowland, Mark, Marc G. Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh.
2018. An analysis of categorical distributional reinforcement learning. In Proceedings of
the International Conference on Artificial Intelligence and Statistics.
Rowland, Mark, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G. Bellemare,
and Will Dabney. 2019. Statistics and samples in distributional reinforcement learning.
In Proceedings of the International Conference on Machine Learning.
Rowland, Mark, Shayegan Omidshafiei, Daniel Hennes, Will Dabney, Andrew Jaegle,
Paul Muller, Julien Pérolat, and Karl Tuyls. 2021. Temporal dierence and return
optimism in cooperative multi-agent reinforcement learning. In Adaptive and Learning
Agents Workshop at the International Conference on Autonomous Agents and Multiagent
Systems.
Rubner, Yossi, Carlo Tomasi, and Leonidas J. Guibas. 1998. A metric for distributions
with applications to image databases. In Sixth International Conference on Computer
Vision.
Rudin, Walter. 1976. Principles of mathematical analysis. McGraw-Hill.
Rumelhart, David E., Georey E. Hinton, and Ronald J. Williams. 1986. Learning
representations by back-propagating errors. Nature 323 (6088): 533–536.
Rummery, Gavin A., and Mahesan Niranjan. 1994. On-line Q-learning using connec-
tionist systems. Technical report. Cambridge University Engineering Department.
Draft version.
358 References
Rüschendorf, Ludger. 2006. On stochastic recursive equations of sum and max type.
Journal of Applied Probability 43 (3): 687–703.
Rüschendorf, Ludger, and Ralph Neininger. 2006. A survey of multivariate aspects of
the contraction method. Discrete Mathematics & Theoretical Computer Science 8:31–56.
Ruszczy
´
nski, Andrzej. 2010. Risk-averse dynamic programming for Markov decision
processes. Mathematical Programming 125 (2): 235–261.
Samuel, Arthur L. 1959. Some studies in machine learning using the game of checkers.
IBM Journal of Research and Development 11 (6): 601–617.
Santambrogio, Filippo. 2015. Optimal transport for applied mathematicians: Calculus
of variations, PDEs and modeling. Birkhäuser.
Särkkä, Simo. 2013. Bayesian filtering and smoothing. Cambridge University Press.
Sato, Makoto, Hajime Kimura, and Shibenobu Kobayashi. 2001. TD algorithm for
the variance of return and mean-variance reinforcement learning. Transactions of the
Japanese Society for Artificial Intelligence 16 (3): 353–362.
Schaul, Tom, John Quan, Ioannis Antonoglou, and David Silver. 2016. Prioritized
experience replay. In Proceedings of the International Conference on Learning
Representations.
Scherrer, Bruno. 2010. Should one compute the temporal dierence fix point or mini-
mize the Bellman residual? The unified oblique projection view. In Proceedings of the
International Conference on Machine Learning.
Scherrer, Bruno. 2014. Approximate policy iteration schemes: A comparison. In
Proceedings of the International Conference on Machine Learning.
Scherrer, Bruno, and Boris Lesner. 2012. On the use of non-stationary policies for sta-
tionary infinite-horizon Markov decision processes. In Advances in Neural Information
Processing Systems.
Schlegel, Matthew, Andrew Jacobsen, Zaheer Abbas, Andrew Patterson, Adam White,
and Martha White. 2021. General value function networks. Journal of Artificial
Intelligence Research (JAIR) 70:497–543.
Schultz, Wolfram. 1986. Responses of midbrain dopamine neurons to behavioral trigger
stimuli in the monkey. Journal of Neurophysiology 56 (5): 1439–1461.
Schultz, Wolfram. 2002. Getting formal with dopamine and reward. Neuron 36 (2):
241–263.
Schultz, Wolfram. 2016. Dopamine reward prediction-error signalling: A two-component
response. Nature Reviews Neuroscience 17 (3): 183–195.
Schultz, Wolfram, Paul Apicella, and Tomas Ljungberg. 1993. Responses of monkey
dopamine neurons to reward and conditioned stimuli during successive steps of learning
a delayed response task. Journal of Neuroscience 13 (3): 900–913.
Schultz, Wolfram, Peter Dayan, and P. Read Montague. 1997. A neural substrate of
prediction and reward. Science 275 (5306): 1593–1599.
Draft version.
References 359
Schultz, Wolfram, and Ranulfo Romo. 1990. Dopamine neurons of the monkey midbrain:
Contingencies of responses to stimuli eliciting immediate behavioral reactions. Journal
of Neurophysiology 63 (3): 607–624.
Shah, Ashvin. 2012. Psychological and neuroscientific connections with reinforcement
learning. In Reinforcement learning, edited by Marco Wiering and Martijn Otterlo,
507–537. Springer.
Shapiro, Alexander, Darinka Dentcheva, and Andrzej Ruszczynski. 2009. Lectures on
stochastic programming: Modeling and theory. SIAM.
Shapley, Lloyd S. 1953. Stochastic games. Proceedings of the National Academy of
Sciences 39 (10): 1095–1100.
Shen, Yun, Wilhelm Stannat, and Klaus Obermayer. 2013. Risk-sensitive Markov control
processes. SIAM Journal on Control and Optimization 51 (5): 3652–3672.
Shoham, Yoav, and Kevin Leyton-Brown. 2009. Multiagent systems. Cambridge
University Press.
Silver, David, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George
van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya
Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel,
and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and
tree search. Nature 529 (7587): 484–489.
Singh, Satinder P., and Richard S. Sutton. 1996. Reinforcement learning with replacing
eligibility traces. Machine Learning 22:123–158.
Sobel, Matthew J. 1982. The variance of discounted Markov decision processes. Journal
of Applied Probability 19 (4): 794–802.
Solomyak, Boris. 1995. On the random series Σ
±λ
n
(an Erd
˝
os problem). Annals of
Mathematics 142 (3): 611–625.
Stalnaker, Thomas A., James D. Howard, Yuji K. Takahashi, Samuel J. Gershman,
Thorsten Kahnt, and Georey Schoenbaum. 2019. Dopamine neuron ensembles signal
the content of sensory prediction errors. eLife 8:e49315.
Steinbach, Marc C. 2001. Markowitz revisited: Mean-variance models in financial
portfolio analysis. SIAM Review 43 (1): 31–85.
Strang, Gilbert. 1993. Introduction to linear algebra. Wellesley-Cambridge Press.
Such, Felipe Petroski, Vashisht Madhavan, Rosanne Liu, Rui Wang, Pablo Samuel Castro,
Yulun Li, Ludwig Schubert, Marc G. Bellemare, JeClune, and Joel Lehman. 2019. An
Atari model zoo for analyzing, visualizing, and comparing deep reinforcement learning
agents. In Proceedings of the International Joint Conference on Artificial Intelligence.
Sun, Wei-Fang, Cheng-Kuang Lee, and Chun-Yi Lee. 2021. DFAC framework: Factoriz-
ing the value function via quantile mixture for multi-agent distributional Q-learning. In
Proceedings of the International Conference on Machine Learning.
Sunehag, Peter, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius
Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls,
Draft version.
360 References
and Thore Graepel. 2017. Value-decomposition networks for cooperative multi-agent
learning. arXiv preprint arXiv:1706.05296.
Sutton, Richard S. 1984. Temporal credit assignment in reinforcement learning. PhD
diss., University of Massachusetts, Amherst.
Sutton, Richard S. 1988. Learning to predict by the methods of temporal dierences.
Machine Learning 3 (1): 9–44.
Sutton, Richard S. 1995. TD models: Modeling the world at a mixture of time scales. In
Proceedings of the International Conference on Machine Learning.
Sutton, Richard S. 1996. Generalization in reinforcement learning: Successful examples
using sparse coarse coding. In Advances in Neural Information Processing Systems.
Sutton, Richard S. 1999. Open theoretical questions in reinforcement learning. In
European Conference on Computational Learning Theory.
Sutton, Richard S., and Andrew G. Barto. 2018. Reinforcement learning: An introduction.
MIT Press.
Sutton, Richard S., Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Sil-
ver, Csaba Szepesvári, and Eric Wiewiora. 2009. Fast gradient-descent methods for
temporal-dierence learning with linear function approximation. In Proceedings of the
International Conference on Machine Learning.
Sutton, Richard S., David A. McAllester, Satinder P. Singh, and Yishay Mansour. 2000.
Policy gradient methods for reinforcement learning with function approximation. In
Advances in Neural Information Processing Systems.
Sutton, Richard S., Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski,
Adam White, and Doina Precup. 2011. Horde: A scalable real-time architecture for
learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the
International Conference on Autonomous Agents and Multiagents Systems.
Sutton, Richard S., Doina Precup, and Satinder Singh. 1999. Between MDPs and semi-
MDPs: A framework for temporal abstraction in reinforcement learning. Artificial
Intelligence 112 (1–2): 181–211.
Sutton, Richard S., Csaba Szepesvári, and Hamid Reza Maei. 2008a. A convergent
O
(
n
)
temporal-dierence algorithm for o-policy learning with linear function approximation.
In Advances in Neural Information Processing Systems.
Sutton, Richard S., Csaba Szespesvári, Alborz Geramifard, and Michael Bowling. 2008b.
Dyna-style planning with linear function approximation and prioritized sweeping. In
Proceedings of the Conference on Uncertainty in Artificial Intelligence.
Székely, Gabor J. 2002. E-statistics: The energy of statistical samples. Technical
report 02-16. Bowling Green State University, Department of Mathematics and Statistics.
Székely, Gábor J., and Maria L. Rizzo. 2013. Energy statistics: A class of statistics based
on distances. Journal of Statistical Planning and Inference 143 (8): 1249–1272.
Szepesvári, Csaba. 1998. The asymptotic convergence-rate of Q-learning. In Advances
in Neural Information Processing Systems.
Draft version.
References 361
Szepesvári, Csaba. 2010. Algorithms for reinforcement learning. Morgan & Claypool
Publishers.
Szepesvári, Csaba. 2020. Constrained MDPs and the reward hypothesis. https://readin
gsml.blogspot.com/2020/03/constrained-mdps-and-reward-hypothesis.html. Accessed
June 25, 2021.
Takahashi, Yuji K., Hannah M. Batchelor, Bing Liu, Akash Khanna, Marisela Morales,
and Georey Schoenbaum. 2017. Dopamine neurons respond to errors in the prediction
of sensory features of expected rewards. Neuron 95 (6): 1395–1405.
Tamar, Aviv, Dotan Di Castro, and Shie Mannor. 2012. Policy gradients with vari-
ance related risk criteria. In Proceedings of the International Conference on Machine
Learning.
Tamar, Aviv, Dotan Di Castro, and Shie Mannor. 2013. Temporal dierence methods
for the variance of the reward to go. In Proceedings of the International Conference on
Machine Learning.
Tamar, Aviv, Dotan Di Castro, and Shie Mannor. 2016. Learning the variance of the
reward-to-go. Journal of Machine Learning Research 17 (1): 361–396.
Tamar, Aviv, Yonatan Glassner, and Shie Mannor. 2015. Optimizing the CVaR via
sampling. In Proceedings of the AAAI Conference on Artificial Intelligence.
Tampuu, Ardi, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan
Aru, Jaan Aru, and Raul Vicente. 2017. Multiagent cooperation and competition with
deep reinforcement learning. PloS One 12 (4): e0172395.
Tan, Ming. 1993. Multi-agent reinforcement learning: Independent vs. cooperative
agents. In Proceedings of the International Conference on Machine Learning.
Tano, Pablo, Peter Dayan, and Alexandre Pouget. 2020. A local temporal dierence code
for distributional reinforcement learning. In Advances in Neural Information Processing
Systems.
Taylor, James W. 2008. Estimating value at risk and expected shortfall using expectiles.
Journal of Financial Econometrics 6 (2): 231–252.
Tesauro, Gerald. 1995. Temporal dierence learning and TD-Gammon. Communications
of the ACM 38 (3): 58–68.
Tessler, Chen, Guy Tennenholtz, and Shie Mannor. 2019. Distributional policy optimiza-
tion: An alternative approach for continuous control. In Advances in Neural Information
Processing Systems.
Tieleman, T., and G. Hinton. 2012. rmsprop: Divide the gradient by a running average
of its recent magnitude. COURSERA: Neural Networks for Machine Learning.
Toussaint, Marc. 2009. Robot trajectory optimization using approximate inference. In
Proceedings of the International Conference on Machine Learning.
Toussaint, Marc, and Amos Storkey. 2006. Probabilistic inference for solving discrete
and continuous state Markov decision processes. In Proceedings of the International
Conference on Machine Learning.
Draft version.
362 References
Tsitsiklis, John N. 1994. Asynchronous stochastic approximation and Q-learning.
Machine Learning 16 (3): 185–202.
Tsitsiklis, John N. 2002. On the convergence of optimistic policy iteration. Journal of
Machine Learning Research 3:59–72.
Tsitsiklis, John N., and Benjamin Van Roy. 1997. An analysis of temporal-dierence
learning with function approximation. IEEE Transactions on Automatic Control 42 (5):
674–690.
Tulcea, Cassius T. Ionescu. 1949. Mesures dans les espaces produits. Atti Accademia
Nazionale Lincei Rend 8 (7): 208–211.
van den Oord, Aäron, Nal Kalchbrenner, and Koray Kavukcuoglu. 2016. Pixel recurrent
neural networks. In Proceedings of the International Conference on Machine Learning.
van der Vaart, Aad W. 2000. Asymptotic statistics. Cambridge University Press.
van der Wal, Johannes. 1981. Stochastic dynamic programming: Successive approxima-
tions and nearly optimal strategies for Markov decision processes and Markov games.
Stichting Mathematisch Centrum.
van Hasselt, Hado, Arthur Guez, Matteo Hessel, Volodymyr Mnih, and David Sil-
ver. 2016a. Learning values across many orders of magnitude. In Advances in Neural
Information Processing Systems.
van Hasselt, Hado, Arthur Guez, and David Silver. 2016b. Deep reinforcement learning
with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan
N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In
Advances in Neural Information Processing Systems.
Vecerik, Mel, Oleg Sushkov, David Barker, Thomas Rothörl, Todd Hester, and Jon
Scholz. 2019. A practical approach to insertion with variable socket position using deep
reinforcement learning. In IEEE International Conference on Robotics and Automation.
Veness, Joel, Marc G. Bellemare, Marcus Hutter, Alvin Chua, and Guillaume Desjardins.
2015. Compress and control. In Proceedings of the AAAI Conference on Artificial
Intelligence.
Veness, Joel, Kee Siong Ng, Marcus Hutter, William T. B. Uther, and David Silver.
2011. A Monte-Carlo AIXI approximation. Journal of Artificial Intelligence Resesearch
40:95–142.
Vershik, A. M. 2013. Long history of the Monge-Kantorovich transportation problem.
The Mathematical Intelligencer 35 (4): 1–9.
Vieillard, Nino, Olivier Pietquin, and Matthieu Geist. 2020. Munchausen reinforcement
learning. In Advances in Neural Information Processing Systems.
Villani, Cédric. 2003. Topics in optimal transportation. Graduate Studies in Mathematics.
American Mathematical Society.
Villani, Cédric. 2008. Optimal transport: Old and new. Springer Science & Business
Media.
Draft version.
References 363
von Neumann, John. 1928. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen
100 (1): 295–320.
Wainwright, Martin J., and Michael I. Jordan. 2008. Graphical models, exponential
families, and variational inference. Foundations and Trends
R
in Machine Learning 1
(1–2): 1–305.
Walton, Neil. 2021. Lecture notes on stochastic control. Unpublished manuscript.
Wang, Ziyu, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando
Freitas. 2016. Dueling network architectures for deep reinforcement learning. In
Proceedings of the International Conference on Machine Learning.
Watkins, Christopher J. C. H. 1989. Learning from delayed rewards. PhD diss., King’s
College, Cambridge.
Watkins, Christopher J. C. H., and Peter Dayan. 1992. Q-learning. Machine Learning 8
(3–4): 279–292.
Weed, Jonathan, and Francis Bach. 2019. Sharp asymptotic and finite-sample rates of
convergence of empirical measures in Wasserstein distance. Bernoulli 25 (4A): 2620–
2648.
Wei, Ermo, and Sean Luke. 2016. Lenient learning in independent-learner stochastic
cooperative games. Journal of Machine Learning Research 17 (1): 2914–2955.
Werbos, Paul J. 1982. Applications of advances in nonlinear sensitivity analysis. In
System modeling and optimization, edited by Rudolph F. Drenick and Frank Kozin,
762–770. Springer.
White, D. J. 1988. Mean, variance, and probabilistic criteria in finite Markov decision
processes: A review. Journal of Optimization Theory and Applications 56 (1): 1–29.
White, Martha. 2017. Unifying task specification in reinforcement learning. In
Proceedings of the International Conference on Machine Learning.
White, Martha, and Adam White. 2016. A greedy approach to adapting the trace param-
eter for temporal dierence learning. In Proceedings of the International Conference on
Autonomous Agents and Multiagent Systems.
White, Norman M., and Marc Viaud. 1991. Localized intracaudate dopamine D2 receptor
activation during the post-training period improves memory for visual or olfactory
conditioned emotional responses in rats. Behavioral and Neural Biology 55 (3): 255–
269.
Widrow, Bernard, and Marcian E. Ho. 1960. Adaptive switching circuits. In WESCON
Convention Record Part IV.
Williams, David. 1991. Probability with martingales. Cambridge University Press.
Wise, Roy A. 2004. Dopamine, learning and motivation. Nature Reviews Neuroscience
5 (6): 483–494.
Wurman, Peter R., Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik
Subramanian, Thomas J. Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert,
Florian Fuchs, Leilani Gilpin, Piyush Khandelwal, Varun Kompella, HaoChih Lin,
Patrick MacAlpine, Declan Oller, Takuma Seno, Craig Sherstan, Michael D. Thomure,
Draft version.
364 References
Houmehr Aghabozorgi, Leon Barrett, Rory Douglas, Dion Whitehead, Peter Dürr, Peter
Stone, Michael Spranger, and Hiroaki Kitano. 2022. Outracing champion Gran Turismo
drivers with deep reinforcement learning. Nature 602 (7896): 223–228.
Yang, Derek, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. 2019. Fully
parameterized quantile function for distributional reinforcement learning. In Advances
in Neural Information Processing Systems.
Young, Kenny, and Tian Tian. 2019. MinAtar: An Atari-inspired testbed for thorough
and reproducible reinforcement learning experiments. arXiv preprint arXiv:1903.03176.
Yue, Yuguang, Zhendong Wang, and Mingyuan Zhou. 2020. Implicit distributional
reinforcement learning. In Advances in Neural Information Processing Systems.
Zhang, Shangtong, and Hengshuai Yao. 2019. QUOTA: The quantile option architec-
ture for reinforcement learning. In Proceedings of the AAAI Conference on Artificial
Intelligence.
Zhou, Fan, Zhoufan Zhu, Qi Kuang, and Liwen Zhang. 2021. Non-decreasing quantile
function network with ecient exploration for distributional reinforcement learning. In
Proceedings of the International Joint Conference on Artificial Intelligence.
Ziegel, Johanna F. 2016. Coherence and elicitability. Mathematical Finance 26 (4):
901–918.
Zolotarev, Vladimir M. 1976. Metric distances in spaces of random variables and their
distributions. Sbornik: Mathematics 30 (3): 373–401.
Draft version.