Contents
Preface ix
1 Introduction 1
1.1 Why distributional reinforcement learning? 2
1.2 An example: Kuhn poker 3
1.3 How is distributional reinforcement learning different? 5
1.4 Intended audience and organisation 7
1.5 Bibliographical remarks 9
2 The Distribution of Returns 11
2.1 Random variables and their probability distributions 11
2.2 Markov decision processes 14
2.3 The pinball model 16
2.4 The return 19
2.5 The Bellman equation 25
2.6 Properties of the random trajectory 27
2.7 The random-variable Bellman equation 31
2.8 From random variables to probability distributions 34
2.9 Alternative notions of the return distribution* 41
2.10 Technical remarks 42
2.11 Bibliographical remarks 44
2.12 Exercises 46
3 Learning the Return Distribution 53
3.1 The Monte Carlo method 54
3.2 Incremental learning 56
3.3 Temporal-difference learning 59
3.4 From values to probabilities 61
3.5 The projection step 62
3.6 Categorical temporal-difference learning 67
v
vi Contents
3.7 Learning to control 71
3.8 Further considerations 72
3.9 Technical remarks 73
3.10 Bibliographical remarks 73
3.11 Exercises 75
4 Operators and Metrics 79
4.1 The Bellman operator 80
4.2 Contraction mappings 81
4.3 The distributional Bellman operator 85
4.4 Wasserstein distances for return functions 89
4.5 `
p
probability metrics and the Cramér distance 94
4.6 Sufficient conditions for contractivity 97
4.7 A matter of domain 101
4.8 Weak convergence of return functions* 105
4.9 Random variable Bellman operators* 107
4.10 Technical remarks 108
4.11 Bibliographical remarks 110
4.12 Exercises 112
5 Distributional Dynamic Programming 119
5.1 Computational model 119
5.2 Representing return-distribution functions 122
5.3 The empirical representation 124
5.4 The normal representation 129
5.5 Fixed-size empirical representations 132
5.6 The projection step 135
5.7 Distributional dynamic programming 140
5.8 Error due to diffusion 144
5.9 Convergence of distributional dynamic programming 146
5.10 Quality of the distributional approximation 150
5.11 Designing distributional dynamic programming algorithms 153
5.12 Technical remarks 154
5.13 Bibliographical remarks 160
5.14 Exercises 162
6 Incremental Algorithms 167
6.1 Computation and statistical estimation 168
6.2 From operators to incremental algorithms 169
6.3 Categorical temporal-difference learning 171
Contents vii
6.4 Quantile temporal-difference learning 174
6.5 An algorithmic template for theoretical analysis 178
6.6 The right step sizes 180
6.7 Overview of convergence analysis 183
6.8 Convergence of incremental algorithms* 186
6.9 Convergence of temporal-difference learning* 190
6.10 Convergence of categorical temporal-difference learning* 192
6.11 Technical remarks 196
6.12 Bibliographical remarks 197
6.13 Exercises 199
7 Control 205
7.1 Risk-neutral control 206
7.2 Value iteration and Q-learning 207
7.3 Distributional value iteration 210
7.4 Dynamics of distributional optimality operators 213
7.5 Dynamics in the presence of multiple optimal policies* 217
7.6 Risk and risk-sensitive control 221
7.7 Challenges in risk-sensitive control 224
7.8 Conditional value-at-risk* 226
7.9 Technical remarks 231
7.10 Bibliographical remarks 236
7.11 Exercises 239
8 Statistical Functionals 243
8.1 Statistical functionals 244
8.2 Moments 245
8.3 Bellman closedness 250
8.4 Statistical functional dynamic programming 254
8.5 Relationship with distributional dynamic programming 257
8.6 Expectile dynamic programming 258
8.7 Infinite collections of statistical functionals 260
8.8 Moment temporal-difference learning* 263
8.9 Technical remarks 264
8.10 Bibliographical remarks 266
8.11 Exercises 268
9 Linear Function Approximation 273
9.1 Function approximation and aliasing 274
9.2 Optimal linear value function approximations 276
viii Contents
9.3 A projected Bellman operator for linear value function
approximation 278
9.4 Semi-gradient temporal-difference learning 283
9.5 Semi-gradient algorithms for distributional reinforcement
learning 285
9.6 An algorithm based on signed distributions* 289
9.7 Convergence of the signed algorithm* 294
9.8 Technical remarks 298
9.9 Bibliographical remarks 300
9.10 Exercises 302
10 Deep Reinforcement Learning 307
10.1 Learning with a deep neural network 308
10.2 Distributional reinforcement learning with deep neural
networks 312
10.3 Implicit parametrisations 315
10.4 Evaluation of deep reinforcement learning agents 318
10.5 How predictions shape state representations 323
10.6 Technical remarks 325
10.7 Bibliographical remarks 326
10.8 Exercises 329
11 Two Applications and a Conclusion 333
11.1 Multi-agent reinforcement learning 333
11.2 Computational neuroscience 337
11.3 Conclusion 344
11.4 Bibliographical remarks 344
Notation 347
References 355