Contents
Preface ix
1 Introduction 1
1.1 Why Distributional Reinforcement Learning? . . . . . . . . . 2
1.2 An Example: Kuhn Poker . . . . . . . . . . . . . . . . . . . . 3
1.3 How Is Distributional Reinforcement Learning Dierent? . . . 5
1.4 Intended Audience and Organization . . . . . . . . . . . . . . 7
1.5 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 9
2 The Distribution of Returns 11
2.1 Random Variables and Their Probability Distributions . . . . . 11
2.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . 14
2.3 The Pinball Model . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 The Return . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 The Bellman Equation . . . . . . . . . . . . . . . . . . . . . 25
2.6 Properties of the Random Trajectory . . . . . . . . . . . . . . 27
2.7 The Random-Variable Bellman Equation . . . . . . . . . . . . 30
2.8 From Random Variables to Probability Distributions . . . . . 33
2.9 Alternative Notions of the Return Distribution* . . . . . . . . 40
2.10 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 41
2.11 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 43
2.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Learning the Return Distribution 51
3.1 The Monte Carlo Method . . . . . . . . . . . . . . . . . . . . 52
3.2 Incremental Learning . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Temporal-Dierence Learning . . . . . . . . . . . . . . . . . 56
3.4 From Values to Probabilities . . . . . . . . . . . . . . . . . . 58
3.5 The Projection Step . . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Categorical Temporal-Dierence Learning . . . . . . . . . . . 65
Draft version. v
vi Contents
3.7 Learning to Control . . . . . . . . . . . . . . . . . . . . . . . 69
3.8 Further Considerations . . . . . . . . . . . . . . . . . . . . . 70
3.9 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 70
3.10 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 71
3.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4 Operators and Metrics 77
4.1 The Bellman Operator . . . . . . . . . . . . . . . . . . . . . 78
4.2 Contraction Mappings . . . . . . . . . . . . . . . . . . . . . 79
4.3 The Distributional Bellman Operator . . . . . . . . . . . . . . 83
4.4 Wasserstein Distances for Return Functions . . . . . . . . . . 86
4.5
p
Probability Metrics and the Cramér Distance . . . . . . . . 92
4.6 Sucient Conditions for Contractivity . . . . . . . . . . . . . 95
4.7 A Matter of Domain . . . . . . . . . . . . . . . . . . . . . . . 98
4.8 Weak Convergence of Return Functions* . . . . . . . . . . . 102
4.9 Random-Variable Bellman Operators* . . . . . . . . . . . . . 104
4.10 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 105
4.11 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 106
4.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5 Distributional Dynamic Programming 115
5.1 Computational Model . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Representing Return-Distribution Functions . . . . . . . . . . 118
5.3 The Empirical Representation . . . . . . . . . . . . . . . . . 120
5.4 The Normal Representation . . . . . . . . . . . . . . . . . . . 125
5.5 Fixed-Size Empirical Representations . . . . . . . . . . . . . 128
5.6 The Projection Step . . . . . . . . . . . . . . . . . . . . . . . 130
5.7 Distributional Dynamic Programming . . . . . . . . . . . . . 135
5.8 Error Due to Diusion . . . . . . . . . . . . . . . . . . . . . 138
5.9 Convergence of Distributional Dynamic Programming . . . . 141
5.10 Quality of the Distributional Approximation . . . . . . . . . . 145
5.11 Designing Distributional Dynamic Programming Algorithms . 147
5.12 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 148
5.13 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 154
5.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6 Incremental Algorithms 161
6.1 Computation and Statistical Estimation . . . . . . . . . . . . . 161
6.2 From Operators to Incremental Algorithms . . . . . . . . . . 163
6.3 Categorical Temporal-Dierence Learning . . . . . . . . . . . 165
Draft version.
Contents vii
6.4 Quantile Temporal-Dierence Learning . . . . . . . . . . . . 167
6.5 An Algorithmic Template for Theoretical Analysis . . . . . . 171
6.6 The Right Step Sizes . . . . . . . . . . . . . . . . . . . . . . 174
6.7 Overview of Convergence Analysis . . . . . . . . . . . . . . . 176
6.8 Convergence of Incremental Algorithms* . . . . . . . . . . . 179
6.9 Convergence of Temporal-Dierence Learning* . . . . . . . . 183
6.10 Convergence of Categorical Temporal-Dierence Learning* . 185
6.11 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 188
6.12 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 190
6.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7 Control 197
7.1 Risk-Neutral Control . . . . . . . . . . . . . . . . . . . . . . 197
7.2 Value Iteration and Q-Learning . . . . . . . . . . . . . . . . . 199
7.3 Distributional Value Iteration . . . . . . . . . . . . . . . . . . 202
7.4 Dynamics of Distributional Optimality Operators . . . . . . . 204
7.5 Dynamics in the Presence of Multiple Optimal Policies* . . . 209
7.6 Risk and Risk-Sensitive Control . . . . . . . . . . . . . . . . 213
7.7 Challenges in Risk-Sensitive Control . . . . . . . . . . . . . . 214
7.8 Conditional Value-At-Risk* . . . . . . . . . . . . . . . . . . 218
7.9 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 222
7.10 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 227
7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8 Statistical Functionals 233
8.1 Statistical Functionals . . . . . . . . . . . . . . . . . . . . . . 234
8.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
8.3 Bellman Closedness . . . . . . . . . . . . . . . . . . . . . . . 239
8.4 Statistical Functional Dynamic Programming . . . . . . . . . 244
8.5 Relationship to Distributional Dynamic Programming . . . . . 247
8.6 Expectile Dynamic Programming . . . . . . . . . . . . . . . . 248
8.7 Infinite Collections of Statistical Functionals . . . . . . . . . . 250
8.8 Moment Temporal-Dierence Learning* . . . . . . . . . . . . 252
8.9 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 254
8.10 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 255
8.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
9 Linear Function Approximation 261
9.1 Function Approximation and Aliasing . . . . . . . . . . . . . 262
9.2 Optimal Linear Value Function Approximations . . . . . . . . 264
Draft version.
viii Contents
9.3 A Projected Bellman Operator for Linear Value Function
Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 266
9.4 Semi-Gradient Temporal-Dierence Learning . . . . . . . . . 270
9.5 Semi-Gradient Algorithms for Distributional Reinforcement
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
9.6 An Algorithm Based on Signed Distributions* . . . . . . . . . 277
9.7 Convergence of the Signed Algorithm* . . . . . . . . . . . . 281
9.8 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 285
9.9 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 287
9.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
10 Deep Reinforcement Learning 293
10.1 Learning with a Deep Neural Network . . . . . . . . . . . . . 294
10.2 Distributional Reinforcement Learning with Deep Neural
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
10.3 Implicit Parameterizations . . . . . . . . . . . . . . . . . . . 301
10.4 Evaluation of Deep Reinforcement Learning Agents . . . . . . 304
10.5 How Predictions Shape State Representations . . . . . . . . . 309
10.6 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 311
10.7 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 312
10.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
11 Two Applications and a Conclusion 319
11.1 Multiagent Reinforcement Learning . . . . . . . . . . . . . . 319
11.2 Computational Neuroscience . . . . . . . . . . . . . . . . . . 323
11.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
11.4 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 330
Notation 333
References 337
Index 365
Draft version.