Contents

Preface ix

1 Introduction 1

1.1 Why Distributional Reinforcement Learning? . . . . . . . . . 2

1.2 An Example: Kuhn Poker . . . . . . . . . . . . . . . . . . . . 3

1.3 How Is Distributional Reinforcement Learning Diﬀerent? . . . 5

1.4 Intended Audience and Organization . . . . . . . . . . . . . . 7

1.5 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 9

2 The Distribution of Returns 11

2.1 Random Variables and Their Probability Distributions . . . . . 11

2.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . 14

2.3 The Pinball Model . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 The Return . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 The Bellman Equation . . . . . . . . . . . . . . . . . . . . . 25

2.6 Properties of the Random Trajectory . . . . . . . . . . . . . . 27

2.7 The Random-Variable Bellman Equation . . . . . . . . . . . . 30

2.8 From Random Variables to Probability Distributions . . . . . 33

2.9 Alternative Notions of the Return Distribution* . . . . . . . . 40

2.10 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 41

2.11 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 43

2.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Learning the Return Distribution 51

3.1 The Monte Carlo Method . . . . . . . . . . . . . . . . . . . . 52

3.2 Incremental Learning . . . . . . . . . . . . . . . . . . . . . . 54

3.3 Temporal-Diﬀerence Learning . . . . . . . . . . . . . . . . . 56

3.4 From Values to Probabilities . . . . . . . . . . . . . . . . . . 58

3.5 The Projection Step . . . . . . . . . . . . . . . . . . . . . . . 60

3.6 Categorical Temporal-Diﬀerence Learning . . . . . . . . . . . 65

Draft version. v

vi Contents

3.7 Learning to Control . . . . . . . . . . . . . . . . . . . . . . . 69

3.8 Further Considerations . . . . . . . . . . . . . . . . . . . . . 70

3.9 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 70

3.10 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 71

3.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4 Operators and Metrics 77

4.1 The Bellman Operator . . . . . . . . . . . . . . . . . . . . . 78

4.2 Contraction Mappings . . . . . . . . . . . . . . . . . . . . . 79

4.3 The Distributional Bellman Operator . . . . . . . . . . . . . . 83

4.4 Wasserstein Distances for Return Functions . . . . . . . . . . 86

4.5 

Probability Metrics and the Cramér Distance . . . . . . . . 92

4.6 Suﬃcient Conditions for Contractivity . . . . . . . . . . . . . 95

4.7 A Matter of Domain . . . . . . . . . . . . . . . . . . . . . . . 98

4.8 Weak Convergence of Return Functions* . . . . . . . . . . . 102

4.9 Random-Variable Bellman Operators* . . . . . . . . . . . . . 104

4.10 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 105

4.11 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 106

4.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5 Distributional Dynamic Programming 115

5.1 Computational Model . . . . . . . . . . . . . . . . . . . . . . 115

5.2 Representing Return-Distribution Functions . . . . . . . . . . 118

5.3 The Empirical Representation . . . . . . . . . . . . . . . . . 120

5.4 The Normal Representation . . . . . . . . . . . . . . . . . . . 125

5.5 Fixed-Size Empirical Representations . . . . . . . . . . . . . 128

5.6 The Projection Step . . . . . . . . . . . . . . . . . . . . . . . 130

5.7 Distributional Dynamic Programming . . . . . . . . . . . . . 135

5.8 Error Due to Diﬀusion . . . . . . . . . . . . . . . . . . . . . 138

5.9 Convergence of Distributional Dynamic Programming . . . . 141

5.10 Quality of the Distributional Approximation . . . . . . . . . . 145

5.11 Designing Distributional Dynamic Programming Algorithms . 147

5.12 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 148

5.13 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 154

5.14 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6 Incremental Algorithms 161

6.1 Computation and Statistical Estimation . . . . . . . . . . . . . 161

6.2 From Operators to Incremental Algorithms . . . . . . . . . . 163

6.3 Categorical Temporal-Diﬀerence Learning . . . . . . . . . . . 165

Draft version.

Contents vii

6.4 Quantile Temporal-Diﬀerence Learning . . . . . . . . . . . . 167

6.5 An Algorithmic Template for Theoretical Analysis . . . . . . 171

6.6 The Right Step Sizes . . . . . . . . . . . . . . . . . . . . . . 174

6.7 Overview of Convergence Analysis . . . . . . . . . . . . . . . 176

6.8 Convergence of Incremental Algorithms* . . . . . . . . . . . 179

6.9 Convergence of Temporal-Diﬀerence Learning* . . . . . . . . 183

6.10 Convergence of Categorical Temporal-Diﬀerence Learning* . 185

6.11 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 188

6.12 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 190

6.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

7 Control 197

7.1 Risk-Neutral Control . . . . . . . . . . . . . . . . . . . . . . 197

7.2 Value Iteration and Q-Learning . . . . . . . . . . . . . . . . . 199

7.3 Distributional Value Iteration . . . . . . . . . . . . . . . . . . 202

7.4 Dynamics of Distributional Optimality Operators . . . . . . . 204

7.5 Dynamics in the Presence of Multiple Optimal Policies* . . . 209

7.6 Risk and Risk-Sensitive Control . . . . . . . . . . . . . . . . 213

7.7 Challenges in Risk-Sensitive Control . . . . . . . . . . . . . . 214

7.8 Conditional Value-At-Risk* . . . . . . . . . . . . . . . . . . 218

7.9 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 222

7.10 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 227

7.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

8 Statistical Functionals 233

8.1 Statistical Functionals . . . . . . . . . . . . . . . . . . . . . . 234

8.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

8.3 Bellman Closedness . . . . . . . . . . . . . . . . . . . . . . . 239

8.4 Statistical Functional Dynamic Programming . . . . . . . . . 244

8.5 Relationship to Distributional Dynamic Programming . . . . . 247

8.6 Expectile Dynamic Programming . . . . . . . . . . . . . . . . 248

8.7 Inﬁnite Collections of Statistical Functionals . . . . . . . . . . 250

8.8 Moment Temporal-Diﬀerence Learning* . . . . . . . . . . . . 252

8.9 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 254

8.10 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 255

8.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

9 Linear Function Approximation 261

9.1 Function Approximation and Aliasing . . . . . . . . . . . . . 262

9.2 Optimal Linear Value Function Approximations . . . . . . . . 264

Draft version.

viii Contents

9.3 A Projected Bellman Operator for Linear Value Function

Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 266

9.4 Semi-Gradient Temporal-Diﬀerence Learning . . . . . . . . . 270

9.5 Semi-Gradient Algorithms for Distributional Reinforcement

Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

9.6 An Algorithm Based on Signed Distributions* . . . . . . . . . 277

9.7 Convergence of the Signed Algorithm* . . . . . . . . . . . . 281

9.8 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 285

9.9 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 287

9.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

10 Deep Reinforcement Learning 293

10.1 Learning with a Deep Neural Network . . . . . . . . . . . . . 294

10.2 Distributional Reinforcement Learning with Deep Neural

Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298

10.3 Implicit Parameterizations . . . . . . . . . . . . . . . . . . . 301

10.4 Evaluation of Deep Reinforcement Learning Agents . . . . . . 304

10.5 How Predictions Shape State Representations . . . . . . . . . 309

10.6 Technical Remarks . . . . . . . . . . . . . . . . . . . . . . . 311

10.7 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 312

10.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

11 Two Applications and a Conclusion 319

11.1 Multiagent Reinforcement Learning . . . . . . . . . . . . . . 319

11.2 Computational Neuroscience . . . . . . . . . . . . . . . . . . 323

11.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

11.4 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . 330

Notation 333

References 337

Index 365

Draft version.