238 Chapter 7
7.6.
The notion of risk and risk-sensitive decisions can be traced back to
Markowitz [1952], who introduced the concept of trading off expected gains and
variations in those gains in the context of constructing an investment portfolio;
see also Steinbach [2001] for a retrospective. Artzner et al. [1999] proposes
a collection of desirable characteristics that make a risk measure coherent
in the sense that it satisfies certain preference axioms. Of the risk measures
mentioned here, CVaR is coherent but the variance-constrained objective is
not. Artzner et al. [2007] discusses coherent risk measures in the context of
sequential decisions. Ruszczy
´
nski [2010] introduces the notion of dynamic risk
measures for Markov decision processes, which are amenable to optimisation
via Bellman-style recursions; see also Chow [2017] for a discussion of static
and dynamic risk measures as well as time consistency. Jiang and Powell [2018]
develop sample-based optimisation methods for dynamic risk measures based
on quantiles.
Howard and Matheson [1972] considered the optimisation of an exponential
utility function applied to the random return by means of policy iteration. The
same objective is given a distributional treatment by Chung and Sobel [1987].
Heger [1994] considers optimising for worst-case returns. Haskell and Jain
[2015] study the use of occupancy measures over augmented state spaces as an
approach for finding optimal policies for risk-sensitive control; similarly, an
occupancy measure-based approach to CVaR optimisation is studied by Carpin
et al. [2016]. Mihatsch and Neuneier [2002] and Shen et al. [2013] extend
Q-learning to the optimisation of recursive risk measures, where a base risk
measure is applied at each time step. Recursive risk measures are more easily
optimised than risk measures directly applied to the random return, but are not
as easily interpreted. Martin et al. [2020] consider combining distributional
reinforcement learning with the notion of second-order stochastic dominance
as a means of action selection. Quantile criteria are considered by Filar et al.
[1995] in the case of average-reward MDPs, and more recently by Gilbert et al.
[2017] and Li et al. [2021]. Delage and Mannor [2010] solves a risk-constrained
optimisation problem to handle uncertainty in a learned model’s parameters. See
Prashanth and Fu [2021] for a survey on risk-sensitive reinforcement learning.
7.7.
Sobel [1982] establishes that an operator constructed directly from the
variance-penalised objective does not have the monotone improvement property,
making it optimisation more challenging. The examples demonstrating the need
for randomisation and a history-dependent policy are adapted from Mannor and
Tsitsiklis [2011], who also prove the NP-hardness of the problem of optimising
the variance-constrained objective. Tamar et al. [2012] propose a policy gradient
algorithm for optimising a mean-variance objective and for the CVaR objective