Tim Sullivan

Outlines of Some Research Topics

My general area of research is uncertainty quantification, at the intersection of applied mathematics, probability, statistics, and computational science. The brief descriptions below outline some of my research interests, past work, and ideas for the future. If you are interested in any of these areas, then please get in touch!

A common theme underlying many of my current research interests is to add ‘the right amount of uncertainty’ to an otherwise certain or near-certain prediction, in order to correctly compensate for over-confidence in the modelling process. In the words of Julian Baggini:

The mark of a mature, psychologically healthy mind is indeed the ability to live with uncertainty and ambiguity, but only as much as there really is.

Probabilistic Numerical Methods

There is a long tradition of giving probabilistic solutions to deterministic problems such as quadrature, linear algebra, optimisation, etc.; early contributions in this direction date back to Poincaré at the start of the 20th Century, continued by the work of Larkin in the 1970s, and seminal papers of Diaconis (1988) and O'Hagan (1992). For example, a Bayesian approach to computing the integral of some function \(f \colon [a, b] \to \mathbb{R}\) would be to specify a prior probability measure on a suitable space of integrands, treat the evaluations \(y_{i} = f(x_{i})\) of \(f\) at \(n\) quadrature nodes \(x_{i} \in [a, b]\) as observed data, and then form the posterior distribution

\( \mathbb{P} \left( \int_{a}^{b} f(x) \, \mathrm{d} x \,\middle|\, f(x_{i}) = y_{i} \text{ for } i = 1, \dots, n \right) . \)

The expected value of this distribution gives an estimate of the value of the integral, and the variance can be used as a measure of accuracy. Indeed, a now-classical result of Sul′din shows that, under the Brownian motion prior on \(f\), the posterior mean / MAP estimator for \(\int_{a}^{b} f(x) \, \mathrm{d} x\) is the elementary trapezoidal rule for quadrature — there is an underlying posterior mean estimator for \(f\), namely the piecewise-linear interpolant of the data \(f(x_{i}) = y_{i}\), and the integral of this interpolant is the posterior mean estimator of \(\int_{a}^{b} f(x) \, \mathrm{d} x\). The moral is that many deterministic numerical methods can be seen as point estimators for statistical procedures; conversely, statistical procedures offer a richer quantification of uncertainty in numerical tasks, and this UQ can be fed forward into other parts of scientific pipelines (e.g. inverse problems).

This probabilistic perpective on numerical analysis has received renewed attention in the last five years from the statistical and machine learning communities (Hennig et al., 2015), with an increasing emphasis on mathematical analysis and foundations, as well as new classes of methods.

Mathematical and Statistical Foundations

On a general level, I am interested in foundational statistical questions, e.g. what it means for a probabilistic numerical method to be Bayesian, and how to exactly or approximately realise the Bayesian posterior for a probabilistic numerical method (Cockayne et al., 2017). I am also interested in concrete applications, e.g. to data-centric engineering, molecular dynamics, atomistics, and other multiscale modelling problems.

Applications to Inverse Problems

One motivation for the use of probabilistic numerical methods is to quantify the impact of discretisation error in a statistical way, in particular when they are used as solvers in Bayesian inverse problems. To properly understand this impact, one needs to extend the analysis of Bayesian inverse problems to consider the use of randomised forward models and log-likelihoods (Lie, Sullivan, and Teckentrup, 2017).

Differential Equations

A fully Bayesian perspective on the solution of ODEs was advocated by Skilling (1992). As is often the case with dynamical systems that have a “time's arrow”, a fully Bayesian perspective is often too demanding, and a sequential filtering-based approach is more appropriate (Schober et al., 2014). However, another approach is based in the perturbation of the numerical flow, as advocated by Conrad et al. (2016), who pointed out the problems associated with over-confident inferences arising from under-resolved models, and established an optimal scaling for Gaussian process perturbations of standard numerical schemes (e.g. implicit Euler or RK4) that introduces maximal uncertainty without changing the deterministic convergence rate for the mean. Thus, for an ODE

\( u'(t) = f(u(t)) \)

and a time step \(\tau > 0\), one replaces the deterministic numerical integrator (discrete time flow map \(\Psi^{\tau}\))

\( U_{k + 1} = \Psi^{\tau}(U_{k}), \)

which generates deterministic approximations \(U_{k} \approx u(k \tau)\) by a randomised integrator

\( U_{k + 1} = \Psi^{\tau}(U_{k}) + \xi_{k}(\tau), \)

where the \(\xi_{k}(t)\) are suitably-chosen stochastic processes: one desirable property is that if \(\Psi^{\tau}\) has truncation error of order \(q\), then so too should \(\mathbb{E}[U]\) — though, of course, the randomised trajectories will exhibit some ‘spread’ about the mean. This approach provides a statistical characterisation of the impact of numerical errors in the integration scheme, which is highly desirable in Bayesian inverse problems, in which the numerical solution \(U\) may be used to help infer important physical parameters.

I am interested in furthering this line of work, both in developing the theory of probabilistic integrators for more general ordinary (Lie, Stuart, and Sullivan, 2017; Teymur, Lie, Sullivan, and Calderhead, 2018) and partial differential equations (Cockayne, Oates, Sullivan, and Girolami, 2016; 2017).

Linear Algebra

A major computational bottleneck in the application of Gaussian process techniques in machine learning and artificial intelligence is the cost of inverting a large square matrix of the form

\( \Theta = \begin{bmatrix} k(x_{1}, x_{1}) & \cdots & k(x_{1}, x_{N}) \\ \vdots & \ddots & \vdots \\ k(x_{N}, x_{1}) & \cdots & k(x_{N}, x_{N}) \end{bmatrix} \in \mathbb{R}^{N \times N} , \)

where \(\{ x_{1}, \dots, x_{N} \}\) is a data set and \(k\) is a covariance kernel function.

In Schäfer, Sullivan, and Owhadi (2017), a probabilistic interpretation of Cholesky factorisation motivates a constructive choice of near-linearly many entries of the \(N^{2}\) entries of \(\Theta\), followed by a zero-fill-in incomplete Cholesky factorisation, yielding an exponentially accurate approximate inverse for \(\Theta\) at near-linear computational cost.

Bayesian Inverse Problems

Bayesian procedures are popular methods for the quantification of uncertainty, especially in inverse problems. In many modern applications, the underlying models are infinite-dimensional, and discretisation only enters as a computational consideration. The general form of an inverse problem is, given spaces \(\mathcal{U}\) and \(\mathcal{Y}\) and a known operator \(G \colon \mathcal{U} \to \mathcal{Y}\), to recover \(u \in \mathcal{U}\) from \(y \in \mathcal{Y}\), where \(y\) is a (perhaps noisily corrupted version of) \(G(u)\). A simple example is the additive noise model

\( y = G(u) + \eta , \)

where \(\mathcal{Y}\) is a vector space and \(\eta\) is a \(\mathcal{Y}\)-valued random variable. In modern applications, the forward model \(G\) may correspond to solving an ordinary or partial differential equation, and the parameter \(u\) may be a scalar, vector, or tensor field — and so \(\mathcal{U}\) is an infinite-dimensional space. Inverse problems are typically ill-posed in the strict sense (i.e. no such \(u \in \mathcal{U}\) exists, or there are multiple solutions, or they depend sensitively upon the problem setup. In the Bayesian paradigm, closely related to the tradition of regularisation, the inverse problem is rendered well-posed by seeking a conditional posterior probability distribution for \(u\) given \(y\), not just a point estimate. In modern terminology, following seminal contributions of Stuart (2010), the Bayesian posterior distribution \(\mu^{y}\) on \(\mathcal{U}\) of \(u\) given the data \(y\) is given in terms of its density with respect to the prior distribution \(\mu_{0}\):

\( \begin{align*} \dfrac{\mathrm{d} \mu^{y}}{\mathrm{d} \mu_{0}} (u) & = \dfrac{\exp(- \Phi(u; y))}{Z(y)} , \\ Z(y) & = \int_{\mathcal{U}} \exp(- \Phi(u; y)) \, \mu_{0} (\mathrm{d} u) , \end{align*} \)

where \(\Phi(\cdot; y) \colon \mathcal{U} \to \mathbb{R}\) denotes the negative log-likelihood function (misfit) for the data \(y\). The basic example, when the data space \(\mathcal{Y}\) is finite-dimensional and \(\eta \sim \mathcal{N}(0, \Gamma)\) is centred additive Gaussian noise of covariance \(\Gamma\), is the quadratic misfit functional

\( \Phi(u; y) = \dfrac{1}{2} \| \Gamma^{-1/2} ( y - G(u) ) \|^{2} . \)

Stability and Consistency of Bayesian inverse problems

From both frequentist and numerical analytic points of view, it is natural to ask whether Bayesian methods are well-defined, stable, and consistent:

  • well-definedness: does \(\mu^{y}\), as defined above, really correspond to the notion of conditioning of probability measures, and are there any issues of uniquness?
  • stability: does \(\mu^{y}\) change “only a little bit” if the data \(y\), the log-likelihood \(\Phi\), forward model \(G\), or prior \(\mu_{0}\) are slightly altered?
  • consistency: if exposed to enough high-quality data generated by a “true” data-generating parameter value \(u^{\dagger}\), will the posterior measure \(\mu^{y}\) concentrate on \(u^{\dagger}\)?

In this sense, what kinds of priors and likelihoods make for good modelling choices, and also lead to well-posed inverse problems? This is a subtle topic with many positive and negative results (Owhadi, Scovel, and Sullivan, 2015a, 2015b). I am also interested in the use of heavy-tailed priors (Sullivan, 2017; Lie and Sullivan, 2018) and randomised models (Lie, Sullivan, and Teckentrup, 2017) in Bayesian inverse problems.

Maximum a Posteriori Estimators

A general probability measure \(\mu\), such as a Bayesian posterior, is generally too complicated to realise exactly, so some kinds of low-order summary statistics must be used. Expected values and covariances are common choices for such summaries; another choice is a “most likely point under \(\mu\)”, a so-called mode or maximum a posteriori estimator. In finite-dimensional spaces \(\mathcal{U}\), a mode is simply a maximiser of the probability density function of \(\mu\). In general, though, there is no Lebesgue measure with which to take densities, so modes must be defined intrinsically.

As is often the case in functional analysis, there are multiple ways of defining modes on an infinite-dimensional space \(\mathcal{U}\), e.g. a separable Banach space. Dashti et al. (2013) proposed defining a strong mode of \(\mu\) to be any point \(u^{\ast} \in \mathcal{U}\) such that

\( \displaystyle \lim_{r \to 0} \frac{\sup_{v \in \mathcal{U}} \mu(B(v, r))}{\mu(B(u^{\ast}, r))} = 1 . \)

In later work, Helin and Burger (2015) proposed a weak mode with respect to a dense subspace \(E \subset \mathcal{U}\). It turns out that, under reasonable conditions, every weak mode is a strong mode and vice versa (Lie and Sullivan, 2017). On the other hand, there are pathological measures, even in finite-dimensional spaces such as \(\mathcal{U} = \mathbb{R}^{2}\), that have weak modes but no strong modes, and so in general these two notions are distinct.

Exploration of Bayesian Posterior Distributions

In principle, methods such as Markov chain Monte Carlo (MCMC) can be used to explore and sample an arbitrary probability distribution \(\mu\) on a space \(\mathcal{U}\), such as a Bayesian posterior distribution \(\mu^{y}\). However, these methods often converge poorly — if at all — in practice when \(\mathcal{U}\) has high dimension, or when \(\mu\) is concentrated on a low-dimensional subset of \(\mathcal{U}\). In such situations, it is desirable to approximate \(\mu\) by any reasonable means in order to obtain an informative approximate solution in finite time. Methods of interest include

Optimal Distributionally-Robust Uncertainty Quantification

In its most general form, uncertainty quantification can be cast as a constrained optimisation problem over a large space of parameters, functions, and probability measures, with information acting as constraints on the feasible set (Owhadi et al., 2013; Sullivan et al., 2013). Typically, one is interested in the expected value (and other moments) of some quantity of interest \(q\), as the probability distribution \(\mu\) of the inputs \(X\) and any associated response function \(f\) range over a general admissible set \(\mathcal{A}\); that is, we need to calculate and minimise/maximise

\( \mathbb{E}_{X \sim \mu}[q(X, f(X))], \quad (\mu, f) \in \mathcal{A}. \)

The admissible set \(\mathcal{A}\) corresponds to the available information about the system of interest: for example, one might know that the mean of \(X\) must lie between \(a\) and \(b\), and so the constraint \(a \leq \mathbb{E}_{X \sim \mu}[X] \leq b\) would form part of the definition of \(\mathcal{A}\). Similarly, one one could consider higher-order statistics related to \(\mu\), and information about the response function \(f\), e.g. numerical models and approximations, archives of legacy data, and so on (Sullivan et al., 2013; Kamga et al., 2014).

In principle, the optimisation problem over \((\mu, f) \in \mathcal{A}\) is infinite-dimensional, and therefore computationally intractable. However, extensions of classical Choquet theory (Owhadi et al., 2013) can be used to show that, under generalised independence and moment constraints on the probability measures \(\mu\), this optimisation problem reduces to a search over a finite-dimensional effective admissible set \(\mathcal{A}_{\Delta} \subset \mathcal{A}\), and this reduced problem can be solved numerically — in some cases, it is even a convex problem (Han et al., 2015). In the reduced setting \(\mathcal{A}_{\Delta}\), the measures \(\mu\) are discrete measures with finite support, and the values of \(f\) are only needed on the finitely many points where \(\mu\) assigns positive probability mass:

\( \mu \to \displaystyle \sum_{i = 0}^{N} w_{i} \delta_{x_{i}}, \quad f \to (x_{i}, f(x_{i}))_{i = 0}^{N} . \)

For the forward uncertainty propagation problem, this approach allows for the optimal propagation of distributional uncertainty (e.g. histogram envelopes) through general forward models. In practice, this opens up new possibilities for probabilistic reliability forecasting and quality control. For inverse problems, this approach allows one to study the robustness of Bayesian inversion to perturbations of the prior and likelihood model.

Concentration of Measure

A powerful heuristic first observed by Lévy (1951) and Milman (1971) is that a ‘nice’ function of many independent or weakly correlated random variables has overwhelming probability to take values near its mean, median, or a low-dimensional subset of its range. This has implications for uncertainty quantification of complex systems: in situations where concentration of measure results can be proved, a large stochastic system becomes effectively deterministic. In the past, I have worked on applications of concentration-of-measure methods to the certification of terminal ballistics (Sullivan et al., 2011; Adams et al., 2012; Kidane et al., 2012).