A robust, discrete-gradient descent procedure for optimisation with time-dependent PDE and norm constraints

Many physical questions in fluid dynamics can be recast in terms of norm constrained optimisation problems; which in-turn, can be further recast as unconstrained problems on spherical manifolds. Due to the nonlinearities of the governing PDEs, and the computational cost of performing optimal control on such systems, improving the numerical convergence of the optimisation procedure is crucial. Borrowing tools from the optimisation on manifolds community we outline a numerically consistent, discrete formulation of the direct-adjoint looping method accompanied by gradient descent and line-search algorithms with global convergence guarantees. We numerically demonstrate the robustness of this formulation on three example problems of relevance in fluid dynamics and provide an accompanying library SphereManOpt


Introduction
Applications of norm constrained, nonlinear optimal control abound in fluid dynamics: spanning from the improvement of transport and mixing efficiencies in complex fluids (e.g.[17,12]), to the stability analysis of thermo-acoustic oscillations (e.g.[27]), the modeling of transitions between a dynamical system's multiple basins of attraction (e.g.[33,28]) or the forecasts of oceanographic systems (e.g.[53,42,4]).Because the governing partial differential equations (PDEs) are fully nonlinear, solutions to the corresponding optimisation problems must in general be calculated numerically.Due to the large dimension of both the control and state vectors and the significant cost of direct numerical simulations, efficient optimisation routines must be applied to address such problems.As our motivation is essentially numerical, we will restrict our attention to finite-dimension problems in what follows.
Given a discretised set of equations describing the evolution of the state vector X ∈ R n , a general formulation of such norm-constrained problem is as follows min where X 0 ∈ R n is the control variable we seek to optimise, and E 0 its energy.The objective function J(X, X 0 ) is a quantity of interest to be minimised (or maximised), and F (X, t) is a nonlinear operator PM Mannix, CS Skene, D Auroux, & F Marcotte describing how the system evolves in time.For example, X 0 can be the initial velocity field of kinetic energy E 0 , yielding optimal mixing over a finite time window t ∈ (0, T ] as the flow evolves according to its governing equations.In this case, well-mixedness can be achieved by minimising some measure of the fluid heterogeneity (such as the scale-sensitive mix-norms [38,54,24]).In the context of stability analysis, X 0 can be the initial perturbation, which, for a given energy input, maximises for example the nonlinear growth of perturbations -thus allowing for a better characterisation of laminar-turbulent transitions in shear flows [46,28].Although (1.1) assumes control of the initial condition only, this can be easily modified to consider a time-dependent forcing or an alternative control variable.

Adjoint formalism
To solve the optimisation problem (1.1) one formulates the Lagrangian L, related to the pure objective functional, as L(X, X 0 ; q, λ) ≡ Ĵ(X 0 ) + λc(X 0 ) = J(X, X 0 ) + T 0 ⟨q, G(X, X 0 )⟩ dt + λ(||X 0 || 2 − E 0 ), (1.2) where q, λ denote the Lagrange multipliers (also termed adjoint variables) used to enforce these constraints, ⟨a, b⟩ denotes a vector inner product and ||a|| 2 = ⟨a, a⟩ the associated norm.Here we choose to denote the pure objective functional by Ĵ(X 0 ), thus assuming X(X 0 ) is implicitly defined for all X 0 by the constraint equations.Any stationary point of L satisfies the following Euler-Lagrange equations: δL δq = G(X, X 0 ) = 0, (1.4) δL δX T = δJ δX T + q T = 0, (1.5) where X T , q T refers to the vectors at the final time, while † notation is used to denote the adjoint operator such that ⟨x, Ly⟩ = ⟨L † x, y⟩.In general, reaching a stationary point of (1.2) demands an iterative procedure, whereby we solve (1.3) to (1.7) from top to bottom as given.That is: a random initial vector X 0 is normalised to be of norm E 0 , and the PDE constraint equations are solved forward in time from t = 0 → T using an appropriate time-stepping scheme [3,2], whilst simultaneously calculating Ĵ(X 0 ).Subject to the final time condition (1.5), the adjoint equations (1.6) are then solved backwards from t = T → 0. Denoting the current and subsequent iterates of a gradient descent procedure by k, k + 1 respectively, the control parameter is finally updated, using the gradient at iteration k (1.5), via where α ∈ R + is an appropriately chosen step-size to ensure descent [59] and λ is determined to enforce The first challenge when performing gradient descent is to accurately and efficiently estimate the gradient δL δX 0 k (λ).Using open-source automatic differentiation (AD) to obtain discrete gradients is one possible solution [23].While this approach works well on forward codes written so as to be amenable to AD, such as dolfin-adjoint for example [13,40], it can otherwise consume large amounts of memory and be extremely challenging to debug [43,44].Moreover, using AD to differentiate MPIparallel codes also raises important difficulties regarding how to manage communication dataflow and dependencies [56,39], which in practice hinders its applicability to computationally expensive fluid dynamics problems.When impractical to use AD for some of the reasons mentioned above, the alternative approach is to derive analytically the discrete adjoint equations using summation by parts.This approach, albeit established in the optimisation literature [20,61], has been less discussed until recently in the aeronautics [60,18] and fluid dynamics literature [57,51,30,14].
Even when armed with an accurate gradient however, we must move in the correct direction and use an efficient gradient descent routine.Typically one enforces the equality constraint c(X 0 ) using a Lagrange multiplier λ as outlined above.Given however that this constraint naturally restricts we can implement a consistent and therefore more efficient routine by exploiting this structure [1,48], which we recall in section 3.

Issue 1: Discrete versus continuous
To highlight the inconsistencies that typically arise when discretizing the previous (continuous) derivation, we recall that, although a linear operator Lpu dx, its corresponding matrix operator although dependent on the discretization used L † i,j ̸ = L i,j seldom conserves this property [25,34].Similarly, integration by parts of the temporal derivative followed by discretisation, does not generally commute with discretisation followed by discrete integration by parts.To clarify this issue we consider the following discrete Lagrangian consisting of a linear constraint equation X t = LX, discretised using a Crank-Nicolson temporal scheme where ∆t = T /N, X = (X 0 , • • • , X N ) and the constraint equations M X = f (incorporating the constraint on the initial condition) now in discrete form read where a, b = ( . Taking variations with respect to X we obtain M † q = δJ δX + 2λX 0 which in component form reads . (1.12)

PM Mannix, CS Skene, D Auroux, & F Marcotte
Comparing these discrete Euler-Lagrange equations (1.12) with their continuous counterpart (1.5)-(1.7)one notices that: (1) the adjoint reverses the direction of information propagation, (2) the discrete adjoint of a centered differencing scheme (here for the temporal derivative) would remain consistent with its continuous adjoint if and only if L = L † and (3) the discrete terminal/compatibility conditions at t = 0, T differ from their continuous counterpart at n = 0, N .As we will show in this paper, the later issue is compounded for higher order schemes which contribute additional variations to more of the top and bottom rows.Throughout the remainder of this paper the term discrete gradient will be used to refer to the approach where the adjoint numerical scheme is obtained by differentiating the already-discretized direct equations.Conversely continuous gradient will be used here to refer to the approach where the adjoint equations are first derived from the continuous problem, then discretized.
A scheme for which the solutions to both approaches coincide is termed adjoint/dual consistent [22,5].

Issue 2: Constrained gradient update
The second issue is that we still have a norm-constrained update step (1.9).For every choice of stepsize α we must apply the normalisation constraint to choose a corresponding λ.This relegates the applicability of line-search algorithms for unconstrained optimisation, which allow us to choose stepsizes α guaranteeing descent [59].Aside from the in-applicability of standard line-search algorithms however, further issues exist with the Lagrange multiplier approach.Using ∇f = δf δX 0 to denote the variational derivative, it follows that for an optimum to exist, a necessary condition is that a λ exists such that implying that the gradient of the constraint equations ∇ Ĵ(X 0 ) and norm constraint ∇c(X 0 ) are parallel.To render this condition sufficient requires that λ satisfy the Karush-Kuhn-Tucker (KKT) conditions [59], which in-turn requires computing the Hessian of (1.13).If ∇ Ĵ(X 0 ), ∇c(X 0 ) are not parallel, (1.13) is not satisfied and we ideally, according to a first order approximation, choose a search direction d such that ∇ Ĵ(X 0 ) T d < 0 and ∇c(X 0 ) T d = 0.A good candidate which also defines the gradient on the spherical manifold S [7], is the tangent-vector as it satisfies the previous inequality and equality constraints but crucially is independent of λ.Thus by restricting X 0 to S and updating X 0 on geodesics of S for all values of the step size α [11], we can omit the equality constraint in (1.2).Furthermore ∇ Ĵ(X 0 ) = 0 now provides a necessary condition for a local minimiser [7].What remains, following [49,48], is to appropriately adapt the existing line-search algorithms such that updating X k 0 leads to global convergence, and to assess their efficiency on a suite of non-trivial fluid dynamics examples.Aside from the Pymanopt library for optimisation on matrix manifolds [55], SLSQP [31] presents the only alternative for problems with equality constraints.Owing to the construction, storage O(n 2 ) and inversion O(n 3 ) of the Hessian matrix SLSQP is however numerically too demanding to handle high-dimension control variables [59,26].Conversely while Pymanopt implements computationally efficient methods, its linesearch does not satisfy the necessary conditions in order to guarantee global convergence [49].
In what follows, we address the previously highlighted issues and test their numerical remedy against a suite of physical problems.Given the additional complexity represented by the derivation of a consistent, discrete adjoint numerical scheme, compared to a continuous gradient approach (where the same numerical scheme is used to discretize the forward and backward equations), we also conduct a systematic numerical comparison between discrete and continuous adjoint optimisation, which excluding the studies of [57] and [15] is currently lacking.To this end, we first derive the discrete adjoint equations in section 2; following [48] we then outline a consistent gradient update with guaranteed global descent in section 3; and finally we combine these developments in section 4 to demonstrate their superior performance.Perspectives for further work are then discussed in the conclusive section.

Discrete adjoint equations
A set of spatially discretised coupled evolution equations can be written as where X is the vector of unknowns, M, L are linear operators encompassing both the constraint equations and system boundary conditions and the right hand side F encompasses all nonlinear and non-autonomous terms.Then we can write a general s-step multistep implicit-explicit (IMEX) method [3] as where index n denotes the n th time step, j the internal stages and the right hand side is evaluated and treated as a vector.Notably when initiating time-stepping from a single initial condition, we have insufficient data to use 2nd or higher order schemes directly and must revert to a first order scheme.While diagonally implicit Runge-Kutta (DIRK/SDIRK) integration schemes [2] solve this issue and are L-stable, their numerical overhead, both in terms of CPU and memory, is much higher however and their discrete adjoint derived by [20,47,57,60] complexifies faster.We first present the discrete adjoint equations in their full generality, and subsequently consider their numerical performance on particular examples by making comparisons with their continuous formulation.As multistep schemes of order s > 2 are not A-stable and objective functions can be time-integrated with O(∆t 2 ) accuracy at best, higher order schemes are not considered.We note that the adjoint of a subclass of IMEX schemes known as backward differencing schemes (BDF) has previously been derived up to third order by [45].

A second order IMEX scheme
Partitioning the time interval t ∈ (0, T ] as t n = n∆t, n ∈ (0, N = T /∆t] the discrete Lagrangian, with normalisation constraints omitted, is given as where tilde notation is used to denote the coefficients retained if considering a first order scheme only, and the objective function is for example given by either where W is a known matrix operator.Taking variations with respect to the adjoint variables q n and setting δL δq n = 0 we recover the initial condition and the constraint equations, while with respect to the primal variables X n we obtain where we have permuted the indices in each of these terms and ∂F ∂Xn is used to denote the Jacobian of the nonlinear term.Grouping terms of (2.5) by their indices and setting them equal to zero we obtain the discrete Euler-Lagrange equations for n The solution of the O(∆t) accurate counterpart of these equations using a preconditioned scheme, applicable for example to discretisations of wall bounded domains is given in Appendix A. Concentrating on the minor and feasible modifications of simple schemes which help achieve accurate convergence, we have implemented primarily first order schemes in this paper.
Comparing expressions (2.6)-(2.9)with their continuous counterpart (1.5)-(1.7),we see that while the time-stepping scheme is unaffected for s ≤ n ≤ N − 1, the compatibility conditions demand the inversion of non-local linear operators at n = N, t = T and an explicit evaluation at n = 0, t = 0.For a fully periodic domain discretised using a self-adjoint Fourier basis the error is likely to be small, but for wall bounded domains where boundary layers are present, such as the optimal mixing example presented in 2.2.3, the error is potentially significant.

Numerical performance
To validate the adjoint of the discrete equations and justify the additional complexity of its implementation we now report numerical experiments on three example cases that showcase its superior performance.The consistency of the adjoint is checked by testing the Taylor remainder, i.e. verifying that for any arbitrary perturbation δX of the initial condition, and An additional test of interest is that the first difference in (2.11), divided by h, is equivalent to the inner product in the last term, to a number of decimal places.This is preferable when h ≪ 1 as it avoids multiplying very small and large numbers.In all tests which follow we generate X 0 , δX 0 randomly and re-scale such that In practice, it is strongly advised whenever possible to also choose X 0 close to a representative state of interest of the problem studied, as this ensures consistency of gradient definition with respect to problem relevant parameter vectors.

First test-case: a dynamo problem
In many astrophysical objects, magnetic fields are generated by the motions of electrically conducting fluids, through a process known as the dynamo effect [32].A dynamo optimisation problem first considered by [58], comprises finding the velocity field U and the initial magnetic field B 0 (constrained by some energy budget), which maximise the magnetic field's growth by some target time T in a triply periodic box.Here the velocity field is assumed to be frozen, i.e. independent of time and the evolution of the magnetic field.Following [58], this optimisation problem is written in dimensionless form as max where V is the volume of the box and Rm (the magnetic Reynolds number) is a control parameter quantifying the relative strength of electromagnetic induction compared to ohmic dissipation.Due to the domain periodicity the problem can be spatially discretised using a Fourier basis.Spatial differentiation amounts to multiplication by ik in Fourier space (where k is the wavenumber) such that integration by parts discretely or continuously yield the same adjoint operators.As a result, only errors associated with the temporal discretization can arise in the gradient estimation.Equation (2.12) and its adjoint (C.2) are time-integrated in the open-source code Dedalus [8] using a Crank-Nicolson Adams-Bashforth 1 (CNAB1) scheme.The problem setup and numerical resolutions are the same as in [58] and the dimensionless control parameters are here both set to Rm = T = 1.First we test the gradient with respect to the magnetic field δL/δB 0 , holding U fixed as prescribed in [58], and subsequently the gradient with respect to the velocity field δL/δU by holding B 0 fixed.Table 1 compares the gradient approximation obtained from the continuous adjoint equations (C.2) with that of the discrete adjoint equations (C.6) -(C.11).Both are derived in terms of a Lagrangian formulation of (2.12) in section 2. To calculate the order of convergence when assessing the validity of (2.11), we assume that for any In all tables a significant improvement is seen when using the discrete adjoint while the continuous formulation cannot provide the mathematically consistent approximation demanded by (2.11).

Second test-case: Swift-Hohenberg multi-stability
The Swift-Hohenberg equation constitutes a paradigm for studying pattern formation in physics.It is known to have multiple stable solutions coexisting for the same parameter values, characterized by different energy levels [29].This naturally raises the question of triggering transition from one stable state to another.Following [33] we consider the problem of selecting an initial condition u 0 = u(x, t = 0) of energy E 0 , which when added to the stable, zero-energy equilibrium u = 0 induces a transition to a new, higher-energy equilibrium.This can be achieved by considering max on a periodic domain of dimensionless length L = 12π with parameter a = −0.3 and target time T = 50 following [33].These values ensure that the zero-energy state u = 0 is linearly stable to infinitesimal perturbations, but unstable to some finite-amplitude perturbations which can drive the system away to a different stable state.Discretising the spatial operators using a Fourier basis we once again ensure that only temporal errors can contaminate the gradient estimate.Equation (2.15) and its discrete adjoint (B.5) are time-integrated in Dedalus using a first order backwards differencing (SBDF1) scheme.
Table 2. Test of gradient's accuracy for N = 256, ∆t = 0.05 and an SBDF1 timestepping scheme as per [33].The discrete equations achieve a consistent gradient valid to 5 decimal places, however smaller h will provide a yet more consistent approximation.Table 2 reports a comparison of the continuous and discrete gradient by considering a fixed perturbations of noise for different h.While several digits of accuracy and a consistent gradient are obtained using the discrete equations we were unable to obtain a consistent gradient using the continuous equations for any choice of ∆t or h.Having demonstrated the superiority of the discrete adjoint approach on two spatially self-adjoint discretisations, we now consider a wall bounded problem in which both the temporal and spatial discretisation can contaminate the gradient.

Third test-case: optimal mixing problem
In both the natural environment and industrial applications the mixing of fluids plays a crucial role.In the ocean for example, the irreversible mixing which occurs is the consequence of a complex interplay between density stratification and background shear playing out in a vast range of length-scales.A grand challenge in fluid dynamics community is to understand how the turbulent motions which ensue from these effects transport warm saline water, biomass and organisms such as phytoplankton across density surfaces in the oceans [9].An idealised example of optimal mixing is the optimal stirring strategy in a plane, pressure-driven channel flow [16,37,24].This example considers what is, for example, the most efficient way of mixing a stably stratified layer of hot and cold fluid?This question may be posed as an optimisation problem, where the initial velocity perturbation u 0 is determined so as to optimise well-mixedness.This can be achieved by minimising the following mix-norm for the density variation: where the non-dimensional Reynolds and Péclet numbers Re, P e = 500 characterise the strength of flow inertia to viscous and thermal diffusion respectively and the Richardson number Ri = 0.05 quantifies the strength of buoyancy relative to the background shear.Although we do not present it here, in SphereManOpt we include the additional test case where the time averaged kinetic energy is maximised instead, as in [37].We assume periodicity in the streamwise direction x ∈ [0, 4π] while in the shearwise direction z ∈ [−1, 1] we impose no-slip velocity, the pressure gauge, the no-flux condition on the deviation density and its accompanying gauge (2.17)

PM Mannix, CS Skene, D Auroux, & F Marcotte
The latter gauge is essential as it (1) ensures that the mix-norm is well defined [54] and (2) prevents the optimisation from stalling due to an undefined gauge freedom.The background shear flow and the initial density deviation profile are given by a value larger than that of previous studies to facilitate the use of a spectral method.Following [37] the initial amplitude is set to E 0 = 0.02 and the optimisation window to T = 5.This slightly shorter optimisation window is chosen so as to avoid excessively small values of the mix-norm which were found to induce rounding errors.(2.16) and its discrete adjoint (D.2) are time-integrated in Dedalus using a first order backwards differencing (SBDF1).In contrast to the previous examples the spatial operators are discretised using a Fourier basis to treat the streamwise direction and a Chebyshev basis in the shearwise direction.Consequently the discrete operators appearing in (2.6)-(2.9)are no longer self-adjoint, such that both temporal and spatial errors can now contaminate the gradient estimate.3 reports a comparison of the continuous and discrete gradient by considering a fixed perturbations of noise for different h and for different objective functions.We find that several digits of accuracy and a consistent gradient are obtained using the discrete equations while we were unable to obtain a consistent gradient using the continuous equations for any choice of ∆t or h.

Run time
Table 4 reports the efficiency of this section's gradient tests.The adjoint of the kinematic dynamo problem having twice the forward solve's dimension costs almost double as anticipated.The optimal mixing and Swift-Hohenberg problems cost equal or substantially less.Computations require the forwards solve in coefficient space only, thus minimising memory usage as compared with automatic differentiation.
Table 4. Ratio of the adjoint integration time versus the forwards integration time as computed using a single node with 2 AMD Epyc 7302 @ 3 GHz -16 cores processors.To minimise memory usage only the forwards solve in spectral space is stored.

Gradient descent on a spherical manifold
In the previous section we detailed how to obtain a numerically consistent approximation of the gradient of the pure functional Ĵ(X 0 ).Extending this further we now consider min where such that, by restricting X0 to the spherical manifold S, the equality constraints are built-in.By defining (3.1) as an unconstrained optimisation problem we can now proceed in outlining an appropriate gradient descent procedure, either by using the rotation method [11] (also termed exponential map in the literature on optimisation on Riemannian manifolds [1]) which moves X0 along geodesics of S in the search direction d k , or using the retraction formulae of [1] which is computationally simpler and easily generalises to the consideration of L p norms [50] (our discussion concentrates exclusively on L 2 -norms).
In the fluid mechanics community, the rotation method was first used to perform norm-constrained optimisation on a nonlinear PDE problem by [17] who combined it with a Polak-Ribière conjugate gradient method [21].They achieved faster convergence with the rotation method than with the Lagrange multiplier method.More recently [52] highlighted this link and utilised Riemannian optimisation techniques for performing a unit-norm constrained linear optimisation problem.In what follows we outline the recent theoretical developments concerning steepest descent [7] and conjugate-gradient descent developed by [48,49].This is followed by our numerical validation of their formulae, which using a stringent line-search is shown to provide convergence consistent with theoretical guarantees.

Conjugate gradient formulation
Following [7,48] we denote the parameter vector at iterate k by Xk , the Euclidean gradient of the objective function by ∇ Ĵ( Xk ) and the tangent-gradient of the objective function (i.e. its gradient on the spherical manifold S) by Starting from an initial guess X0 , restricted to S, gradient descent proceeds via the recurrence where β k is the conjugate-gradient update parameter [49,48] and the step size α k is determined using a line-search procedure in order to find an improved minimum of Ĵ( X0 ).For steepest-descent β k = 0 is set, while in the conjugate-gradient method we use a combined Riemannian Polak-Ribière and Fletcher-Reeves type update following a general formulation put forward by [49], who demonstrated that the following choice of β k yields global convergence of the conjugate-gradient algorithm: with (3.6)

PM Mannix, CS Skene, D Auroux, & F Marcotte
The vector transport operator and scaling coefficient are used to account for the following: (1) vector addition on the manifold S only makes sense if the vectors live in the same tangent space, and (2) the inequality (3.8) must be satisfied for convergence of (3.4); that is the transport operator must not increase the vectors norm [49].In all cases the iterative procedure repeats until the gradient residual involving the measure of the angle between the parameter vector Xk and Euclidean gradient ∇ Ĵk , becomes less than some numerically prescribed tolerance, meaning that no further progress can be made without departing too much from the constraint manifold.We note that this measure provides an immediate connection to, both the Lagrangian method (1.13) and to the Armijo condition (3.10) when d k ≈ −g k .

Conditions on the step size
Convergence guarantees on the scheme (3.4) can be achieved by imposing conditions on the step-size α [49,48].Assuming that Ĵ( XK ) is continuously differentiable, bounded from below and that the gradient is Lipschitz continuous for all Xk ∈ S, the (Riemannian) strong Wolfe-conditions with 0 < c 1 < c 2 < 1/2 ensure global convergence to a stationary point [49].Here we have used the notation Xk+1 (α k ) to denote the step-size dependent choice of parameter vector and g k+1 (α k ) to denote its corresponding tangent-gradient.In contrast to Euclidean gradient descent, the Riemannian Wolfe-conditions include an additional transport of the search direction T Xk+1 (d k ) to ensure that the inner product, which is now a metric on S, remains well defined.

Principle component analysis
To assess the efficacy of the update procedure and the practical cost of enforcing (3.10), (3.11), we solve min which is equivalent to principle component analysis (PCA) or if M corresponds to the discretisation of a self-adjoint linear operator, finding its leading eigenvector.Assuming an exact line-search, the rate of convergence of gradient descent to a stationary point follows analytically from the condition number κ(M ) = λ n /λ 1 of the Hessian where λ 1 , λ n refers to the smallest/largest eigenvalue of M respectively.Denoting the solution of (3.12) by X * the result of [35] states for steepest descent for large k, such that Xk → X * at a linear rate.To test this bound we generate random M ∈ R N ×N and perform gradient descent.For N = 100 we calculate (κ(M )−1/κ(M )+1) 2 = 0.9799 and using the slope of ( Ĵk+1 − Ĵ * )/( Ĵk − Ĵ * ) = 0.9187, showing that the calculated estimate obeys the bound (3.13) and monotonic linear convergence is observed.Performing the same optimisation using the conjugate gradient method (3.4) subject to the strong Wolfe conditions we obtained super-linear convergence as shown in Table 5 and Figure 1.  5).
Table 5.Comparison of the steepest descent and conjugate gradient methods applied to (3.12).We set the max step-size α max = 1 and terminate optimisation when r ≤ 1e-06 or when iterations > 200.Star * denotes a stalled run.Despite conducting a more expensive line-search the conjugate-gradient routine converges super-linearly thus requiring fewer evaluations as shown in Figure 1.

Routine
Size To facilitate the application of the descent routine outlined in this section, we have combined the update formula of subsection 3.1 with an implementation of an Armijo (3.10) line-search based on cubic interpolation and an implementation of a strong Wolfe condition line-search (3.11) following [41,59].This is available in the companion package to this paper SphereManOpt [36] which also contains all examples presented in this paper.In contrast to the libraries manopt and pymanopt [6,55], our library focuses exclusively on spherical manifolds of variable radii (as desirable in fluid dynamics applications) and implements a line-search routine satisfying the strong Wolfe conditions.Such a while necessary to ensure global convergence of the conjugate gradient method [48] is currently unavailable in manopt or pymanopt [6,55].

Numerical performance
Having outlined the calculation of a consistent gradient estimate in section 2 and a line-search following [48,49] with global convergence to a stationary point in section 3, we now combine the developments of these sections and apply them to solving for the optimal solution of the examples previously considered in section 2.

Swift-Hohenberg multi-stability
For the choice of parameters given in section 2 the Swift-Hohenberg equation has four stable equilibria.Following [33], we find the smallest perturbation u 0 of energy M 2 which, when added to the zeroenergy state O, triggers a transition to the next least-energy state u T of energy S 2 .(In practice, M 2 is determined by repeatedly computing the optimal u 0 , of gradually decreasing energy M , that maximises the energy growth by target time T .Since larger energies characterize the searched-for transition, M 2 corresponds to the smallest value of M for which the optimal u 0 still triggers the transition.) Figure 2 compares amplitudes of the perturbation M 2 and stable equilibrium S 2 with the results of [33].All steady equilibrium are correctly identified to four digits of precision while the perturbation amplitudes are correct to one digit.The differences with the result of [33] are explained by the fact we performed a small number of runs initiated from random noise, to probe a cost functional with numerous local minima.In contrast they perform several hundred runs so as to explore many possible basins of attraction.
Optimisation using the discrete gradient recovers the energies of the initial condition M 2 and equilibrium state S 2 (inset).These are given in figures 1 and 5 of [33] as M 2 , S 2 = (0.2048, 0.5164).

Discrete vs. continuous adjoint
Having established that our code faithfully reproduces the results of [33] we now compare the performance of the discrete and continuous gradient when optimising for the perturbation u 0 .The convergence of these runs are reported in rows (a),(c) of Table 6.Both runs use steepest-descent (SD) with an Armijo line-search and are initiated using random noise.While the continuous gradient seems to converge faster to a small residual, this appearance is in fact misleading.By running the optimum (found with the 'continuous', i.e. differentiate-then-discretize approach) once through the direct and discrete adjoint solvers (i.e.discretize-then-differentiate), we obtain the corresponding 'discrete residual' which is in fact much higher.This is highlighted in row (a) of Table 6 where the discrete residual is 6 orders of magnitude greater that the continuous one, showing that only the discrete gradient can accurately yield convergence to a stationary point.A further disadvantage of the continuous gradient as shown in row (b) of Table 6 is that when supplied to the conjugate-gradient (CG) routine it stalls, thus eliminating the possibility of super-linear convergence.

Line-search algorithms
With the superiority of the discrete gradient established we now consider the efficiency and convergence behaviour of three different descent routines.The convergence of each routine is reported in rows (c)-(f) of Table 6 and frames (c)-(f) of Figure 3.While all routines reach a small residual error, the most commonly used routine, a backtracking line-search accompanied by an Armijo condition, performs the worst.As anticipated the conjugate-gradient method implementing the strong Wolfe conditions and thus enjoying super-linear convergence performs best.
Table 6.Test summary and numerical cost of reaching a stationary point using different descent routines as shown in Figure 3.We set the max step-size α max = π and terminate optimisation when r ≤ 1e-06 or when iterations > 200.Star * denotes a stalled run.In rows (a)-(c) we have reported the continuous residual as well as the discrete residual.Having established the validity of the gradient in 2.2, where only the discrete adjoint gives a consistent estimate, and similarly the validity of our code by comparison with [58](see Figure 4), we now first demonstrate the superiority of the discrete gradient when optimising for fields U and B 0 .In contrast to the previous examples this demands choosing a single step-size α k to simultaneously update all n components of the parameter vector Xk , whilst retaining a termination condition based on the residual r i k of each component.We argue that this is preferable to taking a summed quantity such as r k = n i=0 r i k which could lead to over and under converged directions compensating one-another.6, in terms of the cost functional J (dotted, red) and gradient residual r (chained, blue), to the perturbation M 2 when initiated with random noise of the same amplitude.The steepest descent (SD) methods (d),(e) converge linearly while the conjugate-gradient (CG) method (g) super-linearly, thus demanding fewer iterations. .This optimum was calculated using an Armijo based line-search corresponding to row (c) Table 7.
One weakness of the outlined approach is that when different components of Xk converge at different rates, we will limit the convergence rate by restricting ourselves to a single step-size.Preconditioning the update step presents a possible solution to this problem [21] 7 compare the performance of the continuous and discrete gradient using steepest descent (SD) combined with an Armijo line-search.While the continuous gradient appears to converge in fewer gradient/function evaluations, we find that upon diverging after ∼ 25 iterations it does so to a spurious solution characterised by a non-zero magnetic divergence and net magnetic flux (Figure 5(a)).The optimisation using the discrete gradient by contrast converges monotonically to the desired convergence criteria Figure 5(c).As previous it was not possible to avail of the conjugategradient (CG) method when using a continuous gradient (row (b) Table 7).
Table 7. Numerical cost of reaching a stationary point using different descent routines as shown in Figure 5.We set α max = 100 and terminate optimisation when r i k ≤ 1e-06 or when iterations > 200.Star * denotes a stalled run, while * * a non-physical solution.All rows are given in terms of their discrete gradient residual.

Routine Gradient
Line-search evals Ĵ( B0 , Û ) evals Having established the necessity of the discrete gradient, we now consider the efficiency of the strong Wolfe line-search and conjugate-gradient routines when using a combined update approach.Columns (d) and (f) of Table 7 repeat the same calculation using a strong Wolfe line-search with and without the conjugate-gradient update.While both routines converge in fewer iterations when comparing Figure 5(c) with figures 5(d,f) their total number of function and gradient evaluations is however slightly larger.

Optimal Mixing
The previous examples demonstrate that our gradient descent algorithm efficiently achieves a high degree of convergence, but that this is possible only when using the discrete gradient.We now combine all this paper's developments to treat a PDE discretised using a combined Fourier/Chebyshev basis, as is required to solve the optimal mixing problem (2.16).As of yet adjoint optimisation studies with a discrete gradient have used finite difference [57], finite element [60,40] or finite volume methods [51] and in doing so encountered some difficulties in treating boundary conditions and pressure gauge conditions.By using the Dedalus code which treats the pressure and divergence free condition (and similar integral conditions) alongside all other variables during a single time-step no splitting methods or complications arise in our implementation (A).Given the prevalence of flow features with large gradients such as velocity boundary layers or density fronts, the accurate solution of control problems using a Chebyshev pseudospectral method is an essential step.Figure 6(a) shows the sharp initial density deviation b 0 and optimal velocity perturbation (here plotted as the vorticity ∇ × u 0 ), which as shown Figure 6(b) allows the mixing process to unfold via Taylor dispersion [37].These figures, computed by solving (2.16) using the discrete gradient and conjugate gradient method accompanied by a Wolfe line-search (Table 8(f)), also demonstrate that our implementation faithfully reproduces the results of [37].

Discrete vs. continuous adjoint
With our code's validity established we now compare the performance of the discrete and continuous gradient when optimising for the perturbation u 0 .The convergence of these runs reported in  .Optimisation (a) used a continuous gradient and (c-f) a discrete gradient while their routines and convergence cost are summarised in Table 7.While the continuous gradient appears to converge faster it is in fact to a spurious solution with non-zero divergence and net magnetic flux.Routines implementing the Wolfe conditions converge in fewer iterations but their line-search procedures are slightly more costly.establish that only the discrete gradient can satisfy the imposed convergence criteria.Although the conjugate gradient method using the continuous gradient works well initially, it subsequently stalls due to a poor gradient estimate.While both gradient estimates choose a dominant wavelength k x = 7 the discrete gradient consistently achieves a slightly better optimum than the continuous gradient.

Line-search algorithms
As shown in Figure 7 and Table 8 the discrete gradient allows us to use the conjugate gradient algorithm and obtain fast convergence.Combining the discrete gradient with either a conjugate gradient method (e) or a strong Wolfe line-search (f) the imposed convergence criteria can also be satisfied, albeit less efficiently.Comparing row (e) with (f) and row (c) with (d), it is also apparent that the strong Wolfe line-search greatly influences convergence despite its greater cost.for Ri = 0.05.To be compared with figure 2 of [37].This optimum was calculated using a Wolfe based line-search and conjugate gradient method corresponding to Table 8(f).For each problem considered we have prescribed a hyper-parameter α max .For an Armijo line-search it is the initial step-size α 0 = α max from which back-tracking begins, while in a strong Wolfe linesearch this constitutes the maximum admissible step-size.By reducing or increasing the number of iterations required to achieve convergence, this parameter was found to have a substantial impact on performance, in all problems considered.Typically as was the case in the PCA and Swift-Hohenberg problems setting α max ≈ 1 works well.In the kinematic dynamo and optimal mixing problems however we found that the range of step sizes considered by the line-search must be larger, particularly when using the strong Wolfe-conditions (α min , α max ) = (10 −6 , 10 2 ); although steepest descent was also found to converge in 20-40 fewer iterations when a large maximum/initial step-size was chosen for these problems.We understand this behaviour to be a consequence of Ĵ( Xk+1 ) being (informally speaking) poorly bounded during initial iterations due to the unfavourable initial dynamo velocity field U or mixing perturbation velocity field u.Consequently this requires selecting a larger than usual α k for the Armijo condition (3.10) to become active and for (3.11) be easily satisfied.A further discussion of this issue is given in the introduction of [41].8, in terms of the cost function J (dotted, red) and gradient residual r (chained, blue), to the perturbation u 0 when initiated with random noise.The steepest descent (SD) methods (c),(d) converge sub-linearly while the conjugate-gradient (CG) method (f) at least linearly, thus demanding fewer iterations.

Conclusion
In this paper we have focused on inconsistencies arising from the gradient estimation and application of equality constraints in the numerical direct adjoint-based control procedure [28]; to show why these issues occur and how they can be resolved on practical examples.Deriving the discrete adjoint equations of time-dependent PDE constraints for (preconditioned) multistep schemes [3], and considering three example problems we demonstrated in section 2 a consistent gradient approximation, readily obtainable with minor code modifications.In section 3 we recalled following [7,48] how an equality constrained problem can be solved as an unconstrained problem on a spherical manifold.Considering three examples, we numerically demonstrate that this procedure, available as SphereManOpt [36], is capable of giving robust linear to super-linear convergence to stationary points with small residual error when both nonlinear PDE and multiple norm constraints are present.Useful extensions of this work would consider the discrete adjoint of Runge-Kutta schemes, preconditioning schemes for problems with multiple norm constraints and exploiting additional geometric structures of the optimisation problems considered.Further developments of SphereManOpt would include L p norms and the integration of an optimal checkpointing strategy [19].x , ∂ † z ) their discrete-adjoint counterpart.The discrete adjoint for ∇ −2 is found directly from the Dedalus implementation of (D.4).In this manner, we find b † (x, T ) by solving (∇ 2 ) † b † (x, T ) = ∇ † W † ∇ψ.

Figure 1 .
Figure1.Convergence history of the optimisation (3.12) using steepest-descent with an Armijo line-search (solid line), conjugate-gradient with an Armijo line-search (chained line) and conjugate-gradient with a Wolfe I & II line-search (dotted line) method.The absolute value of the objective function | Ĵk | is shown in red and the gradient residual r k in blue.Despite conducting a more expensive line-search the conjugategradient method, which exhibits super-linear convergence, ultimately requires fewer iterations to converge (see Table5).

Figure 3 .
Figure 3. Convergence of the optimisation runs in rows (d)-(g) of Table6, in terms of the cost functional J (dotted, red) and gradient residual r (chained, blue), to the perturbation M 2 when initiated with random noise of the same amplitude.The steepest descent (SD) methods (d),(e) converge linearly while the conjugate-gradient (CG) method (g) super-linearly, thus demanding fewer iterations.

( a )Figure 4 .
Figure 4.The optimal (a) magnetic and (b) velocity field for Rm = T = 1 which yield ⟨B 2T ⟩ = 0.4329.To be compared with figures 1 and 2 of[58].This optimum was calculated using an Armijo based line-search corresponding to row (c) Table7. .

Figure 5 .
Figure 5. Convergence history in terms of the residual errors r B k (shown as c 0 ) and r U k (shown as c 1 ).Optimisation (a) used a continuous gradient and (c-f) a discrete gradient while their routines and convergence cost are summarised in Table7.While the continuous gradient appears to converge faster it is in fact to a spurious solution with non-zero divergence and net magnetic flux.Routines implementing the Wolfe conditions converge in fewer iterations but their line-search procedures are slightly more costly.

Figure 6 .
Figure 6.The optimal (a) initial velocity perturbation (given in terms of the vorticity ∇×u 0 ) and (b) the resulting deviation density and perturbation velocity fields at T = 5 for Ri = 0.05.To be compared with figure2of[37].This optimum was calculated using a Wolfe based line-search and conjugate gradient method corresponding to Table 8(f).

Figure 7 .
Figure 7. Convergence of the optimisation runs in rows (c)-(f) of Table8, in terms of the cost function J (dotted, red) and gradient residual r (chained, blue), to the perturbation u 0 when initiated with random noise.The steepest descent (SD) methods (c),(d) converge sub-linearly while the conjugate-gradient (CG) method (f) at least linearly, thus demanding fewer iterations.

Table 1 .
[58]arison of the continuous and discrete gradient for N x,y,z = 24 Fourier modes, time-step ∆t =1e-03 as per[58].(a)Thevelocity field U as given in[58]is held fixed.(b) The initial magnetic field B 0 is randomly chosen and held fixed.In both cases the vectors B 0 , δB, δU of unit norm are generated by time-stepping vectors of noise a small number of iterations and then re-normalising.This ensures the boundary and divergence free conditions are satisfied whilst it avoids biasing the initial data.

Table 3 .
Test of gradient's accuracy for N x , N z = 256, 128, ∆t = 1e − 03 using an SBDF1 scheme.The initial perturbation velocities u, δu of unit norm are generated by time-stepping vectors of noise a small number of iterations and then re-normalising.
N line-search Obj evals Ĵ Grad evals ∇ Ĵ Residual r

Table 8 .
Test summary and numerical cost of reaching a stationary point using different descent routines as shown in figures 6 and 7. We set α max = 100 and terminate optimisation when r i k ≤1e-06 or iterations > 200.Star * denotes a stalled run.Rows (a),(b) report the continuous residual and (c)-(f) the discrete residual.