Inverse stochastic-dynamic models for high-resolution Greenland ice core records

. Proxy records from Greenland ice cores have been studied for several decades, yet many open ques-tions remain regarding the climate variability encoded therein. Here, we use a Bayesian framework for inferring inverse, stochastic–dynamic models from δ 18 O and dust records of unprecedented, subdecadal temporal resolution. The records stem from the North Greenland Ice Core Project (NGRIP), and we focus on the time interval 59–22 ka b2k. Our model reproduces the dynamical characteristics of both the δ 18 O and dust proxy records, including the millennial-scale Dansgaard–Oeschger variability, as well as statistical properties such as probability


Introduction
Data-driven stochastic difference equation models have recently been successfully applied to a wide range of climatic phenomena (Kondrashov et al., , 2006Kravtsov et al., 2005Kravtsov et al., , 2009. The striking success of this empiricalmodel reduction (EMR) approach in reproducing dynamical and statistical properties of the underlying dynamical systems has been explained by embedding EMR models in the larger class of multilayer stochastic models (MSMs); the latter models, in turn, were shown to be solidly grounded in the Mori-Zwanzig (MZ;Zwanzig, 1964;Mori, 1965) formalism of statistical physics (Kondrashov et al., 2015). In addition to enhancing our understanding of the geophysical systems un-der consideration, EMR-MSM models have also been shown to be well suited for predictive purposes; e.g., they have considerable skill in predicting certain key variables associated with the El Niño-Southern Oscillation (Barnston et al., 2012;Chekroun et al., 2011) and the Madden-Julian Oscillation (Kondrashov et al., 2013).
In general terms, stochastic-dynamic models are derived by approximating the discrete-time divided differences of observed time series by a deterministic function F plus residual noise (e.g., Hasselmann, 1976): Published by Copernicus Publications on behalf of the European Geosciences Union.
here x i is the empirical observations, x i = x i+1 − x i , and t i = t i+1 − t i denotes the time spans from one observation to the next. The specific functional form of F may use some a priori knowledge of the system under study, while the parameters of the proposed model are always inferred by training it on a given set of time series produced by the system. In this sense, the approach is semiempirical, rather than being entirely hypothesis-free. Most existing methodologies for empirical-model derivation are based on least-squares fitting to determine optimal parameters for F.
In the present study, we are specifically interested in fitting low-dimensional stochastic-dynamic models to highresolution time series of two paleoclimatic proxy records, namely the δ 18 O ratios and dust concentrations obtained by the North Greenland Ice Core Project (NGRIP; Ruth et al., 2003;Andersen et al., 2004;Gkinis et al., 2014); see Fig. 1. Low-dimensional conceptual models have a long history in paleoclimate (e.g., Källén et al., 1979;Le Treut and Ghil, 1983;Saltzman and Maasch, 1990;Ghil, 1994;Tziperman et al., 2006;De Saedeleer et al., 2013). Such models typically incorporate a few global or regional climate variables, such as global temperature and ice-sheet volume, nonlinear interactions among these, and astronomical forcing subject to possible stochastic fluctuations (Saltzman and Maasch, 1991;Crucifix and Rougier, 2009).
In contrast, we choose here a data-driven, stochasticdynamic approach: we intend to find a system of stochastic differential equations (SDEs) to simulate time series that reproduce the statistical and dynamical properties of the observed δ 18 O and dust time series and do so without taking into account exogenous astronomical forcing. The main issue with the naive approach of Eq. (1) is that the noise term η, which represents the unobserved variables, will typically be correlated with the state vector x. To overcome this problem, we adapt the recently developed non-Markovian data-driven closure models introduced by Kondrashov et al. (2015) as follows.
Given a multivariate, low-dimensional time series x(t) of partial observations of a much higher-dimensional system, the MZ formalism yields the abstract generalized Langevin equation (GLE): In our application, x is the two-vector of δ 18 O and log(dust), while the much larger system is the climate system. The first term F(x) is Markovian and it accounts for the nonlinear self-interactions among the observed variables, while the non-Markovian integral term accounts for the crossinteractions between the observed and unobserved variables; in this formulation, the latter are not present in F(x). This non-Markovian integral term involves the past of the observed variables and thus introduces memory effects into the closed system. The term η(t) accounts for the stochastic forcing that is now uncorrelated with x, as guaranteed by the MZ formalism. The noise term, however, is not necessarily white in time or space; see Kondrashov et al. (2015) for a detailed derivation of the GLE approach and Appendix A for a sketch of the main ideas and more technical details. The temporal evolution of both the δ 18 O and dust time series indicates that there exist two alternative, relatively steady states for the underlying dynamical system, namely the colder stadials and the warmer interstadials (Dansgaard et al., 1993;Rasmussen et al., 2014). Transitions from the stadials to the interstadials occur very abruptly, within several decades, during the so-called Dansgaard-Oeschger (DO) events (Johnsen et al., 1992;Dansgaard et al., 1993;Ditlevsen et al., 2005), while transitions in the opposite direction are characterized by a comparably slow relaxation process that may last centuries to millennia, depending on the specific event.
This bistability suggests using a system of two coupled SDEs with a double-well potential as a model of the processes generating the two time series of δ 18 O ratios and dust (Ditlevsen, 1999;Ditlevsen and Ditlevsen, 2009). In addition, we will take into account here possible memory effects in the climate system (Bhattacharya et al., 1982;Ghil et al., 2015) by including explicit non-Markovian terms in the model. In particular, these memory terms are included in order to reproduce the temporal asymmetry of the observed time series: the sharp transitions from the stadials to the interstadials, followed by rather smooth transitions from the interstadials back to the stadials. For these data, we thus propose a two-dimensional stochastic delay differential equation as an approximation of the GLE (2): Here x is the two-dimensional time series of δ 18 O and dust, which is sampled at time steps t i in the NGRIP ice core. The model has a cubic drift term and retarded, non-Markovian arguments in the linear terms. The matrix Q denotes the Cholesky decomposition of the covariance matrix of the noise, and W (t) denotes a multidimensional Wiener process. Model parameters will be inferred using maximum likelihood estimation (MLE) and, for comparison, ordinary least-squares fitting. It will turn out that both approaches are in fact equivalent for the specific optimization problem proposed here; see Sect. 2.2 for further details and the explicit version of this SDE that we use in practice.
Time series simulated by our empirical model will be compared to the original time series in terms of statistical properties, such as the probability density functions (PDFs) of the time series, their power spectra and the average waiting time between sharp transitions from stadials to interstadials. Furthermore, we will test the relevance of the different model ingredients, such as the nonlinear terms, the memory and the coupling terms, using Bayesian model selection criteria.
The general potential of Bayesian parameter inference such as MLE has recently been discussed for stochasticdynamic climate models from incomplete data (Peavoy et al., 2015). A Bayesian framework to compare different types of models has also recently been used for the specific case of the NGRIP δ 18 O record, including a double-well potential model, a relaxation oscillator, and two versions of a mixture of locally linear stochastic models. Based on the Bayesian information criterion, it was concluded there that the two locally linear mixture models are best supported by the observations if three local models are used in each mixture (Kwasniok, 2013). Furthermore, a comprehensive Bayesian approach was employed to infer parameters for a double-well potential model from the NGRIP δ 18 O record (Krumscheid et al., 2015). Most recently, a double-well potential model was compared to a relaxation oscillator model with different external forcings using comparative Bayesian statistics (Mitsui and Crucifix, 2017). There, the conclusion was that the oscillator model is the more likely model candidate given the data and in particular that external forcing in terms of the ice volume significantly improves the statistical model. The latter three studies, however, used a lower-resolution version of the record, and did not consider either memory terms or coupling between the δ 18 O record and the dust record, as will be done here.

High-resolution NGRIP data
We employ proxy records of δ 18 O ratios (Andersen et al., 2004;Gkinis et al., 2014) and dust concentrations (Ruth et al., 2003) from the same core at the NGRIP drilling site. The δ 18 O ratios were regularly sampled every 5 cm, while the dust concentrations were sampled at a resolution of 1 mm (Ruth et al., 2002(Ruth et al., , 2003. The dust record was resampled here to the lower 5 cm resolution for consistency with the δ 18 O record. The proxy time series for the two variables have a common chronology, referred to as GICC05 , for the time interval starting at approximately 59 ka b2k. Here, 1 ka equals 1000 a and "b2k" refers to "before AD 2000." The GICC05 chronology is based on counting annual layers, which are distinguishable due to seasonal variations in the ice (Andersen et al., 2006), and the sampling intervals along the core range from 1 to 7 a.
Dating uncertainties were reported for each measurement of the raw data, and they accumulate toward the more remote past, as a consequence of the layer-counting procedure, which starts from the top of the core Rasmussen et al., 2014). The dating uncertainties are reported to be roughly 5 % and reach a maximum counting error of 2573 a at an age of 59 420 a b2k. Note that there are no uncertainties in the relative timing of the δ 18 O and dust, since they are obtained from the same ice core.
The δ 18 O ratios are interpreted as proxies for surface air temperature variability, with algebraically higher ratios corresponding to warmer temperatures (Johnsen et al., 1992(Johnsen et al., , 2001aDansgaard et al., 1993;Andersen et al., 2004). The dust concentrations have been proposed as proxies of largescale atmospheric circulation changes, with higher particle counts being associated with stronger winds and thus with larger equator-to-pole temperature differences (Fischer et al., 2007;Steffensen et al., 2008). Although the two proxies thus represent very distinct climatic features, they exhibit a high degree of co-variability for the time interval 59-22 ka b2k, as previously reported, e.g., by Johnsen et al. (1997), Ruth et al. (2003) and Rasmussen et al. (2014). This co-variability is best seen when considering the negative natural logarithm − log(dust) of the data along with the δ 18 O data (Fig. 1c).
The raw time series (blue curves in Fig. 1a and b) are preprocessed as follows: first, both time series are interpolated to an equidistant time axis with 5 a intervals. Second, gaps of varying length present in the dust time series and totaling about 6 % of the data points are filled by next-neighbor interpolation. Third, the multimillennial trend is removed from both time series using a 40 ka running mean. Fourth, the noise level of the δ 18 O record is reduced by applying a Butterworth low-pass filter of order 4, with a cutoff frequency of 0.02, corresponding to a period of 50 a.

Approximating the GLE in practice
It has recently been shown in general terms how to approximate the GLE (2) by a set of SDEs that is relatively easy to derive from the observables x (Kondrashov et al., 2015). Their MSMs both generalize EMR models and provide the correct time-continuous limit of such models. Based on their work, we directly derive here an approximation of Eq. (2) in terms of Eq. (3).
Compared to the GLE (2), the non-Markovian term with general kernel G is discretized and replaced by a sequence of Dirac kernels to obtain the second term on the right-hand side of Eq. (3), while the cubic term D(x, x, x) guarantees the stability of the solutions. Essentially, the proposed model is thus a dynamical system with a two-dimensional double-well potential that accounts for two alternative stable states. The additional memory term modulates the transitions between the two wells, which are stochastically forced by Gaussian white noise.
In practical applications, the SDE (3) is approximated by the following system of k coupled, discrete difference equations with delays: Here x n ∈ R k denotes the k-component observed variable x(t) at time t = t n , with 1 ≤ n ≤ N, and the χ n ∈ R k denote k-dimensional, independent white-noise increments. Note that for modeling the NGRIP records, we have k = 2 variables given by the δ 18 O and dust measurements.
In contrast to all other model parameters, the values for d and τ used in the linear memory part in Eq. (4) are not varied in the parameter optimization: in accordance with the MZ formalism, choosing d = 2 suffices to obtain residuals that are sufficiently uncorrelated with the observations x. Optimal values of the memory step width τ are chosen in an outer loop so as to have the averaged PDFs of the modelsimulated time series, along with the average waiting time between subsequent transitions, as close as possible to those of the observed ones. A choice of τ = 12 time steps, corresponding to 60 a, will provide optimal approximations of these statistical characteristics; see The explicit coupled SDE system governing our stochastic-dynamic model is hence given by s 11 x n−sτ + B s 12 y n−sτ + C 11 x 2 n + C 12 x n y n + C 13 y 2 n + D 11 x 3 n + D 12 x 2 n y n + D 13 x n y 2 n + D 14 y 3 n δt n x 2 n y n + D 23 x n y 2 n + D 24 y 3 n δt n where x n = (x n , y n ) and τ = 60 a. In total, there are thus 31 parameters to be estimated. Note that the total set of model parameters includes the standard deviations of the noise residuals, as well as their correlation.

Maximum likelihood estimation
Recently, Chorin and Lu (2015) introduced a general methodological framework for discrete stochastic parameterizations. In principle, an optimal parameter set * for the forms A, B, C and D of order 0-3 can be determined by regressing the right-hand side of Eq. (4) onto the observed increments δx n = x n+1 − x n , e.g., by ordinary least-squares (LSQ) optimization: * = arg min where F denotes the operator corresponding to the righthand side of Eq. (4), dropping the noise term and using a parameter combination .
Following, e.g., Kwasniok (2013), Chorin and Lu (2015), Krumscheid et al. (2015), Mitsui and Crucifix (2017), we rely here on Bayesian parameter inference for reduced stochastic models. For the present modeling task, we propose the Gaussian likelihood function here is the covariance matrix of the noise, estimated from the residuals of the least-squares optimization. Note that the functional form of the likelihood function in Eq. (7) assumes that the residuals are normally and independently distributed. The MZ formalism only ensures that there exists a GLE with noise forcing that is uncorrelated with x. The additional assumption that the noise is white in time is not theoretically guaranteed. Therefore, one has to check empirically how well Gaussian white noise approximates the residual. Given that the stochastic difference equation (Eq. 4) only approximates the theoretical GLE provided by the MZ formalism, one also needs to validate empirically that the residuals are uncorrelated with the observations x (Kondrashov et al., 2015).
We remark that the approximation of the divided differences δx/δt by the function F is in fact a multivariate multiple linear regression, with regressors chosen to be the polynomials in Eq. (5). It can easily be seen that MLE and LSQ are equivalent for the case of univariate data and Gaussian errors, since the same term δx n /δt n − F is minimized in Eqs. (6) and (7). In fact, subject to the assumption of uncorrelated -but not necessarily Gaussian -errors with zero mean and identical variance, the Gauss-Markov theorem ensures that parameters obtained from LSQ, i.e., from Eq. (6), are the best linear unbiased estimators for such a regression. These estimators are best in the sense that they have the lowest variance.
On the other hand, the MLE is asymptotically optimal. In particular, it is efficient in the sense that the variance of the parameter estimates achieves the so-called Cramér-Rao lower bound, which is optimal, as the number of samples tends to infinity (Andersen, 1970). Srivastava (1965) extended the Gauss-Markov theorem to the multivariate case, where correlations among the variables lead to correlations between the error terms of the distinct variables and thus result in cross-terms in the exponent of Eq. (7). Although both optimization approaches are thus equivalent in the case at hand, the MLE approach has the advantage that the parameter estimates can be interpreted as the most likely ones, given the observed data.

Results
As a training set for the parameter optimization of our stochastic-dynamic model in Eq. (5), we choose the time interval 59-22 ka b2k; this interval roughly coincides with Marine Isotope Stage 3 (approximately 60-28 ka b2k). Our choice results in N = 7529 data points for each time series. The reason for this choice is that the layer-counted chronology has only been carried out until 59 ka b2k and that the co-variability between δ 18 O and log(dust) is substantially reduced for the more recent part of the record, as already noted by Ruth et al. (2002Ruth et al. ( , 2003 and apparent in Fig. 1c here. We use the natural logarithm of the dust time series (i) because of the large range of dust concentration values and (ii) because it has high co-variability with δ 18 O (cf. Fig. 1c); it also guarantees that the simulated dust values are positive. This logarithmic scale requires, however, high accuracy in the modeling of the dust concentrations to resolve the multiplicity of abrupt variations that span several orders of magnitude; cf. Fig. 1b.
Our results indicate that for d = 2 memory steps, the residuals of the least-squares optimization Eq. (6) are indeed uncorrelated with the observations x -the correlations are less than 10 −8 -and they are approximately Gaussian distributed (Fig. A1), while their autocorrelations decay very fast (not shown). These tests empirically support our choice of a Gaussian likelihood function of the form Eq. (7) for the MLE. Furthermore, these results allow us to integrate the stochastic-dynamic model given in Eq. (5) with an Euler-Maruyama scheme.
The specific values of the coefficients of Eq. (5), as derived via MLE, are given in Table A1. Note, in particular, the crucial nonlinear and lagged cross-interaction terms. For the reasons explained above, the parameters that minimize the least-squares problem Eq. (6) are identical to the most likely parameter values as determined via MLE up to numerical precision, i.e., with a relative error less than 10 −5 .
In order to simulate optimal sample time series for the δ 18 O and log(dust) variables, the parameter combination * that maximizes the likelihood function (Eq. 7) is used. The model is then integrated by the Euler-Maruyama scheme with a uniform step size of δt = 10 −5 . Compared to the observational data, this step size corresponds to the sampling size of 5 a. Essentially, we have thus rescaled the time unit in order to guarantee a stable numerical integration. The stochastic forcing is given by two-dimensional Gaussian white-noise increments multiplied by the Cholesky matrix Q. The residuals of the δ 18 O and log(dust) are correlated with a Pearson's correlation coefficient r P = −0.13, resulting in a non-diagonal covariance matrix = QQ T .
Illustrative time series of δ 18 O and dust, simulated using the most likely model parameter combinations, are shown in Fig. 2. In Fig. 3a and b, the dashed lines show averages and uncertainties in PDFs of 1000 time series simulated using the most likely parameter combinations. Figure 3c and d show the average spectral densities of the latter time series. In this case, error bars would not be visible and are therefore omitted.
The means and standard deviations of the observed time series are reproduced well by the simulations: for the preprocessed δ 18 O, the mean and standard deviation equal −41.79 ± 1.72 compared to −41.82 ± 1.73 for the simulations; the corresponding preprocessed and simulated values for log(dust) are 11.98 ± 0.97 and 12.00 ± 0.92. For the sim-ulations, these values are computed as averages over 1000 simulated sample time series, obtained using the parameter combinations that maximize the likelihood function Eq. (7).
The simulated time series (Fig. 2) exhibit abrupt changes that resemble the so-called Dansgaard-Oeschger events, which mark the sharp transitions from colder stadials to warmer interstadials (Dansgaard et al., 1993;Johnsen et al., 2001b;Rasmussen et al., 2014) in the original time series (Fig. 1). Given the more gradual temperature changes from interstadials to stadials, the red curve in Fig. 2a has a dominant sawtooth-shape pattern. Recall that time here, as in Fig. 1, runs from right to left, as it does in Fig. 3a and c of (Krumscheid et al., 2015), which also display a high qualitative resemblance between their model-simulated and observed δ 18 O time series.
The observed time series are, due to the sawtooth-shaped transitions between stadials and interstadials, not symmetric under time reversal. A quantitative measure of the timereversal asymmetry is provided by the third-order moment (Kwasniok, 2013):   Note how the bimodality, corresponding to the transitions between stadials and interstadials, only arises after applying a low-pass filter to the time series, leading to the preprocessed time series. (b) PDFs for the raw, preprocessed, and simulated log(dust) time series. PDFs are obtained as averages over 1000 simulated time series, each obtained from the most likely model parameters (solid red). Error bars indicate the 2σ range of uncertainties derived from theses 1000 sample time series. (c) Power spectral densities for the interpolated, preprocessed (i.e., interpolated and smoothed) and simulated δ 18 O. (d) Power spectral densities (PSDs) for the preprocessed and simulated log(dust). For the simulated time series, Gaussian white noise with the same standard deviation as the residual of the LSQ (Eq. 6) was added before computing the PSD. The PSDs are estimated using the multi-taper method (Vautard et al., 1992), with a total of seven tapers. The PSD shown for the simulations is an average over 1000 simulated time series and is therefore strongly smoothed. For the spectra, the uncertainties would be hardly visible and are therefore omitted. what similar behavior to the M(θ ) computed from the observed dust. Quantitatively, the similarity between M(θ) computed for the observed and simulated dust is also supported by Kendall's τ value (red curve in Fig. 5d). The temporal asymmetry is, however, not reproduced by the δ 18 O simulations (Fig. 4a), as also confirmed by Kendall's τ (blue curve in Fig. 5d).
Given the very strong correlations r P between the two variables, with r P ≈ 0.9 for both observations and simulations, the discrepancies between the resulting M(θ ) given δ 18 O or dust data are rather surprising. These discrepancies suggest that M(θ ) might not be the best measure for quantifying whether sawtooth-shaped oscillations are present or not.
Following Krumscheid et al. (2015), we define the average waiting time τ DO between Dansgaard-Oeschger events as the sum of the average residence times in stadials and interstadials. For this purpose, stadials and interstadials are determined as intervals for which the time series are, respectively, below or above the mean of the series. Due to their comparably high noise level, the high-resolution time series employed here are further smoothed by singular spectrum analysis (SSA) (Vautard et al., 1992;Ghil et al., 2002) using a window size of 500 a and keeping only the five leading reconstructed components. For the observed δ 18 O and log(dust), we obtain τ DO = 1506 and 1744 a, respectively, compared to τ DO = 1370 ± 224 a and 1559 ± 346 a for the simulations, computed as averages over 1000 simulated time series. Note that this definition of waiting times may not be optimal in view of the high noise level of the time series, leading to different values for δ 18 O and log(dust). This definition is nevertheless employed here for the purpose of facilitating comparison with the results of Krumscheid et al. (2015).
For the dust concentrations (Fig. 2b), there is a striking similarity between the observed and simulated time series in terms of episodes with low dust concentrations of variable durations, as well as the presence of burst episodes of variable magnitudes. Over the training interval, the preprocessed time series are correlated at r P = −0.84, which equals the average correlation between the simulated time series.
The PDFs of the simulated δ 18 O and log(dust) (red solid lines), obtained as averages over 1000 time series, are quite similar in shape to those of the preprocessed time series (blue solid lines) of both observed variables; see Fig. 3a and b. In particular, the bimodality of the PDFs, which reflects the relative persistence of stadials and interstadials, is reproduced by our inverse model. Uncertainties in the PDFs are derived on the basis of ensembles of 1000 simulated time series, produced by sampling from the most likely parameter combination. Error bars on the PDFs in Fig. 3a and b reflect the 2σ range of these ensembles.
The spectra of both the δ 18 O and the log(dust) are well reproduced by our model for the entire frequency range, including sub-centennial periodicities; see the black and red curves in Fig. 3c.

Discussion
We studied the δ 18 O and dust time series obtained from the high-resolution NGRIP record for the 37 000 a interval 59-22 ka b2k. The results described above show that the statistical properties of these time series -such as their PDFs, spectra and average waiting times -can be approximated quite well by the proposed stochastic-dynamic model of Eq. (3).
The main features of our inverse model are (i) the nonlinear terms in the Markovian part, (ii) the inclusion of non-Markovian memory terms and (iii) the coupling terms between the two time series. Cubic terms have previously been used to model the δ 18 O time series of the NGRIP record (Ditlevsen et al., 2007;Kwasniok, 2013;Krumscheid et al., 2015). Non-Markovian contributions in an SDE model have, so far, only been considered in very few paleoclimatic studies. Pelletier (2003) studied the influence of delayed ice volume feedbacks on glacial-interglacial variability. In a deterministic setting, Rial (2004) used a forced logistic delay differential equation model to study DO cycles as well as glacial-interglacial transitions, and Berger (1999) discussed memory contributions in the context of the 100 ka cycle and its association with Milankovitch forcing. We are not aware of modeling efforts that take advantage of the co-variability between the δ 18 O and log(dust) variables. It is shown in the following that the coupling terms between the two variables are crucial in order to capture the statistical characteristics of the measured NGRIP time series presented above, while the non-Markovian contribution is also significant.
The nonlinear terms can be physically motivated by the fact that the observed time series oscillate between two quasiequilibria, namely the stadials and interstadials. If only linear terms were used, the bimodality of the observed time series, and hence the existence of two quasi-stable states, could not be reproduced (see Figs. A2 and A3).
The purely Markovian form of the model approximates the PDFs of the observed time series less well (Figs. 5c and A5). In particular, the bimodality of the PDFs for both δ 18 O and log(dust) is weaker in this case, indicating that the memory terms do contribute to an appropriate modeling of the persistence in the stadials and interstadials, as well as of the transitions between them. In addition, the average waiting times between DO-like transitions are τ DO = 1404 a for the δ 18 O and 1254 a for log(dust), and they are thus too short for the log(dust) time series, in particular.
When removing all coupling terms between the two variables from the inverse-model Eq. (4), sawtooth-shaped transitions are completely absent from the simulated time series (Fig. A6). Furthermore, the bimodality of the PDFs is missed when excluding the couplings (Fig. A7), and the average waiting times are much too short, namely τ DO = 1208 a for the δ 18 O record and 867 a for the log(dust) record.
Following previous authors (Kwasniok, 2013;Krumscheid et al., 2015;Mitsui and Crucifix, 2017), we also compare the different model versions in terms of the Bayesian (BIC)  Table 1. Sample-size-corrected Akaike information criteria (AICc) and Bayesian information criteria (BIC) for the different model versions. Note that the model parameters include the standard deviations and correlation that appear in the respective model's noise term. For the models with memory, we chose d = 2 memory steps with step size τ = 12, corresponding to 60 a. and sample-size-corrected Akaike (AICc) information criteria; these are defined as AICc = 2pn/(n − p − 1) − 2 log(L * ) and BIC = p log(n) − 2 log(L * ). Here, p denotes the number of model parameters, n = N − 1 = 7528 the total number of data points (a total number of data points N leads to N −1 increments to be approximated) and L * the maximum value of the likelihood function Eq. (7). The lower the value of AICc or BIC for a given model, the higher the relative confidence in that model. For the case at hand, both AICc and BIC consistently favor the full model, which includes nonlinear, memory and coupling terms. This is followed by the linear coupled model with memory terms, the nonlinear coupled model without memory terms, the nonlinear model with memory but without coupling terms, and finally the linear model without memory terms (Table 1).
Note that the AICc penalizes higher numbers p of model parameters less strongly than the BIC. Although the AICc was found, under certain conditions, to be optimal in selecting model candidates (e.g., Burnham and Anderson, 2002;Yang, 2005), notable counterexamples are known (e.g., Penland et al., 1991).
We thus suggest interpreting the values in Table 1 with caution. For example, in the case at hand both AICc and BIC propose higher confidence in the linear model than in the model without memory terms. This would suggest that the memory terms are more important than the higher-order parameters that correspond to the double-well shape of the potential. However, the PDFs of the cubic model without memory terms are considerably closer to the observed PDFs than the PDFs obtained from the linear model including memory terms. In particular, by construction, the latter model cannot reproduce the bimodality of the observed PDFs. We would thus still argue that the nonlinear contributions are more important than the memory terms. Nevertheless, the AICc and BIC values presented herein provide information-theoretic evidence that the inclusion of memory terms does substantially improve the model.
We emphasize that the full model proposed herein has the highest number of parameters out of the different candidates but is still the one with the lowest BIC and AIC. Therefore, it can be argued that this number of parameters is not too high, and it is not likely that the full model over-fits the observed data.
Furthermore, it should be noted that the values of AICc and BIC can only be compared on the basis of the same underlying data. Since we use a higher-resolution version of the NGRIP data as compared to the previous authors (Kwasniok, 2013;Krumscheid et al., 2015;Mitsui and Crucifix, 2017), the values for AICc and BIC presented here cannot be compared to the values reported in those studies.
As noted above, we chose for the memory step size τ = 60 a, corresponding to 12 time steps. In fact, the smallest values of both AICc and BIC are obtained at a considerably smaller step size of τ = 1 time step (5 a) (Fig. 5a). However, the approximation of the correct standard deviations, PDFs, third-order moments and average waiting times by the simulations is strongly improved for larger values of τ (Fig. 5b-e). Therefore, we suggest that a choice of τ = 60 a provides a good tradeoff between the AICc and BIC, on the one hand, and the statistical characteristics of the simulated time series, on the other. In particular, an accurate reproduction of the correct average waiting time between subsequent transitions from stadials to interstadials is only achieved for 55 a τ 60 a. It should be emphasized, moreover, that both the AICc and BIC criteria consistently favor the models with memory terms over the models without memory terms, regardless of the specific value of τ .
The model results presented here appear only in the highresolution version of the NGRIP ice core record, which was originally sampled every 5 cm, a depth sampling that yielded temporal step sizes between 1 and 7 a. For example, interpolating the raw data to a uniform grid with t = 10 a, instead of 5 a, leads to substantially less accurate approximations of the observed statistical properties of both the δ 18 O and dust time series (not shown). On the other hand, interpolating the raw data with a uniform sampling step of 3 a is problematic because of the original temporal step sizes, and does not further improve the results (not shown). It would thus appear that the 5 a uniform grid size is nearly optimal, given the irregularities in the sampling and the uncertainties in both the dating and the values of the records.

Conclusions
We have shown that a coupled, two-dimensional stochasticdynamic model with cubic drift term and linear delay terms is capable of reproducing the statistical properties of δ 18 O and dust time series derived from the high-resolution NGRIP record for the interval 59-22 ka b2k, which roughly corresponds to Marine Isotope Stage 3. These statistical properties are expressed in terms of the PDFs of the time series, their power spectra and the waiting times between sharp transitions from stadials to interstadials.
Key ingredients for an accurate simulation of the observed time series are as follows.
1. High-resolution time series have to be used as training data, indicating that the high-frequency variability present in the records plays a vital role for the overall evolution of the climate processes that generated the NGRIP ice core. Interpolation of the raw data, which is sampled at depth intervals of 5 cm in the core, to 5 a intervals in the preprocessed time series was found to be optimal in our inverse-model setup. This finding is qualitatively consistent with the assertion of Rypdal (2016) that an increase in decadal-scale variability may be a statistical precursor for the abrupt transitions from stadials to interstadials during Dansgaard-Oeschger events.
2. Cubic terms need to be included in the Markovian part of the model. This can be physically motivated by the presence of two quasi-equilibria in the observed time series -the stadials and interstadials -that could not be modeled without two such quasi-stable states in the underlying dynamical system. Cubic terms have already been considered in previous attempts to model the δ 18 O time series of the NGRIP record (Ditlevsen et al., 2007;Kwasniok, 2013;Krumscheid et al., 2015); these attempts, however, did not include the dust series, used lower-resolution data and did not consider memory effects.
3. Coupling terms between the δ 18 O and dust variables substantially improve the statistical characteristics of the simulated time series: the reproduction of the bimodality of the PDFs, as well as of the correct average waiting time between subsequent transitions from stadials to interstadials, are substantially improved when coupling terms are included.
4. Non-Markovian terms that account for memory effects are helpful. Their inclusion allows, to some extent, reproducing the time-reversal asymmetry of the dust time series. The main contribution of including memory terms into the model is, however, to improve the average simulated waiting times between subsequent transitions from stadials to interstadials (cf. Fig. 2e) for 55 a τ 60 a. The memory terms also help improve the PDFs of simulated time series, in particular for the δ 18 O time series. Furthermore, in general, the AICc and BIC criteria support the model versions that include memory terms.
Our results demonstrate that the statistical characteristics of the roughly 40 ka long, high-resolution NGRIP time series of δ 18 O and dust considered here can be reproduced by a nonlinear inverse model without taking into account exogenous forcing, whether astronomical, solar or volcanic. Based on our results alone, there would thus be no reason to assume that the temporal evolution of the δ 18 O ratios and dust concentrations -and hence that of the climatic variabilities they represent, in particular the DO transitions between stadials and interstadials -are externally forced.
We note, though, that Mitsui and Crucifix (2017) have recently found evidence that including orbital forcing in modeling the 20 a averaged NGRIP Ca 2+ time series improves their model's BIC score. This forcing -however important on longer, astronomical timescales -does not appear to be directly relevant for the millennial-scale DO transitions, according to the present results.
The predictive power of the proposed stochastic-dynamic model for the abrupt transitions from stadials to interstadials should be addressed in future work.
Data availability. The high-resolution NGRIP data used in this study are available online at http://www.icecores.dk.

Appendix A: Key ideas in deriving the GLE
The approach to data-driven stochastic-dynamic modeling taken here is rooted in the MZ formalism of statistical mechanics (Zwanzig, 1964;Mori, 1965;Chorin et al., 2002;Chorin and Hald, 2013), which proposes an integrodifferential closed form for model inference from partial data. While the derivation of such an MZ model does not allow one to easily simulate its solutions, it is possible, under suitable hypotheses, to obtain a good approximation thereof by a finite number of coupled SDEs (Chekroun et al., 2011;Kondrashov et al., 2015).
Assume that z is a high-dimensional state vector, whose temporal evolution is governed by the following system of ordinary differential equations, which is not explicitly known: where R denotes the set of real numbers. Assume furthermore that z can be decomposed into a sum of an observed vector x and an unobserved vector y, i.e., z = x + y. By orthogonally projecting Eq. (A1) onto the subspace spanned by the observed variables x, via P z = x, we obtain which depends on the unobserved variable y. By introducing the averaging where µ x denotes the probability distribution of y conditioned on x, we obtain P F(x + y) = P F(x) averaged part The parameterization of the fluctuating part in Eq. (A4) is at the core of any stochastic parameterization method -whether linear (Penland and Sardeshmukh, 1995) or nonlinear (Majda et al., 2001) -as well as of the MZ formalism. Ergodic-type arguments show that the averaged part can in principle be learned from a time series, assuming the existence of a "nice" invariant measure (Chekroun et al., 2011;Kondrashov et al., 2015): which holds for almost all y 0 with respect to the Lebesgue measure; see Lemma 4.1 in Kondrashov et al. (2015) for further details. x n B 0 11 = −3.16 × 10 5 B 0 21 = 9.03 × 10 5 y n B 0 12 = 8.01 × 10 5 B 0 22 = −3.69 × 10 6 x n−τ B 1 11 = 1.26 × 10 3 B 1 21 = 9.03 × 10 2 y n−τ B 1 12 = 5.49 × 10 3 B 1 22 = 1.38 × 10 4 x n−2τ B 2 11 = −3.89 × 10 1 B 2 21 = 2.31 × 10 3 y n−2τ B 2 12 = 9.19 × 10 2 B 2 22 = 6.30 × 10 3 x 2 n C 11 = −5.09 × 10 3 C 21 = 1.38 × 10 4 x n y n C 12 = 1.68 × 10 4 C 22 = −6.51 × 10 4 y 2 n C 13 = −3.97 × 10 4 C 23 = 1.92 × 10 5 Note that Eq. (3) can be approximated by a Markovian SDE by using the Galerkin approximation techniques of Chekroun et al. (2016) (their Appendix C and Remark 5.1). In that respect, Eq. (3) can be put within an SDE format consistent with, although different from, the MSMs discussed by Kondrashov et al. (2015).  Figure A1. Probability density of the least-squares residuals for δ 18 O (blue) and log(dust) (red). The residuals for the log(dust) can be approximated very well by a Gaussian, but for the δ 18 O the approximation is also sufficiently good to justify the choice of a Gaussian likelihood function for the maximum likelihood estimation.   Figure A3. Same as Fig. 3a and b in the main text but for the model without nonlinear terms.  Figure A5. Same as Fig. 3a and b in the main text but for the model without memory terms.  Figure A6. Same as Fig. 2 in the main text but for the model without coupling terms. Note that a coevolution of the observed δ 18 O and log(dust) time series can in this case not be expected to be reproduced. www.earth-syst-dynam.net/8/1171/2017/