The impact of structural error on parameter constraint in a climate model

model Doug McNeall1, Jonny Williams2,4, Ben Booth1, Richard Betts1, Peter Challenor3, Andy Wiltshire1, and David Sexton1 1Met Office Hadley Centre, FitzRoy Road, Exeter, EX1 3PB UK 2BRIDGE, School of Geographical Sciences, University of Bristol, Bristol, BS8 1SS, UK 3University of Exeter, North Park Road, Exeter EX4 4QE UK 4Now at NIWA, 301 Evans Bay Parade, Hataitai, Wellington 6021, New Zealand Correspondence to: Doug McNeall (doug.mcneall@metoffice.gov.uk)

point out that strong prior information is required to distinguish between parameter uncertainty and discrepancy, and that this information is often lacking. Further, even inadequate (as opposed to outright wrong) specification of a simulator discrepancy can lead to overconfidence and bias in parameters and predictions.

Calibration of Land surface components
Parametric uncertainty in the land surface and carbon cycle component of models is expected to represent a large fraction of 5 current uncertainty in future climate projections (Booth et al. (2012), Booth et al. (2013), Huntingford et al. (2009)). These components have been introduced into climate simulators more recently, and have not yet been subject to the depth of systematic evaluation as, for example, atmospheric components. There is much focus therefore, in identifying parameter sets consistent with observed climate metrics and reducing future land carbon cycle uncertainty by identifying parts of simulator parameter space inconsistent with observed properties of the real climate system. 10 Using statistical and data assimilation approaches to constrain land surface simulator process parameters extends back at least to Sellers et al. (1996). Recent examples are community efforts to develop a systematic set of observations to benchmark land surface processes against metrics of real world processes, for example the International Land Model Benchmarking Project (Luo et al., 2012), and PALS (Abramowitz, 2012). Such benchmarks use an extensive set of metrics, covering a broad cross-section of simulator processes, enabling an assessment of overall skill and highlighting areas where simulators fall short. 15 They provide a framework to assess improvements in skill arising from continual simulator development as well as prioritising resources towards processes that are less well simulated. Using many observed metrics for diverse processes also discourages overtuning to a particular process, to the detriment of wider simulator performance. One limitation of the benchmarking approach is that there is limited understanding of what information a given observed metric implies about the simulator formulation or parameters, or what this might imply about future projected changes. 20

Paper aims and outline
We aim to identify parameter sets of the land surface module of the climate simulator FAMOUS where simulator output and observations of forest fraction are consistent to an acceptable degree. An initial attempt using history matching suggests that FAMOUS is unable to simulate the Amazon forest and other forests simultaneously at any set of parameters within the experiment design. We argue that this is due to a fundamental simulator discrepancy, which has implications for constraining 25 the input parameters of FAMOUS. We use a number of techniques to characterise and find the drivers of this structural error, before performing a second history match with an appropriate discrepancy function.
In Sect. 2 we describe the ensemble of a climate simulator, and the emulator and history matching techniques used to explore simulator discrepancy in Sect. 2.5 and 2.6 respectively. We perform an initial history matching exercise in Sect. 3.1. We use the emulator to quantify relationships between the simulated forest fraction and a set of simulator input parameters in a sensitivity 30 analysis in Sect. 3.2. Next, we measure the performance of the ensemble in simulating forest fraction in Sect. 3.3. We see how much input space would be ruled out as implausible in various scenarios of data combination and uncertainty budget in Sect.
3.4 and we learn what each individual observation tells us about input space in Sect. 3.5. In Sect. 3.6, we use the emulator and an implausibility measure to find the nominal "best" set of parameters for each forest, and project the consequences of using those parameters on the other forests. Finally, we perform a history matching exercise with a credible discrepancy function to constrain input parameters in Sect. 3.7. In Sect. 4, we discuss the consequences of our findings for simulators of the Amazon rainforest before offering conclusions in Sect. 5.

The FAMOUS climate simulator
We use a pre-existing ensemble of the climate simulator FAMOUS throughout this study. The Fast Met Office UK Universities Simulator FAMOUS (Jones et al., 2005;Smith et al., 2008) is a reduced resolution climate simulator, based on, and tuned to replicate, the climate model HadCM3 (Gordon et al., 2000;Pope et al., 2000). Computational efficiency is gained primarily through reduced resolution. Atmospheric grid boxes are four times the size of HadCM3, and ocean gridboxes are also larger. 10 There are fewer levels in the atmosphere (11 compared to 19), and the ocean timestep is 12 hours compared to 1 hour for HadCM3. In the atmosphere, the timestep is 1 hour, doubled from HadCM3. The dynamic vegetation component is called TRIFFID and is described in detail in Cox (2001). FAMOUS runs approximately ten times faster than HadCM3, making it ideal for running large ensembles, or long integrations, with modest supercomputing facilities.
Smith (2012) describe improvements to FAMOUS in sea ice, ozone, hydrological cycle conservation and upper tropospheric 15 dynamics. Williams et al. (2013) describe the inclusion of the carbon cycle in the simulator via perturbed physics ensembles of terrestrial and ocean parameters, of which the terrestrial ensemble is studied in this paper. Most recently, Williams et al. (2014) give details of inclusion of a scheme to simulate the cycling of oxygen in the ocean and its coupling with the carbon cycle. Williams et al. (2013), which introduces surface tiling in the newer MOSES2 scheme. Five different vegetation types are simulated: broadleaf and needleleaf trees, C3 and C4 grasses, and 20 shrubs, each with a fractional coverage in a gridbox. Several surface types represent the absence of vegetation: bare soil, land ice, urbanised land use and inland water. Williams et al. (2013) describe the optimisation of carbon cycle parameters in the terrestrial and ocean domains, validated against observations and reanalysis products, and present climatologies using both fixed and dynamic vegetation.

25
FAMOUS shows a northern-hemisphere-winter surface air temperature cold bias with respect to HadCM3 and also the overestimation of the fractions of needleleaf trees in North America and C3 grassland in the northern part of Eurasia. The initial version of FAMOUS, used the MOSES1 surface exchange scheme, and did not explicitly describe the inclusion of any vegetation cover, instead using gridbox averages of surface quantities such as root depth, surface albedo and roughness length to describe momentum and water exchange between the surface and the atmosphere. Biases were already present in climate 30

The ensemble
We use an ensemble of 100 simulations of FAMOUS detailed in Williams et al. (2013), and build upon the results of that study. The ensemble was run in order to test the utility of including the carbon cycle in enhancing the FAMOUS simulator.

5
The ensemble design perturbs 7 vegetation and land surface control parameters (see table 1) in a latin hypercube configuration (McKay et al., 1979). This kind of design efficiently spans parameter space, and is commonly used for constructing surface response type statistical models known as emulators (see e.g. (Urban and Fricker, 2010)).
This design builds upon a previous ensemble run by Gregoire et al. (2010), and implicitly contains a further parameter, β, that indexes into that other ensemble. The β parameter indexes the top 10 performing simulations with regards to the atmospheric 10 climate. The Beta parameter is uncorrelated with any land surface parameters and the simulator output, so we exclude it from the ensemble design, essentially treating it as a nuisance parameter.
Ranges for the land surface parameters follow those used in the study by Booth et al. (2012), and as that paper makes clear were chosen for a number of reasons, not necessarily to represent plausible ranges of their uncertainty. However, we are confident that the parameter ranges are wide enough to span the space which might a priori be considered reasonable.

15
The ensemble simulates the preindustrial climate, with ensemble members spun up over a 200 year period to ensure that the vegetation is in equilibrium with the climate at 290 ppm of CO 2 . The vegetation dynamics component of the simulator, TRIFFID is run in "fast spin-up" mode, for the equivalent of 10,000 years for each decade of climate simulation, to allow for the long adjustment time of dynamic vegetation. The climatology is constructed using the final 30 year period of the ensemble.

Simulator outputs and observations
We compare simulated forest fraction against observations adapted from Loveland et al. (2000), consisting of regionally aggregated versions of the data used in the previous study by Williams et al. (2013). We use broadleaf only for the tropical forest, and a mixture of broadleaf and needleleaf for the North American forest. A spatial summary of the ensemble and observations can be found in Fig. 1. Figure 2 shows every input and summary output, plotted against each other. This shows the 5 marginal relationships of the 1) inputs against the inputs (which as expected show no obvious relationship), 2) the strength of the marginal relationship between the inputs and outputs, and 3) the outputs against the outputs, which highlights where outputs vary together. Parameter ranges do not represent uncertainty, so the ensemble mean and standard deviation are not a meaningful representation of data uncertainty but provide a useful summary of the data. To summarise the forest fraction data, we find the mean forest fraction in each of the Amazon, Central African, South East Asian, North American and Global regions 10 (see supplementary material Fig. S1 for region details).
South East Asian and Central African forests vary together very strongly across the ensemble, whereas the Central African and North American forests show a weaker relationship. The latter might be expected, given the different structure of the North American forests, compared with the tropical. The scatter plot also identifies NL0 (leaf Nitrogen) and V_CRIT_ALPHA (soil moisture control on photosynthesis) as being important controls on forest fraction, as the output seems to vary most with these 15 parameters.

Training an emulator
FAMOUS is not fast enough to run at every point within input space required for our analyses. We therefore use a computationally cheap statistical proxy to the simulator, called an emulator. The emulator is a non-parameteric regression model conditioned on the ensemble, providing a prediction of simulator output and corresponding uncertainty orders of magnitude 20 faster than the original simulator. Once trained, any analysis that might have been done with the simulator can be done with the emulator, provided we include the extra uncertainty term to account for the fact that the emulator is not a perfect prediction of the simulator output. A useful introduction to emulators and their uses can be found in O'Hagan (2006), and recent developments in emulator use in climate studies can be found, for example in Tran et al. (2016);Bounceur et al. (2015).
We use a Gaussian process emulator that assumes zero uncertainty at points where the simulator has already been run, 25 growing larger away from those points. We treat the output g(x) of the simulator FAMOUS as a deterministic function of a vector of input parameters x. We train a number of emulators of the ensemble, the details for each depending on the application.
Details of the emulator, training and verification can be found in the supplementary material.

History matching
After Williamson et al. (2014), we use history matching to find a region of parameter space consistent with observations to 30 within the level of observational and acceptable simulator uncertainty. This requires finding a set of input parameters where the output of the simulator is tolerably close to the observations, given uncertainty in the observations and known deficiencies of the simulator. Constraining parameters in this way helps identify the range of projected futures of the forest consistent with the observations, rather than a single set of "best" parameters.
A distinction from simulator calibration where a probability distribution over the parameters is described, history matching rejects inputs inconsistent with observations, or otherwise classifies them "Not Ruled Out Yet" (NROY). We regard NROY inputs as conditionally accepted, contingent on new observations or information. History matching was developed by Craig 5 et al. (1997), and has been used extensively in hydrocarbon extraction sciences, and astronomy (e.g. Vernon et al. (2010)).
Sometimes termed precalibration, It has been used to confront climate simulators with observations, for example by Lee et al. Observations of the system are denoted z, and we assume that they are made with uncorrelated and independent errors 10 such that z = y + , where y represents the true state of the climate being observed. Denoting the "best" set of input parameters x * , and assuming the simulator contains a systematic structural error δ, the observations are related to input parameters We could find the NROY region for x * by running a large number of candidate points of the simulator in a Monte Carlo fashion. FAMOUS is not fast enough for this, and it is also our intention to develop methods that can be used on even more 15 computationally expensive simulators. We therefore use the emulator as a proxy for the simulator output, replacing g(x) with η(x) in Eq. (1), and including a term for emulator uncertainty in the history matching calculations.
Each candidate point is assigned an Implausibility I, according according to the emulated forest fraction and uncertainty via Eq.
(2). Inputs that produce forest fraction further from the observations are deemed more implausible. Those same inputs are less implausible if there is greater uncertainty about the observation, the simulator discrepancy, or the emulated output at that 20 input: A threshold above which a candidate input can be safely ruled out as implausible is usually set to 3; roughly equivalent to a 95% credible interval of a posterior distribution, if using a Bayesian analysis. This is due to Pukelsheim's three-sigma rule; that for any unimodal distribution, 95% of the probability mass will be contained within 3 standard deviations of the mean 25 (Pukelsheim, 1994). Input parameter sets with an implausibility score below the threshold are designated NROY and retained for further analysis. This does not necessarily mean the input settings are good merely that evidence from observations is not yet sufficient to rule them out as implausible. Inputs may be ruled out as more observations or simulator runs become available.

An initial history match
In this section we find regions of land surface parameter space in FAMOUS that remain NROY given some defensible assumptions about observational uncertainty. Figure 3 shows how the regionally aggregated simulated forest fraction varies across the ensemble, compared with the corresponding observations. Although the simulator was not run at the "standard" parameter set-5 tings in the ensemble, we can use the emulator to estimate its output and uncertainty (±1 standard deviation) at those settings, and show these on the plot, in black.
The simulator run at the standard inputs significantly underestimates the forest fraction in the Amazon region, with a best estimate of >0.3. The other tropical forests are slightly overestimated, North American forests are very slightly underestimated.
Global forest fraction is simulated close to the observed fraction. Most ensemble members overestimate forest fraction in 10 Central Africa, Southeast Asia, and North America. Some ensemble members simulate an Amazon forest fraction around, and above, the observed fraction. This gives us cause to hope that it is possible to find a set of parameters where the Amazon and other forests are simultaneously well simulated, without using a simulator discrepancy function.
We aim to to find regions of parameter space where simulator error is removed, or minimised to a level consistent with observational uncertainty. In practice, this requires finding a region where the large negative bias in Amazon forest fraction is 15 minimised while keeping the other forests well represented.
On the advice of domain experts, we assume observational uncertainty of 0.05 (one standard deviation) in the Amazon, Central African, South East Asian and North American forests as broadly representative, or at least usefully illustrative.
This corresponds to an expectation that the true 95% confidence interval is contained within the interval of ±0.15, following Pukelsheim's rule. This is nearly a third of the available range of zero to one, and it would be hard to argue that this 20 represents an over-constraint.
We sample uniformly across input parameter space and run the emulator at these locations. We history match the samples using all four individual forest observations, and visualise the space where max[I] < 3. Figure 4 shows a density pairs plot of the approximately 12% of the 10,000 samples from the emulator that are Not Ruled Out Yet by the history match.
Does this region represent a viable set of inputs, perhaps to replace the default set of parameters, or should we include a 25 non-zero discrepancy term (δ in Eq. 1)? Where it appears that we may have found regions where both Amazon and other forests are plausible, we are suspicious of this region, for three reasons. First, the default set of parameters is ruled out, in this case by comparison of the simulator with observations of the Amazon (Table 2). Second, it appears that in the active parameter space projections, these candidates are near the edges and corners of the input space considered plausible. The failure to rule out these points could be due to a relatively large emulator uncertainty, for example. Third, we plot the histograms of the "best 30 estimate" emulator output at these NROY points (Fig. 5), we see that they can be seen as compromise candidates. In general, if the simulator is run at points in this region, it will overestimate the Central African, South East Asian and, most likely, North American forest fraction while underestimating the Amazon forest fraction. They are still included as NROY at these values because of the combination of the emulator uncertainty and the assumed observational uncertainty. In the remainder of this section, we use a number of analysis techniques to investigate why a region on the edge of parameter space initially considered plausible, that does not contain the default parameter settings, is identified as NROY.

Finding the active parameters with sensitivity analysis
We perform a sensitivity analysis to identify the active subspace of simulator inputs and quantify relationships between inputs and outputs. In a descriptive sensitivity analysis, we show emulated mean regional and global forest fraction with inputs 5 sampled from across input parameter space in a one-factor-at-a-time fashion, holding all but one parameter at their standard values while varying the remaining parameter (Fig. 6). The emulator is not a perfect representation of the simulator, and so we include the emulator uncertainty estimates at ± one standard deviation, shown as shaded regions in the plot.
V_CRIT_ALPHA, and NL0 are the most influential individual parameters and counter each other when both increased.
The Q10 parameter has little or no influence on forest fraction. The TUPP parameter is important only to the Central African 10 (termed "Congo" here, for brevity) and Southeast Asian forest fraction, much less important to the Amazon, and not important at all to the North American forests.
The relationships change across parameter space and are therefore dependent on the somewhat arbitrary range of the initial input parameters of the ensemble design. Sensitivity can change in importance as parts of input space are ruled out. For example, the forests are most sensitive to NL0 in the lower part of the ensemble range, and most sensitive to V_CRIT_ALPHA 15 in the upper part of the ensemble range.
Following (Carslaw et al., 2013), we quantify the sensitivity of the simulated forest fraction to the input parameters, using the FAST methodology (Saltelli et al., 1999), conveniently coded in the R package Sensitivity (Pujol et al., 2015), and easily calculated using the emulator. We calculate the global sensitivity of the simulator output due to each input, as both a main effect and total effect, including interaction terms (Fig. 7). V_CRIT_ALPHA (soil moisture photosynthesis control parameter) is the 20 most important parameter across the tropical forests and globally, with a total effect index of around 0.6. In tropical forests, NL0 (leaf nitrogen parameter) is next most important, with a total effect index between 0.2 and 0.3. In all cases, interaction terms are relatively unimportant, accounting for only a few percent of the variance. North American forests show slightly different results, with NL0 being the most important parameter with a sensitivity index near 0.4 followed by LAI_MIN (leaf area index parameter), at around 0.3 and V_CRIT_ALPHA at 0.25. This difference is unsurprising, as the North American forests are a mix of broadleaf and needleleaf trees, which will have different sensitivities from a broadleaf tropical forest.
Parameter Q10 has almost no influence on forest fraction, in line with the expectations of land surface modellers. This non-zero estimate of sensitivity is likely due to the fact that the emulator is not a perfect representation of the simulator, and a zero sensitivity is well within the uncertainty bounds of the sensitivity analysis. Parameters TUPP and R_GROW have very 5 little impact on forest fraction. Parameter F0 has virtually no influence away from the tropics, conversely, LAI_MIN is only important in the North American forest.

Mapping simulator error in parameter space
In this section, we examine the ability of the simulator to reproduce the observed forest fraction, how that ability varies across input parameter space, and assess the region of parameter space which is consistent with each of the forest fraction observations. 10 We show a map of simulator error in the the two dimensional space of the most important parameters identified in Sect. simulator error is close to zero. The Amazon input space does not have this region -only the high NL0, high V_CRIT_ALPHA corner has a simulator error close to zero, suggesting a bias not common to all forests. The fact that the regions of this reduced input space where the simulator error is close to zero do not overlap, means we are more unlikely to find parameter sets where a simulator discrepancy term is not needed. It is possible to find a portion of parameter space where the error is similar for all simulator outputs in the low NL0, high V_CRIT_ALPHA corner. However, the error is rather large (at least -0.6) at this point.

How much input space is ruled out by combinations of observations?
We find the potential of the history matching technique to rule out parameter space under a number of scenarios of tolerance to observational and simulator structural error. The denominator of Eq.
(2) is the sum of the squared variances of the emulator, discrepancy, and observational uncertainty. Our emulator uncertainty is emergent, but we can experiment by assuming an overall uncertainty budget, or by partitioning assumed uncertainty between observations and simulator discrepancy. independently, and reject those candidates with a value larger than 3 in any. A danger of this approach is that a single poorly specified emulator or simulator discrepancy term could lead to large swathes of parameter space being incorrectly ruled out. As 30 the number of comparisons with data goes up, so does the probability of including a poorly specified simulator discrepancy. For example, comparing a simulator with a serious but undiagnosed bias could lead to all a priori plausible parameter space being ruled out as a poor match to the observations. It is important to first combine knowledge and judgement about the system being modelled, and the way that the parameters represent their real world counterparts (or don't), before relying on observations to remove plausible parameter space.
A conservative approach is to reject a candidate point only if it is judged implausible using a number of measures. This will be more robust to a poorly specified simulator discrepancy term. Vernon et al. (2010) use the 2nd and 3rd highest implausibility score, where a simulator has implausibility scores for multiple outputs calculated. This is to guard against poor emulators, but 5 in practice works just as well for poorly specified simulator discrepancy. An alternative suggested by Vernon et al. (2010) is to use a multivariate measure of implausibility.
To understand the value of individual observations, we ask what is our tolerance to error? What level of uncertainty in observations or simulator discrepancy can we tolerate before our observations become ineffective for history matching? Figure   9 shows the declining proportion of input parameter space ruled out as we increase tolerance to error in a number of scenarios.

10
Tolerance to error is specified as a single standard deviation so the full distribution of the uncertainty of the observation or discrepancy (e.g. the 95% range) will be at least three times as large, using Pukelsheim's rule.
North American, South East Asian and Central African forest observations constrain parameter space to between 40% and 50% of parameter space, even when our tolerance to error is very low. The proportion of NROY space increases quickly, particularly using North American forest fraction, which becomes no constraint at all when our error tolerance is above 0.07 (1 15 standard deviation). The other forests offer some constraint up to about 0.1 (1 standard deviation), and the Amazon is more of a constraint, only losing power as a constraint when the standard deviation of our tolerance to error is above 0.15 (1 standard deviation).
Combining data, and using the maximum Implausibility of any dataset improves the constraint, particularly when the tolerance to error is low. However, we urge caution. The fact that a) the performance of the Amazon data set appears different from 20 the other observations, and b) that all parameter space is ruled out at lower values, even though there is emulator uncertainty, again raises concerns of a poorly specified Amazon simulator discrepancy.
A more robust calculation of tolerance to error can be found by excluding the Amazon observations and using the maximum implausibility from the other observations. This excludes more input parameter space than any single observation on its own, up to a tolerance to error of around 0.85 (1 standard deviation), where it performs in a similar manner to using Southeast Asian 25 forest fraction.
To what extent do the input spaces that are NROY when history matching with two forests overlap? We suppose that data that suggest highly overlapping input spaces give us confidence that those input spaces are valid. Another perspective is that overlapping input spaces give us little extra information, and we should seek out those that minimise overlap. We sample uniformly from the input space, and test each point using a comparison with each forest observation to see if it is ruled out.

30
If a point has the same status using both forests in the history match, we class that as an overlapping point. Table 3 gives the proportion of the samples which have the same status using each permutation of two forests for the history matching.
The most similar input space is found if we use the Southeast Asian and Central African rainforests. Comparing these forests with the North American forests gives a fairly high overlap -61% and 66% for Southeast Asia and Central Africa respectively. The Amazon has markedly lower overlap with the other forests -40% at the most with North America, and only 26% with South East Asia.

What do the individual forests tell us about the best parameters?
To more fully explore the causes of simulator discrepancy and its consequences, we make the illustrative assumption that simulator discrepancy uncertainty is zero, and that observational uncertainty is very low. We sample a large number of points 5 uniformly across input space, assume simulator discrepancy uncertainty of zero and an observational uncertainty of 0.01.
We classify as NROY only those emulated samples where the implausibility (or maximum implausibility in the case of combined data) is below 3. Setting such a demanding threshold allows us to find and describe the relatively small regions in input space where the simulator performs best, in two cases. First, using the South East Asian, Central Africa and North American forest fraction in the history matching exercise, second using the Amazon forest fraction.

10
Plotted in two-dimensional projections in Fig. 10, we see that the "best" set of parameters as defined by matching to the observed Amazon forest fraction, and to the other forests, form almost non-overlapping sets in the most active subspace comprising V_CRIT_ALPHA and NL0. Again, we see a swathe of input parameter space, running from low V_CRIT_ALPHA, low NL0 through high values of those parameters. This pattern is confirmed when using the individual data sets for history matching (not shown). The three non-Amazonian forests have a high degree of overlap of NROY space.

15
FAMOUS struggles to simulate both the Amazon and the other forests simultaneously, at any parameter combination when using a low threshold of implausibility. It is very difficult to reconcile the simulation of the Amazon simultaneously with the other forests if there is little uncertainty about the observations. A simulator discrepancy term and corresponding uncertainty is therefore necessary to attain an adequately performing simulator.

The forests at best parameters 20
To examine the implications of using each observation separately to tune the simulator, we use the emulator to project the each forest at the set of "best" inputs: those where the simulator reproduces each forest with a very small tolerance of error. We then use the emulator to project the Amazon forest fraction using the "best" parameters for each forest, and the forest fraction for each of those forests using the "best" parameters for the Amazon in Fig. 11. As there is some uncertainty, due to emulator uncertainty and a small tolerance to error, these are plotted as histograms.
We find that the using the best set of parameters as defined for each non-Amazon forest would likely lead to an underestimate of the Amazon forest fraction by around 50%, compared to the observed fraction (around 0.3, compared to an observation of 5 around 0.6). Conversely, using the best parameters as defined for the Amazon leads to an overestimate of the other forestsaround 0.3 for the tropical forests, and 0.15 for the North American forest -even though the observed aggregate forest fraction is very similar for the tropical forests.
To further explore this difference, we project the "best" set of input parameters, found using the Amazon and African forest to match the simulator against, over a map of the entire FAMOUS land surface. In each case, an independent emulator is trained 10 on the ensemble for each grid box. The maps of the mean forest fraction for each parameter set, and the difference between them, is shown in Fig. 12.
Even using the "best" Amazon parameters, the simulator underestimates the Amazon coverage in the North East of South America. This makes it very difficult to simulate a sensible forest fraction, even when overestimating the forest fraction in places where the simulator does have forest cover.

History matching allowing for discrepancy in the Amazon
The previous sections show that the inputs where FAMOUS best simulates Central African, South East Asian and North American forests cover a similar input space, whereas the best inputs for the Amazon are in a different region. A parsimonious approach would be to use a non-zero-mean discrepancy for the Amazon: allowing the Amazon to be less vigorous in our simulations, while maintaining that the simulator output should broadly match the other forests. 20 We perform a history match using all of the forest observations, along with a simulator discrepancy term for the Amazon forest. We use the best estimate of the difference between Amazon observations, and that simulated by FAMOUS at the default set of parameters as the best estimate of the discrepancy mean. The difference in forest fraction at the default parameters is approximately 0.3. Figure 13 shows the histograms of emulated simulator output using this discrepancy term, along with credible estimates for observational uncertainty (1 standard deviation = 0.05) and tolerable discrepancy uncertainty (1 standard 25 deviation = 0.03). The corresponding two-dimensional density plots of NROY emulated input samples can be seen in Fig. 14.
The remaining NROY input space represents around 57% of the original input space defined by the input design, meaning that we have ruled out 43% of the space. This contrasts with ruling out around 88% of the space in the inital history match in Sect.
3.1. Marginal histograms of the relative density of NROY points for each individual input parameter (not shown) indicate that no part of the marginal input space is completely ruled out, and so we cannot "constrain" any of the parameters in an individual 30 dimension.
Our analysis illustrates the challenges in distinguishing between simulator discrepancy, parameter uncertainty and observational uncertainty during simulator development. For example, forest fraction in the simulator can be tuned largely by using the two most active parameters: V_CRIT_ALPHA and NL0. As these parameters alter forest fraction in counteracting directions, a number of solutions can be found that give plausible forest fractions. Information from outside sources about the "true" 5 values of one these parameters might therefore offer a strong constraint on the value of the other. NL0 is the leaf nitrogen parameter -the ratio of nitrogen to carbon found in leaves. In theory, this is something that is well observed and recorded, but it is uncertain what value should be to reflect the observational range across the spatial scale of FAMOUS. Nitrogen content determines the maximum photosynthesis, and therefore how much CO 2 can be assimilated, or the productivity of a plant. Low (high) NL0 values correspond to low (high) nitrogen content, and hence a low (high) productivity plant. V_CRIT_ALPHA is 10 the soil moisture threshold below which plants are water limited, so if this parameter is high the plant is more often in a water limited regime. If it is low, then a plant is not as often water limited.
Using observations of the Amazon rainforest along with the other forests major forests in the history matching exercise results in ruling out a large swathe of parameter space, including the default set of parameters, and leaving a corner of parameter space Not Ruled Out Yet. While it appears that here simulator output is tolerably close to the observations given a zero-mean 15 discrepancy, there are good reasons to be suspicious of this region. For illustration, we imagine a situation where we are forced to choose between keeping the default parameters and including a simulator discrepancy function, or rejecting them and accepting a candidate from the new NROY region. Our choices will be dictated by the objective of our analysis: do we wish to provide only the best possible prediction, or do we wish to find parameter values which are, to some extent, "true"?
For a simple prediction problem, we will be less concerned that the parameters more accurately reflect something we might 20 measure in the real system, and might be less inclined to include a discrepancy term. However, sustainable development of the simulator requires that we get things right for the right reason. We argue that we should include a larger discrepancy function for the Amazon rather than ruling out the default parameters, for a number of reasons.
First, the NROY region excludes the default set of parameters, chosen as the result of multiple lines of evidence, scientific judgement, and experience using this and other simulators. Second, the NROY region is close to the edge of the ensemble in the 25 active parameter subspace, so that emulator uncertainty, combined with the generous observational and discrepancy uncertainty, may dominate the implausibility calculation. Emulators tend to increase in uncertainty near the edge of an ensemble, as they are forced to extrapolate more than at the centre of the ensemble. Third, the information obtained from using each of the four forests shows that the Central African, Southeast Asian and North American forests all indicate very similar, highly overlapping NROY regions. In contrast, the NROY region suggested by comparing FAMOUS to observations from the Amazon is very 30 different. Finally, tuning to each of the "best" parameters for each of the forests suggests that the NROY region produces an inevitable compromise: the Amazon will be very likely be underestimated, and the other forests overestimated, if observational uncertainty is reduced. It is possible that there are correlated errors in the other forests, rather than in the Amazon. However, we argue that this is less likely, given that the other forests include tropical (like the Amazon) and the Boreal forest of North America.
We therefore urge caution with a naive or automatic application of history matching conclusions, particularly when using multiple observations for comparison with the simulator. Even in our relatively simple history matching exercise, there is a clear need to include simulator discrepancy, or increase simulator discrepancy uncertainty, or to apply a conservative version 5 of the measure of implausibility. One strategy, adopted for example by Vernon et al. (2014) is to reject parameter space that has a second-or third-highest implausibility metric larger than some threshold. This would be effective in the case of our comparison. Another strategy might be to reject only parameter space where the minimum implausibility is higher than some threshold. We believe that this would not rule out much input space in many circumstances. We call for more research on the behaviour of measures of implausibility, when the number of data comparisons is high, and there is a chance that many of them 10 may suffer from structural biases. Conducting a full probabilistic calibration as an alternative approach to our study might offer a powerful tool to overcome some of the difficulties we mention here. In particular, it would allow us to weight inputs as candidates for the "best", using the rules of probability, at the cost of expending effort in specifying prior distributions and likelihood functions.
We are able to offer a counter example to the hypothesis of Williamson et al. (2014) where what was thought a structural error in the simulator was significantly reduced. In this case, we believe it likely that better observations would simply confirm that the "best" regions of parameter space for the Amazon and other forests were non-overlapping. While individual forest fraction observations may have some uncertainty, we would expect the uncertainty on the differences between those observations to be smaller. A systematic bias in the way that the forests are measured would be common to all observations, for example, even though it would need to be taken into account in the uncertainty calculation 20 for an individual observation.
We find that forest fraction does not offer a marginal constraint on the parameters: that is, there is little or no constraint on each parameter individually, but there is a significant constraint on the joint input space of the parameters. Approximately 43% of a priori parameter space is ruled out, which is relatively little compared to other studies. This is explained by several factors: 1) the ensemble covers a relatively small input space, compared to other studies, due to the fact that the simulator is based on 25 a well-studied climate model, HadCM3 2) our observational uncertainty is assumed conservatively large, and 3) we have only a single wave of history matching. A further experiment could run the climate simulator within the NROY space in order to reduce emulator uncertainty, and provide a basis to further rule out input space. The value of further waves of history matching might be diminished by the fact that the simulator likely has a large discrepancy in the Amazon, and the simulator discrepancy uncertainty is likely a large component of the overall uncertainty budget.

Causes of discrepancy
We suggest three possible causes of fundamental structural error -external and internal to the vegetation model, although a combination of these causes is not ruled out. First, is there a problem with the emulator that would lead us to think that such a discrepancy exists? We believe that this is not the case, as the emulator performs sufficiently well across parameter space in cross validation experiments (see supplementary material Fig. S2).
Second, is there a missing processes in the vegetation model, that impacts the Amazon or other forests in FAMOUS, or perhaps has the Amazon has developed in other ways not seen in the other forests? For example, it is possible that the real Amazon can access water to a deeper level than other forests, through deep rooting. This would cause a low Amazon bias, seen 5 in the simulator output. If the simulated Amazon can't access water through deep enough roots, and simulator parameters were tuned to make Amazon as vigorous as real world, other forests would be more vigorous in the simulator than in observations.
A bias that leads to a reduction in Amazon forest extent (such as that climatological or root depth) is likely to lead to further rainfall reductions, and its associated warming, as the region loses water cycling capability that the forest canopy provided. This is a feedback, and can be expected to enhance any dry/warm bias that results from other factors, and in turn enhance any forest 10 loss. Such a simulator discrepancy could be countered by allowing different parameters in different regions, perhaps through ancillary parameter maps. Alternatively, the number of plant functional types allowed in the simulator could be increased -an approach adopted by many vegetation modelling efforts. (2012) note similar climatic biases across the CMIP5 archive. We suggest that attributing the simulator discrepancy to these causes might be a fruitful direction for further study.

Conclusions
We analyse an ensemble of the fast climate simulator FAMOUS with the aim of constraining carbon cycle parameters through 25 a comparison of simulator output with forest observations. We find that we are unable to constrain the parameters individually, but that areas of joint parameter space are effectively ruled out. With a defensible simulator discrepancy term for the Amazon, and assumed observational uncertainty we are able to rule out 43% of the input parameter space defined by the ensemble design.
We identify moisture control on photosynthesis (V_CRIT_ALPHA) as the most important parameter control on forest frac-30 tion, with the next most important leaf nitrogen (NL0), parameter being approximately half as important, and that twice as important as any other parameter. These parameters have counteracting effects on the forest fraction, so we are unable to rule out a broad swathe of the joint space of these two parameters.
We suggest that we should exercise care if using observations of the Amazon rainforest to constrain the input parameters of FAMOUS, as an apparent structural bias in the climate simulator could lead to misleading results. Using the Amazon forest as an observational constraint suggests very different parts of input parameter space as not implausible than using other forests. Although we are able to find a region of parameter space that we are unable to rule out, given a defensible assumed observational uncertainty, we have reason to suspect that this region does not offer a credible alternative to default parameter 5 settings. Further investigation reveals that choosing the region would systematically overestimate the forest fraction of the Central African, South East Asian and North American forests, while simultaneously underestimating the Amazon. We fail to find a set of parameters that eliminates the discrepancy between the simulated fraction of the Amazon and other tropical and boreal forests. We suggest that we cannot find a set of vegetation model parameters that improve the Amazon without making the other forests worse. This satisfies the criterion of Williamson et al. (2014) to identify a simulator bias.

10
Using a history matching technique, we investigate the limits of observational and simulator discrepancy uncertainty, beyond which observations no longer offer a constraint on input parameter space. We find that if this total error budget is larger than approximately 0.1 (1 standard deviation of forest fraction), and excluding the Amazon rainforest as a comparison, the observations will not offer any form of constraint on the current ensemble, even in joint parameter space.    q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Amazon Africa SE Asia N America All All except Amazon Figure 9. Proportion of NROY (Not Ruled Out Yet) input space plotted against "tolerence to error" -the total error budget including emulator, observational and simulator discrepancy uncertainty.  Density Forest Fraction Figure 13. Histograms of emulated simulator output using credible estimates for observational uncertainty, a simulator discrepancy term for the Amazon, and credible discrepancy uncertainty.  Figure 14. A density plot of the two dimensional projections of NROY samples from the design input space, using a all forest observations and a discrepancy function for the Amazon.