A novel bias correction methodology for climate impact simulations

Abstract. Understanding, quantifying and attributing the impacts of extreme weather and climate events in the terrestrial biosphere is crucial for societal adaptation in a changing climate. However, climate model simulations generated for this purpose typically exhibit biases in their output that hinder any straightforward assessment of impacts. To overcome this issue, various bias correction strategies are routinely used to alleviate climate model deficiencies, most of which have been criticized for physical inconsistency and the nonpreservation of the multivariate correlation structure. In this study, we introduce a novel, resampling-based bias correction scheme that fully preserves the physical consistency and multivariate correlation structure of the model output. This procedure strongly improves the representation of climatic extremes and variability in a large regional climate model ensemble (HadRM3P, climateprediction.net/weatherathome ), which is illustrated for summer extremes in temperature and rainfall over Central Europe. Moreover, we simulate biosphere–atmosphere fluxes of carbon and water using a terrestrial ecosystem model (LPJmL) driven by the bias-corrected climate forcing. The resampling-based bias correction yields strongly improved statistical distributions of carbon and water fluxes, including the extremes. Our results thus highlight the importance of carefully considering statistical moments beyond the mean for climate impact simulations. In conclusion, the present study introduces an approach to alleviate climate model biases in a physically consistent way and demonstrates that this yields strongly improved simulations of climate extremes and associated impacts in the terrestrial biosphere. A wider uptake of our methodology by the climate and impact modelling community therefore seems desirable for accurately quantifying changes in past, current and future extremes.


Introduction
Weather and climate extreme events such as heat waves, droughts or storms cause major impacts upon human societies and ecosystems (IPCC, 2012). In recent years, these climatic events have changed in intensity and frequency in many parts of the world (Barriopedro et al., 2011;Donat et al., 2013;Seneviratne et al., 2014) and changes are likely to continue throughout the 21st century (Sillmann et al., mate system (Reichstein et al., 2013;Frank et al., 2015). Indeed, on continental to global scales, it has been shown that large-scale reductions in photosynthetic uptake of carbon by plants are mainly driven by water limitations (Zscheischler et al., 2014a, b). Furthermore, it has been demonstrated that a single large event such as the European heat and drought summer of 2003 alone might undo several years of ecosystem carbon sequestration (Ciais et al., 2005), thus potentially jeopardizing the terrestrial carbon sink potential (Lewis et al., 2011).
A widely debated question in this realm is whether the observed changes in the occurrence of climatic extremes and associated impacts can be attributed to specific changes in climate forcing, both anthropogenic or natural (Allen, 2003;Stone and Allen, 2005;Stone et al., 2009). To this end, large climate model ensembles are needed in order to derive robust probabilistic conclusions about changes in the odds of these events (Bindoff et al., 2013;Massey et al., 2014), because direct assessments of rare extremes are often prohibited by the lack of long and good quality observational time series. Hence, climate models are indispensable tools to study present and future climate extremes on various spatial and temporal scales, and the availability of such simulations is often a prerequisite for studying climate impacts.
However, despite considerable progress in recent years, global and regional climate models typically exhibit biases in various statistical moments of their simulated variables (Ehret et al., 2012;Wang et al., 2014), which often impedes direct assessments of climate extremes (Otto et al., 2012;Sippel and Otto, 2014) or simulating impacts (Maraun et al., 2010;Hempel et al., 2013). These biases are often due to an imperfect representation of physical processes in the models, parametrizations of sub-grid-scale processes, and an overor underestimation of feedbacks with the land-atmosphere or ocean-atmosphere feedbacks (Ehret et al., 2012;Mueller and Seneviratne, 2014). Due to the various origins of model biases, these biases are frequently varying depending on weather patterns both spatially and temporally, for instance in the distributed weather@home ensemble-based modelling framework (Massey et al., 2014) or in an ensemble of regional climate models (Vautard et al., 2013).
To alleviate this issue, various bias correction schemes have been developed in recent years that generally aim to statistically transform biased model output in order to derive more realistic simulations (see e.g. Maraun et al., 2010;Teutschbein and Seibert, 2012). To do so, a statistical relationship ("transfer function") is built between the statistical distribution of an observed and simulated variable (Piani et al., 2010). Such methods span a wide range from very simple parametric transformations adjusting simulated means to observations (also called the "delta method" (additive) or "linear scaling" (multiplicative); Teutschbein and Seibert, 2012) to sophisticated, nonparametric approaches that aim to correct various statistical moments of the simulated distri-butions such as quantile mapping approaches (Wood et al., 2004;Gudmundsson et al., 2012).
However, the application of bias correction implicitly requires that a range of assumptions are met, which might be questionable in many cases and are discussed in detail in Ehret et al. (2012). Most importantly, the application of bias correction implicitly assumes that the statistical transformation improves the simulated output time series ("effectiveness"), whilst the signal of interest, e.g. the climate change signal or properties of the extremes, remains accurately detectable ("reliability"). Those assumptions are not always fulfilled since statistical bias correction methods are not based on physical principles, but operate rather heuristically on an observed model-data mismatch. To this end, even relatively simple methods that are designed to adjust "only" simulated long-term monthly means to observations (e.g. Hempel et al., 2013) lack a sound physical rationale as to whether these adjustments are to be made additively or multiplicatively. Further, the assumption of time-invariant biases that currently underlies state-of-the-art bias correction procedures (Christensen et al., 2008;Ehret et al., 2012) might be especially critical for century-long climate simulations spanning several degrees of warming (Christensen et al., 2008;Buser et al., 2009) including changing landatmosphere feedback processes (Seneviratne et al., 2006). Recent studies have shown that this assumption is questionable for future climate simulations (Maraun, 2012;Bellprat et al., 2013), and have made attempts to address timedependent biases.
In conclusion, accounting for biases in climate model output is crucial in order to produce credible climate model simulations. Nonetheless, statistical transformations are to be applied with caution and the changes induced to the simulated statistical moments, multivariate dependencies and spatiotemporal patterns deserve considerable attention. Since the tails of statistical distributions are especially sensitive to changes in statistical moments such as the mean and variance (Katz and Brown, 1992), the latter holds in particular for assessments of extreme events and highlights the need for physically consistent ways to alleviate climate model biases.
In this paper, we demonstrate how a physically consistent bias correction of a regional climate model ensemble might aid to better simulate climatic extreme events and impacts in the terrestrial biosphere (see Fig. 1 for the methodological workflow of the paper).
First, we introduce a novel methodology to alleviate biases in the output of climate model ensembles that successfully circumvents major deficiencies of statistical bias correction (Sect. 3): an ensemble-based probabilistic resampling approach retains the physical consistency of the regional climate model output. This includes the preservation of the multivariate correlation structure, and the procedure is shown to considerably improve the simulation of various statistical moments of the simulated variables. Secondly, we assess contemporary temperature and precipitation extreme events in Central Europe on monthly to seasonal timescales by comparing a widely used "standard" statistical bias correction methodology (Hempel et al., 2013) with the original model simulations and the probabilistic resampling (Sects. 4.1 and 4.2). This evaluation also focuses on the uncertainty induced by different observational data sets used as a basis for any bias correction approach. Thirdly, we explicitly test how differently corrected climatic data propagate into the simulation of impacts on major component fluxes of terrestrial carbon -net ecosystem exchange (NEE), gross primary production (GPP) and ecosystem respiration (Reco) -and water cycling (actual evapotranspiration, AET) in the terrestrial biosphere using a dynamic vegetation model (LPJmL,Sect. 4.3). To this end, we demonstrate that different ways to deal with biases in climate simulations yield both qualitatively and quantitatively different results regarding simulated impacts, which affect both central moments of the distribution as well as extremes and variability.

Climate model simulations
In this study, regional climate model ensemble simulations spanning 26 years (1986-2011) with approx. 800 ensemble members per year from the weather@home distributed computing platform are investigated (Massey et al., 2014). The "atmosphere-only" simulations were conducted over the European region (identical to the EURO-CORDEX region; Giorgi et al., 2009) using a regional model (HadRM3P) on a rotated grid nested into the global HadAM3P model. Both models share the same model formulation and are described in Pope et al. (2000). The regional (global) simulations are run with a spatial resolution of 0.44 • × 0.44 • (1.875 • × 1.25 • ) with 19 vertical levels, and the temporal resolution is set to 5 (15) min (Massey et al., 2014). The models are driven by observed sea surface temperatures and sea ice fractions, the observed composition of the atmosphere (greenhouse gases, aerosols) and anomalies in the solar cycle (Massey et al., 2014). To derive different ensemble members, the initial conditions of the driving GCM are perturbed on 1 December of each 1-year simulation (ibid.). For further analysis and bias correction, the ensemble simulations were remapped to 0.5 • spatial resolution using a conservative remapping scheme (Jones, 1999). Massey et al. (2014) demonstrate that the ensemble setup described above produces a realistic representation and statistics of European weather events, including the extremes for most seasons and regions. However, despite these encouraging results, a relatively large mismatch remains between the statistical distribution of the ensemble simulations and the observations in Northern Hemisphere summer, which holds for the means of simulated seasonal temperature and precipitation (Massey et al., 2014) as well as for higher statistical moments, shown in the Supplement against the ERA-Interim reanalysis data set (Dee et al., 2011). Especially in more continental parts of the European model region, HadRM3P shows a pronounced hot and dry bias in simulated summer weather (Figs. S1-S3 in the Supplement). However, note that the ensemble setup still captures the entire range of the observed distribution ( Fig. S1 in the Supplement). In HadRM3P, these biases are likely related to an imperfect parametrization of cloud processes in the model, leading to an overestimation of incoming solar radiation, which in turn triggers warm and dry summer conditions (R. Jones, personal communication, 2015) that are further amplified by strong soil moisture-temperature coupling in the model (Fig. S4 in the Supplement). In this context, it is worthwhile to note that these biases are not a peculiarity of the regional climate model employed in this study, but indeed hold for many dynamically downscaled regional climate model simulations over Europe (Buser et al., 2009;Boberg and Christensen, 2012).

Simulation of atmosphere-biosphere carbon and water fluxes
To assess terrestrial biosphere impacts of bias-correcting regional climate simulations (see Sect. 4.3), we simulate ensembles of atmosphere-biosphere fluxes of carbon (NEE, GPP, Reco) and water (AET) using the Lund-Potsdam-Jena managed land scheme (LPJmL, Version 3.5, Sitch et al., 2003;Bondeau et al., 2007), a state-of-the-art process-based dynamic global vegetation model that accounts for human land use. We follow Schulze (2006) and Chapin III et al. (2006) in their definition of major components of carbon cycling in terrestrial ecosystems: gross primary productivity (GPP) denotes the vegetation's gross photosynthetic uptake of carbon from the atmosphere, whereas ecosystem respiration (Reco) is defined as the respiratory release of carbon by plants and microbes in the ecosystem, i.e. including both (autotrophic) plant respiration and (heterotrophic) soil organic matter decomposition. Net ecosystem exchange (NEE) constitutes the net carbon flux from the ecosystem to the atmosphere, i.e. the difference between Reco and GPP. LPJmL simulates vegetation dynamics (growth, competition and mortality) and fully coupled cycling of carbon (photosynthesis, autotrophic and heterotrophic respiration) and water (transpiration, evaporation, interception, runoff) in terrestrial ecosystems and managed systems (Sitch et al., 2003;Gerten et al., 2004;Bondeau et al., 2007). The model is driven with monthly or daily climatic input data (temperature, precipitation, incoming shortwave radiation and net longwave radiation), atmospheric carbon dioxide concentrations and soil texture. Vegetation structure in LPJmL is characterized by the fractional coverage of 11 plant functional types that differ in their bioclimatic limits and ecophysiological parameters. Vegetation dynamics and competition are explicitly represented using a set of allometric and empirical equations and updated annually (Sitch et al., 2003).
GPP in LPJmL follows the process-oriented coupled photosynthesis and water balance scheme of the BIOME3 model (Haxeltine and Prentice, 1996). Subsequently, autotrophic (growth and maintenance) respiration is subtracted from GPP, and the net carbon uptake is allocated to plant compartments based on a set of allometric constraints (Sitch et al., 2003). Ecosystem heterotrophic respiration depends on temperature and moisture in each litter and soil carbon pool; carbon decomposition dynamics are simulated as first-order kinetics with specified decomposition rate in each pool (Sitch et al., 2003). Water cycling in LPJmL has been improved by Gerten et al. (2004) and Schaphoff et al. (2013), where actual evapotranspiration (the sum of evaporation, transpiration and interception) is computed as a function from atmospheric demand and soil moisture supply. Phenology and photosynthesis-related parameters in the LPJmL version used in this paper have been optimized against remote sensing observations for an improved simulation of natural vegetation greenness dynamics (Forkel et al., 2014), including the introduction of a novel phenology scheme.
LPJmL has been applied in a range of studies assessing ecosystem responses to anomalous climatic conditions Van Oijen et al., 2014;Zscheischler et al., 2014b;Rolinski et al., 2015). Rolinski et al. (2015) argued that the model might be able to capture various ecosystem physiological responses to climatic extreme events such as heat or drought through various pathways. These include a water stress response through reduced stomatal conductance, which in turn decreases both photosynthetic carbon uptake and transpiration. Further, the model responds to very high temperatures by a photosynthesis inhibition and increased respiration .
Furthermore, an adjustment of daily variability does not necessarily improve monthly statistics, thus emphasizing the role of timescales at which bias correction is conducted (Haerter et al., 2011). Lastly, if impact simulations are to be conducted with bias-corrected output of numerical cli- mate models, the multivariate correlation structure between climate variables deserves attention: most bias correction schemes that are currently in use to simulate impacts have been suggested to correct each variable separately (Hempel et al., 2013) and hence dependencies between variables are often not retained. This is especially critical for assessments of extreme events and "compound events" (Leonard et al., 2014), where inter-variable interactions, such as soil moisture-temperature feedbacks, might play an important role (Seneviratne et al., 2006). Although recent progress has been made to derive bivariate bias correction schemes (Hoffmann and Rath, 2012; Piani and Haerter, 2012;Li et al., 2014), to the best of our knowledge currently no bias correction scheme retains a multivariate correlation structure of a larger set of input variables for impact simulations.
In this paper, we use the weather@home climate data to derive ensemble-based simulations of the functioning of terrestrial ecosystems. LPJmL simulations are conducted in natural vegetation mode (i.e. no human land use, fire or permafrost) in 0.5 • spatial resolution and monthly time steps over Central Europe. For each bias-corrected ensemble data set, 2000 years of spinup to equilibrate soil carbon pools were conducted, using randomly chosen years from the first 10 years of the HadRM3P ensemble. Subsequently, atmosphere-biosphere fluxes were simulated at the monthly timescale for 1986-2010 over Central Europe (see Fig. 1 for methodological workflow). This procedure was repeated five times to check that no carry-over effects from the randomized spinup affect simulated biosphere-atmosphere carbon fluxes in the transient period. Since this was not the case, differences in carbon and water fluxes and their extremes can be directly attributed to the bias correction of the climatic forcing in the transient period, and are analysed in Sect. 4.3.

Observations
Any statistical assessment or correction method of biases requires reference data sets, and the quality of bias adjustment is thus restricted by the quality of observations or reanalysis data available (Ehret et al., 2012;Hempel et al., 2013). Consequently, the sensitivity of "bias corrected" model output to any given set of observations needs to be tested. In this study, a range of observational data sets is used in order to characterize uncertainty induced by using different observations for bias correction. In total, seven different temperature and precipitation data sets consisting of gridded observations/reanalysis were used (one at a time) for the univariate bias correction (Sect. 4.2) and are detailed in Table 1. The simultaneous correction of multiple variables for the impact simulations in the terrestrial biosphere presented in Sect. 4.2 are conducted using ERA-Interim as reference data set (Dee et al., 2011, see Table 1).
To conduct the sensitivity analysis of climatic extremes and associated biosphere impacts to the type of bias correction applied, we select one focus region in Central Europe. This region roughly encompasses Germany (47.5-55.0 • N, 6.0-15.0 • E, see e.g. Fig. S2a in the Supplement) and consists of temperate mid-latitude climate with maritime influence to the northwest and more continental characteristics to the east. In addition, to sample local (i.e. grid cell scale) variability we test different bias correction scheme on one single grid cell located in Central Germany ("Jena pixel", 50.75 • N, 11.75 • E). Cumulative density function

Methods
In this section, we describe the different bias correction methods deployed in this study. First, a bias correction methodology for impact simulations that has been adopted widely is summarized (Hempel et al., 2013). Second, we introduce the novel resampling-based bias correction scheme and lastly the methodologies for evaluation are described.

Statistical bias correction
Hempel et al. (2013) presented a bias correction that is designed to preserve long-term trends in simulated impacts and that has been used widely in simulating effects of climatic changes in different sectors such as water, agriculture, ecosystems, health, coastal infrastructure and agro-economy (see Warszawski et al., 2014, for an overview). The approach builds on earlier, conventional statistical bias correction schemes (Piani et al., 2010;Haerter et al., 2011) and is based on linear transfer functions of the form Here, x and x cor represent the simulated and corrected climatic variable, a and b are coefficients to be calibrated. In Hempel et al. (2013) the transfer function is applied additively (for temperature, i.e. b = 1), such that where T mod and T obs represent the means of simulated and observed monthly temperatures, respectively. To account for positivity constraints for precipitation and radiation components, Hempel et al. (2013) suggested a multiplicative adjustment of those variables (i.e. a = 0), such that b = x obs x mod .
These parametric transformations are applied on each grid cell and for each month separately to account for potential temporal and spatial structure in the biases. By applying this transfer function, long-term monthly means of the simulated distributions are matched with those in observations for each grid cell (Hempel et al., 2013). In addition to adjusting monthly means, Hempel et al. (2013) also adjust daily variability about the monthly means, but (importantly) the yearto-year variability at monthly timescales remains unchanged. In our present analysis, we follow this conventional bias correction scheme for comparison and denote it by "ISIMIP".
Furthermore, to isolate the effects of bias-correcting the full suite of impact variables (temperature, precipitation and radiation) vs. correcting simulated precipitation only, we conduct impact simulations with a "precipitation only" biascorrected scenario ("PRECIPCOR").

A novel resampling-based ensemble bias correction scheme
In this section, we introduce a novel "bias correction scheme" suitable for ensemble simulations that retains the physical consistency and multivariate correlation structure of the model output. The idea is to resample plausible ensemble members from a large ensemble simulation given the statistical distribution of an observable meteorological metric ("constraint"). The procedure is illustrated using the weather@home ensemble described above. The largest biases in the HadRM3P simulation occur in the summer season (JJA) over the European model domain, where the model ensemble produces too frequent and too pronounced hot and dry conditions (Massey et al., 2014). Importantly however, the ensemble spans the entire distribution of observed summer conditions in most parts of Europe, i.e. some (but too few) ensemble members produce relatively wet and cold summers. Therefore, our resampling-based correction approach is designed to alleviate the representation of summer conditions in the model ensemble.
The bias correction procedure consists of the following steps and is illustrated in Fig. 2: 1. Define an observable meteorological metric that is poorly represented ("biased") in the model ensemble.
In this paper, we use summer mean temperatures over Central Europe, which are relatively well-constrained in observational data sets.
2. Estimate the probability distribution function of the meteorological constraint from observational data sets using e.g. a kernel density fit (f obs (x), see e.g. Fig. 2a, blue line for an illustration), where x denotes the constraint.
Here, we use a Gaussian kernel with reliable data-based bandwidth selection (Sheather and Jones, 1991) fitted over the observed meteorological constraint for the period 1986-2011 using various observational data sets (one at a time).
3. Estimate the probability distribution of the meteorological constraint in the model ensemble using the same estimation procedure as above over all ensemble members and all years (f mod (x), see Fig. 2a, red line). The deviation between the red and blue line in Fig. 2a illustrates the temperature bias in the weather@home ensemble.
4. Derive a transfer function that maps any given quantile in the observations (q obs,X ) to the respective quantile in the model ensemble (q mod,X , see Fig. 2b), such that q mod,X = T F (q obs,X ) using the fitted kernelsf obs (x) andf mod (x) to determine empirical quantile functions. For example, a "median temperature" summer over Central Europe (approx. 17.2 • C) would correspond to the 50th percentile in the observations-based kernel (by definition). The transfer function would then map the 50th percentile inf obs to the corresponding 20.4th percentile inf mod (i.e. average summer temperatures of 17.2 • would correspond to the 20.4th percentile in the model ensemble, see Fig. 2b). In this study, we use Cubic Hermite splines (Fritsch and Carlson, 1980) to determine the transfer function shown in Fig. 2b.
5. Derive a new "bias-corrected" ensemble (of sample size n) by randomly resampling n times from observed percentiles (q obs,X ) and retaining the ensemble member that corresponds to q mod,X as given by the transfer function (n = 800 per year in our study).
Hence, the outlined procedure does not adjust any output variable in the model ensemble thus preserving physical consistency, but rather selects plausible ensemble members. This procedure invariably leads to a reduction in the effective ensemble size: for example in the HadRM3P ensemble, roughly the hottest 20 % of simulations are effectively not chosen for the resampled ensemble since they are implausibly hot (Fig. 2a). However, an evaluation of the sample size in the bias-corrected ensemble shows that at least 4 % of the ensemble simulations match any decile of observations ( Fig. 2d; in an unbiased ensemble exactly 10 % of ensemble simulations would match each decile of observations), corresponding to an effective sample size of at least approx. 1000 model years (= ensemble members) per decile of observations (Fig. 2d).
In conclusion, the outlined approach to bias correction is conceptually similar to earlier ideas of assigning weights to different regional climate model projections based on each model's performance in order to derive probabilistic multimodel projections (Piani et al., 2005;Collins, 2007;Knutti, 2010;Christensen et al., 2010). However, instead of a weight assignment, specific ensemble members are selected and combined into a new ensemble using the statistical distribution of observed meteorological constraints.

Analysis methodology
In Sect. 4.1 the outlined bias correction method is evaluated for the simulation of temperature, rainfall and radiation using standard evaluation metrics such as seasonal mean values and interannual variability. Further, we evaluate soil moisture coupling in the original and bias-corrected ensemble against reanalysis data and upscaled observations by computing the correlation between summer mean temperatures and the mean latent heat flux following Seneviratne et al. (2006). Moreover, we analyse empirical return times of the original and bias-corrected ensembles that are derived by plotting each ensemble value against its rank both for climatological extremes (Sect. 4.2: monthly summer temperatures and cumulative summer rainfall) and simulated ecosystematmosphere annual fluxes of water and carbon (Sect. 4.3).
To further understand discrepancies between the biascorrected ensemble simulations and observed climate extremes (Sect. 4.2), we characterize the tails of simulated and observed variables by extreme value theory (Coles, 2001). Hence, generalized extreme value distributions (GEV) are derived from monthly temperature and precipitation in a procedure similar to Sippel et al. (2015a), i.e. by resampling block-maxima in randomly concatenated 10-year sequences of ensemble data and fitted to a GEV model using generalized maximum likelihood estimation. In observational data, only a relatively small sample size is available (mostly 1901-2014 only) that is additionally plagued by nonstationarity and does not match the period in which ensemble simulations are available . Hence, for monthly temperatures we subtract the trend and seasonal cycle from the original time series using Singular Spectrum Analysis , and subsequently resample (monthly) summer temperature anomalies (for the whole time series) by adding a trend and seasonal cycle component drawn randomly from the period of available ensemble simulations (1986-2011, each observational data set is analysed separately). Approximate stationarity was assumed for seasonal precipitation, and hence no further adjustments were made. Lastly, GEV models were fitted to the observations following the procedure as described above.

Results
This section is structured as follows. First, we evaluate the bias correction procedure both for resampling based on an area mean and grid-cell-based constraint. Second, climate extreme statistics and their sensitivity to bias correction schemes are investigated (Sect. 4.2). More specifically, the probabilistic resampling scheme introduced in Sect. 3.2 is evaluated against a conventional bias correction scheme (Hempel et al., 2013, Sect. 3.1) and compared against the uncorrected simulations and different observational data sets. Third, we illustrate how biases and their "correction" propagate into climatic impacts exemplified by simulations ecosystem water and carbon fluxes in Central European natural vegetation.

Evaluation of resampling bias correction
An evaluation of the distribution of variables in the resampled ensemble in Central Europe shows that it not only im-     Figure 4. Return times of hot (a, c) and cold (b, d) temperature extremes in summer (JJA) in the original regional model simulations ("ORIG"), in the resampled ensemble ("PROBCOR") and the mean-adjusted ensemble ("ISIMIP"). Plots are shown as spatial averages over Central Europe (top panels) and for an illustrative grid cell (Jena, bottom panels). Black dots in each plot indicate empirical return times estimated from observations taken from seven different data sets that were used for bias correction.
proves the simulation of seasonal mean temperatures (which it does by construction), but also yields considerable improvements to the simulation of rainfall and radiation components (Fig. 3). This suggests that these biases are related to specific synoptic situations in summer, justifying us in applying the bias correction approach to summer months. Hence, the multivariate covariance structure between temperature, precipitation and radiation as simulated by HadRM3P appears to be well represented in the model simulations posterior to the updating procedure given the reanalysis/observational data. Moreover, while this procedure also improves the simulation of summer temperatures and precipitation on a monthly timescale, virtually no changes in the ensemble statistics are induced to non-summer months (Fig. S1 in the Supplement), indicating that the timescales of temporal decorrelation are short enough for a successful application of the resampling procedure (Fig. S1 in the Supplement). However, while conventional statistical bias correction following Hempel et al. (2013) adjusts monthly means of the distributions of precipitation and radiation (by construction), changes are induced by the multiplicative adjustment to the width and shape of the distribution, including its tails (Fig. 3, see also Sect. 4.2).
An evaluation of the resulting spatial patterns of the resampling bias correction shows that the representation of the simulated statistical distributions of temperature and precipitation are considerably improved in Central Europe (area mean constraint) and across the entire European model region (single grid cell constraints, Figs. S2-S3 in the Supplement). Remarkably, this holds not only for seasonal averages, but also for higher statistical moments such as the inter-decile range (Figs. S2-S3 in the Supplement).
Furthermore, we test the representation of landatmosphere coupling in the original and resampled model ensemble by investigating the correlation strength between summer mean temperatures (T) with latent heat (LE) fluxes following Seneviratne et al. (2006). The original HadRM3P ensemble shows strong water limitation of evapotranspiration in summer (negative correlation between LE and T) for most temperate and Mediterranean European regions, thus overestimating soil moisture control compared to reanalysis data and upscaled observations (Fig. S4 in the Supplement). In the resampled ensemble, land-atmosphere coupling remains strongly soil moisture controlled in the Mediterranean regions, but reduces in temperate European regions, resulting in spatial patterns that resemble those of land-atmosphere coupling in ERA-Interim (Fig. S4 in the Supplement). The latter finding indicates that the procedure of eliminating implausible ensemble members also yields an improved representation of physical processes such as land-atmosphere coupling in the resampled ensemble.
4.2 Sensitivity of climatic extremes to bias correction 4.2.1 Summertime temperature extremes Summertime monthly extreme temperatures are shown in Fig. 4 as a spatial average for the study region located over Central Europe and for an illustrative and randomly chosen grid cell ("Jena grid cell").
The location, slope and shape of the lines in the return time plots shown in Fig. 4 reveal that the tails of simulated monthly temperature extremes are highly sensitive to the type of bias correction applied, both for a regional average and a single grid cell: uncorrected simulations overestimate both location and scale (i.e. slope of the line in the return time plot) of positive temperature anomalies in summer, while this is not the case for anomalously cold summer months (Fig. 4). An additive adjustment of monthly means (orange lines in Fig. 4, Hempel et al., 2013) preserves slope and shape of the tail, i.e. preserves the year-to-year variability of simulated monthly temperatures (and biases therein) in the ensemble. Note that this procedure cannot account for the asymmetry between the upper and lower tail of simulated monthly temperatures -i.e. the offset correction leads to an overcorrection of cold months, whereas the statistics of the hot tails improve only marginally. This is confirmed by a statistical extreme value analysis (Figs. S5-S6 in the Supplement): the temperature offset approach adjusts only the location of the GEV yielding spurious artefacts in the (originally well simulated) cold tail, whilst not accounting fully for the upper tail due to the aforementioned asymmetries. This is a fundamental drawback of using linear parametric transfer functions, i.e. even if the variability of the simulated distributions would have been adjusted along with the means (see e.g. Sippel and Otto, 2014), the outlined "asymmetry" issue would not necessarily improve. On the other hand, the probabilistic resampling procedure alters both the location and slope of the lines in the return time plot, where resampling based on a spatial average as well as on a grid cell constraint yield relatively similar representations of the tails. An evaluation of the extreme value statistics shows that the probabilistic procedure indeed considerably improves the statistical characteristics of the simulated tails in the ensemble compared to (long-term) observations (Figs. S5-S6 in the Supplement). To this end, resampling the original ensemble changes location and scale of the extreme value distributions, but the shape parameter of the tails remain effectively unchanged. Some caution is required due to the relatively scarce availability of observed monthly mean temperatures (i.e. 1901-2014), which induces considerable uncertainties to the pa-rameters of the fitted GEV distributions (Figs. S5-S6 in the Supplement). Moreover, the different time periods of observations and ensemble simulations (1986-2011) impede a direct "evaluation" of the bias correction. Nonetheless, this indicative comparison yields very promising results of biascorrecting without invasive changes to the simulated statistical distribution.
Lastly, our analysis shows that any bias correction based on a single grid cell level induces some uncertainty due to the choice of observational data set. This is an important issue to consider if impact model simulations on a grid cell scale are to be conducted, whereas regional averages are not as strongly affected. Figure 4 shows that resampling the ensemble based on a spatial average constraint reduces this uncertainty as compared to adjusting monthly means or resampling on a grid cell scale.

Summertime rainfall extremes
We extend the analysis of the previous paragraph to investigate how resampling based on a temperature constraint alters the representation of summer precipitation in a large ensemble simulation. The original HadRM3P simulated summer seasons are too dry on average over Central Europe (Fig. S2), which is largely due to a much too dry lower tail (Fig. 5), whilst simulated heavy monthly precipitation matches relatively well the available observational data (Fig. 5).
The tails of simulated (cumulative) seasonal precipitation are sensitive to bias correction. As above, the plots in Fig. 5 illustrate that a statistical adjustment of the means can be detrimental to statistics of extremes and variability. For instance, scaling monthly means to match observations (Hempel et al., 2013) leads to an inflation of very wet seasons that are physically implausible given the observations (Fig. 5, orange lines). Likewise, the (biased) dry tail in HadRM3P improves only to a very limited extent if the scaling approach is used. The extreme value analysis (Fig. S6 in the Supplement) shows that the multiplicative adjustment changes both location and scale of the tail distribution -and that both parameters are not necessarily improving (indeed often deteriorating, see e.g. scale parameters in Fig. S6 in the Supplement) by applying a simple statistical bias correction. However, resampling based on a temperature constraint yields a new ensemble, in which the simulation of both tails has improved (Fig. 5, Fig. S6 in the Supplement). Only minor changes have been induced to the (well-simulated) wet tail, whilst the previously strongly biased dry tail has considerably improved (Fig. 5, Fig. S6 in the Supplement), indicating that temperature-based resampling as deployed here successfully separates "plausible" ensemble members from the (unrealistic) hot and dry members. The extreme value analysis shows that resampling largely alters the location of the simulated distribution of seasonal rainfall extremes, whilst the scale and shape of the tails remain largely unchanged. Precipitation, JJA [mm]  Figure 5. Return times of wet (a, c) and dry (c, d) rainfall extremes in summer (JJA) in the original regional model simulations ("ORIG"), in the resampled ensemble ("PROBCOR") and the mean-adjusted ensemble ("ISIMIP"). Plots are shown as spatial averages over Central Europe (top panels) and for an illustrative grid cell ("Jena pixel", bottom panels). Black dots in each plot indicate empirical return times estimated from observations taken from seven different data sets that were used for bias correction.
To conclude, it was shown that resampling based on a univariate observations-based temperature constraint improves the simulation of rainfall variability and extremes by teasing out ensemble members that are implausibly hot and dry in our case study region.

The impact of bias correction on simulated ecosystem water and carbon fluxes
In this section, we present HadRM3P-LPJmL ensembles of simulated fluxes of carbon and water and discuss bias correction methods with a focus on the extreme tails of the simulated distributions. Further, we investigate the sensitivity of the simulated carbon fluxes to an accurate representation of rainfall in the climatic input data. Annual mean fluxes across the large ensemble of NEE, GPP, Reco and AET are shown in Table 2 for the 1986-2010 period for each bias correction and the control simulation. Conventional statistical bias correction that matches monthly means of the HadRM3P ensemble exactly to those of the ERA-Interim control climate simulation yields differences in fluxes of −6.6, −7.5 and −4.7 % for GPP, Reco and AET, respectively. Note that differences in the resampled HadRM3P ensemble are less pronounced (−4.2, −4.5 and −2.0 %, respectively), although no attempt has been made to adjust the statistical properties of the model output. Those differences in simulated annual mean fluxes are related to higher statistical moments of the statistical distributions and shown in Fig. 6.
To this end, simulated GPP, NEE and AET show strong asymmetry in their simulated distributions (Fig. 6): Negative anomalies in GPP and AET are much more pronounced than positive ones; this holds also for NEE but with an inverted sign (ecosystem carbon release corresponds to positive fluxes). However, the simulation of these extremes is highly sensitive to bias correction, where the lower tails of GPP and AET in the original and statistically bias-corrected ensemble strongly overestimate reductions in carbon and water flux. In contrast, negative GPP and AET anomalies in the resampled ensemble (corresponding to positive ones in NEE) exhibit a much less pronounced lower tail and asymmetry and agree well with the control simulations.
For example, a positive anomaly in NEE corresponding to a 30-year return period exceeds +200 g C m −2 yr −1 in the conventionally bias-corrected simulations and the original   ensemble, whereas such an anomaly in the resampled ensemble hardly reaches +150 g C m −2 yr −1 (Fig. 6b), roughly corresponding to an empirical 30-year return event in the ERA-Interim control simulations. Similar arguments can be made for negative anomalies in annual GPP and annual AET (Fig. 6). The different tails of the simulations occur because the original meteorological ensemble implies large hot and dry biases in summer, inducing negative anomalies in ecosystem-atmosphere carbon and water cycling. These biases are not accounted for by conventional statistical bias correction but they are alleviated if an ensemble resampling scheme is used (see previous section). However, this is remarkable because monthly means of precipitation in PRE-CIPCOR and ISIMIP are identical to the control climate simulation, which highlights the importance to consider statistical moments beyond the mean for impact simulations. However, note that the positive tails of GPP and AET are not as strongly affected. Furthermore, ecosystem respiratory fluxes show a relatively lower sensitivity to bias correction (i.e. to hot and dry summer conditions).
Further, we investigate whether different bias correction schemes induce different sensitivities of LPJmL simulated carbon fluxes to rainfall. Here, the relationship between a growing season rainfall proxy (April-September rainfall sums) and annual NEE is characterized using piecewise linear regression (Fig. 7a-d). Figure 7e shows that LPJmL simulated annual NEE responds to rainfall in a roughly similar way across different bias correction schemes, which highlights the need of an accurate representation of precipitation in climate impact simulations in the terrestrial biosphere. However, characterizing the annual NEE response for each quantile of the rainfall distribution shows that the resampled rainfall distribution (PROBCOR) leads to a less negative NEE response to rainfall (larger slopes in Fig. 7f), whereas a dry summer tail (in the ORIG, ISIMIP and PRECIPCOR simulations) yields a generally stronger NEE response (more negatively sloped in Fig. 7f).
In conclusion, different bias correction methods induce different statistical properties of simulated ecosystematmosphere fluxes of carbon and water. This affects the variability and skewness of NEE, GPP and AET simulations (as shown in Fig. 6), where hot and dry biases in summer imply a disproportional reduction in carbon and water fluxes in climatically "unfavourable" years. Conventional statistical bias correction cannot account for this issue, whereas the novel probabilistic bias correction schemes alleviates those biases to a very large extent.

Discussion
In this paper, we have introduced a novel ensemble-based resampling bias correction approach that retains the physical consistency and multivariate correlation structure of regional climate model output. The approach thus relies on a physically consistent set of climate model simulations (i.e. closure of water and energy balances). The methodology is conceptually similar to earlier approaches designed to constrain future probabilistic climate predictions based on observational constraints (Piani et al., 2005;Collins, 2007). Its application has been shown in this paper to yield considerably improved simulations of weather and climate extremes. Remarkably, the improvement holds for variables that have not been con-strained upon (i.e. constraining on seasonal mean temperatures improves the representation of mean and extreme precipitation), which indeed emphasizes the importance to bias correct in a physically meaningful way.
Furthermore, simple but widely used statistical bias correction methodologies (e.g. Hempel et al., 2013) have been evaluated with respect to the effect on the representation of weather and climate extremes on monthly to seasonal timescales. These methods cannot account for biases associated with e.g. specific synoptic situations that result in biases in higher statistical moments of the simulated distributions, which indeed emphasizes the importance to bias correct in a physically meaningful way. We demonstrated that this shortcoming of conventional methodologies can be detrimental to statistics of weather and climate extremes and their variability. More sophisticated statistical bias correction schemes (see Gudmundsson et al., 2012, for an overview) that might have an improved skill in rectifying biases in higher statistical moments (such as e.g. asymmetries in simulated distributions) have not been explicitly tested in this study. However, the fundamental question of how physical consistency can be preserved after bias correction (Ehret et al., 2012), including multivariate dependencies between variables, remains elusive. Therefore nonlinear and nonparametric bias correction techniques (Gudmundsson et al., 2012) might potentially improve statistics of extreme events if a large enough sample of observations is available, but cannot retain physical consistency (Sippel and Otto, 2014) and may ultimately fall short for correcting a set of input variables.
To this end, we have explicitly simulated an ensemble of ecosystem-atmosphere fluxes of carbon and water using a state-of-the-art biosphere model (LPJmL) in order to test the sensitivity to bias correction. Similarly to above, we find that bias correction induces strong effects on the representation of extremes and variability in carbon and water fluxes (Sect. 4.3). Mechanistically, the stark contrast between the bias correction schemes can be traced back to the sensitivity of the LPJmL model to dry conditions (see e.g. Rammig et al., 2014;Rolinski et al., 2015): NEE, GPP and AET in Central Europe are to a large extent driven by the availability of rainfall in the growing season, except for wet conditions, under which the relationship levels off (Fig. 7). Bias correction strongly affects the variability and extremes of rainfall (as shown above), thus inducing pronounced asymmetries in simulated water and carbon fluxes (Figs. 7f and 6). Therefore, our results highlight the importance to account not only for biases in the mean but also for higher moments in the climatic input in order to generate robust insights into the past, present and future climate impacts. Our results demonstrate that physically consistent bias correction schemes might be preferable for this task. Moreover, it has been shown recently that climatic drivers exert multivariate controls on ecosystem responses such as phenology and vegetation greenness dynamics (Forkel et al., 2015), therefore accurate ecosystem shorter time periods (e.g. single years) from smaller ensembles such as CORDEX regional simulations (Giorgi et al., 2009) would provide a promising topic for further study. In this context, the applicability of the resampling methodology would depend on the remaining effective sample size after the resampling step. The latter is a function of the biases in the model and the number of ensemble members available, and could be tested in an evaluation step similarly to Fig. 2d. Thirdly, the applicability of bias correction methods for future projections is currently unclear, since previous studies have shown that biases in climate projections (e.g. for the 21st century) might not be stationary (Ehret et al., 2012;Maraun, 2012). However, an application of the resampling approach to future projections similarly to the current practice of statistical bias correction (Hempel et al., 2013, e.g.) would be straightforward, i.e. based on a calibration using present or past conditions. Lastly, a clear distinction between bias correction and statistical downscaling is crucial (Maraun, 2013): while the resampling bias correction is designed to account for the former, no attempt at statistical downscaling or bridging any scale mismatches is made (see e.g. Maraun, 2013, for a detailed discussion).
Notwithstanding these limitations however, we show the functionality of the novel bias correction scheme that might be a useful and physically consistent alternative to conventional statistical bias correction as long as global and regional dynamic climate models suffer from pertinent biases.

Conclusions
In this paper, we introduced a novel bias correction method that retains physical consistency and the multivariate correlation structure of the climate model output based on an ensemble resampling approach. We showed that such an approach strongly improves (a) statistics of weather and climate extreme events, and (b) the simulation of climate impacts such as ecosystematmosphere fluxes of carbon and water, including extremes and variability therein.
The methodology could be readily taken up in probabilistic event attribution studies that deploy large ensemble simulations (see  for an overview) in order to more realistically describe the statistics of (changing) extreme events.
Furthermore, detecting and attributing the impacts of climatic variability and extremes on hydrological and socioecological systems has emerged as a highly topical research area (Stone et al., 2009, including demonstrated interest by stakeholders across various sectors (Schiermeier, 2011;Stott and Walton, 2013;Sippel et al., 2015b). To this end, our study showed that it is crucial to account for higher statistical moments in biased climatic input data, and to correct climatic biases in a physically consistent way. Therefore, our methodology could be taken up by the climate impact modelling community to reduce climate forcing biases to a very large extent without requiring any modifications to the climate model output.
The Supplement related to this article is available online at doi:10.5194/esd-7-71-2016-supplement.