Reconciling the signal and noise of atmospheric warming on decadal timescales

. Interactions between externally forced and internally generated climate variations on decadal timescales is a major determinant of changing climate risk. Severe testing is applied to observed global and regional surface and satellite temperatures and modelled surface temperatures to determine whether these interactions are independent, as in the traditional signal-to-noise model, or whether they interact, resulting in step-like warming. The multistep bivariate test is used to detect step changes in temperature data. The resulting data are then subject to six tests designed to distinguish between the two statistical hypotheses, h step and h trend . Test 1: since the mid-20th century, most observed warming has taken place in four events: in 1979/80 and 1997/98 at the global scale, 1988/89 in the Northern Hemisphere and 1968–70 in the Southern Hemisphere. Temperature is more step-like than trend-like on a regional basis. Satellite temperature is more step-like than surface temperature. Warming from internal trends is less than 40 % of the total for four of ﬁve global records tested (1880–2013/14). Test 2: correlations between step-change frequency in observations and models (1880–2005) are 0.32 (CMIP3) and 0.34 (CMIP5). For the period 1950–2005, grouping selected events (1963/64, 1968–70, 1976/77, 1979/80, 1987/88 and 1996–98), the correlation increases to 0.78. Test 3: steps and shifts (steps minus internal trends) from a 107-member climate model ensemble (2006–2095) explain total warming and equilibrium climate sensitivity better than internal trends. Test 4: in three regions tested, the change between stationary and non-stationary temperatures is step-like and attributable to external forcing. Test 5: step-like changes are also present in tide gauge observations, rainfall, ocean heat content and related variables. Test 6: across a selection of tests, a simple stepladder model better represents the internal structures of warming than a simple trend, providing strong evidence that the climate system is exhibiting complex system behaviour on decadal timescales. This model indicates that in situ warming of the atmosphere does not occur; instead, a store-and-release mechanism from the ocean to the atmosphere is proposed. It is physically plausible and theoretically sound. The presence of step-like – rather than gradual – warming is important information for characterising and managing future climate risk.


Introduction
The dominant paradigm for how the climate changes over decadal timescales is based on the standard signal-to-noise model, where the externally driven signal of climate change forms a trend surrounded by the internally generated noise of climate variability.Here, the external driver of interest is radiative forcing produced by anthropogenic greenhouse gas emissions, mediated by other anthropogenic emissions such as sulfate aerosols and black carbon.This paradigm is widely represented by trend analysis, which extracts a monotonic signal from a noisy time series (e.g.North et al., 1995;Hegerl and Zwiers, 2011;Santer et al., 2011).The resulting methodology dominates climate practice, forming the basis for detection and attribution, projection, prediction and characterisation of climate risk.
However, it is not the only theoretically plausible representation of a changing climate (Palmer, 1999;Branstator and Selten, 2009;Solomon et al., 2011;Kirtman et al., 2013).The two main hypotheses that describe how externally driven Published by Copernicus Publications on behalf of the European Geosciences Union.and internally generated climate may be related over decadal timescales are (Corti et al., 1999;Hasselmann, 2002) H1.Externally forced climate change and internally generated natural variability change independently of each other.
H2.They interact, for example, where patterns of the response project principally onto modes of climate variability (Corti et al., 1999) or form a two-way relationship (Branstator and Selten, 2009).
These interactions can lead to a range of different outcomes.For global mean surface temperature, the signal is generally portrayed as following a linear pathway that conforms to the relationship δT = λδF , where T is temperature, F is forcing and λ is a constant related to feedback processes (Ramaswamy et al., 2001;Andrews et al., 2015).This is widely accepted for both H1 and H2 over longer timescales (e.g.> 50 years), but how boundary-limited and initial conditions uncertainties combine over shorter timescales remains unclear.
For H1, if the response to external forcing is considered to be independent of variability over shorter timescales (< 50 years), the trend model will hold, despite often being obscured by variability.Such variability is generally represented as stochastic behaviour in annual to decadal phenomena, where teleconnections, lagged effects and regime changes all potentially interact (Solomon et al., 2011;Kirtman et al., 2013).Alternatively, instead of a gradual line or curve, a segmented trend is sometimes proposed, where the signal of atmospheric warming is modified by varying decadal regimes governing oceanic sources and sinks of heat (Meehl et al., 2013;Cahill et al., 2015;Trenberth, 2015).All these statistical models are linked by the representation of warming as a gradual process, leading to the gradualistic narrative of change (Jones et al., 2013).
The potential behaviour of warming under H2 has many possible permutations because the signal may project onto the regime-like structures of decadal climate variability, or it may dynamically modify those structures.Although a number of nonlinear and often abrupt changes in climate are recognised as part of decadal change, these are overwhelmingly attributed to changes in climate variability.Here, we deal with one such type of response, manifesting as step changes.
Step changes have been detected in warming and related climatic variables by several different methods (Jones, 2010;Reid and Beaugrand, 2012;Jones et al., 2013;Belolipetsky, 2014;Belolipetsky et al., 2015;Bartsev et al., 2016;Reid et al., 2016); in one case, step-like warming over SE Australia was attributed to anthropogenic forcing (Jones, 2012).The purpose of this paper is to detect step changes in a range of temperature records and to apply severe testing to steps and trends to determine which carries the greater part of the warming signal.The results are used to determine whether H1 or H2 is the more viable hypothesis and, if the signal is shown to be non-gradual, to explore the nature of the interaction between external forcing and internal variability.
We apply a methodology combining theoreticalmechanistic and statistical-inductive reasoning to test which statistical model, step or trend, better represents the warming signal on decadal timescales.It is applied to the substantive null of model adequacy approach described by Mayo and Cox (2010) as part of severe testing principles articulated by Mayo and Spanos (2010).Although a test may provide a small p value for the null hypothesis, other tests may do so as well, in which case the hypothesis that test represents is provisional.Support for both H1 and H2 in the literature shows this to be the case.The presence of several statistical models with similar p values also shows that there are viable alternatives to the simple trend model (Seidel and Lanzante, 2004).
A substantive null of model adequacy is where a test closely supports a hypothesis and where a rival test has a high probability of detecting a specific discrepancy from that hypothesis, if that rival hypothesis is correct (Mayo and Cox, 2010).The testing model can be adapted for a single hypothesis or rival hypotheses.If the rival test fails, then the original hypothesis succeeds; if the rival test succeeds, then the original test should also have a low probability of detecting a specific discrepancy from the rival hypothesis.When rival hypotheses are being tested, confirmation and falsification provide two different views of the same issue.
The theoretical-mechanistic component describes plausible, alternative physical processes in the climate system required to sustain steps and trends respectively.Step changes are measured using an objective rule-based multistep adaptation of the bivariate test of Maronna and Yohai (1978) to analyse regional and global surface air temperature, global satellite temperature of the lower troposphere and global mean temperature from the CMIP3 and CMIP5 climate model archives.The data produced by those analyses are then subject to six tests designed to distinguish between steps and trends as the main drivers of the anthropogenic climate signal over decadal timescales.

Methodology
The process of theoretical-mechanistic and statisticalinductive reasoning requires matching scientific hypotheses (H ) with statistical hypotheses (h) in order to distinguish between alternative hypotheses.The next few sections detail how this has been carried out.This employs a hierarchy of models between theory and data, as suggested by Suppes (1962) and articulated by Haig (2016).Underlying theory is used to inform plausible mechanisms for alternative types of change (steps and trends), experimental analyses test those mechanisms and statistical models that detect those alternative types of change are used to prepare climate data for testing.By and large, statistical models are used to undertake error testing, whereas the experimental analyses under-take probative testing designed to provide evidence for the hypotheses being tested.
Here, linearity of response is defined by the δT = λδF relationship, where forcing produces a continuous response in temperature that can be masked by climate variability.Even if the λ function increases over time (e.g. Rypdal and Rypdal, 2014;Andrews et al., 2015), the response will be gradual but will accelerate with increasing forcing.This relationship is also used to define the concept of model equilibrium climate sensitivity (ECS), measured as the atmospheric warming caused by a forcing of 2 × CO 2 in the atmospheric component of a climate model.The relationship between steplike and trend-like behaviour in climate model output and ECS can be used to test how strongly each responds to radiative forcing.The results will show whether forcing produces gradual or episodic warming over decadal timescales.

Development of physical mechanisms for probative testing
Application of a theoretical-mechanistic process starts with well-agreed theoretical positions (core theory) and then builds on those theories to explore alternative mechanisms required to support competing hypotheses.The exploration of plausible mechanisms produces probative criteria for severe testing.This paper cannot undertake a full survey of the theory behind anthropogenic global warming, but the trapping of heat by added greenhouse gases, creating an imbalance between the surface and the top of the atmosphere and between the equator and the poles, is widely agreed on as the foundational theory, i.e. radiative transfer theory and global warming resulting from the enhanced greenhouse effect (IPCC, 2013).However, between the time when heat is trapped in the atmosphere and when it is measured as a change in temperature there is a gap in understanding, which has competing explanations.These explanations focus on where that trapped heat is stored within the climate system and how it is subsequently distributed.Because H1 implies a gradual signal and H2 a discontinuous or episodic signal, represented here as step-like change, these pathways will be distinctly different.
For H1, close adherence to a warming trend implies that the atmosphere warms gradually.If so, this must occur via either or both of the following processes: 1.A measurable proportion of radiatively forced anthropogenic warming trapped in the atmosphere is retained in situ, as represented by models of radiative convective transfer (Ramanathan and Coakley, 1978), gradually warming the air mass, especially over land.Such warming would also be expected to produce a trend in lower troposphere satellite temperatures since the air mass warms gradually from the surface.
2. Most of the heat trapped by anthropogenic greenhouse gas forcing is absorbed by the ocean, with the ocean re-taining an estimated 93 % of historically trapped heat (Levitus et al., 2012;Roemmich et al., 2015).Models of upwelling diffusion assume a constant release of heat into the atmosphere (Raper et al., 2001(Raper et al., , 2002) ) and the assumption of gradual release follows through into much of the literature.Recent papers discuss the role of decadal variability within the oceans mediating trends in atmospheric warming (England et al., 2014;Watanabe et al., 2014;Dai et al., 2015;Meehl, 2015;Trenberth, 2015;Meehl et al., 2016) through variations in ocean surface temperatures and/or overturning processes.
This combination of processes forms the dominant paradigm, where the anthropogenic warming signal is widely considered largely as forming a monotonic trend (Swanson et al., 2009;Zhou and Tung, 2013;Ji et al., 2014).However, mental (conceptual) models held by individual scientists vary widely (Benestad, 2016).Under a scenario of changing decadal regimes, it is also possible that internally driven step changes could be detected in temperature time series, forming a stepladder, as suggested by Trenberth (2015).However, if H1 were to hold, these would have to be unrelated to forcing.Non-gradual warming (H2) requires mechanisms such as regime change combining with storage and release processes.On decadal timescales, ocean-atmosphere interaction is the only realistic source for such changes.If warming is mediated by the hydrothermal ocean-atmosphere system, it could be entrained by the nonlinear processes involved in the distribution of energy skywards and polewards from the equator through quasi-oscillatory systems (Ozawa et al., 2003;Lucarini and Ragone, 2011).Lucarini and Ragone (2011) describe the overall process of distribution of heat energy within the climate system as the generation of entropy, where moist static energy is transformed into mechanical energy like a heat engine.This could flip between different states, modulated by Lorenzian strange attractors as described by Palmer (1993).One important distinguishing characteristic for nonlinear behaviour in a changing climate is whether it is internally generated and essentially random, whereas if it is forced, the response will be related to changing boundary conditions (Lorenz, 1975;Hasselmann, 2002).Distinguishing between these possibilities is the focus of the testing regime: whether gradual or step-like changes provide the better explanation for the response to external forcing.

Development of severe testing
The aim of severe testing is to produce highly probed (evidential) rather than highly probable results (Mayo, 2005).A hypothesis H passes a severe test T with data x if (Mayo and Spanos, 2010) 2. with very high probability, test T produces a result that accords less well with H than does x, if H is false or incorrect.
Two sets of data are produced, representing competing statistical hypotheses h step and h trend .These are linked to rival hypotheses H1 and H2.Previous statistical testing of alternative structures for warming has been inconclusive.For example, when Seidel and Lanzante (2004) tested trends, steps, segmented trends, and step-and-trend statistical models, no single model stood out.They concluded that detection and attribution studies should consider abrupt changes.
Studies that extract short-term components of climate variability from time series producing a more trend-like result (Foster and Rahmstorf, 2011;Werner et al., 2015) or decompose temperature time series into separate signal and noise components (Wu et al., 2011;Yao et al., 2016) all implicitly assume H1.Consequently, the exact nature of change on decadal timescales remains an open question (Trenberth, 2015).If warming conforms to a long-term complex trend and is additive (Marvel et al., 2015), such studies will only produce a trend-like output because they are not configured to detect alternative structures.However, because they are framed on H1, these tests do not show that such structures do not exist.Therefore, h trend has never been severely tested to the point where its alternatives have been eliminated.The usual null hypothesis for h trend is "no trend has emerged from background variability".Accordingly, the null hypothesis testing of trends is usually carried out assuming H1.Where step changes are detected, they are generally attributed to internal variability.However, non-gradual change on decadal timescales has become part of the "climate wars", being used to challenge global warming theory on the basis that if observed change is not gradual, climate change is either disproved or overstated (e.g.Legates et al., 2015).Evidence of nonlinear change, such as step change, is therefore widely associated with challenges to global warming theory (e.g.see Skeptical Science, 2015).This asymmetry in null hypotheses means that severe testing needs to cover both H1 and H2, testing h step against h trend .
The following six tests are used to test the relationship between gradual and step-like change and their responses to external forcing: Test 1 What patterns of step changes can be detected in temperature observations?Do particular dates and locations line up with known events or processes?
Test 2 Do models forced by historical emissions reproduce the patterns of steps changes shown in observations?
Test 3 What is the relationship between different components of change -steps, internal trends and shifts -to each other and to total warming and ECS?
Test 4 Can step-like change be identified using attribution methods?
Test 5 Do other climate variables also undergo step changes?
Test 6 Are temperature time series more step-like or trendlike?
The first four tests can be considered largely probative, where h step and h trend are tested to determine whether H1 or H2 provides the better explanation for the relationship between external forcing and internal variability.The last two focus mainly on error testing to see how well h step and h trend explain the climate data.The combination of different tests means that deriving a single probability through an objective process is not possible.The procedure we follow here uses a two-sided test between h step and h trend as representatives of H1 and H2.Paraphrasing Mayo and Spanos (2010) to address the results, with very high probability, Tests 1-6 would produce a result that accords less well with H2 than does H1, if H2 were false or incorrect (and conversely).

The multistep Maronna-Yohai bivariate test
The Maronna-Yohai bivariate test (MYBT, Maronna and Yohai, 1978) is used to detect step changes in temperature data.This test has been widely used to detect inhomogeneities in climate variables (Potter, 1981;Bücher and Dessens, 1991;Kirono and Jones, 2007;Sahin and Cigizoglu, 2010), decadal regime shifts in climate-related data and step changes in a wide range of climatic time series (Buishand, 1984;Vivès and Jones, 2005;Boucharel et al., 2011;Jones, 2012;Jones et al., 2013).One of us (Jones) has been using it for 25 years, both for adjusting inhomogeneous data (Jones, 1995;Kirono and Jones, 2007) and for detecting abrupt changes in climate variables.Surprisingly, the MYBT is rarely included in reviews of change point analysis techniques (Rodionov, 2005;Reeves et al., 2007) despite being on par or better than other techniques (Vivès and Jones, 2005).For example, it performed similarly to the STARS test in Jones et al. (2013) but has the advantage of not needing tuning and being able to accommodate a reference data set, providing a degree of flexibility that few other tests have.That made it our testing model of choice, especially because all six tests used here compare step changes in time series to a null reference, and Test 4 assesses step changes between correlated variables.
The test was adapted from being able to only assess single change points by developing an objective set of rules that would detect a minimal and stable configuration of multiple step changes.Previously, this involved a trial-and-error process of constructing a robust set of step changes one at a time.A multistep, rule-based application of the MYBT was devel-oped to carry this out (Ricketts, 2015; see Supplement for details).
The test adapts the formulation of Bücher and Dessens (1991) by testing a single serially independent variate (x i ) against a reference variate (y i ) using a random time series following Vivès and Jones (2005).The important outputs of the test in a time series of length N are (1) the T i statistic, which is defined for times i < N, (2) the T i0 value, which is the maximum T i value, (3) i 0 , the time associated with T i0 , (4) shift at that time, and (5) p, the probability of zero shift.Note that i 0 is the last year prior to the change.In this paper, we routinely give the year of change.
A single time series analysis consists of a screening pass, followed by a convergent pass.In both passes, we apply a resampling test to each segment being examined, where the test is repeated 100 times, resampling the random number reference series.The screening pass starts from the most significant shift in a time series, determined using the resampling test and if p < 0.01, the series is divided into shorter time series either side of the step and these are tested until all steps have been detected.This is a recursive procedure whereby the first steps detected may be influenced by asyet-unlocated steps.The convergent pass then serially refines these segments to provide a causal sequence.The convergent process is repeated until a stable set of step changes is produced.
The analysis above is run 100 times.This procedure may produce several different but related solutions (sets of change dates); the most common solution is returned as the best estimate.Alternatives often indicate the presence of localised events embedded in larger-scale areally averaged data.Most historical temperature records analysed contain one or two stable configurations for surface temperature and zero or one for satellite temperature.Climate model data may produce a larger number of stable solutions, especially for the higher forcing scenarios.
Mean annual data for observations are considered serially independent, and in most cases applied in the paper, the MYBT is reliable.Deseasonalised quarterly and monthly data can be used to locate a shift within 1 year, but are not serially independent and are thus used here in combination with the t test either side of the change date to assess significance.A resampling test that shuffles data either side of a shift will also indicate whether a change point is abrupt or the time series is trend-like.As discussed in Sect.4.3, 21st century model data are not serially independent under high rates of forcing.
For error testing, we routinely use thresholds of p < 0.01 for the bivariate test (exceptions are noted), and p < 0.01, p < 0.05 and non-significant (NS, p > 0.05) for trend analysis and the t test.

Regional attribution
Regional attribution of step changes (Test 4) uses a technique detailed in Jones (2012).The basic methodology is suitable for continental mid-latitude areas where annual average maximum temperature (T max ) is correlated with total rainfall (P ), and minimum temperature (T min ) is correlated with T max (Power et al., 1998;Nicholls et al., 2004;Karoly and Braganza, 2005).For Central England Temperature, a largely maritime climate, diurnal temperature is assessed against precipitation instead of T max .The method uses the following steps: 1. Homogenous regional average data are obtained for T max , T min and P .
2. A period of stationary climate is calculated by testing when the relationship between T min and T max undergoes a statistically significant step change.The relationship between T max and P will change at the same time or a later date.
3. Linear regressions are calculated between each pair (T max /P and T min /T max ) for the stationary period.
4. Externally forced warming is estimated for the nonstationary period using these regressions.
5. The results are tested for step changes.

Observed data
Time series tested here are mean annual global air temperature anomalies from five groups (NCDC, Peterson and Vose, 1997;GISS, Hansen et al., 2010;HadCRU, Morice et al., 2012;BEST, Rohde et al., 2012;C&W, Cowtan and Way, 2014), hemispheric temperatures from three groups (Had-CRU, NCDC and GISS) and zonal temperatures from two groups (NCDC and GISS) to see how prevalent step changes are, whether they coincide across different records and to investigate the relationship between step changes and trends.Lower tropospheric satellite temperatures from two groups (UAH, Christy et al., 2003;Christy et al., 2007;RSS, Mears and Wentz, 2009) are also tested.For the regional data, Australian data were sourced from the Australian Bureau of Meteorology, Texas data from the National Climate Data Center and central England temperatures from the Met Office Hadley Climate Centre.Tide gauge records were sourced from the Permanent Service for Mean Sea Level and the ocean heat content records from the KNMI Climate Explorer.The specific records used are described in the Supplement.

Metrics
Measurement of change where nonlinear behaviour is present is not an exact process, and there is no established terminology that carries commonly understood technical meanings; thus, we define here a limited number of terms used in the paper.The MYBT measures total change between segments of a time series, ignoring any trend that may be present.We refer to these as steps.Internal trends are calculated between steps and the distance between the end of one trend and the start of the next is referred to as a shift.We call the process of calculating steps then trends the step-andtrend model.Steps, internal trends and shifts all provide data for severe testing.Shifts and internal trends are not strictly additive; summed over a number of steps, they can add up to more or less than the change in temperature measured between the beginning and end of a series.These differences are largest in records containing reversals and negative trends.
The main phenomena analysed are (Fig. 1) steps, which are the measurements of the whole change across a discontinuity, assuming stationarity produced by the bivariate test.This assumes no trend either side of the step.
internal trends, which are the measurements of trends between steps using ordinary least square trend analysis.
shifts, which are the measurements of the internal step between the end of a preceding trend and the beginning of the next trend.
trend / step ratio, which is the ratio between total internal trends and total steps in a multistep time series.Because shifts and internal trends are not additive, this measure gives a slight preference to trends over shifts as a ratio.
trend / shift ratio, which is the ratio between total internal trends and internal shifts (steps minus trends).
3 Results -observations In the first half of the 20th century, three global records show positive steps in 1920/21 and in 1937, and two in 1930 (Fig. 2).The GISS record also shows a downward step in 1902, coinciding with the Northern Hemisphere (NH) ocean, the tropics and the Southern Hemisphere.The two groups are based on the early 20th century differences: GISS, BEST and C&W in one group and HadCRU and NCDC in the other.The anomaly averaged from all five records shows upward step changes in 1930, 1979and 1997, coinciding with the HadCRU and NCDC records.Differences emerge between ocean and land records.The global HadSST (HadCRU) records shifts in 1937, 1979 and 1997, whereas the ERSST (NCDC) records shifts in 1890, 1930, 1977, 1987and 1997. Global land records from both CRU and NCDC shifted in 1920/21, 1980and 1997. Northern in 1902 and upward steps in 1933 or 1937, 1968 or 1970, 1977/1978 or 1984, and 1997 or 1998.SH high-latitude data are not very reliable, being absent for NCDC 60-90 • S. The GISS 64-90 • S average anomaly steps downward in 1912 and upward in 1955.

Global and zonal temperatures
Figure 3 shows the internal trends and their error significance for the five global mean temperature records.Steps and trends are consistent for the last two periods 1979/80 to 1996 and 1997 to 2013/14 but diverge in the middle of the record due to differences in the timing and magnitude of steps and accompanying internal trends.Data quality may be an issue in the earlier parts of the record.For example, the versions of GISS data used here show five steps in 1902, 1920, 1937, 1980 and 1997, whereas a version previous to 2013 stabilised at steps in 1930, 1979 and 1997, consistent with the average anomaly of all five records.This indicates that the timing and magnitude of steps in the early 20th century can be influenced by adjustments made to improve data quality.However, all global step change dates coincide with regional steps, showing that while the relative importance of dates associated with step changes may be different, the dates themselves are quite stable.This gives us added confidence that we are not detecting false positives.Internal trends are mainly p > 0.05 in the early record, the exception being the GISS 1920-37 period.The 1979/80 to 1996 trend is at p < 0.01 in two records (HadCRU and NCDC) and p < 0.05 in the other three records.The NH step change in 1987 seen in all three records tested strongly influences this trend, which is examined further in the next section.The post-1997 period is p > 0.05 in two records and p < 0.05 in three records.

Step / trend and shift / trend ratios
There is no objective way to partition shifts and internal trends.Giving the first preference to internal trends in calculating ratios gives a slight preference to gradual change in contrast to episodic change, preferring the methodological status quo.Expressed as a ratio between internal trends and steps, four global records range between 0.32 and 0.38, with the GISS record yielding a ratio of 0.62 due to the cool reversal in the early 20th century.For trends and shifts, the ratio ranges between 0.44 and 0.58, with the GISS record being an outlier at 1.38.
Test 2 aims to determine whether trends or steps are more prominent at the regional level than at the global scale.The global trend / step ratio for the HadCRU record, for example, is 0.55 (0.30 / 0.55 • C), 0.31 for the NH, 0.28 for the SH and 0.33 for the tropics (30 • N-30 • S); this is close to the average of the two hemispheres.When divided into land and ocean, the HadCRU and NCDC records show 0.90 and 1.15 for land and 0.16 and 0.26 for ocean respectively, which shows the oceans to be more step-like and the land to have roughly equal measure.The SH ocean is very step-like (0.16) and SH land is less so (0.39).The mid-latitudes are also very step-like, as is the tropical ocean.High ratios (> 1) often involve a temporary cool reversal around the early 20th century.
This also holds for single steps on a regional basis.In 1997-98 the global shift was 0.16 ± 0.01 • C, a ratio of about 50 % compared to the step change of 0.32 • C. For the NH, this ratio varied between 57 and 68 % for three land and three ocean data sets.For the NH mid-latitudes, land and ocean from two data sets (NCDC 30-60 • N, GISS 24-44 • N) show a step / shift ratio that measures 0.43 / 0.44 • C, close to a 1 : 1 ratio, which indicates no trend.
The more step-like character of both the oceans and the mid-latitudes is consistent with those areas being the loci of change in terms of decadal regimes and nonlinear equatorto-pole transport.This is inconsistent with the hypothesis of gradual warming.Varying shift dates and rates of change at regional scales contribute to the global record being more trend-like than individual regions.

Satellite-era records
A comparison of surface and lower tropospheric satellite temperatures stratifies records according to altitude and source of measurement, which is also consistent with Test 2. Satellite records of annual and seasonal lower-troposphere anomalies sourced from the RSS and UAH records beginning in December 1978 were analysed for step changes .Mean annual global and zonal temperatures show 1995 and 1998 as the two main step dates, with 1995 being more prominent at the global scale (Table 1).Seasonal temperatures were assessed to distinguish between these dates.For individual seasons, steps in 1995 are dominated by the NH June, July and August (JJA) and September, October and November (SON) periods, especially on land.This can be traced back to warm El Niño conditions in 1994/95.For the quarterly time series (4 seasons × 36 years), the JJA and SON quarters of 1997 dominate the UAH global record, which is less so for the RSS record.
Quarterly anomalies for the RSS and UAH satellites and HadCRU and GISS surface mean global temperatures were compared to provide more precision on dates of step changes.Quarterly time series are affected by autocorrelation due to the El Niño-Southern Oscillation (ENSO) for the bivariate test, making results robust for timing but not for probabilities for false positive (Type I) errors.The Student's t test (two sided, unequal variance), which is insensitive to serial correlation, was used as a backup.
that available heat energy is not a limiting factor for abrupt changes.
In Fig. 4, both surface and satellite temperature records are very step-like.The trend / shift ratios for the HadCRU and GISS records are 0.19 and 0.27 respectively and for the RSS and UAH records they are −0.55 and −0.40 respectively, showing the effect of the negative internal trends.Shifts are consequently higher than steps in the satellite data.These are clearly due to the presence of the ENSO cycle within the data, where La Niña events precede shifts and El Niño events accompany them.If they are not assumed to be a "contaminating influence" of noise affecting the signal, there is no clear way to allow for them; therefore, the data are analysed and presented as they are.As we discuss later in the paper, it appears that El Niño has an active role in step-like warming.

Regional attribution
This section on regional attribution covers the issue of stationarity and the character of change over regional areas, and it addresses Test 4. Regional attribution of step changes in annual temperature has previously been carried out for southeastern Australia (SEA, Jones, 2012) and is repeated here for Texas and central England.The methodology is suitable for continental mid-latitude areas where annual average minimum temperature (T min ) is correlated with maximum temperature (T min /T max ), and T max is correlated with total annual rainfall (T max /P ) (Power et al., 1998;Nicholls et al., 2004;Karoly and Braganza, 2005).For maritime areas such as central England, diurnal temperature range (DTR) is used (DTR / P ) instead of T max /P .The method uses the bivariate method to test the dependent variable against the reference variable.A shift in the dependent variable denotes a regime change.
SEA climate was stationary until 1967 when a step change increased T min by 0.6 • C with respect to T max (Jones, 2012).Six independent climate model simulations for the same region become non-stationary by the same means between 1964 and 2003, showing steps of 0.4 to 0.7 • C (Jones, 2012).Texas becomes non-stationary in 1990, with an increase in T min /T max of 0.5 • C. T max increased by 0.8 • C against P in 1998.For central England, T min increased against DTR by 0.3 • C and T max against P by 0.9 • C in 1989.T max also increased against P in 1911 by 0.5 • C (Table 2).
The stationary period is used to established regression relationships that calculate T max and T min from P and T max respectively.These regressions are used to estimate how T max and T min would have evolved during the non-stationary period.The residual is then attributed to anthropogenic regional warming and is tested using the bivariate test.Here, the residuals for T max and T min are averaged to estimate externally forced warming (T av ARW ).
In SEA, T av ARW shifted upward by 0.5 • C in 1973 (Fig. 5).Similar patterns were found for 11 climate model simulations for SEA, undergoing a series of step changes until 2100  (Jones, 2012).For Texas, T av ARW shifted by 0.8 None of the internal trends in Fig. 5 achieve p < 0.05.The trend / shift ratios for T av (not shown in Fig. 5) and attributed to external forcing (T av ARW ) are 0.23 and 0.88 respectively for SEA, 0.45 and −0.53 for Texas, and −0.01 and 0.33 for central England (1878England ( -2014)).The lower ratio in SEA T av ARW is because reduced rainfall post-1997 produces lower attributed T max ARW , but if that rainfall reduction is also a response to external forcing (Timbal et al., 2010), T max ARW will be un-   (Franzke, 2012;Capparelli et al., 2013).These results show that the transition from stationarity to non-stationarity is abrupt for regional temperature at three locations on three continents and for six independent climate model simulations for one of those locations (SE Australia).The close association of the observed transition in SEA in 1968 with the widespread shift date over the SH midlatitudes indicates that the onset of the warming signal in these broader regions is abrupt (Jones, 2012).The changes in central England in 1989 andTexas in 1990 may also be associated with a widespread step change in the NH midlatitudes in 1987/88 (Overland et al., 2008;Boucharel et al., 2009;Lo and Hsu, 2010;Reid and Beaugrand, 2012;North et al., 2013;Menberg et al., 2014;Reid et al., 2016).
The low trend / shift ratios shown for ocean and some zonal areas also occur over the three land areas analysed.This suggests that shifts may be more distinct at regional scales, integrating into a more trend-like global average.This is the case for sea level rise data, where individual tide gauge records exhibit stepladder-like behaviour at individual locations and global mean sea level follows a curve (Jones et al., 2013).

Other climate variables
If climate changes in a stepwise manner, it would be expected that other variables would show signs of this (Test 5).Instances of step changes in the literature are widespread, and are mentioned elsewhere in this paper (e.g.Table 6).For rainfall, notable examples are a step change in the Sahel in 1970 (L'Hôte et al., 2002;Mahé and Paturel, 2009), south-west Western Australia (WA) in the late 1960s and early 1970s (Li et al., 2005;Power et al., 2005;Hope et al., 2010) and the western US in the 1930s (Narisma et al., 2007).Similar changes have been detected in streamflow records world-wide, showing that regime changes in moisture have been a long-standing aspect of climate variability (Whetton et al., 1990).A few more recent changes have been directly attributed to increasing gases, although south-west WA is an exception (Cai and Cowan, 2006;Timbal et al., 2006;Delworth and Zeng, 2014), with large-scale shifts in synoptic types accompanying a rapid decrease in rainfall (Hope et al., 2006).The bivariate test identifies a step change in southwest WA winter rainfall in 1969 (shown in Fig. 6a), with an upward step in summer rainfall in northern Australia 1 year later.
Ocean heat content of the upper ocean also shows step changes occurring in 1977, 1996and 2003 (Fig. 6b (Fig. 6b).Changes in long-run tide gauge records also show a stepladder-like process of sea level rise, with the San Francisco record, quality controlled and dating back to 1855, being a good example; it shows step changes in 1866, 1935, 1957and 1982 (Fig. 6c). (Fig. 6c).Step changes in the Fremantle tide gauge data records, one of the longest in the Southern Hemisphere, shows that most of the decline in the average return intervals of extreme events noted by Church et al. (2006) before and after 1950 occurred in two events (Fig. 6d) in the late 1940s and the late 1990s.This variation in rise was noted by White et al. (2014).None of the internal trends in Fig. 6a-d  These sections report on the multistep analysis of 102 simulations of global mean surface warming from the CMIP3 archive and 295 simulations from the CMIP5 archive.Further information on the archives can be found in the Supplement.The relevant test for models is to identify phenomena similar to observations.Here we describe analyses of the timing of change points and their relationship with known regime changes and the measurement of the relative contributions of steps, shifts and internal trends in the temperature record (covering Tests 1, 2 and 3).(1912)(1913)(1914)(1915)(1916)(1917)(1918)(1919)(1920)(1921)(1922)(1923)(1924)(1925).
Step changes (p < 0.01) identified by the bivariate test.
Starting with observations, the percentages of annual steps (p < 0.01) in the 45 time series of mean annual surface temperature from Fig. 2 are shown in Fig. 7a.Two-thirds of all historical records shift in 1997 and one-third shifts in 1980 and 1937.Lesser peaks of 10-15 % occur in 1920, 1921, 1926, 1930, 1968/69, 1987 and 1988.The three shifts in 1979/80, 1987/88 and 1997/98 are the main contributors to the higher rate of trend noted from around 1970.Because these peaks measure how strongly steps occur globally and regionally, percentages denote how pervasive a step is.The models only register a significant step at the global scale, meaning they will only pick up the most extensive step changes.Any steps occurring below the assigned level of probability (p < 0.01) will show up as part of a trend, as is the case for 1987/88 in the observations.
Figure 7b shows step changes from the CMIP3-combined SRES A1B and A2 simulations for the 20th and 21st centuries: 84 are independent and 18 are ensemble averages.The CMIP3 models were driven by observed forcing, including sulfate aerosols, until 1999-2000 and not all contain natural forcings (see Table S2 in the Supplement).They do a reasonable job of capturing the three main post-1950 peaks.Figure 7c-f show the CMIP5 RCP2.6, RCP4.5, RCP6.0 and RCP8.5 ensemble results respectively.The models were driven by observed forcing, including natural volcanic and solar forcing, until 2005.Visually, the CMIP5 results illustrate the observed peaks and troughs better than CMIP3.This is presumably due to the improved representation of forcing factors and physical processes and to improved model resolution (Table S3).
The RCP4.5 result (Fig. 7d), with 107 independent members, is the largest multi-model ensemble (MME).The three major post-1950 step changes are reproduced as follows: 55 % (58 of 107) of the runs undergo a step change in 1996-98 (17 % step in 1996, 16 % in 1997 and 22 % in 1998), 40 % of the runs peak in 1976-78, just missing the observed peak in 1979/80, and 19 % peak in 1986-88.In the mid-1970s, the models may have picked up the observed regime shift in 1976/77 in the Pacific Ocean (Ebbesmeyer et al., 1991;Miller et al., 1994;Mantua et al., 1997;Hare and Mantua, 2000) as a contemporaneous increase in warming.With weak El Niños affecting observations during 1977-1980 (Wolter and Timlin, 2011), this step change may have been delayed in the observed temperature record until 1979/80.
Of the pre-1950 peaks, the models peak around 1916, rather than 1920, and 1936/37 forms a minor peak, less prominent than in the observations.The volcanic eruptions of Krakatoa (1883) and Mount Agung (1963) both feature in the model simulations but less so in the observations.The mid-20th century period of little change is also reasonably well reproduced.
Correlations over the full period 1880-2005 between observations and the CMIP3 and CMIP5 models are 0.32 and 0.34 respectively (p < 0.01).For the period 1950-2005, the correlations rise to 0.45 and 0.40 respectively.If specific events from 1963/64, 1968-70, 1976/77, 1979/80, 1987/88 and 1996-98 are grouped, and all other years are analysed individually, then the correlation increases to 0.78 for both CMIP3 and CMIP5 records (note that this treats the simulated and observed peaks in the 1970s separately).We con-  sider this a reasonable test because all these dates have been linked to regime changes or break points in temperature in the literature.Finessing the exact years involved around these events makes little difference to the result; thus, the correlation is robust.
Although collectively the model ensembles reproduce the observed peaks, single models do not fare as well.We experimented with a skill score that matched steps between models and observations, but the resulting scores did not correlate with any other factor.The only event reproduced widely by the models was the 1996-98 step change, peaking in 1997, when 58 of the MME (55 %) underwent a step change, although 40 % of the MME produces a step in 1976-78.

Relationship between steps and trends over time
Here, we report on the relationships between steps, shifts, and trends; the magnitude of warming; and ECS to estimate the proportion of signal in each warming component, addressing Test 3. Total warming over time can be represented by straightforward differencing, or change measured from a simple trend and the sum of various components, such as the sum of steps and of shifts and trends.All come up with slightly different answers but describe a process that over many decades largely conforms to a trend.
Warming components measured here are steps, the internal trends between steps and the shifts from one trend to the next.Counting shifts as the remainder between internal trends preferences trends over shifts (by about 5 % in the hindcast period).When each is contrasted with an indepen- For the hindcasts (1861-2005), total warming (the 2000-05 average minus the 1861-99 average) is positively correlated with total steps (0.93, p < 0.01).Their means are 0.97 and 0.94 • C. The correlation between total warming and internal trends is 0.36 (p < 0.01) and is 0.58 between total warming and shifts (p < 0.01).Shifts therefore explain 2.5 times the variance explained by internal trends in estimating total warming (Fig. 8a).A simple linear trend measured over the entire period has the same correlation with steps (0.93, p < 0.01) but averages 0.76 • C, thus underestimating total warming by 0.18 • C. Total warming, total steps, total shifts and total internal trends correlate poorly with ECS (−0.01, −0.01, 0.07 and −0.09, all NS; Table 4, Fig. 8b).
The ratio of total internal trends to total steps slightly favours shifts (mean 0.44), ranging between −0.09 and 1.22.A low ratio means that trends either cancel each other out or are negligible.A high ratio usually indicates that the time series contains one or more negative shifts and/or a number of positive trends.Observations fit comfortably within this distribution, with ratios of 0.32 to 0.38, except for the GISS time series, which has a ratio of 0.62 because of a downward shift and upward trend in the early part of the record (Fig. 8c).The MME ratios are slightly negative with respect to total warming (−0.14, NS), suggesting that the mix of shifts and trends is largely unrelated to the amount of hindcast warming .
For the historical period, total warming and its various components -steps, shifts or trends -are unrelated to ECS.The relationship between total shifts and total internal trends is negative (−0.47, p < 0.01), which is to be expected, but the lack of a relationship between the shift / trend ratios and warming or ECS suggests that this uncertainty is stochastic.
For the projection period, total warming over 2006-95 is based on the difference between 5-year averages centred on 2006 and 2095.Total warming averages 1.55 • C, total steps average 1.57• C and they are highly correlated (0.98, p < 0.01).The correlation between shifts and internal trends with total warming is 0.70 and 0.74 respectively, with trends having a slightly higher correlation (Fig. 8d).However, correlations between ECS and total steps, shifts and trends are 0.81, 0.72 and 0.43 respectively (all p < 0.01, Fig. 8e).This shows that the time series are becoming more trend-like at higher rates of forcing, when compared to the hindcast period.Shifts have 2.9 times more explanatory power than trends with respect to ECS, but 0.9 times the explanatory power with respect to total warming over 2006-2095.We take this to mean that shifts (steps minus internal trends) carry most of the signal and that trends are more random since they are affected by short-term (interannual) stochastic behaviour.Some of the signal embedded in trends could also be due to shifts occurring at regional scales, which are too small to register statistically as steps at the global scale.
The ratio of trends to steps is 0.51, ranging from 0.14 to 0.88.The ratio of trends to shifts favours trend (1.22) but has a large range (3.25 to 0.15).The correlations of both ratios with warming are very low (0.07 and 0.03 respectively, NS).This paradox, where there is no correlation with the amount of warming but there is with ECS, when both ECS and warming are correlated, can be viewed by plotting the different modelling groups according to the relationship between shifts and trends.Individual models plot along linear pathways, as was the case for the hindcast ensemble (Fig. 8f).The high-sensitivity models plot towards the upper right and lower sensitivity models plot towards the lower left.The trend / step ratios for these individual groups vary widely; the CSIRO eight-model ensemble has ratios from 0.25 to 0.56 and the GISS-E2-R 17-member ensemble ranges from 0.17 to 0.72.The potential for the same model to produce very different shift / trend ratios shows high stochastic uncertainty, probably generated by ocean-atmosphere interactions.The timing of these interactions appears to be largely unrelated to climate sensitivity, although the warming response to steps when they do occur is related to sensitivity.Interestingly, the GISS models form two groups, the main difference being the ocean configuration (see Schmidt et al., 2014a), where the Russell ocean model produces more steplike outcomes and the HYCOM ocean model produces more trend-like outcomes.
For each individual decade from 1876-1875 to 2086-2095, correlations were performed between step size and ECS (Table 3).The late 19th century produces downward steps in response to the Krakatoa eruption in 1883 and is negatively correlated with ECS.Positive steps dominate from 1886 through to 1945 and are positively correlated at levels of low or no significance.The period 1946 to 1965 is negatively correlated with ECS; in 1956-65, corresponding with the 1963 Mount Agung eruption, downward steps result in a negative correlation of −0.52 (p < 0.05).Correlations between ECS and step size become positive after 1965, being 0.41 for 1976-85 and 0.49 for 1986-95 (both p < 0.01).For the decade 1996-2005, 101 of the 107 members of the MME underwent an upward step, but the correlation with ECS is only 0.19 (NS).This low correlation may partly be due to a rebound from the negative forcing of the 1991 Mount Pinatubo eruption in the models, which has been overestimated by about one-third (Schmidt et al., 2014b).Correlations for the forcing period (2006-2095) rose to 0.68 in 2006-15 and vary between 0.57 and 0.82 for subsequent decades to 2095.
The lack of predictability in the hindcasts is a result of negative aerosol forcing due to volcanic eruptions and anthropogenic sources occurring after 1950.The more-sensitive models produce strong positive and negative responses depending on the direction of forcing, whereas in the lesssensitive models this effect is reduced.This effect cancels out any consistent relationship between ECS and step size over the historical period.The implication of this finding is that the magnitude of 20th century warming in the models has little predictive skill and is not a reliable guide to potential future risk.
The hindcast results are also uncorrelated with the 21st century projections.Total warming (1861-2005) is negatively correlated with 21st century warming (2006-95, −0.25, p ∼ 0.01) and is uncorrelated with respect to ECS (−0.01).Total steps from the hindcast and forecast periods show similar negative correlations.Internal trends 1861-2005 are also uncorrelated with future total warming, steps or trends.This strongly indicates that 20th century warming may not be a good guide to future warming, if observations are being affected in a similar way.
A final analysis looks at the explanatory power of different change models with respect to ECS over time.Linear and quadratic trends, steps and warming to date are calculated for successive decades for each ensemble member and the results correlated with ECS.Both trends and warming difference respond to negative forcing in the first part of the record.
Step changes are less volatile, remaining close to zero until increasing from 1995 and remaining higher than the other models until the end of the century (Fig. 9a).The standard error measured from total accrued warming was also least out of the three statistical models.Although it would be possible to derive a closer fit for some of those models with a greater number of factors, step changes clearly carry the greatest signal with respect to ECS over time.The analysis repeated from 1965 produces a similar result (Fig. 9b).
This result is further evidence that step changes carry the signal.Warming to date assesses any warming irrespective of its cause, whereas if step changes are part of a direct response to forcing, they would be a better predictor.This is the case for climate models, and may therefore apply to observations as well.The advantage for using warming to date as a measure is that it has roughly a decade's advantage over stawww.earth-syst-dynam.net/8/177/2017/Earth Syst.Dynam., 8, 177-210, 2017 tistical tests, which require hindsight.Therefore, unless the physical mechanism(s) for steps become known, both have roughly equivalent predictive skill at the present time.

21st century forcing profiles
If increased forcing raises the rate of entropy production, we would expect to see step-like behaviour becoming more trend-like over time.Such behaviour would involve either an increase in the frequency and distribution of regional step changes that would integrate to become more trendlike at the global scale, or we would see an increase in the rate of diffuse warming, producing widespread trend-like behaviour.
If either is the case, then simulations for the four different emission pathways, RCP2.6, 4.5, 6.0 and 8.5, should show this. Figure 7c-f show the percentage of step changes in any given year for the multi-model ensemble for each of these pathways.For RCP2.6, peaks occur until about 2050, after which the ensemble stabilises.Some models step downward, the earliest of which occurs in 2051.Individual members stabilise between 2018 and 2092, with 48 of the final shifts being positive and 13 negative.This timing is weakly correlated Earth Syst.Dynam., 8, 177-210, 2017 www.earth-syst-dynam.net/8/177/2017/ with ECS (0.18, NS).ECS is uncorrelated with the size of the final shift or with the gradient of the following trend.The RCP4.5 ensemble produces frequent steps that peak around 2025 and decline towards the end of the century.RCP6 produces a fairly constant rate of steps and RCP8.5 produces sustained steps throughout the century, peaking in the 2080s at a higher rate than 1996-98.This evolution shows a stepladder-like process in the 20th century that changes into an elevator-like process in the 21st century, becoming more trend-like with increasing forcing.Depending on the subsequent rate of forcing, trend-like processes can either recede back to a step-like process or even stabilise.The HadGEM2-ES single model ensemble is used to illustrate this (Fig. 10a).
This ensemble shares the same historical forcing until 2005.It warms by less than observations until 2010, with a reversal in 1964-1980; then it warms substantially in a series of steps over the next few decades.It undergoes a step change of 0.37 • C and shift of 0.18 • C in 1998, 1 year after the observed shift.The next steps occur in 2012, 2013, 2014 and 2015 in the four simulations, ranging from 0.40 to 0.49 • C in absolute terms and 0.19 to 0.27 • C as the shift from the pre-step trend to the post-step trend.The first half of the 21st century shows the influence of decadal variability on mediating step changes.In 2021, the RCP2.6 simulation undergoes a step change and is higher than the others for most of that decade.The RCP6.0 simulation is lower than the others from 2025 to 2045 before accelerating under a sustained step-andtrend process.The relative proportion of internal trends to total warming under the four scenarios is 0.34, 0.60, 0.57 and 0.79 for warming of 1.9, 2.9, 3.7 and 5.3 • C respectively.The RCP4.5 has a higher trend ratio, showing the stochastic uncertainty inherent in the simulations.
Like most statistical tests that detect change points, the bivariate test is considerably weakened under autocorrelated data, where its timing is fairly robust but p(H 0 ) becomes increasingly sensitive.Such autocorrelations may be caused by simple trends, with lag-1 or longer lag processes influencing the complex nature of warming.Removing these without assuming an underlying process is difficult.Thus, one way of assessing its influence is to pass a moving window through a time series.If the data are step-like and largely free of autocorrelation, a distinct step will produce a line of horizontal T i0 statistics on a single date as it passes through the window.If there are no steps within a window period and autocorrelation is low, background T i0 values will return to low values (single digits).With autocorrelation, background T i0 values remain above the p < 0.01 threshold and form a "cloud", rather than steps producing horizontal lines.
In Fig. 10b-e, successive horizontal lines extending right from low T i0 values indicate stepladder-like behaviour in the 20th century.Horizontal lines that stay on the right without returning to low T i0 values indicate both step-like and trending behaviour.A cloud to the far right, as in Fig. 10e, shows a trend-dominated process.Summarising 21st century be-haviour under increasing emissions, RCP2.6 shows a return to step-like changes, stabilising around 2050; RCP4.5 shows a return to step-like change late century; RCP6.0 shows increasing trend-like behaviour over the century and RCP8.5 shows a consistent trend until the end of the century, with few steps.
An indication of change at the regional scale and how it may relate to global change is illustrated by using selected CMIP3 models for south-east Australia, as described in Jones (2012).For example, for the CSIRO Mark3.5 A1B simulation, for global mean warming, internal trends comprise 52 % of total warming from 2006 to 2095, whereas for SEA T max the ratio is 13 % and T min 47 %.These were consistent for A1B-and A2-forced simulations, which are roughly equivalent to RCP4.5 and 6.0.The number of step changes is also notable: four and five at the local scale and 12 at the global scale (Fig. 11).The higher ratio for T min compared to T max may be due to T min being related to largescale sea surface temperature patterns and T max being related to more local soil moisture patterns, as is the case for the central and western US (Alfaro et al., 2006).Jones et al. (2013) showed that such changes at the local scale produce significant increases in impact risks.
These analyses do not support increasing trend-like behaviour at the local scale, and therefore favour the first alternative above, but further work across more regions is required to confirm this.

Testing of steps versus trends
Earlier sections have identified steps and trends in temperature and tested how trends, steps and trend-shift relationships relate to total warming and the independent variable ECS.This section examines how well trend, step and steptrend models reproduce the temperature records examined throughout the paper.This tests h trend against h step .The error value assigning p < h 0 is not the principal measure being sought.Instead, the statistical model that combines low error with unstructured residuals while sustaining physically plausible assumptions is preferred.Another aim is, if possible, to provide likelihoods for severe testing.
Four statistical models are tested: ordinary least square trend, LOWESS, step, and step and trend.The LOWESS model (locally weighted regression; Cleveland and Devlin, 1988) was applied with a bandwidth of 0.5 to assess sensitivity to fluctuations in the data, contrasting those with both the trend and step models.It is not considered a valid statistical rival because it is fitted without regard to physical process.Likewise, although the step-and-trend model will fit well to the data, the step model is the one used for severe testing, being a straightforward measure of h step .The trend model represents h trend .
With the data produced, we look at goodness of fit (r 2 ), the residual sum of squares (ResSS), cumulative residuals ( R) R. N. Jones and J. H. Ricketts: Reconciling the signal and noise of atmospheric warming and cumulative residuals squared ( R 2 ).Residuals (R) show how much variance is explained by the model, cumulative residuals will show whether residuals are showing structure not explained by the model and cumulative residuals squared show accumulating error, including rapid changes not accounted for.Four more tests have been added to these: F tests for autocorrelation (F -auto) and heteroscedasticity (F -hetero) of the residuals over the whole record and percentage of exceedance over moving 40-year windows.White's test (White, 1980) is used for heteroscedasticity.The first four of these tests use absolute error, or the amount of a time series not explained by the statistical models, and the second four show patterns, working on accuracy and precision.The statistical models that fail to combine both are therefore the weakest.Results are shown in Fig. 12 and Tables 4 and 5.The data and statistical models for the HadCRU record for 1880-2014 are shown in Fig. 12a.Cumulative residuals that track close to zero (Fig. 12b) show the model mimicking the data closely and sustained departures show significant deviation.Here, the trend model deviates substantially and the LOWESS model less so, while the step and step-and-trend models deviate least.This follows through to the cumulative residuals squared.The less change the better, whereas upward kinks show rapid changes or large outliers (positive or negative) not incorporated into the model (Fig. 12c).Trend analysis produces an r 2 value of 0.76 and residual sum of squares of 0.87, and the other three statistical models have an r 2 of 0.87 and ResSS of 0.8.For R 2 the trend model behaves more poorly than the other three.
The LOWESS test performs less well than the autocorrelation and heteroscedasticity tests for the 40-year windows.Although the LOWESS model performs well over the whole record, it is subject to deviations within the record that cancel each other out -akin to cutting corners.The step and trend model performs worst for F -hetero over the whole record, but the best over 40-year windows.This is due to high variance within the early part of the record and is an issue of precision, as standard error of this relationship is almost half that of the trend model (not shown, but is similar to the R 2 relationship).The step model is clearly superior to the trend model for the moving window tests.The results for the other four long-term global warming records, BEST, C&W, GISS and NCDC, are not shown but have similar results.
These tests, omitting LOWESS, were carried out for Had-CRU 1965-2014, a period with a sustained radiative forcing signal (Fig. 12d).The results for the different statistical models are similar, with r 2 values of 0.85, 0.86 and 0.89 respectively.The step-and-trend model is still the best performed, but the step model is only slightly better than the trend model.This is due to the NH shift in 1987/88 being incorporated into the global mean trend.Dividing this time series into quarters will bring 1987/88 into the picture but will also make both the MYBT and Student's t tests more sensitive.
Also shown in Table 4 are the zonal temperatures from NCDC 30-60 • N (1880-2014), where total internal trends Here, the step model is clearly superior to the trend model, which fails White's test for the whole record, fails the 40year F -auto at a level of 51 % and has a ResSS double that for steps.This record is entirely made up of steps, showing the lack of trend occurring within some regions.The quarterly record of HadCRU from Fig. 4 (1965Fig. 4 ( -2014) is more fine-grained, incorporating the 1987/88 shift (Table 4).If warming is gradual, the results for trends should be scalable; however, they perform less well at this timescale.The respective r 2 results are 0.69, 0.72, 0.75 and 0.76, whereas the differences in the cumulative residuals are 2.0, 0.5, 0.7 and 0.2, where zero is a perfect score.Here, the LOWESS model performs similarly to the step model because it closely follows the data.The step model performs better than the trend model for HadCRU quarterly data, and almost as well as the step-and-trend model.For the GISS quarterly data, the results are similar.
The satellite records are more step-like than surface temperature when measured using cumulative residuals.The step-and-trend model for the 40-step window heteroscedasticity tests for satellite data fails for both RSS and UAH.This www.earth-syst-dynam.net/8/177/2017/Earth Syst.Dynam., 8, 177-210, 2017 is due to two instances of short-term departures on an otherwise stable background that measures heteroscedasticity as significant with the F test: (1) a warm period during 1998, which is represented as a single step but lasts four quarters, and (2) a small warming event associated with an El Niño event in 2010 lasting two quarters.Removing this short-term warming from these sequences removes the heteroscedastic-ity.Therefore, although not all deviations are removed by representing the satellite record as stepwise, it still provides a better explanation of change than the trend model.Simulated global annual mean surface temperatures from climate models show results consistent with observations (Table 5).The data from Fig. 10 were analysed in the same way, except that quadratic (RCP4.5,RCP6.0), cubic Earth Syst.Dynam., 8, 177-210, 2017 www.earth-syst-dynam.net/8/177/2017/(RCP8.5)and quartic (RCP2.6)polynomial functions were used instead of a linear trend.The LOWESS model used here at 0.5 record length is relatively low resolution, providing 120-year smoothing.The step model outperforms both the trend and the LOWESS model in all simulations, with the exception of the ResSS in the RCP8.5 simulation.The RCP2.6 simulation is the most step-like.In the RCP4.5 simulation, the step model does slightly worse than in the RCP6.0 simulation, which is actually more step-like.This shows the role of stochastic uncertainty in the warming process, as portrayed in Fig. 8f.The RCP8.5 simulation is the most trendlike; the step model fails in the final decades of the 21st century because the bivariate test detects no steps, but the climate continues to warm.This is what we would expect if shifts became more local and more frequent, integrating into a curve at the global level, much like sea level rise does today.

Severe testing summary
A range of statistical tests have been used to examine h step and h trend as representatives of scientific hypotheses H1 and H2.The focus is on whether atmospheric warming is gradual, forming a monotonic or even segmented trend or is stepwise and periodic, forming a complex trend over time.
As stated in the introductory sections, no single test can undertake that task.We rely on the multistep Maronna-Yohai bivariate test to identify step changes in the input data, but we make as few assumptions as possible.A total of six tests with links to the two substantive hypotheses were proposed earlier in the paper.These are designed to pinpoint discrepancies between H1 and H2 by analysing the temperature data they seek to explain.The data generated consist of steps, trends and shifts calculated using the multistep MYBT model and least square trend analysis.The use of statistical models such as LOWESS is for sensitivity testing and not part of the probative assessment.The test results are summarised through the following findings: Test 1 What patterns of step changes can be detected in temperature observations?
-Global and regional analyses of steps show a highly coherent pattern of change points, where warming in the second half of the 20th century aligns with known regime changes associated with changes in decadal variability (Table 6).These events comprise the major proportion of historical warming until 2014.
-Analysis of steps, internal trends and shifts in observations attributes higher proportions of warming to shifts at the zonal scale (up to 100 %), moving to lower proportions at the global scale.Three regional assessments also contain high shift / step ratios, with trends playing a lesser role.
-This effect is larger in mid-latitude regions and with SST, indicating the role of equator-to-pole hydrothermal transport of energy in the oceanatmosphere system.Their timing shows that a strong role is being played by decadal variability.
-Surface and satellite temperatures undergo contemporaneous shifts at the global scale, largely removing the discrepancy between trends within the two data sets.Both surface and satellite temperature records are very step-like, with surface trend / shift ratios of 0.19 and 0.27 and satellite ratios of −0.55 and −0.40 showing the effect of downward internal trends.Shifts are consequently higher than steps in the satellite data.
Test 2 Do models reproduce the patterns of steps changes shown in observations?
-Correlations between step change frequency in the observed 44-member group of global and regional data and the CMIP3 and CMIP5 MMEs analysed (1880-2005) are 0.32 and 0.34 respectively (p < 0.01).For the period 1950-2005, correlations rise to 0.45 and 0.40 respectively.Grouping specific events (1963/64, 1968-70, 1976/77, 1979/80, 1987/88 and 1996-98) and analysing other years individually, correlation increases to 0.78 for both CMIP3 and CMIP5 records.Variations in forcing, especially from volcanoes, may affect the timing and direction of step changes, but they are not the sole cause, given that 21st century simulations produce step changes from smoothly varying changes in forcing.
Test 3 What is the relationship between different components of change?
-For simulated historical warming during 1861-2005, the r 2 values for steps, shifts and trends in explaining total warming are 0.87, 0.43 and 0.13 respectively.Simulated warming for this period is not correlated with ECS.
-For the 21st century , the r 2 values for steps, shifts and trends in explaining total warming are 0.96, 0.54 and 0.49 respectively.The r 2 values for steps, shifts and trends in explaining ECS are 0.65, 0.52 and 0.18 respectively.
Test 4 Can step-like change be identified using attribution methods?
-In all three locations on three continents tested, and for six independent climate model simulations for south-eastern Australia, warming commenced with a step change in T min and sometimes T max .Warming is not slowly emergent in any of these data as would be expected if it were gradual.The coincident timing of shifts in south-eastern Australia with southern hemispheric step changes and those in the UK and US with northern hemispheric changes, suggest that warming has commenced abruptly in different areas of the globe at different times and that the separation between stationarity and nonstationarity in the temperature record is abrupt.
Test 5 Do other climate variables also undergo step changes?
-Step changes exhibiting similar timing have been shown for tide gauge observations, rainfall, ocean heat content, forest fire danger index and a range of other climate variables, in addition to many impact variables (Jones et al., 2013).These are overwhelmingly attributed to random climate variability, including abrupt changes identified as part of decadal regime change.
Test 6 Are temperature time series more step-like or trendlike?
-For observations and selected model data, the simple stepladder model performs better than the monotonic trend model for goodness of fit (r 2 ), the residual sum of squares (ResSS), cumulative ( R) residuals and cumulative residuals squared ( R 2 ), White's test for heteroscedasticity, a moving 40year window regression of the residuals and a moving 40-year window of White's test.could be made to each of these on an individual basis, collectively they show that for externally forced warming on decadal scales, h step is better supported than h trend .
In summary, these tests show that h step is a close approximation of the data when analysing decadal-scale warming.Over the long term, this warming conforms to a complex trend that can be simplified as a monotonic curve, but the actual pathway is step-like.As outlined in Sect.3.3, this rules out gradual warming, either in situ in the atmosphere or as a gradual release from the ocean, in favour of a more abrupt process of storage and release.This conclusion supports the substantive hypothesis H2 over H1, where the climate change and variability interact, rather than varying independently.

Proposed mechanisms for step-like warming
The correlation between step-like warming and ECS in the models, between the timing of steps in model hindcasts and observations and between steps and known regime changes in observations (Table 6), provides strong evidence that warming is non-gradual on decadal timescales.The high correlations of steps and shifts with model ECS indicate that atmospheric feedback processes respond to abrupt releases of heat into the atmosphere.The presence of negligible internal trends occurring over some oceanic regions, the region 30-60 • N, and in tropospheric satellite temperatures, suggests that little of the heat being trapped in the atmosphere by anthropogenic greenhouse gases actually remains there.
One justification given for rejecting externally driven steplike warming is that it is presumed that there is no plausible physical mechanism for this (Cahill et al., 2015;Foster and Abraham, 2015).However, to suggest that the stepwise release of heat energy is physically implausible overlooks the energetics of the ocean-atmosphere system.Hydrodynamic processes are quite capable of supplying the energy required (Ozawa et al., 2003;Lucarini and Ragone, 2011;Ghil, 2012).The atmosphere contains as much heat energy as the top 3.2 m of ocean (Bureau of Meteorology, 2003).About 93 % of historically added heat currently resides in the ocean (Levitus et al., 2012;Roemmich et al., 2015), whereas the atmosphere contains about 3 % of the total.A similar amount of the heat has been stored within the land mass (Balmaseda et al., 2013) and on an annual basis a similar flux is absorbed in melting ice (Hansen et al., 2011).A physical reorganisation of the ocean-atmosphere system, as part of a regime change, is therefore large enough to provide the relatively small amount of energy required to cause abrupt sea surface and atmospheric warming (Roemmich et al., 2015;Reid et al., 2016), as shown by rapid changes in shallow ocean heat content (Fig. 6b; Roemmich and Gilson, 2011;Reid, 2016).
For example, Reid et al. (2016) in describing the late 1980s regime change, show it was associated with large-scale shifts in temperature and multiple impacts across terrestrial and marine systems, mainly in the Northern Hemisphere.Changes in the North Pacific in 1977 were considered even more extensive (Hare and Mantua, 2000), as were those in 1997/98 involving both the Pacific and Atlantic oceans (Chikamoto et al., 2012a, b).In developing tests for detection and attribution, Jones (2012) noted two types of regime change over land: one where codependent variables such as maximum temperature and rainfall undergo a step change but remain in a stationary relationship, and the other, nonstationary change, where warming undergoes a step change independent of rainfall change.This suggests that although regime changes are a normal part of internal climate variability, they can be enhanced, releasing extra heat.The step changes summarised in Table 6 coincide with El Niño events but the heat emitted by other El Niño events dissipates and is absorbed back into the ocean within months; thus, an added mechanism is required.We propose that there is negligible in situ atmospheric warming and that almost all of the added heat trapped by anthropogenic greenhouse gases is absorbed by and stored in the ocean.It is subsequently released through the action of oscillatory mechanisms associated with regime shifts.
Most heat (long-wave radiation) is trapped near the ground or ocean surface and much of that is radiated downwards (Trenberth, 2011).The atmosphere as a whole has little intrinsic heat memory and does not warm independently of the surface.This is supported by observations on land where the overpassing air mass takes on the characteristics of the underlying surface, achieving energy balance within a 300 m distance (Morton, 1983).When passing from land to water, this will see all of the available heat energy taken up by water if the temperature of the air mass exceeds that of water (Morton, 1983(Morton, , 1986)), with the temperature of the overpassing air mass reaching equilibrium with the water beneath within a very short time.Very little of the heat trapped over land can be absorbed by the land surface, but will be transported from land to ocean within a few days to a few weeks, where it can be absorbed (the high latitudes being an exception).Given that the atmosphere interacts with the top 70 m of ocean over an annual cycle (Hartmann, 1994), there is ample opportunity for the majority of available heat trapped over land that is not absorbed by land, lakes and ice to be absorbed by the ocean.
In terms of energy budgets, the additional direct forcing from anthropogenic greenhouse gases is roughly 1.5 % (2.3 Wm −2 , IPCC 2013) of the estimated total annual budget of 155 Wm 2 trapped mainly by water vapour and CO 2 (Schmidt et al., 2010).Since > 90 % of that 1.5 % is already accepted as being absorbed by the ocean, it is not clear why the roughly 3 % of that 2 % (0.07 Wm −2 ) not absorbed by land, snow and ice would remain in the atmosphere if its absorption by the ocean is not energy limited, i.e. in the low to mid-latitudes.Negligible internal trends in lower tropospheric satellite temperatures also indicate that the air column is not warming in situ but exhibits stable temperatures punctuated by step changes (Fig. 4).This suggests that climate forms a series of oscillating steady-state regimes, with the temperature of the atmosphere being controlled by ocean-atmosphere interactions.
Step-like warming requires a trigger and release mechanism.Recently, Peyser et al. (2016) linked dynamic sea level in the Pacific Ocean, measured using an east-west see-saw index, to rapid changes in global mean surface temperature.In 1996-1997, that index underwent a west-to-east see-saw movement of 149 mm.This would mark the release of a large tongue of warm water from the western Pacific warm pool to the east, making heat available for discharge into the atmosphere.Based on a linear regression between the see-saw index and surface temperature calculated from control runs of 38 CMIP5 climate models, they estimate a jump in surface temperature of 0.29 ± 0.10 • C in 1997-1998, close to our estimate of 0.32, or 0.25 • C if 1987/88 is taken into account.They estimated another see-saw change of 111 mm in 2014/15 as contributing to a rapid warming of 0.21 ± 0.07 • C in 2016.We interpret their observations of rapid sea level rise in the western Pacific region as representing the sustained storage of heat in the Indo-Pacific warm pool.Heat absorbed in the tropical Pacific is blown westward into the warm pool, where it accumulates, maintaining the tropical Pacific as a region of generally low warming (Power et al., 2016).As the warm pool reaches critical limits, it becomes unstable, releasing surplus heat as a tongue of warm water from the western to eastern Pacific during an El Niño event.Meehl et al. (2016) have also suggested that the negative phase of the Interdecadal Pacific Oscillation that commenced in 1997/98 (Overland et al., 2008;Meehl et al., 2013) could change to positive during 2015-2019 as part of oscillatory mechanisms associated with the build-up of heat in the western Pacific.O' Kane et al. (2014) provide evidence that such changes may be identified years in advance.An accompanying regime change emplacing large areas of warmer water required to sustain higher temperatures after the initial outburst is consistent with widespread coral bleaching in 2014-2016 (Normile, 2016), rivalling that of 1998.Note that both Peyser et al. (2016) and Meehl et al. (2013) interpret their results as variability acting on a long-term trend; however, we reinterpret their findings as supporting a heat pulse and regime change, producing step-like warming.
In storing heat for redistribution, the Indo-Pacific warm pool acts as a global heat engine (Bosc et al., 2009), a function it has fulfilled for millions of years over a wide range of climatic changes (Gagan et al., 2004;de Garidel-Thoron et al., 2005;Abram et al., 2009).The storage and release mechanism identified by Peyser et al. (2016) may therefore be an additional response to a build-up of heat over and above oscillations associated with ongoing decadal regime change.Storage and release mechanisms may exist in other ocean basins but would need to be identified.

Discussion
There are many reasons as to why H1 -where climate change and variability are considered to be independent of each other -has dominated climate research despite the lack of a conclusive theoretical or statistical case.They include historical, social, theoretical and political considerations too broad to cover here.Benestad (2016) reviews models used to build a mental picture of the greenhouse effect, nominating radiativeconvective and heat balance models as two types historically used for this purpose.He describes the basic processes of radiative transfer as being well understood but insufficient to explain the warming process.Radiative transfer theory constitutes core greenhouse theory.However, the subsequent process of heat diffusion through the climate system is less well understood, although the understanding that if greenhouse gases are increased, the atmosphere will warm until the radiative balance at the top of the atmosphere is achieved also constitutes core theory.
Our conclusion that the atmosphere does not warm in situ will challenge many who consider that to be a basic part of the greenhouse effect.However, an exhaustive search of the literature failed to find any direct evidence that this actually takes place.We find it hard to perceive how an additional increment of long-wave radiation on the order of ∼ 0.2 Wm −2 (direct forcing and feedback derived from Schmidt et al. 2010) can behave differently to the ∼ 155 Wm −2 produced in the atmosphere year to year without being absorbed by the wider climate system.Given that climate models exhibit step-like warming, where the abrupt component carries the greater part of the signal than internal trends, they produce emergent behaviour that is not identified by mainstream analytic approaches.
Overwhelmingly, model-and statistically based studies represent the global warming signal as changing gradually.Some are prescriptive because of their structure or because they apply simplified assumptions about a more complex climate system, other models examine a small part of the system, and some have a historical legacy bestowing familiarity and reliability.Modern climate models are almost as complex as the climate, and thus need to be understood through simpler models (Held, 2005;Benestad, 2016), forming a nested modelling approach from simple through to complex (Schneider and Dickinson, 1974;Ghil, 2015).The linking of trend analysis methods with gradual change may overlook the distinction between process-based and diagnostic models.A diagnostic model may identify a trend without necessarily indicating a gradual process.A large part of the climate wars has been fought over this very point.
Nonlinear responses in climate are being investigated by researchers, with an interest in complex system behaviour via dynamical systems and related theory.Our conclusions suggest that the processes of radiative transfer and subsequent warming take place in two separate domains of the climate www.earth-syst-dynam.net/8/177/2017/Earth Syst.Dynam., 8, 177-210, 2017 204 R. N. Jones and J. H. Ricketts: Reconciling the signal and noise of atmospheric warming system, separated by a delay.The absorption of radiation is a linear process that is quite separate from the behaviour of turbulent dissipation of heat energy within the climate system, which is fundamentally nonlinear (Ozawa et al., 2003).Developments based on deterministic nonlinear and stochastic linear behaviour originating from work by Lorenz (1963) and Hasselmann (1976) respectively explore a range of interrelated phenomena, such as non-equilibrium stable states, oscillators, strange attractors, bifurcations and entropy production, in order to develop a unified theory of climate (Ozawa et al., 2003;Lucarini et al., 2014;Franzke et al., 2015;Ghil, 2015).Studying how the free and forced aspects of change combine to alter the statistical properties of climate is a specific goal (Lucarini and Sarno, 2011;Ghil, 2012Ghil, , 2015)).
Our focus is on understanding the role of linear and nonlinear behaviour in changing climate risk over decadal timescales, specifically how initial condition and boundarylimited uncertainties (as described by Lorenz, 1975 andHasselmann, 2002) combine.Initial-condition uncertainty is boundary limited, varying within a certain amplitude, with the outcome depending on the pathway taken within those limits (Lorenz, 1975).There is also a time-dependent window that serves as a predictability barrier.Changing boundary conditions are intransitive, with the outcome being insensitive to initial conditions.The nested nature of climate phenomena over different timescales results in decadalscale climates being both an initial-condition and intransitive process, combining to produce stochastically driven step changes in warming that integrate into a long-term complex trend.The coincident timing of step changes in both observations and models (Fig. 7) suggests that other factors, such as short-term volcanic forcing, can also influence the timing of step changes.Lorenz (1968) referred to the outcome of forced climate change on century timescales as almost intransitive.The "almost" is due to initial condition uncertainties operating within the boundary limitations of decadal variability.The almost-intransitive model (Lorenz, 1968) is described via linear response theory (Lucarini et al., 2014;Ragone et al., 2016) and shown to be robust for concepts such as effective radiative forcing (Hansen et al., 2005) and effective climate sensitivity (Andrews et al., 2015), although these phenomena would be sensitive to bifurcations if they were to occur (Hasselmann, 2002).
If the ocean takes up the additional available heat from anthropogenic greenhouse gases while maintaining steady-state conditions within an oscillatory system of climate regimes, it can be considered as acting homeostatically with respect to the atmosphere (e.g.Kleidon, 2004).Heat will accumulate in the shallow ocean until such a time as it becomes unstable and is released as part of a stepwise regime change.The new regime, being warmer, enhances vertical and horizontal heat fluxes, which is consistent with a more energetic system.Sustained forcing would produce a series of regime changes becoming successively warmer and forming a stepladder or elevator-like record of change.Whether the oscillatory systems themselves change under greater forcing (e.g.RCP8.5) or whether warming itself becomes more diffuse has yet to be investigated.Note that these step changes are quite different to those catalogued by Drijfhout et al. (2015), who used a different method to screen the CMIP5 model ensemble for abrupt shifts that could be considered as singularities, locating 37 ocean, sea ice, snow cover, permafrost and terrestrial biosphere changes.
Statistical characterisations of changing climate variables are becoming more probabilistic, with probability distribution functions increasingly being produced from climate model ensembles.However, the presence of non-gradual change suggests that statistics developed from the path-wise analysis of individual simulations (as was carried out in this paper and as suggested by Ghil, 2015) are required, especially higher-order statistics that represent extreme events potentially subject to step changes.For example, fire risk in Victoria, Australia, increased abruptly by 38 % between 1972-97 and 1998-2010, driven by a step change in climate (Jones et al., 2013).Because methods for detection, attribution, climate forecasting and characterisation of future climate risk are almost totally dependent on being scaled to gradual change in mean variables, a stepwise process will require a substantial rethink as to how these activities can be conceptualised.
For example, seamless links between weather and climate forecasting over a range of timescales are proposed as a key scientific target (Palmer et al., 2008;Hoskins, 2013).The Global Framework for Climate Services (World Meteorological Organization, 2011) reflects that Weather and climate research are closely intertwined; progress in our understanding of climate processes and their numerical representation is common to both.Seamless prediction (on timescales from a few hours to centuries) needs to be further developed and extended to aspects across multiple disciplines relevant to climate processes (World Meteorological Organization, 2010).Solomon et al. (2011) state that "Long experience in weather and climate forecasting has shown that forecasts are of little utility without a priori assessment of forecast skill and reliability".The assumption that the processes involved are timescale invariant indicates that what seamless prediction means in a decision-support context has not been fully thought through.For the moment, decadal prediction concentrates on ensemble mean change in variables that show skill in climate models, whereas the prospect of non-gradual change carries the greater risk.Under this type of framing, climate services remain supply driven rather than demand driven (Gunasekera et al., 2014;Street, 2016).Projections of mean change also overlook the considerable literature on scenarios that have arisen because of the failure of multi-year Earth Syst.Dynam., 8, 177-210, 2017 www.earth-syst-dynam.net/8/177/2017/predictions of mean change in systems that exhibit considerable non-linearity (Wack, 1985a, b;Börjeson et al., 2006).

Conclusions
Here, we have adapted and applied severe testing principles proposed by Mayo and Spanos (2010) to determine the role that step changes play in decadal-scale warming.This involves the linking of scientific hypotheses H1 and H2 with statistical hypotheses h trend and h step and subjecting them to severe testing.Paraphrasing the severity principle of Mayo (2010), the results of Tests 1-6 provide evidence for hypothesis H2 if and only if h step passes a severe test with very high probability, where h trend would have uncovered the falsity of H2, and yet no such error is detected.
Error and probative testing of steps against trends lends little support for the proposition that the climate warms gradually.
If trend-like behaviour were dominating warming or were on an even footing with step-like change, these tests would have identified it.H1 is only suitable for intransitive estimates of change, where the initial conditions, pathway and non-linear components of forcing are unimportant.Surface and tropospheric warming on decadal timescales is dominated by stepwise changes in temperature (Reid and Beaugrand, 2012;Jones et al., 2013;Belolipetsky et al., 2015;Bartsev et al., 2016;Reid et al., 2016).The basic physical mechanism for moving from H1 to H2 is deceptively simple: instead of warming occurring in situ in the atmosphere and/or being released gradually from the ocean, all available heat from additional greenhouse gases not absorbed by the land surface, snow and ice and in lakes is absorbed by the ocean.There, it is entrained into the nonlinear processes of climate variability, where the added forcing interacts with those processes.The most plausible explanation for step-like behaviour is that steady-state decadal regimes are punctuated by step-like bursts of warming that are subsequently maintained by higher sea surface temperature emplaced by oceanatmosphere regime changes.This conclusion does not invalidate the considerable literature that assesses long-term (> 50 years) climate change as a relatively linear process and the warming response as being broadly additive with respect to forcing (e.g.Lucarini et al., 2010;Marvel et al., 2015).However, the signal-to-noise model of a gradually changing mean surrounded by random climate variability poorly represents warming on decadal timescales.The separation of signal and noise into "good" and "bad" is likewise poor framing for the purposes of understanding and managing risk in fundamentally nonlinear systems (Koutsoyiannis, 2010).As we show, the presence of such changes within climate models does not indicate a need to fundamentally change how climate modelling is carried out.It does, however, indicate a need to change how the results are analysed.
Climate conceptualised as a mechanistic system and described using classical statistical methods is substantially different from climate conceptualised as a complex system.With record atmospheric and surface ocean temperatures in 2015/16 variously being described as a singular event, a reinvigoration of trend-like warming or a wholesale shift to a new climate regime, this issue is too important to be left unresolved.

Figure 1 .
Figure 1.Record of mean annual surface temperature anomalies 1880-2014 from the Hadley Centre and Climate Research Unit (HadCRU), showing step changes (p < 0.01) and internal trends and shifts taken from the end of one internal trend to the start of the next across a step.

Figure 2 .
Figure 2. Dates of statistically significant step changes (p < 0.01) in 1880-2014 for a range of mean annual temperature records.Downward steps are blue and upward steps are red.Records are sourced from the Goddard Institute of Space Studies (GISS); Hadley Centre and Climate Research Unit: HadCRU (land and ocean), HadSST (ocean), CRUtem (land); National Climatic Data Center: NCDC (land, land and ocean); ERSST (ocean); Berkeley Earth Surface Temperature (BEST); and Cowtan and Way (C&W).See Supplement for details.

Figure 3 .Figure 4 .
Figure3.Mean global anomalies of surface temperature with internal trends.The annual anomalies (dotted lines) from five records (Had-CRU, C&W, BEST, NCDC, GISS) are taken from a 1880-1899 baseline.Internal trends (dashed lines) are separated by step changes detected by the bivariate test at the p < 0.01 error level.The size of each step (in red) and change in temperature of each internal trend (in black) is shown in the figure table along with its significance, where NS is p > 0.05, * is p > 0.01< 0.05 and * * is p < 0.01.Totals of trends, steps, shifts (change from one trend to the next) and ratios are also shown.

Figure 5 .
Figure 5. Anomalies of annual mean temperature attributed to nonlinear changes where the influences of interannual variability have been removed for (a) central England, (b) Texas and (c) southeastern Australia.Internal trends (dashed lines) are separated by step changes (p < 0.01).
attain p < 0.05, showing the dynamic nature of change and limited trend-like behaviour in these examples.4 Results -models 4.1 20th century simulations (1861-2014)

Figure 9 .
Figure 9. Correlations between ECS and linear trends, total step changes, warming to date and quadratic trends (a) from 1861 to the current decade (warming to date: 1861-99 average subtracted from current decadal average) and (b) from 1961.Dotted lines mark p < 0.01.

Figure 10 .
Figure 10.Global mean surface temperature as analysed by the multistep bivariate test.(a) Step and trend breakdown of global mean surface temperature in the RCP2.6, 4.5, 6.5 and 8.0 simulations from the HadGEM-ES model, run 3; (b-e) T i0 results from a 40-year moving window for the RCP2.6, 4.5, 6.5 and 8.0 simulations respectively.

Figure 11 .
Figure 11.Anomalies of annual mean temperature showing internal trends separated by step changes from the CSIRO Mk3.5 A1B simulation: (a) Maximum temperature of south-eastern Australia; (b) minimum temperature of south-eastern Australia and (c) global mean surface temperature.Internal trends (dashed lines) are separated by step changes (p < 0.01).

Figure 12 .
Figure 12.Testing three models to mean global anomalies of surface temperature from the HadCRU record for 1880-2014 (a-c) and 1965-2014 (d-f); (a) and (d) mean annual anomalies and linear step change and shift-and-trend models; (b) and (e) show cumulative residuals for each model, where success is measured as tracking close to zero; (c) and (f) show the cumulative sum of residuals squared, where upward steps show non-linearity not explained by each model.

Table 1 .
Dates of step changes for lower tropospheric satellite temperature anomalies, with annual time series and quarterly breakdowns in parentheses (DJF, MAM, JJA, SON), and quarterly time series.Data sources include remote sensing systems (RSS) and the University of Alabama, Huntsville (UAH).

Table 2 .
Year of non-stationarity in regional temperature for south-eastern Australia, Texas and central England.Data source, year of first change greater than 1 standard deviation for T max against P and T min against T max , or DTR /P using the bivariate test.The stationary period is also shown.The negative ratio for Texas is because T av ARW contains negative internal trends, mostly after 1990 (largely a rainfall effect on T max ).For central England, the ratio for T av has been calculated from the long-term record from 1659, which shows no step changes or trends between 1701 and 1920.Late 20th century warming in both central England and the continental US elsewhere has also been analysed as nonlinear

Table 3 .
Steps collated for each decade from 1876 to 2195 from the RCP4.5 MME, showing total steps up and down and the correlation between step size and ECS.The second part of the table shows the correlations between total warming, steps and trends over the observed and simulated periods and ECS.Correlations are classified as not significant (NS, p > 0.05), p < 0.05 ( * ) and p < 0.01 ( * * ).Total correlations with the MME are n = 107 and with ECS are n = 92.

Table 4 .
Results of eight tests on four statistical models for selected observed global temperature data (except where noted).The statistical models tested are trends (power shown), LOWESS (0.5 total series smoothing), steps, and steps and trends.Results include the adjusted r 2 value, the residual sum of squares (SS), cumulative residuals and squared cumulative residuals.F tests for the whole series are shown, with p < 0.05 and p < 0.01 noted if registered, otherwise p > 0.05.F test failure for 40-year period autocorrelation and heteroscedasticity is measured at p < 0.01.

Table 5 .
Results of eight tests on four statistical models for representing global mean warming from HadGEM-ES climate model run 3 RCP2.6,4.5, 6.0 and 8.5, showing the amount of warming for different measures.The statistical models tested are trends (power shown), LOWESS (0.5 total series smoothing), steps, and steps and trends.Results include the adjusted r 2 value, the residual sum of squares (SS), cumulative residuals and squared cumulative residuals.F tests for the whole series are shown, with p < 0.05 and p < 0.01 noted if registered, otherwise p > 0.05.F test failure for 40-year period autocorrelation and heteroscedasticity is measured at p < 0.01.

Table 6
summarises the major tests undertaken with expected outcomes for h trend and h step .While objections

Table 6 .
Selected test results that distinguish between h trend and h step .The null positions for each are generally not considered diametric.There is no generally accepted null with respect to h trend that references nonlinear change, whereas for H step the null has no significant stepwise change points, or if it does, they are completely random and do not contain an external forcing signal.