Using Network Theory and Machine Learning to predict El Ni\~no

The skill of current predictions of the warm phase of the El Ni\~no Southern Oscillation (ENSO) reduces significantly beyond a lag of six months. In this paper, we aim to increase this prediction skill at lags up to one year. The new method to do so combines a classical Autoregressive Integrated Moving Average technique with a modern machine learning approach (through an Artificial Neural Network). The attributes in such a neural network are derived from topological properties of Climate Networks and are tested on both a Zebiak-Cane-type model and observations. For predictions up to six months ahead, the results of the hybrid model give a better skill than the CFSv2 ensemble prediction by the National Centers for Environmental Prediction (NCEP). Moreover, results for a twelve month lead time prediction have a similar skill as the shorter lead time predictions.


Introduction
Approximately every 4 years, the sea surface temperature (SST) is higher than average in the eastern equatorial Pacific (Philander, 1990). This phenomenon is called an El Niño and is caused by a large-scale ocean-atmosphere interaction between the equatorial Pacific and the global atmosphere (Bjerknes, 1969), referred to as the El Niño Southern Oscillation (ENSO). It is the dominant mode of climate variability at interannual timescales and has teleconnections worldwide. As El Niño events cause enormous damage worldwide, skillful predictions, preferable for lead times up to 1 year, are highly desired.
So far, both statistical and dynamical models are used to predict ENSO (Chen et al., 2004;Yeh et al., 2009;Fedorov et al., 2003). However, El Niño events are not predicted well enough up to 6 months ahead due to the existence of the socalled predictability barrier (Goddard et al., 2001). Some theories indicate that this is due to the chaotic, yet deterministic, behaviour of the coupled atmosphere-ocean system (Jin et al., 1994;Tziperman et al., 1994). Others point out the importance of atmospheric noise, acting as a high-frequency forcing sustaining a damped oscillation (Moore and Kleeman, 1999).
Recently, attempts have been made to improve the ENSO prediction skill beyond this spring predictability boundary, for example by using machine learning (ML; Wu et al., 2006) methods, also combined with network techniques . ML has shown to be a promising tool in other branches of physics, outperforming conventional methods (Hush, 2017). As the amount of data in the climate sciences is increasing, ML methods such as artificial neural networks (ANNs), are becoming more interesting to apply to prediction studies.
Briefly, ANN is a system of linked neurons that describes, after optimization, a function from one or more input variables (or attributes) to the output variable(s). Generally, one has to choose how large and complicated the ANN structure is. The more complicated an ANN, the more it will filter the important information from the attributes itself, but it will require more input data and is computationally intensive. Therefore, simpler ANN structures are used in this arti-cle. However, techniques will have to be applied in order to reduce the amount of input variables and select the important ones, to make the problem appropriate for the simpler ANN. This reduction and selection problem can be tackled in many ways, which are crucial for the prediction. The main issue in these methods, however, is what attributes to use for ENSO prediction.
Complex networks turn out to be an efficient way to represent spatiotemporal information in climate systems (Tsonis et al., 2006;Steinhaeuser et al., 2012;Fountalis et al., 2015) and can be used as an attribute reduction technique. These climate networks are in general constructed by linking spatiotemporal locations that are significantly correlated with each other according to some measure. It has been demonstrated that relationships exist between topological properties of climate networks and nontrivial properties of the underlying dynamical system (Deza et al., 2014;Stolbova et al., 2014), also specifically for ENSO (Gozolchiani et al., 2011(Gozolchiani et al., , 2008Wang et al., 2015). Climate networks already appear to be a useful tool for more qualitative ENSO prediction, by considering a warning of the onset of El Niño when a certain network property exceeds some critical value (Ludescher et al., 2014;Meng et al., 2017;Rodríguez-Méndez et al., 2016).
In this paper, a hybrid model is introduced for ENSO prediction. The model combines the classical linear statistical method of autoregressive integrated moving average (ARIMA) and an ANN method. ANN is applied to predict the residual, due to the nonlinear processes, that is left after the ARIMA forecast (Wu et al., 2006). To motivate our choice for attributes in the ANN, we use an intermediatecomplexity model which can adequately simulate ENSO behaviour, the Zebiak-Cane (ZC) model (Zebiak and Cane, 1987). The attributes which are used in the prediction model are related to physical processes which are relevant for ENSO prediction. Moreover, network variables are considered as attributes such that they relate to a physical mechanism, but additionally contain spatial information.
Section 2 briefly describes the ZC model, the methods considering both the climate networks and ML, and the data from observations. In Sect. 3, the network methods are first applied to the ZC model. Second, the attributes selected for observations are presented. These attributes, among which there is a network variable, are applied in the hybrid prediction model in Sect. 4, which discusses the skill of this model to predict El Niño. The paper concludes with a summary and discussion in Sect. 5.

Data from observations
As observational data, we use the sea surface height (SSH) from the weekly ORAP5.0 (Ocean ReAnalysis Pilot 5.0) re-analysed dataset of ECMWF from 1979 to 2014 between 140 to 280 • E and 20 • S to 20 • N.
For recent predictions, the SSALTO/DUACS altimeter products are used for the same spatial domain, since the SSH is available from 1993 up to the present in this dataset. The SSALTO/DUACS altimeter products were produced and distributed by the Copernicus Marine and Environment Monitoring Service (http://marine.copernicus.eu/, June 2017).
In addition, the HadISST dataset of the Hadley Centre has been used for the SST and the NCEP/NCAR Reanalysis dataset for the wind stress from 1980 to the present (Rayner et al., 2003).
To quantify ENSO, the NINO3.4 index is used, i.e. the 3-month running mean of the average SST anomaly in the extended reconstructed SST dataset between 170 to 120 • W and 5 • S and 5 • N (Huang et al., 2015).
The warm water volume (WWV), being the integrated volume above the 20 • C isotherm between 5 • N-5 • S and 120-280 • E, is determined from the temperature analyses of the Bureau National Operations Centre (https://www. pmel.noaa.gov/elnino/upper-ocean-heat-content-and-enso, National Oceanic and Atmospheric Administration, 2017).

The Zebiak-Cane model
The ZC model (Zebiak and Cane, 1987) represents the coupled ocean-atmosphere system on an equatorial β-plane in the equatorial Pacific (see Fig. 1). This model is used here to infer which processes are important for ENSO prediction and to find the attributes which represent those processes. Also, a network analyses is applied to the ZC model in order to find network variables which could improve prediction, before these network variables are calculated in observations. We use the numerically implicit version of this model (van der Vaart et al., 2000;von der Heydt et al., 2011) as in Feng (2015. In the ZC model, a shallow-water ocean component is coupled to a steady shallow-water Gill atmosphere model (Gill, 1980). The atmosphere is driven by heat fluxes from the ocean, depending linearly on the anomaly of the SST T with respect to a radiative equilibrium temperature T 0 . The zonal wind stress τ x is the sum of a coupled and an external part. The external part is independent of the coupling between the atmosphere and ocean and represents a weak easterly wind stress due to the Hadley circulation. The coupled part of the zonal wind stress is proportional to the zonal wind from the atmospheric model; the meridional component of the wind stress is neglected in this model.
As shown in van der Vaart et al. (2000), the parameter measuring the magnitude of the ocean-atmosphere coupled processes is the coupling strength µ. Without any included noise, a temperature anomaly damps out to a constant value and a stationary state if µ < µ c , where µ c indicates a critical value. However, if the coupling strength exceeds the critical value µ c , a supercritical Hopf bifurcation occurs. A perturba-Figure 1. Pacific area (red rectangle) from 140-280 • E to 20 • S-20 • N, the NINO3.4 area (green rectangle) from 170-120 • W to 5 • S-5 • N and the WWV area (blue rectangle) from 120-280 • E to 5 • S-5 • N. tion then does not decay, but an oscillation is sustained with a period of approximately 4 years.
Three positive feedbacks related to the thermocline depth, upwelling and zonal advection can cause the amplification of SST anomalies (Dijkstra, 2006), while the oscillatory behaviour associated with ENSO is caused by negative delayed feedbacks. The "classical delayed oscillator" paradigm assumes this negative feedback is caused by waves through geostrophic adjustment, controlling the thermocline depth. A complementary, different view is the "recharge/discharge oscillator" (Jin, 1997), also regarding oceanic waves excited through oceanic adjustment. The waves excited to preserve the Sverdrup balance are responsible for a transport of warm surface water to higher latitudes, discharging the warm water in the tropical Pacific. The thermocline depth is raised, resulting in more cooling of SST. The WWV is the variable generally used to capture how much the tropical Pacific is "charged".
Apart from the coupled ocean-atmosphere processes, ENSO is also affected by fast processes in the atmosphere, which are considered as noise in the ZC model. An important example of atmospheric noise are the so-called westerly wind bursts (WWB). These are related to the Madden-Julian oscillation (Madden and Julian, 1994). The WWB is a strong westerly anomaly in the zonal wind field, occurring every 40 to 50 days and lasting approximately 1 week. The effect of the noise on the model behaviour depends on whether the model is in the super-or subcritical regime (i.e. whether µ is above or below µ c ). If µ < µ c , the noise excites the ENSO mode, causing irregular oscillations. In the supercritical regime, a cycle of approximately 4 years is present, and noise causes a larger amplitude of ENSO variability.
The atmospheric noise in the model is represented by obtaining a residual of the wind stress from observations as in . Since weekly data are considered, every discrete time step in the model is 1 week.

Network variables
Here we explain the methods to calculate a property of a climate network which is tested in the ZC model and observa-tions and will be used in the hybrid model. From the network analysis we found several climate network quantities with interesting properties for prediction, but which are not used in the hybrid model of the next section. The methods to calculate these properties can be found in Appendix A1.
An undirected and unweighted network is constructed making use of the Pearson correlation of climate variables related to ENSO (e.g. SST, thermocline depth or zonal wind stress). Network nodes are model or observation grid positions i and the links are stored in a symmetric adjacency matrix A, where A ij = 1 if node i is connected to node j and A ij = 0 otherwise. A ij is defined by Here R ij is the Pearson correlation between node i and j, is the threshold value and Θ denotes the Heaviside function. Hence, if the Pearson correlation exceeds the threshold , the two nodes will be linked. The δ ij is the Kronecker delta function, implemented to prevent connection of nodes with themselves.
Percolation theory is then considered, describing the connectivity of different clusters in a network. It has been found that the connectivity of some climate networks increases just before an El Niño and decreases afterwards (Rodríguez-Méndez et al., 2016), as local correlations between points increase and decrease. At such a percolation-like transition, the addition of only a few links can cause a considerable part of the network to become connected. Before the percolation transition, clusters of small sizes will form. Therefore the variable c s will warn for the transition: Here n s is the amount of clusters of size s and N the size (i.e. the total amount of nodes) of the network. Thus c s is the fraction of nodes that are part of a cluster of (generally small) size s.

Hybrid prediction model
A hybrid model (Valenzuela et al., 2008) will be applied to predict ENSO, in which the observation Z t at time t is rep-resented by Here Y t is modelled by a linear process and N t by a MLtype technique. LetỸ t be the prediction of the part Y t using ARIMA, then Z t −Ỹ t is the residual with respect to the observed value. This residual will be predicted by the feedforward ANN: Here f is a nonlinear function of the N attributes x 1 (t), · · · , x N (t) andÑ t the prediction of residual Z t −Ỹ t at time t. Notice the nonlinear function f does not depend on history, whereas the ARIMA partỸ does. The final predic-tionZ t of the hybrid model is Previous work showed that the results of a hybrid model are in general more stable and reduce the risk of a bad prediction, compared to a single prediction method (Hibon and Evgeniou, 2005). "More stable" means that a hybrid model has a lower variability in prediction skill for different arbitrary time series. Besides, ARIMA is a simple method to include information about the history in the prediction model, which is not in the feed-forward ANN. This scheme describes a "supervised" model, implying that the predictant is "known". This known quantity is the NINO3.4 index. The standard procedure for supervised learning is to optimize the ML method on a "training set" to define an optimal model, which predicts ENSO with a certain time ahead. This function will then be tested on a test set. Here a training set of 80 % and a test set of 20 % of the total time series is used. The dataset can be represented by a T × N matrix, where T represents the length of the time series and each time t = 1, ··· , T has a set of N attributes x 1 (t), ··· , x N (t). Note that, since we are predicting time series, for any training set [t train In the following, we describe more in detail the different parts of this hybrid prediction method.
First, the training set is used to optimize an ARIMA(p, d, q) process for the NINO3.4 time series. The standard method maximizing the log likelihood function is used to fit α 1 , · · · , α p , β 1 , · · · , β q , such that t ε 2 t is minimized for time series Z t with t in months: where ε t is the residual, differencing order d determines the amount of differencing terms, p is the amount of autoregressive terms and q is the amount of moving average terms on the right-hand side; B (BZ t = Z t−1 ) is the lag operator.
Finding the most optimal ARIMA order (p, d, q) is not trivial (Zhang, 2003;Aladag et al., 2009). General methods include the Akaike's information criterion (Akaike, 1974) or minimum description length (Rissanen, 1978). However, these methods are often not satisfactory and additional methods have been proposed to determine the order (Al-Smadi and Al-Zaben, 2005). In this article we mainly present results obtained with orders p = 12, d = 1 and q = 0 or q = 1, which gave good prediction skill and it can be argued that in such a chaotic system, information from too long ago is not important anymore.
LetỸ t be the ARIMA prediction of τ > 0 months ahead, by calculatingŶ t for τ times in the future and replacing any observation Z t with the consecutive calculatedŶ t , where t is in the future and Z t therefore unknown. Similarly, if q = 1 and τ > 1 months, the residual is calculated by t−1 =Ỹ t−1 −Ŷ t−1 , since the observed value Z t−1 is in the future. Hence the ARIMA predictionỸ t will be a time extrapolation with the optimized ARIMA model. AfterỸ t is predicted by the ARIMA model, the ANN will be used for the predictionÑ t , making use of more variables than the NINO3.4 index alone. Deciding which of the variables to use is not a straightforward problem, yet crucial for the eventual prediction. Generally in an ANN, a pair of two variables can be compatible in the prediction, but perform poor when applied alone. Other pairs can be redundant and cover important information when used alone, but solely noise is included when used together (Guyon and Elisseeff, 2003). Adding a variable to the attribute set and seeing if it improves prediction can only conclude whether it improves prediction with respect to the old attribute set, not whether the variable is predictive in itself. To determine the attribute set, we consider which variables represent a certain physical mechanism that is important for the ENSO prediction. This helps to find attributes which are not related to each other, but include important information on their own. Besides, it is tested whether the prediction skill is reduced if a variable is dropped out of the attribute set.
Moreover, at every lead time an optimal attribute must be selected. Hence the final prediction model is tuned for a specific lead time and will not be a step by step prediction forward in time. Apart from considering the physical mechanisms the variables represent, two methods will help to decide which variables can improve the prediction. First, correlation between the predictor and predictant is a commonly used measure for attribute selection (Hall, 1999). Therefore the Pearson cross-correlation is calculated for the attributes at lag τ to show the predictability of a time series: Here p is the predictor, q is the predictant and lag τ ≤ 64 weeks such that no information too far in the past is considered.
However, the effect of a variable on ENSO at a short lead time increases the cross-correlation at a longer lead time, due to the effect of autocorrelation (Runge, 2014). To solve this autocorrelation problem, a Wiener-Granger causality F test (Sun et al., 2014) is performed between all predictors x 1 , · · · , x N and the predictant at lags τ . Note Granger causality is not the same as a "true" causality. If the test results in a low p value, the null hypothesis that x i does not cause in the Granger sense the predictant, due to Granger causality, is rejected at a low significance level (i.e. x i is more likely to cause the predictant due to Granger causality). Notice that both the cross-correlation and Wiener-Granger method give us merely an idea of which variables can be used for the prediction at different lead times. Both methods are linear, while the attributes will be used in a nonlinear method.
Finally, the T ×N dataset with selected attributes is used to predict the residual between the ARIMA forecast and the observations in an ANN. Besides using the NINO3.4 sequence itself, the additional attributes can be applied to add important information and improve the prediction.
In this paper, only a feed-forward ANN is applied, having a structure without loops. The input variables are linearly combined and projected to the first layer neurons according to (Bishop, 2006): Here z j is the value of the jth neuron of the layer; w (1) ji is the weight between input x i from neuron i to neuron j, where the (1) denotes the first layer. w (1) j0 is referred to as the bias. h is the sigmoid activation function, essential for incorporating the nonlinearity in the prediction model.
These z j can again be used as input for a second layer, which can be used for a third layer, and so on. Eventually this leads to some output which can be compared with the time series that must be predicted. Using a backward-propagating technique, the squared error t (y t −ŷ t ) 2 between the residual we are predicting y t and the output of the ANNŷ t will be minimized over the weights for the training set. The optimized function can then be tested on the test set. Initially, some random distribution of weights is used. The ANN part of the prediction will be performed with the toolbox Climate-Learn .
To summarize the tuning of the hybrid model: the ARIMA order and the hyperparameters controlling the ANN structure are tuned on the data, i.e. such that the prediction result is optimal. However, we will consider whether some set of different parameter values converges to similar predictions, which can show whether the hyperparameter tuning was a one lucky shot or not. The choice of the attributes is based on the ZC model giving a more physical basis for the infor-mation needed for a good prediction. To select them at a specific lag their cross-correlation and Wiener-Granger causality with the ENSO index and performance are also considered, which could lead to the replacement of an attribute with another attribute which is physically related.

Analysis of network properties and selection of ML attributes
In this section, topological properties of climate networks are analysed within the ZC model and observations, which lead to specific choices of attributes in the hybrid prediction model.

Network variables from the ZC model
Weekly spatiotemporal data on a 31 × 30 grid in the Pacific region are obtained for 45 years from the ZC model, to construct the climate networks. The first 5 years are not considered, to discard the effect of the initial conditions. A slidingwindow approach is used to calculate the network variables. This implies that a different network is calculated at each time, which is sliding 4 weeks ahead every time step. For the ZC model, either the thermocline network (from h), SST network (from T ), wind-stress network (from τ x ) or a combination of these are considered for network construction. Only the network variable which showed the same behaviour in the observations and in the ZC model is presented here. Other network variables with interesting properties can be found in Appendix A2.
The network variable of interest is c 2 (the proportion of nodes belonging to clusters of size two) of the thermocline network, because it indicates the approach to a percolation transition of the network during an El Niño event (Fig. 2). A window of 1 year is used. c 2 increases approximately 1 to 2 years before an El Niño event. This is mainly clear in the supercritical case. In the subcritical case, a clear warning of an event occurs when the oscillation of ENSO is more clear and the El Niños are stronger. Because c 2 is a warning signal of an El Niño event in the ZC model, we will look in the next section at how it behaves when it is calculated from observations.

Selecting attributes from observations
The ZC model results have given an indication of the network variables that could be used as attributes in the hybrid model to predict El Niño. Although the network variables show interesting behaviour in the ZC model for prediction, this is not always the case in observations. This section describes which variables, including a network variable, are implemented in the hybrid model and the selection of these attributes at different lead times. Notice that only anomalies of the time series in observations are considered.  First, from the recharge/discharge oscillator point of view, the WWV shows great potential for the prediction of ENSO (Bosc and Delcroix, 2008;Bunge and Clarke, 2014). Therefore it is used in the attribute set. The second attribute is a network variable related to WWV. The correlations of the SSH time series on a grid of 27 latitude points and 30 longitude points in the Pacific area are used to reconstruct a network with a threshold = 0.9 and a sliding window of 1 year. The SSH is used instead of thermocline depth, because more data is available and it is by approximation proportional to the thermocline depth (Rebert et al., 1985). During an El Niño event, the link density of this network increases in the warm pool and the cold tongue specifically, causing a percolationlike transition. As discussed in the previous section, an early warning could be obtained with c 2 . This variable allows us to extend the lead time of the WWV (Fig. 3). Third, atmospheric noise from the WWBs are a limitation for the prediction of ENSO (Moore and Kleeman, 1999;Latif et al., 1988). To obtain a variable related to the WWBs, the linear effect of the SST is subtracted from the zonal component of the wind stress. The second principal component (PC 2 ), explaining 8 % of the variance, is associated with these WWBs.  In Fig. 4, the principal component and its empirical orthogonal function (EOF) are presented. The peaks in the principal component are visible before the great El Niño events of 1982 and 1997. Thereby, the EOF has the typical WWB structure, being positive west from the dateline and negative east. Finally, the attribute set does not yet contain any information about the seasonal cycle (SC) yet. The phase locking of an El Niño event to boreal winter is very typical to ENSO. Therefore a sinusoid with a period of 1 year is used as attribute, to see if it can improve the prediction skill.
To determine at which lead time the different attributes should be applied, the cross-correlation and the p value of the Granger test between the attributes and NINO3.4 are considered (Fig. 5). The cross-correlations of PC 2 and the WWV show peaks at respectively 12 and 20 weeks, indicating their optimal lead times, since the p values of the Granger tests are low at every lag and autocorrelation does not play an important role. For c 2 , however, the cross-correlation increases up to the maximum considered lag, but the p value of the Granger test has a local minimum close to a lag of 44 weeks. According to these methods, c 2 is especially predictive at the longer lead times close to 44 weeks.
To summarize, we are interested in the variables that represent specific physical characteristics related to the prediction of ENSO, to select the attributes. Both c 2 and the WWV are related to the recharge/discharge mechanism. PC 2 is related to the atmospheric noise from WWBs. The SC is related to the phase locking of El Niño events to boreal winter. The hybrid model allows us to implement different variables in the attribute set at different lead times. Therefore, the crosscorrelations and Wiener-Granger causality were used to determine which attribute is more optimal at various lead times. This showed that it is better to use c 2 instead of WWV at lead times of more than 40 weeks. The other network variables which were interesting for the ZC model output (see the Appendix) are performing worse when applied to observations and hence are not used as attributes in the hybrid model.

Prediction results
This section presents the predictions of the hybrid model, as compared with observations and with alternative predictions from the CFSv2 model ensemble of NCEP. The skill with ANN structures up to three hidden layers is investigated. First, a comparison between both predictions is made for the year 2010 (Fig. 6). Moreover, several lead time predictions are shown and compared to the available CFSv2 lead time predictions. Next it is shown that these prediction models converge to similar results for different hyperparameters and when using different training and test sets in a crossvalidation method. Finally, a recent forecast is made and it is shown how the hybrid model predicts the development of ENSO the coming year. From now on, the normalized root mean squared error (NRMSE) is used to indicate the skill of prediction within the test set: Here y A k and y B k are respectively the NINO3.4 index and its prediction at time t k in the test set. n is the number of points in the test set. A low NRMSE indicates the prediction skill is better. For all presented hindcasts, the ARIMA prediction had a significant residual, which implies that the addition of the ANN part improved prediction.
The year 2010 is a recent example of an under-performing CFSv2 ensemble. Especially in January, all members of the ensemble overestimate the NINO3.4 index, resulting in an overestimation of the ensemble mean (see Fig. 6). The hybrid model is used to predict the same period, with ARIMA(12,1,1) and a 2 × 1 × 1 ANN structure with the 3month running mean of the WWV, PC 2 , the SC and NINO3.4 itself as attributes. In this case the hybrid model performs better than the CFSv2 ensemble. A 2 × 1 × 1 structure means a feed-forward structure with three layers of respectively two, one and one neuron. This ANN structure is found to be the best performing structure in terms of NRMSE at a 3-month lead time prediction. It will probably not be the most optimal ANN structure at other lead times.
Considering the 3-, 6-and 12-month lead time predictions, both the 3-and 6-month lead time prediction of the CFSv2 ensemble show some lag and amplification of the real NINO3.4 index (Fig. 7). The hybrid model predictions with ARIMA(12,1,0) resulting in a low NRMSE and relatively simple ANN structure within an ensemble consisting of 84 different ANN structures are also shown in Fig. 7. The 84 different structures are all structures up to three hidden layers with up to four neurons.
Comparing the 3-month lead prediction of the CFSv2 ensemble with the 4-month lead prediction of the hybrid model, both the amplification and the lag of the hybrid model prediction are smaller. While the lead time of the hybrid model is 1 month longer, the prediction skill is better in terms of NRMSE. The prediction skill of the hybrid model decreases at a 6-month lead compared to the 4-month lead time prediction. Thereby the lag and amplification of the CFSv2 prediction increase. Although the hybrid model does not suffer as much from the lag, it underestimates the El Niño event of 2010. In terms of NRMSE the hybrid model still obtains a better prediction skill.
Although the shorter lead time predictions show slightly better results than the conventional models, most important is a good prediction skill for larger lead times that appears to overcome the spring predictability barrier. To perform a 12month lead prediction which could overcome this barrier, the attributes from the shorter lead time predictions are found to be insufficient. However, c 2 of the SSH network has shown to be predictive at this lead time, according to its Granger causality and cross-correlation. Therefore the WWV is replaced by c 2 for this prediction, which is related to the same physical mechanism. In terms of NRMSE, the 12-month lead prediction even improves the 6-month lead prediction of the hybrid model. On average the prediction does not contain a lag in this period.
The hyperparameter values (i.e. the ARIMA order and the ANN structure) of the predictions in Fig. 7 could still be a lucky shot. Therefore the spread of the predictions with different hyperparameter values is shown in Fig. 8. For the ANN structures, nine optimal (in terms of NRMSE) predictions from the ensemble of 84 are considered. This resulted in a higher spread in the 6-and 12-month lead prediction compared to the 4-month lead prediction. For the ARIMA order all 9 ≤ p ≤ 14 are chosen, which resulted in almost no spread for the 3-and 12-month lead prediction and a higher spread in the 6-month lead prediction. Overall the models converge to similar predictions for those different hyperparameter values.
To test the robustness of these results, a series of crossvalidations has been performed on the prediction models of Fig. 7. Several percentage splits have been chosen for the training and test set (65-35, 70-30, 75-25 and 80-20) is not necessarily satisfied anymore. This allows us to make full use of the short time series we have (Bergmeir and Benítez, 2012). If the results for different training and test sets do not deviate much, it is evidence that   Red is the CFSv2 ensemble prediction mean and the shaded area is the spread of the ensemble. The hybrid model prediction in blue is given by predictions from hybrid models found to be most optimal at the different lead times with ARIMA(12,1,0). The dashed blue line is the running 12-month lead time prediction. the model can generalize to an arbitrary training and test set. The different percentage splits are chosen since the size of a training set could possibly have an influence on the prediction model. The cross-validation results of the hybrid models of Fig. 7 are presented in Fig. 9. At all three prediction lead times, the peaks coincide at the same NRMSE for different training-test set ratios. Therefore the different sizes of training and test sets do not seem to influence the result. However, the width of the peaks increases when the prediction lead time increases. This implies the prediction skill becomes more sensitive to the choice of the training and test set with higher lead time. Interestingly, at the 4-and 6-month lead time predictions, the average NRMSE is lower than the NRMSE of the prediction of Fig. 7. This implies the predictions with a different training and test set are on average even better than the prediction shown in Fig. 7.
Finally, a prediction is made for the coming year in Fig. 10. Different hybrid models are used at different lead times with ARIMA(12,1,0). ANN structures are chosen that are found to be optimal at the different lead times. For the predictions up to 5 months, the attributes WWV, PC 2 and the SC are used from 1980 until the present. For the 12-month lead prediction, the WWV is replaced by c 2 again. This time c 2 is computed from the SSALTO/DUACS dataset. Therefore, only a dataset from 1993 until the present has been used to train the model and perform the 12-month lead prediction.
Interestingly, as can be seen in Fig. 10, the hybrid model typically predicts much lower ENSO development than the CFSv2 ensemble. The uncertainty in the CFSv2 ensemble is large, since the spread of predictions is between a strong El Niño (NINO3.4 index between 1.5 and 2) and a moderate La Niña (NINO3.4 index between −1 and −1.5) for the coming 9 months. The hybrid models predict development to a strong La Niña (NINO3.4 index lower than −1.5) the coming year. From the time of writing, only time will tell which prediction is better. By the time of submission in early March 2018, La Niña conditions are present according to the Climate Prediction Centre of NCEP.

Summary and discussion
A successful attempt was made in this paper to use machine learning (ML) techniques in a hybrid model to improve the skill of El Niño predictions. Crucial for the success of this hybrid model is the choice of the attributes applied to the artificial neural network. Here, we have explored the use of network variables as additional attributes to several physical ones. Results of the ZC model provided several interesting network variables. Of these network variables, c 2 the amount of clusters of size two in a sea surface height (SSH) network constructed from observations, is found to provide a warning of a percolation-like transition in the SSH network. This percolation-like transition coincides with an El Niño event. This variable relates to the WWV and hence the recharge/discharge mechanism, but extends the prediction lead time of the WWV when applied in the prediction scheme. Furthermore, apart from both these quantities related to "recharge/discharge", the PC 2 and the seasonal cycle (SC) improve the prediction skill, representing respectively the WWBs and the phase locking of ENSO. The flexibility of implementing different variables at different lead times allows the hybrid model to improve on the CFSv2 ensemble at short lead times (up to 6 months). Furthermore, it had a better prediction result than all members of the CFSv2 ensemble in January 2010.
By including the network variable c 2 , we obtained a 12month lead time prediction with comparable skill to the predictions at shorter lead times. This prediction shows a step towards beating the spring predictability barrier. Using ML has the advantage of recognizing the early warning signal of c 2 as either a false or true positive. Therefore, it can be a more reliable method then considering a warning when the signal exceeds a certain threshold (Ludescher et al., 2014). Moreover, the early signal from the network variable is not only used to predict an El Niño event, but the development of ENSO, as the hybrid model provides a regression of the NINO3.4 index. ML serves as a tool which is able to recognize important, but subtle, patterns. Something the conventional statistical and dynamical models fail to do in the chaotic system. In the end, the predictions from May 2017 are discussed. By the time of writing, this is the prediction for the coming year. The CFSv2 ensemble mean predicts neutral conditions for the coming 9 months, with the spread between different members ranging from a strong El Niño to a moderate La Niña. The hybrid model predicts moderate to strong La Niña conditions for the coming year.
Although the results of the methods are promising, some adaptations to the methods which select attributes could still improve predictions. Several network variables resulted in a clear signal in the ZC model, but not necessarily for the observations. Perhaps the cross-correlation and a Granger causality test are not enough to determine the suitability of a variable in the observations. Testing all possible attribute sets in the prediction scheme and comparing results costs time. As a solution, the nonlinear methods "lagged mutual information" and "transfer entropy" can be techniques to select variables at different lead times. After all, the attributes are applied in the nonlinear part of the prediction scheme. Consequently, more variables might be found to increase the prediction skill.
Even though the currently applied network measures showed interesting properties, different climate network construction methods can still be interesting to apply. The Pearson correlation is a simple, effective method to define links between nodes. However, different properties of climate networks could be found when using mutual information instead. Moreover, the effect of spatial distance between nodes can be investigated and corrected for Berezin et al. (2012). Besides, we have limited ourselves to networks within the Pacific area itself. As ENSO is an important mode in the whole climate system, the area used for network construction might as well be extended. More specifically, it can be interesting to include the Indian Ocean in the network construction. Evidence is found that a cold SST to the west of the Indian Ocean is related to a WWB a few months later (Wieners et al., 2016). This could result in a variable related to WWBs, but increasing the lead compared to PC 2 , which is comparable to c 2 increasing the lead compared of the WWV.
By applying the ARIMA as a simple yet effective statistical method to apply in the first step of the scheme, the hybrid model shows promising results. However, the exact reason for how this model works remains a topic of investigation. The ARIMA prediction could be related to the linear wave dynamics. It can be interesting to replace the ARIMA part of the scheme by a dynamical model accounting for these linear wave dynamics. For the same reason, vector autoregression can be used instead of ARIMA. Being a multivariate generalization of an autoregressive model, this can implement the linear effect of other variables on ENSO.
Next to investigation of the exact reason the hybrid model works, some adaptations could still improve the prediction scheme. For example, it is assumed the linear and nonlinear part of the model are additive (see Eq. 3). This is not necessarily the case for the real system (Khashei and Bijari, 2011). Besides, the current model does not take into account possible nonlinear effects from the history, since the ANN describes a nonlinear function which does not depend on the history. The ANN probably succeeds here because of its performance for nonlinear time series in general. However, it could be interesting to investigate whether climate network properties comprise enough of the nonlinear dynam-ics by themselves, by combining them with a purely linear model. Moreover, the applied methods searched for a prediction model which is most optimal in terms of least squares minimization. However, it could be interesting to put larger weight on predicting the extreme events in the optimization scheme (as the 6-month lead predictions missed the 2010 El Niño event in Fig. 8), or find a function which is simpler (e.g. applying a support vector machine instead of ANN; Pai and Lin, 2005).
A general difficulty in El Niño prediction is the short available observational time series, also in other statistical prediction models (Drosdowsky, 2006). Although different hyperparameters (the ANN structure and ARIMA order) converge to a similar prediction and the prediction models perform well at different training and test sets, the short time series makes it difficult to perform another cross-validation method which completely rules out that the model is overfitting. Although the hybrid model and the attribute selection can clearly be improved, the results here have shown the potential for ML methods, in particular with network attributes, for El Niño prediction. The underlying reason for this success is likely that through the network attributes, more global correlations are taken into account which are needed to be able to overcome the spring predictability barrier.

Appendix A
This appendix summarizes the methods to calculate climate network properties. The methods improved the prediction in the ZC model, but not for observational data. Thus, they are not discussed in the main text. Appendix A1 defines the different quantities and Appendix A2 their application to the ZC model.

A1 Alternative network methods
From the unweighted network we compute the local degree d i of node i in the network as i.e. degree d i is equal to the amount of nodes that are connected to node i. The spatial symmetry of the degree distribution is of interest, since it informs where most links of the network are located. More specifically, our interest will be in the symmetry in the zonal direction in a network. Therefore, the skewness of the meridional mean of the degree in the network is calculated. This defines the zonal skewness of the degree distribution in a network.
The following two climate network properties are derived from a so-called NetOfNet approach. This is a network constructed with the same methods as previously, but using multiple variables at each grid point (as specified in Appendix A2). This gives a network consisting of the networks from the different variables interacting with each other. Only NetOfNet of two different variables are considered. First, the cross clustering contains information about the interaction between two unweighted networks. The local cross clustering of a node is the probability that two connected nodes in the other network are also connected to each other. The global cross clustering C vw is the average over all nodes in subnetwork G v of the cross clustering between G v and G w : Here r is a node in subnetwork G v of size N v , p and q are the nodes in the other subnetwork G w , and k r denotes the cross degree of node r (i.e. amount of cross links node r has with the other subnetwork). The second NetOfNet property is the algebraic connectivity. This is the second smallest eigenvalue (λ 2 ) of the Laplacian matrix as in Newman (2010) and describes the "diffusion" of information in the network. In general, λ 2 > 0 if the network has a single component.
A final network property ∆ makes use of a differently calculated network which is also undirected, but weighted. To construct it, the cross-correlation C ij (∆t) at lag ∆t, i.e. the Pearson correlation between the variables p i (t) and Figure A1. Global cross clustering between the SST and windstress network in blue and its variance in green in the ZC model. The coupling strength µ defined as a sinusoid around µc = 3 with an amplitude of 0.25 is in red. The sliding window is applied with a window of 5 years. p j (t+∆t) is considered. Then the weights between the nodes are calculated by Here max ∆t denotes the maximum, SD the standard deviation and mean the mean value over all time steps that are considered.
To calculate the property ∆ of the network, links are added to a network one by one, adding the link with the largest weight first (Eq. A3). At every step T that a link is added, the size of the largest cluster S 1 (T ) is calculated. At the point of the percolation transition, S 1 (T ) increases rapidly. The size of this jump is ∆: ∆ = max [S 1 (2) − S 1 (1), ··· , S 1 (T + 1) − S 1 (T ), ··· ] . (A4) The quantity ∆ can be used to capture the percolation-like transition (Meng et al., 2017).

A2 Climate network properties of the ZC model
Determining how strong noise can excite the ENSO mechanisms in the subcritical case, or determining whether the feedbacks sustain an oscillation in the supercritical state, could provide information to increase the prediction skill. Feng (2015) found that the skewness of the degree distribution S d of the network reconstructed from SST decreases monotonically with increasing coupling strength µ. Although S d relates to the climate stability and coupling strength, it does not inform whether the system is in either the supercritical or subcritical state.
Here, we introduce a NetOfNet variable which may represent properties of the stability of the background state: the global cross clustering (C vw ) between the SST and windstress network. A sliding window of 5 years with = 0.6 was used to compute the networks. In this case, the global cross clustering coefficient is a measure of the amount of triangles in the networks, containing one wind node and two SST nodes. In Fig. A1, this cross clustering is calculated from data from the ZC model, when coupling strength µ changes periodically in time around the critical value µ c ∼ 3.0. Under subcritical conditions, the noise has a larger influence on local correlations. This causes triangles to break and the variance in the cross clustering coefficient to increase. The cross clustering C vw is hence a diagnostic network variable which informs whether the state of the system is in the supercritical or subcritical regime.
Second, from the classical view of the oscillatory behaviour of ENSO, waves in the thermocline should contain memory of the system, because of their negative delayed feedback. The changing structure of the thermocline network is therefore of interest when predicting ENSO. Calculating this network with threshold = 0.6 and a sliding window with a length of 1 year, a zonal pattern in the change of the network close to the equator can be observed during an ENSO cycle. To compare network structures in the superand subcritical state, now constant µ = 2.7 (subcritical) and µ = 3.25 (supercritical) are taken. Generally, the degree field is quite spatially symmetric, but when the ENSO turns either from upward to downward, or from downward to upward, the degree of the nodes in the east decreases. This is at the peak El Niño or La Niña.
To capture this zonal asymmetry around the equator with a variable, the zonal skewness of the degree field will be used between 7 • S to 7 • N. The higher the skewness, the more the degree will be located west of the basin. If the skewness is close to zero, the degree is symmetrically distributed over the basin. If it is low, most of the degree is situated in the east. The skewness will show a negative peak when the ENSO index is at its highest or lowest point in the cycle (Fig. A2). In the supercritical case µ = 3.25 this effect is indeed observed. Nevertheless, in the subcritical case, the pattern is only visible once the ENSO index shows a clear oscillation (around year 32).
Third, the quantity ∆ behaves similar to c 2 , when calculated from the same (thermocline) network. Although ∆ does not depend on a chosen threshold like c 2 , it peaks closer to an El Niño event.
Finally, the algebraic connectivity (λ 2 ) can show the spread of information within a network. Specifically, when considering an unweighted NetOfNet from thermocline depth (h) and zonal wind (τ x ) with threshold = 0.6. The spread of information is relatively high before an event, but also after an event, such that λ 2 peaks both before and after an El Niño event (both for µ = 2.7 and µ = 3.25).
Data availability. All used observational data are from third parties and are either cited or can be found by URL as specified in Sect. 2.1.