Multivariate anomaly detection for Earth observations : a comparison of algorithms and feature extraction techniques

Today, many processes at the Earth’s surface are constantly monitored by multiple data streams. These observations have become central to advancing our understanding of vegetation dynamics in response to climate or land use change. Another set of important applications is monitoring effects of extreme climatic events, other disturbances such as fires, or abrupt land transitions. One important methodological question is how to reliably detect anomalies in an automated and generic way within multivariate data streams, which typically vary seasonally and are interconnected across variables. Although many algorithms have been proposed for detecting anomalies in multivariate data, only a few have been investigated in the context of Earth system science applications. In this study, we systematically combine and compare feature extraction and anomaly detection algorithms for detecting anomalous events. Our aim is to identify suitable workflows for automatically detecting anomalous patterns in multivariate Earth system data streams. We rely on artificial data that mimic typical properties and anomalies in multivariate spatiotemporal Earth observations like sudden changes in basic characteristics of time series such as the sample mean, the variance, changes in the cycle amplitude, and trends. This artificial experiment is needed as there is no “gold standard” for the identification of anomalies in real Earth observations. Our results show that a well-chosen feature extraction step (e.g., subtracting seasonal cycles, or dimensionality reduction) is more important than the choice of a particular anomaly detection algorithm. Nevertheless, we identify three detection algorithms (k-nearest neighbors mean distance, kernel density estimation, a recurrence approach) and their combinations (ensembles) that outperform other multivariate approaches as well as univariate extreme-event detection methods. Our results therefore provide an effective workflow to automatically detect anomalies in Earth system science data. Published by Copernicus Publications on behalf of the European Geosciences Union. 678 M. Flach et al.: Multivariate anomaly detection for Earth observations: a comparison


Introduction
The Earth system can be conceptualized as a system of highly interconnected subsystems (e.g., atmosphere, biosphere, hydrosphere, lithosphere). Each of these subsystems can be monitored and characterized by multiple variables. Technological progress over the past decades has led to a boost in satellite technologies (Pfeifer et al., 2011;Nagendra et al., 2013) as well as ground station development and routine monitoring (Baldocchi et al., 2001;Dorigo et al., 2011;Ciais et al., 2014). Additionally, advanced computational methods efficiently integrate remote sensing and in situ information to routinely derive novel data products (e.g., Beer et al., 2010;Jung et al., 2011;Tramontana et al., 2016). One key scientific challenge is co-interpreting these multiple views of the Earth system, in particular to address the impacts of changes in the climate system, the land use system, and other transformations.
Of particular importance is the analysis of extreme events like droughts, fires, heat waves, or floods, which are expected to change in a future climate (Kharin et al., 2013). One matter of concern is changes in hydrometeorological extremes that may translate into anomalies in vegetation dynamics, or extremes in vegetation dynamics that might result from slight changes in climatological conditions or human intervention and that can have severe consequences for vegetation and the carbon cycle (Easterling et al., 2000;Meehl and Tebaldi, 2004;Seneviratne et al., 2012;Reichstein et al., 2013). Apart from natural events, one also aims to detect events that are a direct consequence of human interference, e.g., detecting deforestation activities is required to assess the compliance with laws or agreements on forest conservation and climate change.
The flood of observational data is accompanied by a similar increase in data from Earth system models (Overpeck et al., 2011). As large numbers of data are difficult to handle and to translate into quantities of human interest, it can be easy to overlook events of particular importance. For example, using a simple semiautomatic detection scheme to identify abrupt climate shifts in simulations of future climate, Drijfhout et al. (2015) found a number of abrupt events that had previously been overlooked in simulations.
In observations, anomalous events are often detected using extreme event detection methods suitable for univariate data streams (e.g., Alexander et al., 2006;Rahmstorf and Coumou, 2011;Zhou et al., 2011;Donat et al., 2013;Lehmann et al., 2015). Univariate extreme event detection can also be used to infer knowledge about underlying drivers of extremes (Zscheischler et al., 2014a); it is particularly valid when the variable of interest is either of specific importance or integrates a wide array of relevant processes. However, some information might only be inferred when taking the multivariate combination of several data streams into account (Vicente-Serrano et al., 2010;Seneviratne et al., 2012;Fischer, 2013;Zscheischler et al., 2015). For instance, a significant fraction of events of carbon extremes in Europe is not associated with univariate climate extremes (Zscheischler et al., 2014b). Earth observations (EOs) are multivariate and naturally characterized by strong dependencies and correlations in space, time, and across dimensions (Leonard et al., 2013). We assume that any suitable anomaly detection algorithm needs to consider these data properties. By considering multivariate constellations for anomaly detection, it might become possible to gain further information, i.e., about anomalies that cannot be detected with univariate extreme event detection methods (for a review of approaches see, e.g., Ghil et al., 2011).
Multivariate approaches in geoscience make use of anomalies occurring simultaneously in multiple data streams, often referred to as coincidences or co-exceedances (e.g., Donges et al., 2011b;Rammig et al., 2015;Zscheischler et al., 2015;Donges et al., 2016;Guanche et al., 2016;Siegmund et al., 2016). An alternative is the copula approach introduced to the field by Schoelzel and Friedrichs (2008) and Durante and Salvadori (2010). However, the copula approach is limited so far to two or three simultaneous data streams (Mikosch, 2006), which makes it unsuitable for highdimensional data as used in this paper.
Interestingly, there are multiple industrial applications that likewise require anomaly detection. In this context, anomaly detection has become a standard procedure in the wake of Harold Hotelling's publication of the T 2 control chart in 1947 (Hotelling, 1947;Lowry and Woodall, 1992). Consider, for instance, several sensors observing some industrial production chain. These (potentially correlated) sensor data streams can be monitored with a statistical process control (SPC) algorithm (Lim et al., 2014;Ge et al., 2013;Lowry and Montgomery, 1995). The basic idea is to raise an alarm as soon as an anomaly according to the SPC is detected, meaning that the production chain is out of control. Despite the obvious analogy, the ideas of SPC are largely unknown in the geoscience community to the best of our knowledge. Conceptually, the industrial application is equivalent to the idea of monitoring environmental variables. However, data differ. EOs exhibit strong (potentially nonlinear) dependencies among the variables; seasonal cycles are typically present in both temporal mean and variance. The variables may also encode dynamic feedbacks and abrupt transitions. EOs are possibly more strongly corrupted by noise compared to industrial applications. Furthermore, industrial applications are typically less affected by low-frequency variability than EOs. The most problematic aspect when considering SPC concepts in Earth system sciences is, however, defining states of normality.
The objective of this study is to provide an overview and comparison of anomaly detection algorithms and their combination with feature extraction techniques for identifying multivariate anomalies in EOs. Spatiotemporal EOs are therefore stored in the Earth system data cube, which is a four-dimensional array of latitudes, longitudes, time, and dif-ferent measurement variables. To detect multivariate anomalies in EOs, we define an anomaly to be any consecutive spatiotemporal part of the data cube that differs with respect to the mean, the variance, the amplitude of the seasonal cycle, or trends from the normal rest of the data cube. We adapt algorithms from SPC and novelty detection. The study is structured as follows: first, we create a series of artificial Earth system data cubes that try to mimic a series of real world features (in terms of multiple variables, seasonal cycles, and correlation structure, etc.). We are aware that these artificial data cubes are not real simulations of Earth system data cubes. However, relying on artificial data in this paper is motivated by the fact that a meaningful quantitative evaluation of unsupervised anomaly detection algorithms and feature extraction techniques in real Earth observation data is difficult due to the lack of ground-truth data (Zimek et al., 2012). Second, we use these artificial data to evaluate the capability of different algorithms to detect multivariate anomalous events, including compound events (e.g., events in which none of the single variables are extreme, but their joint distribution is anomalous and might lead to an extreme impact) (Seneviratne et al., 2012;Leonard et al., 2013). Specifically, we evaluate the performance of the algorithms in detecting multivariate changes in the mean (comparable to an extreme event), the amplitude of the annual cycle, the variance, and the onset of trends. Using the artificial dataset as a test bed we apply various feature extraction schemes (Sect. 3.1), several detection algorithms (Sect. 3.2), and combinations of detection algorithms (ensembles, Sect. 3.4) to compare their performance in identifying anomalous events (Sect. 3.3). From this comparison we select suitable combinations of feature extraction (Sect. 4.1) and a few algorithms (Sect. 4.2) as well as ensembles of algorithms (Sect. 4.3) as the best ones applicable to EOs, including suggestions for their specific usage (Sect. 5).

Generation principle of the artificial data
Ground truth for detecting anomalies in multivariate data is rare, in particular for detecting anomalies in real EOs. Thus, we generate artificial data that represent common properties of EOs, including anomalies. In particular, we focus on the existence of seasonality, correlations among variables, and non-Gaussian distributions. Data generation assumes that each subsystem of the Earth has uncorrelated intrinsic properties, i.e., it is dominated by a few independent components. Consequently, generating these independent components (which cannot directly be monitored) is the first step. We then derive variables that contain elements of all independent components and correspond to the observable measurements as a set of correlated variables (Fig. 1).
More precisely, as a basic version we create three independent components for the artificial data, each consisting of a signal (Gaussian, SD = 1.0) that includes seasonality in Figure 1. Combination of three independent component cubes to derive 10 correlated variables X as observable measurements. The anomalous event is propagated into some variables of X. some cases (Sect. 2.2). Anomalous events are induced in one of the independent components for which we track the exact spatiotemporal location. These three independent components are then weighted with randomly generated linear (or nonlinear, Sect. 2.3) weights to create a set of 10 correlated variables, which represent the artificial data cube, i.e., try to mimic observable measurements. We add some additional measurement noise (Gaussian, SD = 0.3) to the data cube. For more technical details of this generation scheme we refer the reader to the Appendix A.

Generating anomalous events
Anomalous events are introduced in the independent components only and then propagated from the independent component to some of the variables in the data cube with random weights. The anomalies are contiguous in space and time. The center of the anomaly is assigned randomly. The challenge is to detect the propagated anomaly through the unsupervised algorithms, i.e., without using the information about the spatiotemporal location of the anomaly. With this data cube generation scheme, we can generate anomalies by controlling the type of the anomalous event (event type), the magnitude of the anomalous event, and the spatiotemporal location.
We create four data cubes using the following temporary event types: a. a shift in the baseline, i.e., shift of the running mean of a time series (BaseShift) (Fig. 2a). This event type is closely related to extremes in real world EOs.
b. an onset of a trend in the time series (TrendOnset) (Fig. 2b).
c. a change in the amplitude of the mean seasonal cycle of a time series (MSCChange) (Fig. 2c) Figure 2. Visualization of the four different event types (a-d) with two variables along time t i,j = 1, . . ., T (T = 300). The two variables contain an anomalous event (red shape) that is propagated through the underlying independent components with randomly drawn weights within the generation process of the variables. For illustration purposes two variables are shown for one specific magnitude of the anomaly. The artificial data farm encompasses 10 variables and anomalous events of 20 different magnitudes ranging from very subtle to exceptionally high changes.
happen in the real world carbon cycle as a response to combined drought-heat waves (Ciais et al., 2005).

Additional data properties
Apart from the basic data cubes, we want to test the influence of certain data properties on the anomaly detection algorithm. In order to do so, we create data cubes, each with one added data property, i.e., we increase the number of independent components (MoreIndepComponents) or use a squared dependency among independent components (NonLinearDep) instead of a linear one. Furthermore, typical EO variables are often driven by extrinsic forcings, i.e., the Earth's solar system orbit, rotation, and axis tilt. Thus, we add a seasonal cycle modifying the signal (SeasonalCycle).
In a global context, the mean is rarely constant; we therefore introduce a linear latitudinal trend into the baseline (Latitu-dinalGradient). In the basic case, the signal of our independent components follows a Gaussian distribution. In the more complicated versions, we also implement alternative scenarios with Laplacian (doubly exponentially) distributed signals (LaplacianNoise) and signals that exhibit spatiotemporal correlation with red noise (CorrelatedNoise). Signal-to-noise ratio is 0.3 in the basic version, one additional data property increases the signal-to-noise ratio to 1.0 (NoiseIncrease). Also, the shape and duration of anomalous events differ. We double (LongExtremes) or reduce the temporal duration of the anomalous events (ShortExtremes) and change the spatial shape from rectangular to randomly affecting neighboring grid cells (RandomWalkExtreme).

Experiment design
Each data cube with a specific type of the event is generated 20 times, each time with a different magnitude of the anomalous event (Appendix A). We introduce 10 spatially contiguous anomalous events into the independent components, with a spatial extent of 20 latitude and longitude steps each. Each event has a temporal extent of five time steps (which would be equivalent to 40 consecutive anomalous days in a 6.5-year record). Our total number of anomalies equals about 3 % of the total data cube, which we consider to be a realistic scenario (comparable to Zscheischler et al., 2014a), for example. Some latitudes and longitudes do not exhibit any anomaly by design. The algorithms (Sect. 3.2) are expected to be able to deal with parts of the data cube that do not exhibit anomalies at all, as this is also very likely to happen for applications in real EOs. Our experiment comprises 36 different event-type combinations of data properties, each repeated 20 times with varying event magnitudes (Appendix A). The entire set of artificial data cubes consists of 720 data cubes, corresponding to ≈ 87 GB of data 1 . . Data processing for detecting multivariate anomalies. We extract relevant features from each artificial data cube before applying the detection algorithms. The detection algorithms output some anomaly score, which we evaluate against the known extent of the event using the area under the curve (AUC). Feature extraction elements on the right-hand side are understood as options and can be combined with each other.

Workflows to detect anomalies
The idea of this study is to elaborate workflows that contain both data preprocessing via feature extraction and algorithms for the detection of anomalous events (Fig. 3). In the following we introduce these two elements separately and explain the performance evaluation strategy afterwards.

Feature extraction
Feature extraction is a process to derive information from the data and condense it into nonredundant characteristic patterns. This may facilitate data interpretation (van der Maaten, 2009). In our study the aim is to maximize the detection of anomalous events by providing relevant features. Feature extraction is often an element of data preprocessing. A very simple form of feature extraction could be to subtract the mean seasonal cycle. We consider the anomaly time series to be the extracted feature in this case. Here, we concentrated mainly on feature extraction methods that are used in the context of classical multivariate SPC (Lowry and Montgomery, 1995), data-based process monitoring in industry (Ge et al., 2013), and univariate extreme event detection. The following feature extraction methods are used in this study: Subtracting the median seasonal cycle (sMSC) is one way to deseasonalize time series. Deseasonalization may be instrumental in detecting anomalous events across different seasons. The remaining part of the time series is often referred to as anomalies in the climatological sense. These anomalies are used here as an input feature. Please note, that the climatological anomalies are only the difference between the mean behavior and thus are not to be mixed up with anomalies (strange or rare regions in the data, closely related to extreme events) as detected through the (multivariate) anomaly detection algorithms (Sect. 3.2).
Computing the moving window variance (mwVAR) is a popular technique for detecting trends in the variance in univariate time series (e.g., Huntingford et al., 2013). We choose a window size of 10 and compute the variance in the running window along the time series of each variable. We use the estimates of the mwVAR time series as a feature to detect multivariate anomalies in the variance.
Time delay embedding (TDE) increases the feature vector ) to include temporal context information. In the univariate case, this approach ideally creates an image of the attractor of a dynamical system (Takens, 1980). In highdimensional multivariate data applications it is used to include information of the dynamics in the feature vector (e.g., Koçak et al., 2004;Ge et al., 2013;Smets et al., 2009). Critical hyperparameters are the time delay τ and the number of dimensions m. We fix m to 3 (corresponding to the number of independent components within the data farm creation) and τ to 6, which is a compromise between the typical choice of the first zero crossing of the temporal autocorrelation function or the first local minimum of the mutual information (Webber Jr. and Marwan, 2015) (here: 11.5 corresponding to one-quarter of the annual cycle with 46 time steps) and an accurate temporal detection (requires small τ ).
Principal component analysis (PCA) is a data rotation, used to find an orthogonal (uncorrelated) subspace of the data of n PC ≤ VAR variables (Von Storch and Zwiers, 2001). We choose n PC such that at least 95 % of the variance in the original data cube is explained. By assuming a homogeneous covariance structure within the entire data cube, we perform the PCA globally, i.e., with the same rotation matrix for all grid cells. The combination of TDE and PCA is sometimes referred to as dynamic PCA when considering subsequent lags in the time series (Lee et al., 2004).
Independent component analysis (ICA) can be regarded as a nonlinear alternative to PCA; it has become a standard technique of data-based process monitoring. We use one ICA variant that tries to separate different sources of data by maximizing the negentropy, a measure of non-Gaussianity of the data (Hyväringen and Oja, 2000) 2 . We apply ICA globally to each data cube. The hyperparameter is the number of independent components (sources). We choose the number of independent components to be equal to n PC (see PCA) for consistency reasons (Majeed and Avison, 2014).
Exponentially weighted moving average (EWMA) is one way of reducing the noise of the time series and taking temporal information into account. It is common in the context of classical multivariate SPC to detect only "significant" outliers (Lowry and Woodall, 1992). The multivariate feature time series Y is computed recursively as The hyperparameter λ determines the degree of exponential weighting between 1 (no weighting) and zero (common choice 0.1 ≤ λ ≤ 0.3; Santos-Fernández, 2013). We stay in this range with λ = 0.15. There is of course a multitude of alternative approaches available in the literature, but we focus on the previously summarized ones as they are widely used and efficiently implemented. Furthermore, different feature extraction methods can also be combined (Fig. 3). As the number of possible combinations is considerably large, we focus here on dimensionality reduction techniques (ICA, PCA) combined with some EWMA to reduce the noise level afterwards. Depending on the event type and data properties, additionally removing seasonality (sMSC) or including the variance mw-VAR seems to be straightforward. Information about the dynamics (TDE) can be included before applying dimensionality reduction techniques to keep the dimensionality of the system as low as possible. In the following, combinations are noted in the order in which they were applied (e.g., PCA_EWMA means first applying PCA, then applying EWMA to the PCA features). In some cases this might lead to non-commutative combinations, especially for nonlinear feature extraction techniques (ICA, TDE).

Anomaly detection algorithms
We use several detection algorithms that we implemented in the Julia package MultivariateAnomalies (https://github. com/milanflach/MultivariateAnomalies.jl). Some anomaly detection algorithms require the estimation of parameters (details are given below for each algorithm separately). In that case we fix the model parameters for the entire data cube. We estimate model parameters (σ , ε, Q, µ; see below) and train the models themselves (support vector data description, kernel Null Foley-Sammon Transform, KNFST; see below) based on a random subsample of 5000 data points obtained from the entire data cube. To account for variability in the model parameter estimation, we resample three times. More resampling is not affordable due to high computational costs of processing the large number of data cubes. However, very little random variability is observed with this sample size for the best algorithms. Thus, we consider a resampling of three times to be sufficient for a first attempt to account for variability in the parameterization. The following algorithms are investigated for anomaly detection.
Univariate approach (UNIV) is a simple approach to define extremes in univariate data by identifying all points above (or below) a certain quantile. This so-called "peakover-threshold" approach can be transferred to deal with multiple univariate data streams. In this case, one would consider a data point to be extreme if one or several of the univariate variables are above (or below) a certain quantile threshold of the marginal distributions of each single variable. (here: globally) (e.g., Ledford and Tawn, 1996;Bae et al., 2003;Donges et al., 2016). Applications of the so-called co-occurrence or coincidence analysis can be found in Donges et al. (2011b), Rammig et al. (2015), Zscheischler et al. (2015), Guanche et al. (2016), and Siegmund et al. (2016). For comparing the algorithms, we are interested in the information that at least one variable is above a certain threshold. We compute this information for different thresholds (in terms of quantiles of the marginal distributions between 0.0 and 1.0, accuracy 0.01) to get a score, i.e., a ranking of the extremeness of the data points.
Hotelling's T 2 (T 2) computes the squared Mahalanobis distance of each data point X t to its temporal mean µ weighted with the covariance matrix Q (Hotelling, 1947): A crucial prerequisite is the estimation of the covariance matrix Q, which is estimated from the random subsample of 5000 data points. Combining the feature extraction EWMA with T 2 equals the traditional multivariate exponential weighted moving average (Lowry and Woodall, 1992;Lowry and Montgomery, 1995). Apart from computing weighted distances to the mean (like T 2), it is also possible to compute pairwise Euclidean distances in variable space d(X t i , X t j ) between vectors X t i and X t j of time steps t i and t j for all possible time steps is often referred to as distance matrix or dissimilarity matrix. For real world data, variables have to be standardized with care before computing the distance matrix (Sect. 5). However, in the artificial data used the variables are already comparable by construction; thus, standardization is not needed. The following algorithms are based on pairwise distances.
K-nearest neighbors (KNNs) can be used for anomaly detection by considering the mean distance to the k-nearest neighbors (k-nearest neighbors Gamma, KNN-Gamma) and the length of the mean of the vectors pointing from X t i to its k-nearest neighbors (k-nearest neighbors Delta, KNN-Delta) (Harmeling et al., 2006;Ramaswamy et al., 2000). With the latter approach KNN-Delta also considers the direction of the neighbors, i.e., has higher values in case its nearest neighbors are pointing in one direction, which is in contrast to the directionless distance of KNN-Gamma. We fix the hyperparameter k at 10 after carefully trying different choices for k without seeing major effects on preliminary results. Furthermore, we exclude trivial temporal autocorrelations by excluding five neighboring time steps (abs(t i − t j ) ≥ 5) to also be nearest neighbors.
Recurrences (REC). Within the framework of the theory of nonlinear dynamical systems, each state of a dynamical system will revisit a particular region in its phase space, if waiting for a sufficiently long time (Poincaré, 1890). These dynamics can be visualized in the recurrence plot and are quantified with several metrics usually referred to as recurrence quantification analysis (Marwan et al., 2007). It seems straightforward to use the concept of recurrence analysis to detect states in a dynamical system that are considered to be rare or unusual. Faranda and Vaienti (2013) used the concept of recurrences and combined it with extreme value theory. We want to use a more general approach without binning the time series. We count the number of observations ζ falling into a certain ε ball in a system of multiple variables, condensed by their distance d(X t i , X t j ): ( 3) (z) is the Heaviside function, coding the distances to binary values ( (z) = 0 if z < 0, (z) = 1 otherwise). An ε hyperball containing only few recurrent observations is considered to be rare in comparison to the majority of ζ values. We compute 1−ζ ·T −1 to get anomaly scores, which are more likely to be an anomaly for high score values. ζ · T −1 is known as local recurrence rate or degree density in recurrence analysis (Marwan et al., 2007;Donner et al., 2010) (Donges et al., 2012). ε is the crucial hyperparameter, defining the radius of the ball. Typical choices of ε in recurrence analysis are quantiles of the distribution of elements of the distance matrix, e.g., 5 or 10 % (Donges et al., 2011a;. As we are not interested in small-scale variations in REC, but more in major anomalies we estimate ε as median of the distance matrices in the random subsample. This choice turned out to be the optimal choice (in terms of maximizing the area under the curve, AUC; Sect. 3.3) for ε in a small simulation, varying the thresholds between the 5 and 95 % quantiles of the element of the distance matrix (Supplement Fig. S1). We exclude five neighboring time steps to be counted as recurrences (similar to KNN). KNN has similarities to REC, as one could also choose a data-adaptive k such that ζ = k.
The distance matrix D can be transformed into a kernel matrix K = exp(−0.5 · D · σ −2 ), i.e., by computing pairwise dissimilarities using Gaussian kernels centered on each data point.
Kernel density estimation (KDE) is a standard technique for estimating densities based on column means of the kernel matrix K (Parzen, 1962). The bandwidth σ of the kernel is a hyperparameter. We estimate σ by using the median of the temporal distance matrix on the random subsample, which is a common choice Schölkopf et al., 2015). Support vector data description (SVDD) models the distribution of the training data with an enclosing hypersphere in a high-dimensional kernel feature space (Tax and Duin, 2004). As usual a kernel matrix of the random subsample is used for training. Although being a rather simple data description, a hypersphere in the kernel feature space can result in complex nonlinear decision boundaries in the original space of predictor variables if a nonlinear kernel function is used. In addition to the σ hyperparameter of the kernel function (see KDE), the SVDD approach has a parameter called outlier ratio ν (fixed to 0.2). The outlier ratio ν controls the number of training samples that can be located outside of the hyper-sphere to prevent overfitting. As anomaly score for testing, its distance to the center of the hypersphere in the kernel feature space is computed. Testing requires pairwise similarities between test and training samples. For performance reasons in terms of computation time, we used the implementation by (Chang and Lin, 2013) of the one-class support vector machine , which is an alternative formulation that leads to identical data descriptions as SVDD in our setup.
kernel Null Foley-Sammon Transform (KNFST) maps the training data into a so-called null space, in which the training samples have zero variance, i.e., all training samples are mapped to the same point called the target value (Bodesheim et al., 2013). Nonlinearity is incorporated by using a kernel matrix containing pairwise similarities of the training samples (training on the random subsample as for SVDD). Since all training samples are represented by a single target value in the one-dimensional null space, the anomaly score of a test sample is the absolute difference between its projection in the null space and this target value. The projection of the test sample requires pairwise similarities to the training samples. Compared to SVDD no parameters need to be tuned except for σ of the kernel function that is fixed to the same values for all kernel methods.

Ranking of the workflows
Given the large number of potential combinations of feature extraction and anomaly detection algorithms, we need an objective criterion to compare the performances of the numerous possible workflows. We use the area under the receiver operator characteristics curve (AUC) as our measure of detection skill for a specific event type (Fawcett, 2006). The AUC is based on the fraction of events that are correctly detected (true positives) and the fraction of detections among all non-events (false positives), for all possible decision thresholds that could be applied to scores produced by the algorithms. AUC values of 0.5 would be achieved by random detection, and values below 0.5 indicate that a lower score is more likely assigned to (true) anomalies than to nonanomalies.
For each data cube with a given event magnitude and event type we compute the AUC for each data property, feature extraction, and algorithm combination. This leads to an entire catalogue of possible combinations, namely 1.27×10 5 (4 event types, 20 event magnitudes, 11 data properties, 18 feature extraction combinations, 8 algorithms). The number of combinations strongly requires simplification to infer knowledge about which combination is advisable to use. Hence, we focus on events of magnitudes typically detected in real world data i.e., deviations from the mean (extremes) larger than 2 SD (e.g., temperature extremes in Hansen et al., 2012), a relative increase or decrease in the mean annual cycle amplitude of 25 % (which might happen in the carbon cycle after combined drought and heat waves (Ciais et al., 2005) or in the Arctic due to abrupt sea ice losses (Bintanja and van der Linden, 2013;Bathiany et al., 2016), for example) or an increase in the signal variance of 25 % (e.g., in temperature; Huntingford et al., 2013).
One way of summarizing the results of such a large number of combinations is treating the AUC values as the outcomes of an experiment in which the different design decisions (e.g., feature extraction techniques, anomaly detection algorithms) are the experimental factors. As a control treatment we introduce the simplest possible approach to detecting the anomaly: UNIV approach on the selected event type, without any further data properties (e.g., short extremes or increased measurement noise) on the event type and without prior feature extraction. In order to assess the (averaged) effect of each experimental factor, we fit a linear mixed-effect model (Pinheiro et al., 2016) to the AUC data (fixed effects: data properties, feature extraction, anomaly detection algorithms; random effect: magnitude of the event). This model's coefficients express the overall effect of a factor level with respect to the control while averaging over all other experimental factors. They are considered to be significant for p < 0.01.

Ensembles of anomaly detection algorithms
Summarizing the output of several anomaly detection algorithms is one way to create more robust results (Thompson, 1977). For better comparability of the algorithms' outputs, we rank them by computing the percentiles of the algorithm scores. These are then aggregated into ensemble scores by computing the minimum (consensus voting), the mean (balanced voting), or the maximum (risky voting) of the scores of selected well-performing algorithms (e.g., Aggarwal, 2012;Zimek et al., 2013).

Results and discussion
In the following, we present the performance of the workflows in subsections corresponding to feature extraction techniques (Sect. 4.1), anomaly detection algorithms (Sect. 4.2), and ensembles of detection algorithms (Sect. 4.3). Specifically, we present the AUC difference to the UNIV control, i.e., the output of the linear mixed-effect model on the experimental factors feature extraction and detection algorithm (Fig. 4). The corresponding tables present the estimates as well as the RVP (Tables 1, 2). Apart from the model the full range of AUC values with respect to different event magni-tudes, data properties, and event types is presented in Appendix B, Fig. B1.

Feature extraction techniques
Feature extraction techniques are often more important than the detection algorithm itself (Fig. 4). However, we find that choosing a suitable feature extraction technique largely depends on the event type of interest. Therefore, the feature extraction techniques are presented for different event types separately.
BaseShift Shifts in the baseline are simulated to mimic extreme events. Increasing the magnitude (in terms of standard deviations) of a BaseShift makes it easier to detect the event (Fig. B1). Dimensionality reduction (via PCA or ICA) is a crucial feature extraction technique step as it derives meaningful uncorrelated subsets of the data (Fig. 4a). The combination of dimensionality reduction with some temporal smoothing (EWMA) does not exhibit better overall performance (Fig. 4a) as it fails for ShortExtremes due to oversmoothing. Nevertheless, EWMA can improve the detection rate for special cases, i.e., long events (LongExtremes) and high signal-to-noise ratios (NoiseIncrease) (Fig. B1).
TrendOnset Results look very similar to those of Base-Shift, except that temporal smoothing with EWMA has a stronger positive effect than for BaseShift. This may be related to the fact that events for TrendOnset are longer than those for BaseShift. Since the algorithms used in this work are not specifically designed to detect the onset of linear trends, we speculate that their capability to detect such anomalies may be related to their ability to detect base shifts. While algorithms specifically designed to detect changes in trends (e.g., Forkel et al., 2013)) were not included in our work due to our focus on more generic types of anomalies, such specialized algorithms may perform better for this particular class of anomaly.
MSCChange In the detection of MSCChange, most feature extraction algorithms showed some skill in the detection of an amplitude increase, while only a subset of these also succeeded in detecting decreases in amplitude (Fig. B1). We focus on the latter ones, which have one step in common: they subtract the median seasonal cycle before applying the detection algorithm (sMSC) (Fig. 4c). In line with the results for TrendOnset and BaseShift, temporal smoothing in combination with dimensionality reduction improves detection by a large margin (PCA_sMSC_EWMA). Furthermore, accounting for temporal dynamics with a TDE is even more suitable (TDE_PCA_sMSC_EWMA).  VarianceChange The algorithms used are hardly able to detect any decrease in variance (Fig. B1). This may be due to an overwriting of the decrease in signal variance with the independent noise since we are using a signal-to-noise ratio of 0.3. Thus, we exclude a decrease in the variance from the evaluation of the detection algorithms compared to the control. The detection of an increase in the variance can be improved by a combination of dimensionality reduction and variance in a moving window (PCA_mwVAR) (Fig. 4d). Using the variance in a moving window is a popular approach (Huntingford et al., 2013) although it has to be applied with care when used in conjunction with normalization procedures (Sippel et al., 2015).
SeasonalCycle Seasonality occurs in most EOs. Not accounting for the seasonal cycle has a negative impact on the AUC (Appendix B, Fig. B2a, b, d). However, if we subtract the median cycle within the feature ex-traction step (PCA_sMSC_EWMA; Fig. 4a, b, d), we can almost account for the negative AUC impact of the seasonal cycle, as in our experimental setting anomalous events do occur independently of seasonality. However, depending on the research question, independence of seasonality might not always be the case: some EOs may depend on vegetation activity, for example, which results in a strong dependence on seasonality.

Performance of multivariate anomaly detection algorithms
In contrast to the investigated combinations of feature extraction methods, we can identify three of the tested algorithms performing on average almost equally well for most event types given a suitable feature extraction as discussed before (Sect. 4.1). KDE and REC These techniques exhibit overall the highest AUC and lowest RVP (Table 1). Their estimated mean differences are rather small since REC can be considered as a binary form of the KDE. As REC uses a threshold ε for defining the hyperball of recurrences, the results can exhibit slightly higher AUC than KDE (Fig. S1). However, with REC the caveat is that the parameter ε is not necessarily optimally chosen.
KNN In most of the cases, KNN-Gamma performance is better than the UNIV control, but only as good as the UNIV control for detecting TrendOnset. This may be due to the fact that for TrendOnset, the mean distance to the KNN does not change, unless considering a very large number of KNN values or excluding a large fraction of temporally near data points to be within the KNN. When excluding TrendOnset, the mean performance increases to 0.019, which is comparable to KDE and REC. In addition, we observe even superior performance of KNN-Gamma compared to KDE and REC for difficult data properties (e.g., MoreIndepComponents, CorrelatedNoise; Fig. S2). In contrast, KNN-Delta does not yield high AUC, probably because we do not construct anomalies in the data cube explicitly with a direction that is accounted for by KNN-Delta (mean length of the vectors to its KNN). The finding that simple algorithms like KNN-Gamma (or KDE, T 2) are very competitive, if not favorable algorithms, is in line with results of Harmeling et al. (2006), Killourhy andMaxion (2009), andDing et al. (2014) for various data sets.
KNFST and SVDD These techniques perform on average worse than or equally as well as UNIV. Also, the RVP is highest among the algorithms (Table 1). It has already been reported that SVDD can exhibit remarkable fluctuations in the results for sample sizes smaller than 1000 data points (Ding et al., 2014). However, we use 5000 points for training. Thus, we suggest that the fluctuations are due to the fact that SVDD and KNFST use a training set that is chosen at random and may itself contain anomalies. In the current setting the size of the training sample (5000) is rather small compared to the spatiotemporal size of the data cube (750 000), and it does not seem to be sufficient to train these algo-rithms on the data cube. Increased sample sizes, however, would heavily increase memory demand and computing time, rendering kernel algorithms computationally inapplicable. Furthermore, Ding et al. (2014) shows that the sample size has a remarkable effect for SVDD (better performance for lager sample sizes). However, even with very large sample sizes SVDD still performs worse than KNN in the setting of Ding et al. Training and testing SVDD on each pixel does also not improve the results as the number of anomalies differs between different pixels in our setting. Training and testing SVDD on each pixel assumes the same number of anomalies in each pixel (constant outlier ratio assumed by the fixed ν parameter), which is contrary to the generation of the artificial data farm.
We explicitly do not want to state that KNFST and SVDD are generally worse algorithms, i.e., they are just not built for these massive numbers of data. KNFST and SVDD outperform other algorithms in very different settings (novelty detection in images) (Bodesheim et al., 2013).
T2 exhibits good performance for detecting starting trends and shifts in the mean. However, it also exhibits the third largest RVP (Table 1), indicating that the estimation of the covariance matrix may be sensitive to random variation in the data. Nevertheless, the RVP is still far better than for SVDD. The robust estimation of the mean and covariance matrix might be a difficult task (Smetek and Bauer, 2007;Rousseeuw and Hubert, 2011) for which rather complex algorithms like the (fast) minimum determinant covariance estimator, which are closely related to T 2, have been proposed (Rousseeuw and Van Driessen, 1990). Furthermore, T 2 assumes a multivariate Gaussian distribution and linear dependencies among the variables. Thus, it is not preferable for the data properties NonLinearDep and Correlat-edNoise unless combined with a nonlinear feature extraction technique like ICA (Fig. B1).

Ensembles
The selection of algorithms for computing the ensemble is a compromise between accurate detection of and diversity amongst the selected algorithms (Zimek et al., 2013). We select the four best algorithms (KDE, REC, KNN-Gamma, T 2; referred to together as 4b) and the three best distance-based algorithms (KDE, REC, KNN-Gamma; referred to together as 3d) for computing their ensembles. We assume that this choice accounts for accuracy (best algorithms selected) as well as for diversity (different algorithms selected).
Overall, ensemble building improves the anomaly detection rate. The mean AUC of each of the ensemble members (3d: +0.030, 4b: +0.023) is lower than the AUC of the ensemble, regardless of whether the maximum or the mean is used for ensemble aggregation. Minimum aggregation of ensemble members, however, performs worse than the individual ensemble members REC and KDE. Using the maximum or mean yields consistently higher AUC than using the minimum ( Table 2). The superior performance of the maximum choice compared to the minimum indicates that single algorithms overlook more often anomalous events than raising false alarm. Nevertheless, the maximum has the caveat that even a single algorithm may cause a false alarm (Zimek et al., 2013), e.g., due to a poor parameterization or inadequate assumptions about properties of the data. Thus, a more balanced voting procedure like the mean is the preferable choice and more stable with respect to possible error sources. Among the mean ensembles, the 3d or 4b ensembles perform equally well (0.041 vs. 0.039 ± 0.001 overall) ( Table 2).

Limitations
High dimensionality The utility of distance-based outlier detection algorithms as used in this paper is often questioned in the context of high-dimensional data (Zimek et al., 2012). The "curse of dimensionality" states that the difference between near and far distances diminishes with increasing dimensionality. However, Zimek et al. (2012) showed in the case of KNN that the contrary is true for outliers with fixed magnitude in otherwise uncorrelated data. Dimensionality reduction as crucial feature extraction transforms the data into a few (ideally) meaningful and uncorrelated variables. Thus, the findings of Zimek et al. (2012) provide strong arguments for applying dimensionality reduction on correlated data. We anticipate that their findings are the reason of the superior performance of dimensionality reduction here.
Heuristic choices Within the parameterization process, several heuristic choices are made. We exclude five time steps to be counted as recurrences or k-nearest neighbors. We fix several parameters, e.g., the number of nearest neighbors is fixed to 10. Also, other parameter choices are rather heuristic (e.g., σ ), although commonly used. The artificial data farm's intrinsic dimension is 3 as it was created from three independent components. Therefore, the embedding dimension m is fixed accordingly, although it can be inferred based on the data by determining the number of false nearest neighbors (Kennel et al., 1992;Hegger et al., 1999). The signal-to-noise ratio of our artificial data farm is 0.3. Furthermore, the choice of the data properties might influence the results for each event type, as the standard deviation of AUC values over all data properties (0.05) is rather large, compared to the average AUC gain of the three best algorithms with respect to the control (+0.03). However, the ordering of the algorithms is also important to derive rankings of algorithms (Hornik and Meyer, 2007). By choosing different subsets of the data properties, we observe that the three best algorithms (KDE, REC, KNN-Gamma) are on top, independently of the chosen data property. Therefore, the data properties might have an influence on the AUC values themselves, but not on the choice of the top three candidates.

Remarks on applications for real Earth observations
Our version of the artificial data farm was generated to test different algorithms for their capability to deal with typical properties of EO data. The workflows were chosen to be as generic as possible, and therefore their application to real data with slightly different properties should be made as easy as possible. Nevertheless, several points have to be considered, when applying the algorithms to real EOs. A typical preprocessing of EOs is to center variables to zero mean and standardize to unit variance (also known as z transformation). A standardization of this kind is of key importance in global EOs. Real multivariate observations often have different physical units or ranges, which have to be made comparable before analyzing. However, standardization has to be applied with care. Differences between the mean and variance between geographically distinct or even adjacent grid cells as well as seasonal cycles might corrupt any further analysis. We recommend subtracting the median seasonal cycle before standardization. The median is preferred over the mean as mean seasonal cycles are affected by changes in the amplitude of the cycle. Standardization can be applied globally (i.e., with global spatiotemporal mean and variance), regionally (i.e., with spatiotemporal mean and variance in subregions of the globe), or locally (i.e., with temporal mean and variance in each grid cell). Global standardization might be more robust than local but detects only anomalies in high-variance regions. Local standardization assumes that the number of extreme anomalies is equal in each grid cell, which is a rather strong assumption. Thus, a regional standardization is favorable in regions with similar mean and variance.
Especially variables presenting a signal from the biosphere are known to exhibit heteroscedasticity, e.g., the variance during growing season is substantially larger than during the rest of the year (Fig. 5). Atmospheric variables in high latitudes also show higher variability during the cold season, e.g., temperature variability might be higher over ice (cold season) than over open water (warm season) (Hansen et al., 2012). Specifically for global applications, using estimates of variance or standard deviation locally (in each grid cell) leads to an underestimation of the variance during growing season and thus to an overestimation of anomalies due to standardization, especially in the northern latitudes (Guanche et al., 2016). Thus, we recommend accounting for the heteroscedastic pattern by adjusting the variance during the growing season within similar regions. We also recommend this kind of adjustment for the covariance matrix used, e.g., in T 2 or PCA as well as for the parameterization of KDE or REC.
Furthermore, anomalies are also overestimated when using a reference period for the estimation of the variance (Sippel et al., 2015). However, with 300 observations in 8-day intervals, as used in this study, this issue is expected to be less pronounced than for fewer observations as it scales with the length of the time series. Nevertheless, we rather recommend using estimates of the variance of the entire time series or correcting for the overestimation in the out-of-reference period as shown in Sippel et al. (2015).
Regarding the parameterization process of the algorithms, we use fixed parameters for σ , ε, k, ν, mean vector, and covariance matrix globally on the entire artificial data cube. Local parameterization assumes the same number of anomalies in each region, which is neither suitable for the artificial data by construction nor for real global data. Thus, we recommend parameterizing globally or within similar regions. Classification of the Earth into similar regions and applying multivariate extreme detection in each region will be the subject of future research.

Conclusions
Our aim is to identify suitable methods for detecting anomalies in highly multivariate, correlated, and seasonally varying data streams as they are common in Earth system science. In particular, we are interested in detecting shifts in mean (extremes), changes in the amplitude of the seasonal cycle, temporal changes in the variance, and onsets of trends. We test a wide range of workflows (i.e., combining feature extraction techniques and anomaly detection algorithms). All experiments are based on artificial data, designed to mimic real world Earth observations.
We can show that, on average over different anomaly types and data properties, three multivariate anomaly detection algorithms (KDE, REC, KNN-Gamma) outperform univariate extreme event detection as well as other multivariate approaches (mean AUC compared to univariate control: +0.030). Additional slight improvement can be achieved by combining the best algorithms into ensembles using an aggregation by averaging score quantiles (+0.041). In contrast, the tested machine-learning algorithms (SVDD −0.05, KN-FST −0.01) may fail due to overfitting to the training sample.
However, we also find for the considered type of events that including a suitable feature extraction technique in the detection workflow is often more important than the choice of the event detection algorithm itself. However, we find that the feature extraction has to be explicitly designed for the event type of interest, i.e., time delay embedding (for detecting changes in the cycle amplitude) and exponential weighted moving average (for detecting trends and long extremes and removing uncorrelated noise in the signal) increases the detection rate of the anomalous events. Including features of the variance within a moving window works partly for detecting increases in the variance but fails to detect a decrease in the variance due to the relatively high observational noise level. In general, if the data comprise seasonality, subtracting them and using the remaining time series as the input feature is essential. Furthermore, we improve the detection rate of multivariate anomalies in highly correlated data streams by adding a dimensionality reduction method to the workflow (in line with results of Zimek et al., 2012).
The proposed workflows are capable of dealing with common properties of Earth observations like seasonality, nonlinear dependencies, and (to a certain degree) non-Gaussian distributions and noise. Nevertheless, they have to be applied with care to Earth observations, i.e., standardization issues along with strong heteroscedastic patterns (e.g., in biosphere variables of northern latitudes) may lead to an overestimation of anomalies. Future work will explore the potential of the identified workflows in rediscovering known and potentially unknown extremes as well as other anomalies in a set of real Earth system science data streams. We anticipate that an automated application of our workflows might enable the establishment of automated Earth system process control in a very generic manner.
Data availability. The artificial data farm can be created after cloning in https://github.com/CAB-LAB/DataFarm. Generation is done with the following command within the programming language Julia, version 0.4, using SurrogateCube, Surrogate Cube. DataFarm.makeDataFarm(300,50,50,PathToFolder). Within the generation process, we assume that the signal S is additive to the baseline B. The baseline might represent reoccurring patterns like seasonality or a constant mean. In addition, binary event parameters ev t,lat,lon are introduced, which allow for switching the anomaly on (ev t,lat,lon = 0) and off (ev t,lat,lon = 0) (normality). The event type and magnitude of the event is controlled separately by a parameter for the baseline (k b ), the signal (k s ), and a mean-shift parameter (k m ) scaled with the standard deviation of the data (SD). t,lat,lon = B t,lat,lon · 2 (k b ·ev t,lat,lon ) + S t,lat,lon · 2 (k s ·ev t,lat,lon ) + k m · ev t,lat,lon · SD (A1) For a basic version, three independent components t,lat,lon,var are created with the signal consisting of Gaussian noise (SD = 1). Each component represents intrinsic properties of the Earth system. Furthermore, we assume that properties of the Earth system t,lat,lon are not measured directly but indirectly via a set of correlated variables, i.e., representing patterns of these intrinsic properties. Hence, these variables propagate anomalous events that occur in one independent component. This set of correlated variables X var is created by weighting the intrinsic properties var with randomly drawn linear (or nonlinear) weights w j plus additional measurement noise (Gaussian, SD = 0.3) added to each variable. Using this data generation scheme, a standard data cube X t i,j ,lat,lon,var is created, encompassing 300 time steps (T ), 10 temporally correlated variables (VAR), and the total number of latitudes (LAT) and longitudes (LON) fixed to 50 each. We induce anomalous events with a spatial extent of 40 % of the latitude and longitude and 10 events, each with a temporal extent of five time steps. Our total number of anomalies equals about 3 % of the total data cube.
In the basic version we create four data cubes, each with a different temporary event type, including shift in the baseline, i.e., shift of the running mean of a time series (BaseShift) (Fig. 2a); change in the variance of the time series (Vari-anceChange) (Fig. 2b); change in the amplitude of the mean seasonal cycle of a time series (MSCChange) (Fig. 2c); onset of a trend in the time series (TrendOnset) (Fig. 2d).
Regarding the data properties, some of the event type data property combinations are excluded (Table A1). In detail, we do not expect a TrendOnset to infect neighbored cells (Tren-dOnset plus RandomWalkExtreme) and a TrendOnset can hardly be called a TrendOnset if it encompasses only one time step (ShortExtremes).  Figure B1. AUC versus event magnitude for all combinations (grey) and the univariate control (red). Columns of the matrix represent different event types; rows represent data properties. Additional colored workflows represent the workflows with the five highest mean values for the magnitudes > 2 SD (> 0.6).  Figure B2. Effect of the data properties on the three best detection algorithms (KDE, REC, KNN-Gamma) presented as AUC difference of the UNIV control for the event types (a-d).