Towards improved and more routine Earth system model evaluation in CMIP

. The Coupled Model Intercomparison Project (CMIP) has successfully provided the climate community with a rich collection of simulation output from Earth system models (ESMs) that can be used to understand past climate changes and make projections and uncertainty estimates of the future. Conﬁdence in ESMs can be gained because the models are based on physical principles and reproduce many important aspects of observed climate. More research is required to identify the processes that are most responsible for systematic biases and the magnitude and uncertainty of future projections so that more relevant performance tests can be developed. At the same time, there are many aspects of ESM evaluation that are well established and considered an essential part of systematic evaluation but have been implemented ad hoc with little community coordination. Given the diversity and complexity of ESM analysis, we argue that the CMIP community has reached a critical juncture at which many baseline aspects of model evaluation need to be performed much more efﬁciently and consis-tently. Here, we provide a perspective and viewpoint on how a more systematic, open, and rapid performance assessment of the large and diverse number of models that will participate in current and future phases of CMIP can be achieved, and announce our intention to implement such a system for CMIP6. Accomplishing this could also free up valuable resources as many scientists are frequently “re-inventing the wheel” by re-writing analysis routines for well-established analysis methods. A more systematic approach for the community would be to develop and apply evaluation tools that are based on the latest scientiﬁc knowledge and observational reference, are well suited for routine use, and provide a wide range of diagnostics and performance metrics that compre-hensively characterize model behaviour as soon as the output is published to the Earth System Grid Federation (ESGF). The CMIP infrastructure enforces data standards and conventions for model output and documentation accessible via the ESGF, additionally publishing observations (obs4MIPs) and reanalyses (ana4MIPs) for model intercomparison projects using the same data structure and organization as the ESM output. This largely facilitates routine evaluation of the ESMs, but to be able to process the data automatically alongside the ESGF, the infrastructure needs to be extended with processing capabilities at the ESGF data nodes where the evaluation tools can be executed on a routine basis. Efforts are already underway to develop community-based evaluation tools, and we encourage experts to provide additional diagnostic codes that would enhance this capability for CMIP. At the same time, we encourage the community to contribute observations and reanalyses for model evaluation to the obs4MIPs and ana4MIPs archives. The intention is to produce through the ESGF a widely accepted quasi-operational evaluation framework for CMIP6 that would routinely execute a series of standardized evaluation tasks. Over time, as this capability matures, we expect to produce an increasingly systematic characterization of models which, compared with early phases of CMIP, will more quickly and openly identify the strengths and weaknesses of the simulations. This will also reveal whether long-standing model errors remain evident in newer models and will assist modelling groups in improving their models. This framework will be designed to readily incorporate updates, including new observations and additional diagnostics and metrics as they become available from the research community.


Introduction
High-profile reports such as the Intergovernmental Panel on Climate Change (IPCC) Fifth Assessment Report (AR5; IPCC, 2013) attest to the exceptional societal interest in understanding and projecting future climate. The climate simulations considered in IPCC AR5 are mostly based on Earth system model (ESM) experiments defined and internationally coordinated as part of the World Climate Research Programme (WCRP) Coupled Model Intercomparison Project Phase 5 (CMIP5; Taylor et al., 2012). The objective of CMIP is to better understand past, present, and future climate changes in a multi-model context. However, adequate use of the simulations requires an awareness of their limitations. Therefore, it is essential to systematically evaluate models with available observations (Flato et al., 2013). More generally, model evaluation and intercomparison provides a necessary albeit insufficient perspective on the reliability of models and also facilitates the prioritization of research that aims at improving the models.
Output from CMIP5 models is archived in a common format and structure and is accessible via a distributed data archive, namely the Earth System Grid Federation (ESGF). 1 The scientific contents of the models and the details of the simulations are further described via the Earth System Documentation (ES-DOC) effort. 2 This has enabled a diverse community of scientists with more than 27 000 registered users (Williams et al., 2015) to readily search, retrieve, and analyse these simulations. Since CMIP5, there has also been a large effort to provide observations and reanalysis products to end users of CMIP results as part of the observations (obs4MIPs; Teixeira et al., 2014) and reanalysis (ana4MIPs) for model intercomparison projects. Together, these efforts have the potential to facilitate comparisons of model simulations with observations and reanalyses. However, the full 1 http://esgf.llnl.gov/ 2 http://es-doc.org rewards of the coordinated experiments and data standards have yet to be realized to further capitalize on the CMIP multi-model and observational infrastructure already in place (Williams et al., 2015).
Here, we provide a perspective for developing standardized analysis procedures that could routinely be applied to CMIP model output at the time of publication to the ESGF, and we announce our intention to implement such a system in time for the sixth phase of CMIP (CMIP6; Eyring et al., 2016a). The goal is to produce -along with the model output and documentation -a set of informative diagnostics and performance metrics that provide a broad, albeit incomplete, overview of model performance and simulation behaviour. With this paper we aim to attract input and development of established, yet innovative analysis codes from the broad community of scientists analysing CMIP results, including the CMIP6-Endorsed model intercomparison projects (MIPs). The CMIP standard evaluation procedure should utilize open-source and community-based evaluation tools, flexibly designed in order to allow their improvement and extension over time. Our discussion here specifically addresses the crucial infrastructure requirements generated by such community tools for ESM analysis and evaluation, including how such requirements lead to reliance on the infrastructure supporting ESM output and relevant Earth system observations. An overarching theme is that, if we are to capitalize on the community effort devoted to model development, analysis, documentation, and evaluation and if we are to fully exploit the value of coordinated multi-model simulation activities like CMIP, then further infrastructure development and maintenance will be needed. Given the CMIP6 timeline and the complex and integrated nature of the infrastructure, it is expected that requirements will have to be satisfied by modifications and additions to the current infrastructure, rather than development and deployment of a completely new approach. The proposed infrastructure relies on conventions for data and for recording model and ex-Earth Syst. Dynam., 7, 813-830, 2016 www.earth-syst-dynam.net/7/813/2016/ periment documentation that have been developed over the last two decades. Its backbone is the distributed data archive and the delivery system developed by the ESGF, which with CMIP5's success and WCRP's encouragement is increasingly being adopted by the climate research community. We hope the overview presented here inspires additional, focused efforts toward improved and more routine evaluation in CMIP. We emphasize that routine evaluation of the ESMs cannot and is not meant to replace the cutting-edge and in-depth explorative analysis and research that makes use of CMIP output, which will remain essential to close gaps in our scientific understanding. Rather, we suggest to make the wellestablished parts of ESM evaluation that have demonstrated their value in the peer-reviewed literature, for example, as part of the IPCC climate model evaluation chapters (Flato et al., 2013) more routine. This will leave more time for innovative research, for example on additional guidance in reducing systematic biases and on new diagnostics that can reduce the uncertainty in future projections.
Our assessment draws substantially on responses to a CMIP5 survey 3 of representatives from the climate science community. The summer 2013 survey was developed by the CMIP Panel, a subcommittee of the WCRP Working Group on Coupled Modelling (WGCM), which is responsible for direct coordination of CMIP. The scientific gaps and recommendations for CMIP6 that were identified through this community survey are summarized by Stouffer et al. (2016). This paper is organized as follows. In Sect. 2 we argue for the development of community evaluation tools that would be routinely applied to CMIP model output as soon as it becomes available from the ESGF, and we identify the associated software infrastructural needs. In Sect. 3, we discuss some of the scientific gaps and challenges that might be addressed through innovative diagnostics that could be incorporated into future, more comprehensive evaluation tools. Section 4 closes with a summary and outlook.

Evaluation tools and corresponding infrastructure needs for routine model evaluation in CMIP
With the increasing complexity and resolution of ESMs, it is a daunting challenge to systematically analyse, evaluate, understand, and document their behaviour. Thus, it is an especially attractive idea to engage a wide range of scientific and technical experts in the development of communitybased diagnostic packages.   lend themselves to this purpose. The workflow for routinely analysing and evaluating the CMIP simulations is shown in Fig. 1. It utilizes community tools and relies on the ESGF infrastructure and relevant Earth system observations. The workflow assumes CMIP model output and observations are accessible in a common format on ESGF data nodes (Sect. 2.1), open-source software evaluation tools exist (Sect. 2.2), and the existing ESGF infrastructure, which is now mainly a data archive, is enhanced with additional processing capabilities enabling evaluation tools to be directly executed on at least some of the ESGF nodes (Sect. 2.3). Plans for making evaluation results traceable, well documented, and visually rendered are also discussed (Sect. 2.4).

Access to CMIP model output and observations in common formats
The CMIP5 archive of multi-model output constitutes an enormous and valuable resource that efficiently enables progress in climate research. This diverse repository, in excess of 2 PB (see Table 1), of commonly formatted climate model data also has proved valuable in the preparation of climate assessment reports such as the IPCC and in serving the needs of downstream users of climate model output such as impact researchers. The CMIP data format requirements are based on the Climate and Forecast (CF) selfdescribing Network Common Data Format (NetCDF) standards and naming convention 4 and tools such as Climate Model Output Rewriter (CMOR). 5 As a result, the CMIP model output conforms to a common standard with metadata that enables automated interpretation of file contents. The layout of data in storage and the definition of discovery metadata have also been standardized in the data reference syntax (DRS), 6 which provides for logical and automated ways to access data across all models. This has enabled development of analysis tools capable of treating data from all models in the same way and effectively independent of the platform on which they are executed. The infrastructure supporting the publication of CMIP5 data was developed by the ESGF, which archives data accessible via a common interface but distributed among data nodes hosted by modelling and data centres. The CMIP5 survey noted that this first generation of a distributed infrastructure to serve the model data did not initially perform well, Figure 1. Schematic diagram of the workflow for routinely producing a broad characterization of model performance for CMIP model output using community evaluation tools that utilize relevant observations and reanalyses and rely on the ESGF infrastructure. which, retrospectively, is not surprising given that it was a first major application of a distributed approach to archiving CMIP data and given the limited time and resources available for development and testing. Storing, testing, and delivering these data have relied on a distributed infrastructure developed largely through community-based coordination and short-term funding. This relatively fragile approach to providing climate modelling infrastructure will face even stiffer challenges in the future. Climate modelling and evaluation, which already involves management of enormous amounts of data, is a big data challenge confronted with demands for prompt access and availability (Laney, 2012). Unless we meet the challenge of dealing with increasing volumes of data, it will be difficult to routinely and promptly evaluate CMIP models. Improvements in the functionality of the ESGF require a coordinated international undertaking. Priorities for CMIP are set by the WGCM Infrastructure Panel (WIP), and through the ESGF's own governance structure these are integrated with demands from other projects. The individual, funded projects comprising the ESGF ultimately determine what can be realized by volunteering to respond to the prioritized needs and requirements. The model evaluation activity advocated here depends on the ESGF providing automated and robust access to all published model output and relevant observational data. The quantity of data made available under CMIP5 was about 50 times larger than under CMIP3. The data volume is expected to grow by another factor of 10-20 for CMIP6, resulting in a database of between 20 and 40 PB, depending on model resolution and the number of modelling centres ultimately participating in the project ( Table 1). The CMIP6 routine model evaluation activity discussed here will initially rely mostly on well-observed and commonly analysed fields, so this activity is not expected to increase the Earth Syst. Dynam., 7, 813-830, 2016 www.earth-syst-dynam.net/7/813/2016/ These tools are also included as separate namelists in the ESMValTool and will be applied to CMIP6 models together with other ESMValTool diagnostics and performance metrics as soon as the output is published to the ESGF.
CMIP6 data request beyond the CMIP6-Endorsed MIP demands.
The convenience of dealing with CMIP output that adheres to well-defined standards and conventions is a major reason why the data have been used extensively in research. Another requirement of any model evaluation activity is wellcharacterized observational data. Traditionally, observations from different sources have been archived and documented in a variety of ways and formats. To encourage a more unified approach, the obs4MIPs initiative (Teixeira et al., 2014) has defined a set of specifications and criteria for technically aligning observational datasets with CMIP model output (with common file format, data, and metadata structure). Over 50 gridded datasets that conform to these standards are now archived on the ESGF alongside CMIP model output, and the archive continues to rapidly expand . Data users have enthusiastically received obs4MIPs, and the WCRP Data Advisory Council's (WDAC) has established a task team to encourage the project and provide guidance and governance at the international level. The expansion of the obs4MIPs project, with additional observational products directly relevant to Earth's climate system components and process evaluation, is a clear opportunity to facilitate routine evaluation of ESMs in CMIP. A sister project, ana4MIPs, provides selected fields well suited for model evaluation from major atmospheric reanalyses. The obs4MIPs protocol requires every dataset submitted to be accompanied by a technical note, which includes, for example, discussion of uncertainties and guidance as to aspects of the data product that are particularly relevant to model evaluation. Similar documentation efforts for observations specifically meant for use in model evaluation can be found at the National Center for Atmospheric Research (NCAR) climate data guide. 7 Ideally, standard technical documentation as defined by obs4MIPs will be adopted broadly by the international observational community and will be hosted alongside (or integrated with) the CMIP model and simulation standard documentation (ES-DOC). Additionally, there are proposals being considered to include non-gridded data in obs4MIPs (e.g. data collected by ground stations or during aircraft campaigns), and the possibility that auxiliary data such as landsea masks, averaging kernels, and additional uncertainty data might also be provided. Whatever datasets are used for model evaluation, it will be important to determine the size of observational error relative to the errors in the models. One approach being developed is to provide ensembles of observational estimates, all based on a single sensor or product and generated by making many different choices of retrieval algorithms or parameters, all considered to be reasonable. The goal is to be able to extend obs4MIPs in order to better characterize observational uncertainty. Figure 2. Examples of performance metrics and diagnostics that will be calculated from CMIP6 models with the ESMValTool (Eyring et al., 2016b) as soon as the output is submitted to the ESGF. (a) Taylor diagram showing the 20-year annual average performance of CMIP5 models for total cloud fraction as compared to MODIS satellite observations, (b) aerosol optical depth from ESA-CCI satellite data (contours) compared with station measurements by AERONET (circles), (c) an emergent constraint on the carbon cycle-climate feedback (γ LT ) based on the short-term sensitivity of atmospheric CO 2 to interannual temperature variability (γ IAV ) in the tropics, (d) modelled and observed time series of September mean Arctic sea-ice extent, (e) RMSD metric of several components of the global carbon cycle, and (f) annual-mean precipitation rate (mm day −1 ) bias from the CMIP5 multi-model mean compared to the Global Precipitation Climatology Project.

Community tools for Earth system model evaluation ready for CMIP6
There is growing awareness that community-shared software could facilitate more comprehensive and efficient evaluation of ESMs and that this could help increase the pace of understanding model behaviour and consequentially also the rate of model improvement. Here we highlight several examples of capabilities that are currently under development and relevant to the goal of developing routine evaluation of CMIP simulations. Table 2 provides examples for existing diagnostic tools that can be used within CMIP6. Specifics of the design and the diagnostics included in these tools are detailed in the corresponding documentations of the tools that we refer to in the text and It is envisaged that well-established plots produced by the standardize evaluation process outlined here will eventually be archived and become part of model documentation. In the meantime they can also be included in publications on model evaluation: since the tools that produce them are opensource, the resulting plots are also effectively freely available. However, we would expect users to cite both software versions and technical papers produced by the tool developers to provide the formal provenance for the plots.

Evaluation tools targeting the broad characterization of ESMs in CMIP6
Our initial goal is the coupling of two capabilities to the ESGF to produce a broad characterization of CMIP6  Both software packages are open-source, have a wide range of functionalities, and are being developed as community tools with the involvement of multiple institutions. CMIP6 modelling groups and users of the CMIP6 data can make use of the evaluation results that are produced with these tools which will be made available to the wider community. They can also download the source code and can run the tools locally before submission of the results to the ESGF for an additional quality check of the simulations.
Here we summarize some aspects of these tools but refer the reader to their respective documentation in the literature for further details.
-The ESMValTool (Eyring et al., 2016b) consists of a workflow manager and a number of diagnostic and graphical output scripts. The workflow manager is written in Python, whereas multi-language support is provided for the diagnostic and graphic routines. The ES-MValTool workflow is controlled by a main namelist file defining the model and observational data to be read, the variables to be analysed, and the diagnostics to be applied. The priority of the effort so far has been to target specific scientific themes focusing on selected essential climate variables (ECVs); a range of known systematic biases common to ESMs, such as coupled tropical climate variability, monsoons, Southern Ocean processes, continental dry biases and soil hydrology-climate interactions; atmospheric CO 2 budgets; tropospheric and stratospheric ozone; and tropospheric aerosols. ESM-ValTool v1.0 includes a large collection of standard namelists for reproducing the analysis of many variables across atmosphere, ocean, and land domains, with diagnostics and performance metrics focusing on the mean state, trends, variability and important processes, phenomena, and emergent constraints. The collection of standard namelists allows for reproduction of, for example, the figures from the climate model evaluation chapter of IPCC AR5 (Chapt. 9; Flato et al., 2013) and parts of the projection chapter (Chapt. 12; Collins et al., 2013b), a portrait diagram comparing the timemean root mean square difference (RMSD) over different subdomains as in Gleckler et al. (2008) and for land and ocean components of the global carbon cycle as in Anav et al. (2013). ESMValTool v1.0 also includes stand-alone packages such as the NCAR CVDP and the cloud regime metric developed by the Cloud Feedback MIP (CFMIP) community (Williams and Webb, 2009), as well as detailed diagnostics for monsoon, El Niño-Southern Oscillation (ENSO), and the Madden-Julian Oscillation (MJO). Example plots that illustrate the type of plots that will be produced with ESMVal-Tool for CMIP6 are illustrated in Fig. 2, and we refer the reader to the corresponding literature and the ESM-ValTool website (see Table 2) for full details.
-The PMP  includes a diverse suite of summary statistics to objectively gauge the level of agreement between model simulations and observations across a broad range of space and timescales. It is built on the Python and Ultrascale Visualization Climate Data Analysis Tools (UV-CDAT; Williams, 2014), a powerful software tool kit that provides cutting-edge data management, diagnostic, and visualization capabilities. Example plots produced with PMP are shown in Fig. 3. The first examines how well simulated sea ice agrees with measurements on sector scales and demonstrates that the classical measure of total sea-ice area is often misleading because of compensating errors (Ivanova et al., 2016). The second highlights the amplitude and phase of the diurnal cycle of precipitation (Covey et al., 2016), and the third example is given by a "portrait plot" comparing different versions of the same model  in Atmospheric Model Intercomparison Project (AMIP) mode.
Both tools are under rapid development with a priority of providing a diverse suite of diagnostics and performance metrics for all DECK and historical simulations in CMIP6 to researchers and model developers suitable for use soon after each simulation is published on the ESGF. Since these tools are freely available, modelling groups participating in CMIP can additionally make use of these packages. They could choose, for example, to utilize the tools during the model development process in order to identify relative strengths and weaknesses of new model versions also in the context of the performance of other models or they could run the tools locally before publishing the model output to the ESGF. The tools are therefore highly portable and have been tested across different platforms.  Other examples are the NCAR CVDP, which has been designed to work on CMIP output and provides analysis of the major modes of climate variability in models and observations (Phillips et al., 2014). The NCAR CVDP is also implemented as a stand-alone namelist in the ESMValTool. Figure 4 shows a comparison of the CMIP5 models with observations for the Pacific Decadal Oscillation (PDO) to illustrate the kind of plots that can be produced with CVDP. Other available model evaluation packages that could be applied to CMIP6 output are the International Land Modeling Benchmarking Project (ILAMB), focusing on the representation of the carbon cycle and land surface processes in climate models via extensive comparison of model results with observations (Luo et al., 2012). Still other packages target model evaluation methods that are computationally demanding such as the parallel toolkit for extreme climate analysis (TECA; Prabhat et al., 2012).
There is some overlap in function between the ESMVal-Tool and PMP and the other tools mentioned above, but efforts are underway to provide some coordination between these developing capabilities to reduce duplication of effort and to help ensure they advance in a way that best serves the CMIP modelling and research communities, including the modelling groups themselves. In any case, encouraging a diversity of technical approaches and tools rather than a single one may at this stage be beneficial as it will provide experience that will help guide a more integrated approach in the longer term, perhaps as the community prepares for CMIP7 and beyond. Current testing with the same RMSD and ENSO metrics implemented in both the ESMValTool and PMP should inform such comparisons and reliability tests of the same scientific metrics incorporated into different technical frameworks.
The wider community is being encouraged to contribute to the development of these tools by adding code for additional diagnostics. We refer the reader to the literature of the individual tools for details on how the development teams invite these contributions. The free availability of the codes should facilitate this task and also help to increase code quality. We stress again that the focus of these evaluation tools is on reproducing standard evaluation tasks and not on performing generic data processing task, such as extracting, for example, monthly or zonal means, or reducing or regridding model data. Although they could be in principle used just for data processing, this is not their main goal and they may not include all the functionalities typically covered by pure dataprocessing tools.

Integration of evaluation tools in ESGF infrastructure
In order to connect multivariate results from multiple models and multiple observational datasets (Sect. 2.1) with tools for a quasi-operational evaluation of the CMIP models (Sect. 2.2), an efficient ESGF infrastructure is needed that can handle the vast amount of data and execute the evaluation tools. At the same time the workflow should be captured so that the evaluation procedure can be reproduced as new model output becomes available. This will allow changes in model performance to be monitored over a time frame of many years. Our expectation is that, for CMIP6, the ESMVal-Tool and PMP, with contributions from other efforts such as the NCAR CVDP and ILAMB packages, will be able to operate directly on the data served by the major ESGF data nodes. While it was and is possible to run analysis tools over the CMIP5 archive, it was difficult, error-prone, and not widely done. The proposed new functionality for CMIP6 is a step to-Earth Syst. Dynam., 7, 813-830, 2016 www.earth-syst-dynam.net/7/813/2016/ ward what should become a tighter integration of model analysis tools with data servers. This advancement will be particularly advantageous given the very large and complex CMIP data archive. Here we describe the necessary associated infrastructural changes that need to be made to enable this for CMIP6. As we provide an overview of the challenges emerging from the desire to move towards more routine evaluation of the models in future CMIP phases, it should be understood that actual implementation will require specification of many important technical details not addressed here. It is envisaged that the evaluation tools will be executed at one or more of the ESGF sites that host copies (i.e. "replicas") of most of the required CMIP datasets and the obser-vations used by the evaluation tools. Although these replicas typically represent a significant subset of the data volume available on the ESGF, especially at the larger ESGF nodes, the complete replication of the entire CMIP model output at a single ESGF site cannot be achieved. As a consequence, some of the required CMIP model output used in the evaluation tools might still not be available even on the largest ESGF nodes. There are two practical solutions: (1) to distribute the processing of the evaluation tools at different ESGF nodes, and (2) to acquire and potentially cache data as needed for the evaluation tools. We regard the first option as not being practical in the CMIP6 time frame but a possibly promising option in the long term. The second option that we envisage to be feasible for CMIP6 is schematically displayed in Fig. 5. The evaluation tools are executed with specific user configurations (e.g. the ESMValTool namelists; Eyring et al., 2016b). These user configurations also include the list of model and observational data to be analysed. Tools such as esgf-pyclient 9 and synda 10 exist which allow interrogation of local and distributed node data and which could transfer the necessary data into either a cache or the ESGF replica storage. OPeN-DAP 11 could also be used without the necessity for a cache. However, the workflow for managing this process does not yet exist and needs to be developed. Given the huge volumes of the ESGF data collections, it is realistic to assume that the requisite data will be maintained only at specific ESGF nodes where the evaluation tools will be executed. It is therefore realistic that within CMIP6 the evaluation tools will be installed and operated on selected ESGF supernodes only, currently expected to be those hosted by seven climate data centres on four continents (Beijing Normal University (China), Centre for Environmental Data Analysis (CEDA, UK), Deutsches Klimarechenzentrum (DKRZ, Germany), Institut Pierre Simon Laplace (IPSL, France), Lawrence Livermore National Laboratory (LLNL, USA), National Com-putational Infrastructure (NCI, Australia), and the University of Tokyo (Japan); see Williams et al., 2015). These supernodes will need to provide the necessary storage and computing resources and be integrated into the ESGF replication infrastructure, which optimizes data transport between core ESGF sites. Since it will take substantial time to replicate all output from the CMIP DECK and historical simulations to the supernodes (similar replications took months in CMIP5), we have recommended to the ESGF teams that the data used by the CMIP evaluation tools be replicated with higher priority. This should substantially speed up the evaluation of model results after submission of the simulation output to the ESGF. A prerequisite for this is that the evaluation tools provide an overview of the experiments, the subset of data from the CMIP6 data request, and the observations and reanalyses that are used. In the long term (e.g. in time for CMIP7), more automatic and rapid procedures could be developed so that the evaluation tools could be run as part of the publication process of the model output.
Executing the evaluation tools directly alongside the ESGF may also require the extension of the current hardware and software infrastructure to implement processing capabilities where the tools can be run. This infrastructure will need to include new interfaces to computers, and should allow for flexible deployment and usage scenarios since we can foresee application in a spectrum of possible environments discussed above. Given the large amount of data involved, parallelization of the data handling in the evaluation tools themselves needs to be considered in order to efficiently process the large amount of data. A number of different technical solutions are possible, but in Europe at least, it is likely that supernodes will deploy web processing services 12 exposing the diagnostic codes as "capabilities" to new ESGF portals which exploit backend computing and access to the ESGF data nodes.
A coordinated set of community-based diagnostic packages will require standards and conventions to be adopted governing the analysis interface and the output produced by the diagnostic procedures. Clear documentation of the procedures and codes is required, as are standards for all key interfaces. Because working towards a community-based approach represents a shift in CMIP procedure, like the data standards themselves it will likely take considerable time and effort to establish agreed-upon software standards. In the interim, substantial progress can be made by expert teams developing diagnostic tools if they follow a set of best practices and reasonable efforts are made to coordinate them where possible. During this period the different approaches available can be assessed, and further experience with them can help lead to advancing community-based interfaces. During this time it will also be possible to experiment with different approaches to delivering the required computing within or alongside the ESGF. Given that the amount of necessary and affordable computing resources is not yet clear, it is likely that early ESGF resources will be allocated to the tool developers to provide diagnostics products centrally rather than for open computing on demand by multiple users. Multiple users could, however, still make profitable use of the tools by downloading the source codes and running them on their own local systems. For more information regarding ESGF's infrastructure and progress towards computing and tool integration, please see the 2016 5th Annual ESGF Face-to-Face Conference Report. 13 In support of the ESGF infrastructure, a library will provide a system for indexing the output of the communitybased diagnostics packages and automatically generate a user-friendly web interface for looking through the results (i.e. "viewer"). This library will integrate with an ESGF web service to provide a simple workflow for uploading diagnostics results to a server and share them with collaborators. Each diagnostics run will generate provenance data that will track data used for input, the version of the community-based package, who ran the diagnostics and at which location, etc. This information would then be bundled with the output au-tomatically and made available within the ESGF web service as well as in the local viewer.
To summarize, we will begin in the CMIP6 time frame with the deployment of a subset of packages, in particular the ESMValTool (which itself includes other well-known packages such as NCAR CVDP) and PMP and run them on or alongside ESGF supernodes. Starting with available data in existing CMIP5 replica caches, the evaluation package developments are tested at dedicated sites (some of the supernodes) and prepared for CMIP6. In parallel, developments with respect to the supporting infrastructure (replication, cache maintenance, provenance recording, parallel processing) are starting. We expect this initial effort to spur developments toward a uniform approach to analytic package deployment. Eventually we aspire to put in place a robust and agile framework whereby new diagnostics developed by individual scientists can quickly and routinely be deployed on the large scale.

Data documentation, provenance, and visualization
For CMIP6, a specific goal will be to use the analysis tools currently being developed and to execute them on the ESGF once CMIP6 model output is published to provide a comprehensive evaluation of model behaviour. To document the process and to ensure traceability and reproducibility of the evaluation tool results, a catalogue will be created, including all the relevant information about models, observations, and versions of the tools used for evaluation along with information on the creation date of running the script, applied diagnostics and variables, and corresponding references. In this way, a record of model evolution and performance through different CMIP phases would be preserved and tracked over time (see Fig. 6). In the long term, such an evaluation could be part of the publication workflow (Sect. 2.3).
The interpretation of the model evaluation results requires a precise understanding of a model's configuration and the experimental conditions. Although these requirements are not new for CMIP, the plan to carry out routine model evaluation increases the priority for enhancing documentation in these respects. In CMIP5 with over 1000 different modelexperiment combinations, the first attempt was made to capture structured metadata describing the models and the simulations themselves . Based upon the Common Information Model (CIM; Lawrence et al., 2012), the European Metafor and US Earth System Curator projects worked together to provide tools to capture documentation of models and simulations. This effort is now continuing as part of the international ES-DOC activity, which defines common controlled vocabularies (CVs) that describe models, simulations, forcings, and conformance to MIP protocols. Information from this structured representation of models and experiments can be extracted to provide comparative views of differences across models. Feedback from the CMIP5 survey indicates that improvements in methodology used to record model documentation consistent with the CIM are needed. These developments are currently underway and will be implemented in time for CMIP6. With the focus here on model evaluation, we anticipate expanding model documentation in the longer term to include metrics of the model scientific performance in order to help characterize the simulations.
In addition, proper data citation and provenance is required. Both model output and the observations serve as the basis for large numbers of scientific papers. It is recognized that sound science and due credit require (1) that data be cited in research papers to give appropriate credit for the data creator and (2) that the provenance of data be recorded to enable results to be verified. Although these requirements were recognized in CMIP5, an automated system to generate appropriate data citation information and provenance information remained immature. For CMIP6 the WIP encourages concerted efforts in this area to meet the growing demand for formal scientific literature to cite all datasets used. Visualiza-tion of the evaluation diagnostics and metrics generated by the tools is also envisaged for CMIP6; see also Sect. 2.3.

Current Earth system model evaluation approaches and scientific challenges
Establishing a more routine evaluation approach based on performance metrics and diagnostics that have been commonly used in ESM evaluation in the peer-reviewed literature will complement model evaluation analyses existing at each individual modelling group and will more rapidly allow modelling groups and users of CMIP output to identify strengths and weaknesses of the simulations in a shared and multimodel framework. This will constitute an important step forward that will help uncover some of the main characteristics of CMIP models. However, in order to fill some of the main long-standing scientific gaps around systematic biases in the models and the spread of the models' responses to external forcings as evident, for example, in the large spread in equilibrium climate sensitivity in CMIP5 models (Collins et al., 2013b), additional research is required so that more relevant performance tests can be developed that could at a later stage be added to the community tools. Unlike numerical weather prediction models, which can routinely be tested against observations on a daily basis, ESMs produce their own interannual variability and "weather", meaning that they cannot be compared with observations of a specific day, month, or year but rather only evaluated in a statistical sense over a longer, climate-relevant time period, except when they are run in offline mode and nudged towards, for example, observed meteorology (e.g. Righi et al., 2015). Confidence in ESMs relies on them being based on physical principles and able to reproduce many important aspects of observed climate (Flato et al., 2013). Assessing ESMs' performance is essential as they are used to understand historical and present-day climate and to make scenario-based projections of the Earth's climate over many decades and centuries. While significant progress has been made in ESM evaluation over the last decades, there are still many important scientific research opportunities and challenges for CMIP6 that will be addressed by the various CMIP6-Endorsed MIPs with the seven WCRP Grand Challenges as their scientific backdrop . Stouffer et al. (2016) summarize the main CMIP5 scientific gaps and here we review and discuss briefly only those scientific challenges specifically related to model evaluation.
A critical aspect in ESM evaluation is that, despite significant progress in observing the Earth's climate, the ability to evaluate model performance is often still limited by deficiencies or gaps in observations (Collins et al., 2013a;Flato et al., 2013). Additional investment in sustained observations is required, while at the same time some improvements can be made by fully exploiting existing observational data and by more thoroughly taking into account observational uncer-tainty so that model performance can be advanced. In addition, the comparability of models and observations will need to be further improved, for example, through the development of simulators that take into account the features of the specific instrument (Aghedo et al., 2011;Bodas-Salcedo et al., 2011;Jöckel et al., 2010;Santer et al., 2008;Schutgens et al., 2016). Model evaluations must also take into account the details of any model tuning (Hourdin et al., 2016;Mauritsen et al., 2012) which necessitate comprehensive information and documentation about what tuning went into setting up the model, so that evaluations can be cognizant of any consequences. ES-DOC will be collecting the relevant information to aid this process.
A wide variety of observational datasets, including the already identified ECVs (GCOS, 2010), can be used to assess the evolving climate state (e.g. means, trends, extreme events, and variability) on a range of temporal and spatial scales. Examples include the evaluation of the simulated annual and seasonal mean surface air temperature, precipitation rate, and cloud radiative effects (e.g. Fig. 9.2-9.5 of Flato et al., 2013). In evaluating the climate state, many studies are limited to the end result of the combined effects of all processes represented in CMIP simulations, and as determined by the prescribed boundary conditions, forcings, and other experiment specifications.
While a necessary part of model evaluation, a limitation of this approach is that it rarely reveals the extent to which compensating model errors might be responsible for any realisticlooking behaviour, and it often fails to reveal the origins of model biases. To learn more about the sources of errors and uncertainties in models and thereby highlight specific areas that require improvements, evaluation of the underlying processes and phenomena is necessary. This approach hones in on the sources of model errors by performing process-or regime-oriented evaluations (Bony et al., 2006(Bony et al., , 2015Eyring et al., 2005;Waugh and Eyring, 2008;Williams and Webb, 2009). Indeed, the metrics need to be sufficiently broad in scope in order to avoid tuning towards a small subset of metrics. As an example of broad metrics applied successfully on a process-based manner to models, we refer the reader to the SPARC CCMVal report (SPARC-CCMVal, 2010). Other targeted diagnostics can determine the extent to which specific phenomena (such as natural, unforced modes of climate variability like ENSO) are accurately represented by models (Bellenger et al., 2014;Guilyardi et al., 2009;Sperber et al., 2013).
Another long-standing open scientific question is the missing relation between model performance and future projections. While the evaluation of the evolving climate state and processes can be used to build confidence in model fidelity, this does not guarantee the correct response to changing forcings in the future. One strategy is to compare model results against palaeo-observations. The response of ESMs to forcings that have been experienced during, for example, the Last Glacial Maximum or the mid-Holocene can be assessed and compared with the observational palaeo-record (Braconnot et al., 2012;Otto-Bliesner et al., 2009). Another increasingly explored option is to identify apparent relationships across an ensemble of models, between some aspect of long-term Earth system sensitivity and an observable trend or variation in the current climate. Such relationships are termed "emergent constraints", referring to the use of observations to constrain a simulated future Earth system feedback. If physically plausible relationships can be found between, for example, changes occurring on seasonal or interannual timescales and changes found in anthropogenically forced climate change, then models that correctly simulate the seasonal or interannual responses could be considered more likely to make more reliable projections. For example, Hall and Qu (2006) used the observable variation in the seasonal cycle of the snow albedo as a proxy for constraining the unobservable feedback strength to climate warming, and Cox et al. (2013) and Wenzel et al. (2014) found a good correlation between the carbon cycle-climate feedback and the observable sensitivity of interannual variations in the CO 2 growth rate to temperature variations in an ensemble of models, enabling the projections to be constrained with observations. Other examples include constraints on the CO 2 fertilization effect (Wenzel et al., 2016a), equilibrium climate sensitivity and clouds (Fasullo et al., 2015;Fasullo and Trenberth, 2012;Klein and Hall, 2015;Sherwood et al., 2014), the austral jet stream (Wenzel et al., 2016b), total column ozone (Karpechko et al., 2013), and sea ice (Mahlstein and Knutti, 2012;Massonnet et al., 2012). One should keep in mind, however, that the "emergent constraint" approach is based on relationships which are uncovered in models themselves. Moreover, we must rule out the possibility that some apparent relationship might simply occur by chance or because the representation of the underlying physics is too simplistic. The key is whether the processes underlying the constraints are understood and simple enough to likely govern changes on multiple timescales (Caldwell et al., 2014;Karpechko et al., 2013;Klocke et al., 2011). In addition, different studies should not lead to contradictory results but rather confirm each other. As the approach is fairly new, more work is needed to consolidate its applicability. Related to the topic on emergent constraints, more research is required to explore the value of weighting multi-model projections based on both model performance (e.g. Knutti et al., 2010) and model interdependence , as well as the statistical interpretation of the model ensemble (Tebaldi and Knutti, 2007).
With the ever-expanding range of scientific questions and communities using CMIP output, model evaluation also needs to be expanded to develop more downstream, useroriented diagnostics and metrics that are relevant for impact studies, such as statistics (e.g. frequency and severity) of extreme events that can potentially have a significant impact on ecosystems and human activities (e.g. Ciais et al., 2005), or diagnostics for water management (e.g. Sun et al., 2007) or the energy sector (e.g. Schaeffer et al., 2012).
In summary, there is a large demand for substantially more research in the area of ESM evaluation. The evaluation tools proposed here will support this by making established approaches more routine, thus leaving more time to develop innovative diagnostics targeting open scientific questions such as the ones discussed above, which will then be included in the system as research progresses.

Summary and discussion
We provide a viewpoint here that advocates the development of community evaluation tools and the associated infrastructure that as part of CMIP6 will enable increasingly systematic and efficient ESM evaluation. This is an improvement over the existing CMIP infrastructure, which mainly only supports access to the data in the CMIP database. The initial goal is to make available in shared, common analysis packages a fairly comprehensive suite of performance metrics and diagnostics, including those that appeared in the IPCC's AR5 chapter on climate model evaluation (Flato et al., 2013). Over time, an expanding collection of performance metrics and diagnostics would be produced for successive model generations. These baseline measures of model performance, applied at the time new model results are archived, would also likely uncover obvious mistakes in data processing and metadata information, thereby providing an additional level of quality control on output submitted to the CMIP archive. Routine evaluation of the ESMs cannot and is not meant to replace cutting-edge and in-depth explorative multi-model analysis and research, in particular within the various CMIP6-Endorsed MIPs. Rather, the routine evaluation would complement CMIP research by providing comprehensive baseline documentation of broad aspects of model behaviour. Furthermore, the use of the broad set of diagnostics offered by the tools highlighted here also reduces the risk that model performance is tuned to a single or limited set of metrics.
Our experience with past MIPs has been that initially the threshold effort required for standardizing data output (CMORization) is perceived as an obstacle by many groups, but time and experience has shown that this effort is well worth it. We have found that only standardized data get widely used by the community, and the analysis of those data, especially by researchers outside the major modelling centres, has been central to CMIP's success. Once the output is collected in a common format, a more routine and systematic approach to model evaluation in CMIP has clear benefits for the scientific community. The recording of a set of informative diagnostics and metrics, along with publication of the model output itself and the model and simulation documentation, would enable anyone interested in CMIP model output to obtain a broad overview of model behaviour soon after the simulation has been published to the ESGF, and with a level of efficiency that was not possible before. The infor-mation would, for example, help the climate community to analyse the multi-model ensemble and would facilitate the comparison of models more generally. In addition, the diagnostic tools could also be run locally by individual modelling groups to provide an initial check of the quality of their simulations before submission to the ESGF, thereby accelerating the model development/improvement process. The ES-MValTool (Eyring et al., 2016b) and the PMP  are now available to directly run on CMIP6 model output and observations alongside the ESGF and will form the starting point for routine evaluation of CMIP6 models. An international strategy is required to organize and present results from these tools and to develop a set of performance metrics and diagnostics that are most relevant for climate change studies. The WGNE/WGCM Climate Model Diagnostics and Metrics Panel is in the process of defining such a strategy in collaboration with the CMIP Panel and the CMIP community. Such a strategy should also propose a way to mitigate the risk of restricting the evaluation of models to a predefined set of -possibly rapidly aging -metrics, however comprehensive, or to a limited subset of models or model ensembles. It should, for instance, ensure that performance and process-based metrics definitions evolve as scientific knowledge progresses. This requires that the relevant science expert groups be involved in the development so that they can directly feed new metrics into the evaluation infrastructure.
Modelling centres now periodically produce and distribute data compliant with the CMIP data standards and conventions. These standards critically underpin the multi-model analyses that play an ever-increasing role in supporting and enabling climate science. Development of an analysis and evaluation framework requires ongoing maintenance and evolution of that existing infrastructure. Observational and reanalysis data are also produced now in accordance with well-defined specifications and are stored on ESGF data nodes as part of obs4MIPs and ana4MIPs. The modelling, observational, and reanalysis communities should continue to nurture these efforts and ensure that these datasets include documentation in the form of technical notes, uncertainty information, and any special guidance on how to use the observations to evaluate models. This encapsulates ongoing efforts of the WCRP's data advisory council. The effort devoted to conforming data to well-defined standards should pay off in the long term and lead to a better process-level understanding of the models and the Earth's climate system while fully exploiting existing observations.
With an eventual multi-model evaluation infrastructure established, we can look forward to revolutionary advancement in how climate models are evaluated. Specifically, results from a comprehensive suite of important climate characteristics should become available soon after simulations are made publicly available, with extensive documentation and workflow traceability. Moreover, modelling centres will be able to incorporate these codes into their own development-phase workflows to gain a more comprehensive understanding of Earth Syst. Dynam., 7, 813-830, 2016 www.earth-syst-dynam.net/7/813/2016/ the performance of new model versions. The infrastructure will enable groups of experts to develop and contribute both standard and novel analysis codes to community-developed diagnostic packages. The ongoing efforts to establish uniform standards across models and observations will lead to standard ways to develop and integrate codes across analysis packages and languages. Successful realization of these plans will require our community to make a long-term commitment to support the envisioned infrastructure. Moreover, the wider climate research community will need encouragement to contribute innovative analysis codes to augment the community-developed tools already being developed. The resulting suite of diagnostic codes will constitute a CMIP evaluation capability that is expected to evolve over time and be run routinely on CMIP model simulations. At the same time, continuous innovative scientific research on model evaluation is required if metrics and diagnostics are to be discovered that might help in narrowing the spread in future climate projections.

Data availability
The model output from CMIP simulations as well as observations and reanalyses for model evaluation from obs4MIPs and ana4MIPs are distributed through the Earth System Grid Federation (ESGF) with digital object identifiers (DOIs). The model output and obs4MIPs/ana4MIPs data are freely accessible through data portals after registration. An example of a dataset DOI can be found here: http://dx.doi.org/10.1594/ WDCC/CMIP5.MXELpc. Additional observations used for model evaluation in the example plots shown here produced with the ESMValTool, NCAR CVDP, and PMP are described in the documentation papers of these tools.