3.5 Quantifying “these results should be almost the same”; for differences: how big is too big?

Gill, David, National Center for Atmospheric Research

A number of situations arise where there is an expectation of being able to get nearly identical results when comparing generated output with an exemplar of provided model data. For example: during a benchmark exercise, it is unlikely that all users are conducting tests on the same architecture, OS, or compiler; and when upgrading to a newer version of a compiler or when changing levels of optimization within a single compiler, results are usually not bit-wise identical. As a possible solution, introducing random perturbations to an input file has been tested with the WRF and MPAS models. The idea is to gauge the level of departure of the truth solution with the artificially perturbed solution, and the compare that difference to other scenarios (contributions from other benchmark providers, output from a different architecture, or data that used a enhanced optimization setting). With a small sample size, this technique has not provided sufficient fidelity to correctly identify cases to identify known cases that we would want to be identified as “essentially the same” or as “fundamentally different”. However, using an ANOVA (analysis of variance) technique for MPAS with just a few variables over distinct geographical areas is providing promising results.