WRF Version 2 Parallel Benchmark Page

 

Contents

 

Contents.. 1

Introduction.. 1

Disclaimer. 1

Purpose for the benchmark.. 1

Definitions.. 1

Counting Parallel Processes.. 2

Link to Benchmark Results.. 3

Running and submitting benchmark results.. 3

Downloadable code, data, configuration.. 3

Instructions for compiling WRF.. 3

Running Benchmark: General 4

Benchmark Cases and Instructions.. 5

1.            Single domain, medium size. 12km CONUS, Oct. 2001.. 5

2.            Single domain, large size. 2.5 km CONUS,  June 4, 2005.. 6

3.            Nested case, medium.  Hurricane Ivan 12km/4km moving nest, September, 2004.  7

 

Introduction

Disclaimer

 

We make no claims regarding the accuracy or appropriateness of these results for any purpose. All results are for the fixed-size WRF cases described here, on the specific machines listed, and no claim is made of representativeness for other model scenarios, configurations, code versions, or hardware installations.

Purpose for the benchmark

 

Demonstrate computational performance and scaling of the WRF model on target architectures.  This is a measure of integration speed and, unless otherwise noted for a specific case, ignores I/O and initialization costs. The benchmarks are intended to provide a means for comparing the performance of different architectures and for comparing WRF computational performance and scaling with other models.

Definitions

 

Performance is model speed, ignoring I/O and initialization cost, directly measured as the average cost per time step over a representative period of model integration, and is presented as normalized floating-point rate and as simulation speed.

 

A representative period of model integration should be the smallest period that (1) includes all different types of time-step in the proportions they will occur for any length simulation, (2) that provides a number of sequences of the complete set of time steps to reasonably represent performance variability stemming from varying state of the atmosphere being simulated and operational variability of the computer system itself, and (3) far enough into a simulation to be considered spun-up. The representative period is specified in the instructions for running each benchmark case.

 

Floating-point rate provides a measure of efficiency relative to the theoretical peak capability of computing system. It is the average number of floating-point operations per time step divided by the average number of seconds per time step.  Average floating-point operations per time step is determined by executing the test case over the integration period, counting the number of operations using the vendor’s hardware, and then dividing by the number of time steps in the integration period. The minimum over all systems measured is used for determining floating point rate. Using a minimum avoids overstating performance and efficiency of the WRF code. The average time per time step is the sum of the times for each time step in the integration period divided by the number of time steps.

 

Simulation speed is a measure of actual time-to-solution. Simulation speed is the ratio of model time simulated to actual time. Simulation speed is defined as the ratio of the model time step (Dt) to the average time per time step as defined above.

 

Scaling is the ratio of increase in simulation speed (or floating point rate) to the increase in the number of parallel processes.

 

A parallel process is the independent variable of this experiment. It is the unit of parallelism that is scaled up or down when running WRF on a parallel system. There is more concerning the definition of a parallel process in the discussion that follows.

Counting Parallel Processes

                     

As noted above, the independent variable for this benchmark is the number of parallel processes. In light of the continuing evolution and increasing diversity of high-performance computing hardware it is important to define what is being counted as a process for this benchmark: a parallel process is the finest-grained sequence of instructions and associated state that produces a separable and disjoint part of the solution. Typically, the number of processes is the number of WRF tiles that are executing in parallel during a given run.

 

Some examples:

 

·            The number of parallel processes running WRF on a number of single-threaded parallel MPI tasks is that number of tasks, since each task can run no more than a single tile in parallel.

·            If, in the previous example, the parallel MPI tasks run multiple OpenMP-threads, the number of processes is the total number of parallel OpenMP threads on all MPI tasks.

·            In the case of finer-levels of parallel process architecture such as “multi-streaming”, the number of processes is the number of these processes if it is possible at the program- or user-levels to decompose the computation over WRF tiles.

·            The number of processes running WRF on a shared-memory, ccNUMA, or multi-streaming system is the number of threads or streams used, whether or not the parallelization is programmer-specified or determined automatically by a parallelizing compiler.

·            The number of arithmetic units, vector pipes, individual units of a pipelined processor, or multi- or “hyper-” hardware threads in a processor are not processes, since they cannot be individually assigned to a tile.

·            The processing elements of a SIMD machine are processes because each element executes a “tile” that is one cell of the domain.

                                                                                                                                           

Link to Benchmark Results

 

Click here for benchmark results pages.

 

Running and submitting benchmark results

 

Anyone may download and run the benchmark code and datasets from this page for their own purposes or for the purpose of submitting results to us for posting, subject to the terms and conditions outlined below. This section provides information on downloading, building, and running WRF on the benchmark cases and on submitting candidate results for posting.

Downloadable code, data, configuration

                          

The standard benchmark WRF is v2.1.1 release of the model. Click here to download.

 

Notes, fixes:

 

1.      November 16, 2005. Potential divide-by-zero in diffwrf utility (NetCDF and Binary version).  This does not affect the model itself, but may cause problems when running diffwrf to generate output comparisons for benchmark. Click here for discussion and fix.

2.      December 30, 2005. Model hangs for 2.5km CONUS case when writing restarts if nio_tasks_per_group is greater than zero in the namelist.input file.  Click here for discussion and fix.

                                                                                                                                                                                                                                                                  

Instructions for compiling WRF

 

This section provides general instructions for building WRF.  Also refer to any instructions that may be listed with the specific benchmark case you are running (see Benchmark Cases and Instructions).

 

Download the standard WRF benchmark code from the link, above. Unzip and untar:

 

    gzip –c –d WRFV2.1.1.TAR.gz | tar xf -

 

creating the WRFV2 directory.

 

Compiling WRF with the NetCDF library is needed to read the input data files provided with the benchmark. If it is not on your system, obtain and install NetCDF.  Click here for more information. NetCDF may need to be compiled using the same compilers as WRF.

 

Compiling WRF with a version of the Message Passing Interface (MPI) library may be needed on your system if you intend to run WRF in parallel distributed-memory mode. If it is not on your system, obtain and install MPI. Note that there are a number of implementations of MPI, some for specific interconnection networks or vendor systems. Click here for information on the freely available MPICH reference implementation of MPI from Argonne National Laboratory. MPI may need to be compiled with the same compilers as WRF. Note too, most build configurations for WRF assume that MPI provides the compile commands mpif90 and mpicc. These should be accessible through your shell command path when compiling WRF for MPI.

 

In the WRFV2 directory, run ./configure. On systems already supported by WRF, this script will present a list of compile options for your computer, select the option you wish to compile. A configure.wrf file will be created. This file configure.wrf file is to be submitted with the benchmark results. It may be modified if desired. For other systems, it may be necessary to construct a configure.wrf file. Refer to templates listed in the configure.defaults file in the arch directory.

 

On some systems, the configure script will offer the option of compiling with either the RSL or RSL_LITE communication library if you are compiling for distributed memory parallel execution. WRF domains larger than 1024 cells in either horizontal dimension require RSL_LITE. Either will work for the cases on this benchmark page. Neither package is required if you intend to run without distributed memory parallelism.

 

Once a configure.wrf script has been generated, compile the code by typing

 

   ./compile em_real

 

in the top-level WRFV2 directory.  The resulting executable is wrf.exe in the run directory. A symbolic link to the executable is created in test/em_real, where the model is typically run.

 

When reconfiguring the code to compile using a different configure option, type

 

   ./clean –a

 

before issuing the configure and compile commands again.  The previous configure.wrf file will be delete by clean –a.

 

Running Benchmark: General

 

Chose and download the case you wish to run from the next section.  Refer to additional case-specific information listed on this page or in a README file which may be included in the tar file.

 

Unless otherwise directed, it is not necessary to edit the namelist.input file that is provided with the case; however, for distributed memory parallel runs, the default domain decomposition over processors in X and Y can be overridden, if desired. Add nproc_x and nproc_y variables to the domains section of namelist.input and set these to the number of processors in X and Y, keeping in mind that the product of these two variables must equal the number of MPI tasks specified with mpirun minus the number of I/O processes (if any).

 

The namelist.input file also provides control over the number and shape of the tiling WRF uses within each patch. On most systems, the default tiling will be only over the Y dimension; the X dimension is left undecomposed over tiles. This default behavior is defined in frame/module_machine.F in the routine init_module_machine. The default number of tiles per patch is 1 if the code is not compiled for OpenMP or the value of the OpenMP function omp_get_max_threads(). This default is defined in frame/module_tiles.F in the routine set_tiles2. The default number of tiles may be overridden by setting the variable numtiles to a value greater than 1 in the domains section of the namelist.input file.  Or the default shape of a tile can be specified by setting tile_sz_x and tile_sz_y in the domains section of the namelist.input file. These will override both the default tiling and numtiles if it is set.

 

The namelist.input file also provides an option for specifying additional MPI tasks to act as output servers. Unless the benchmark case specifically measures I/O performance, there is probably no benefit to doing so.

 

WRF runs are typically done on the run directory or in one of the subdirectories of the test directory. However, if desired, the wrf.exe program can be run in a different directory. Simply copy all of the files in or linked to from this directory to the directory you wish to run the model in.

 

Run the model according to the system-specific procedures for executing jobs on your system. The model will generate output data to files beginning with wrfout_d01 and, for distributed memory runs, a set of rsl.out.dddd and rsl.error.dddd files where dddd is the MPI task number. You can use either RSL or RSL_LITE for distributed memory runs (RSL_LITE is recommended). When running with RSL, a file show_domain_0000 will also be generated, which contains decomposition information. With RSL_LITE the decomposition information appears in the form of patch starting and ending indices at the beginning of each rsl.error.* file. For non-distributed memory runs, capture the output from the running program to a file to be turned in along with the benchmark results.

                     

Since one of the benchmark measures is scalability, you will typically run the model for a series of processor counts. Save the files from each of the runs.

 

Modifications to code. Modifications to the WRF source code are allowed, but the modified source code and configure.wrf file should be returned when the benchmark is submitted. A description of the modifications and their purpose should be included.

 

Acceptance or rejection of benchmark submissions: Benchmark results may be rejected at our sole discretion. Submitters are advised to contact us first before preparing a benchmark submission. Reasons for rejection may include:

 

·            Incomplete submissions

·            Incorrect, suspect, or unrealistic results

·            Non-uniqueness or unwarranted duplication of existing results

·            Lack of relevance to WRF user community

 

Posting of results: Results will be updated on the page on a scheduled basis, not to exceed twice per year. We may, at our discretion, occasionally update on an as-received basis.

 

 

Benchmark Cases and Instructions

                                                      

1.        Single domain, medium size. 12km CONUS, Oct. 2001

 

Description: 48-hour, 12km resolution case over the Continental U.S. (CONUS) domain October 24, 2001 that uses the Eulerian Mass (EM) dynamics. The computational cost for this domain is about 22 billion floating point operations per average time step (72 seconds). The benchmark period is hours 25-27 (3 hours), starting from a

restart file from the end of hour 24 (provided). Click here for an animation. Click here for input data: http://box.mmm.ucar.edu/wrf/WG2/benchv2/conus12km_2001   .

 

Instructions:

 

Download, configure, and compile the WRF code (see above).

 

Obtain the input data for the benchmark case.

 

Read the README.BENCHMARK file that distribution.

 

Put the files from the distribution in the test/em_real directory or in another directory that contains all the files from test/em_real. 

 

Use the namelist.input file from the benchmark distribution, not the one that comes with the WRF code. You should only edit the namelist.input file to change the decomposition (number of processors in X and Y), the tiling (for shared- or hybrid shared/distributed-memory runs), or the number of I/O processes. 

 

Run the model on a series of different numbers of processes and submit the following files from each run:

 

·            namelist.input

·            configure.wrf

·            Either:

o          rsl.error.0000 and rsl.out.0000 (distributed memory parallel)

o          terminal output redirected wrf.exe (non-distributed memory)

·            diffout_tag (from wrfout_d01_2001-10-25_03:00:00 ; see below)

 

Also submit a tar file of the WRFV2 source directory with any source code modifications you may have made. Only one such file is needed unless you used different versions of the code for different runs. Please run clean –a in the WRFV2 directory and delete any wrfinput, wrfbdy, wrfout, and any other extraneous large files before archiving. Gzip the tar file. It will not be larger than 10MB if you have cleaned and removed data files. It is only necessary to submit one tar file of the WRFV2 source directory if you are submitting benchmarks for more than one case and if you have not modified the source code from case to case.

 

Do not submit wrfout_d01 files; they are too large. Instead, please submit the output from diffwrf for the wrfout_d01 files generated by each of your runs. The input data contains a reference output file named wrfout_reference. Diffwrf will compare your output with the reference output file and generate difference statistics for each field that is not bit-for-bit identical to the reference output. Run diffwrf as follows:

 

diffwrf your_output wrfout_reference > diffout_tag

 

and return the captured output. The diffwrf program is distributed and compiled automatically with WRF. The executable file is external/io_netcdf/diffwrf.

 

Click here to see a sample diffout_tag file. This file was produced by comparing two output files for this benchmark generated on the same machine, one of which was generated with input perturbed in the lowest order bit to simulate the effect of floating point round-off error. The number of digits of agreement in your diffwrf_tag files should not differ significantly from that shown in this sample output.

 

 

2.        Single domain, large size. 2.5 km CONUS,  June 4, 2005

 

Description: Latter 3 hours of a 6-hour, 2.5km resolution case covering the Continental U.S. (CONUS) domain June 4, 2005, using the Eulerian Mass (EM) dynamics with a 15 second time step.  The benchmark period is hours 3-6 (3 hours), starting from a restart file from the end of the initial

3 hour period. As an alternative, the model may be run 6 hours from cold start. 

 

Click here for input data: http://www.mmm.ucar.edu/wrf/WG2/benchv2/conus2.5km_2005   .

 

Instructions:

 

Download, configure, and compile the WRF code (see above).

 

Obtain the input data for the benchmark case.

 

Read the README.BENCHMARK file that distribution.

 

Put the files from the distribution in the test/em_real directory or in another directory that contains all the files from test/em_real. 

 

Use the appropriate namelist.input file from the benchmark distribution, not the one that comes with the WRF code. The namelist.input file you use will depend on whether you are doing the benchmark as a restart or a cold start. See README.BENCHMARK. You should only edit the namelist.input file to change the decomposition (number of processors in X and Y), the tiling (for shared- or hybrid shared/distributed-memory runs), or the number of I/O processes. Note: using only one I/O process with this large case will not work: code hangs. Fix in progress for next WRF release. For this benchmark, use either none or more than one I/O process [JM, 2005-11-09]. 

 

Run the model on a series of different numbers of processes and submit the following files from each run:

 

·            namelist.input

·            configure.wrf

·            Either:

o          rsl.error.0000 and rsl.out.0000 (distributed memory parallel)

o          terminal output redirected wrf.exe (non-distributed memory)

·            diffout_tag (from wrfout_d01_2005-06-04_06:00:00; see below)

 

Also submit a tar file of the WRFV2 source directory with any source code modifications you may have made. Only one such file is needed unless you used different versions of the code for different runs. Please run clean –a in the WRFV2 directory and delete any wrfinput, wrfbdy, wrfout, and any other extraneous large files before archiving. Gzip the tar file. It will not be larger than 10MB if you have cleaned and removed data files. It is only necessary to submit one tar file of the WRFV2 source directory if you are submitting benchmarks for more than one case and if you have not modified the source code from case to case.

 

Do not submit wrfout_d01 files; they are too large. Instead, please submit the output from diffwrf for the wrfout_d01 files generated by each of your runs. The input data contains a reference output file named wrfout_reference. Diffwrf will compare your output with the reference output file and generate difference statistics for each field that is not bit-for-bit identical to the reference output. Run diffwrf as follows:

 

diffwrf your_output wrfout_reference > diffout_tag

 

and return the captured output. The diffwrf program is distributed and compiled automatically with WRF. The executable file is external/io_netcdf/diffwrf.

 

Click here to see a sample diffout_tag file for this case. This file was produced by comparing two output files for this benchmark generated on the same machine, one of which was generated with input perturbed in the lowest order bit to simulate the effect of floating point round-off error. The number of digits of agreement in your diffwrf_tag files should not differ significantly from that shown in this sample output.

 

3.            Nested case, medium.  Hurricane Ivan 12km/4km moving nest, September, 2004.

 

Description:  5-day forecast of Hurricane Ivan beginning on 00Z September 11, 2004.  Both the 12km coarse domain and the 4km moving nested domain are approximately the same size as the CONUS benchmark above; therefore, this is classified a “medium” case. However, the length of the run and the fact that the simulation involves both domains running synchronously and 2-way interacting makes it a very costly simulation.  Click here for input data. Sample output from a run is viewable at http://www.mmm.ucar.edu/wrf/WG2/ivan_5day.gif .  Some additional features of this simulation are:

 

1)      Nest moves automatically to keep the vortex centered within the 4km domain.

2)      Excessive vertical motion may trigger automatic damping (CFL messages may appear on some processors).

3)      The nested domain is initialized solely from interpolated coarse domain data (hi-resolution ingest is not used in this benchmark since the topography files are very large and because the ingest code has not been generally released to the WRF user community).

4)      In addition to nearest neighbor communication for finite-differencing on individual domains, the run also exercises MPI_Alltoall and MPI_Gather communications for the nest forcing and feedback.

5)      Model output is kept to a minimum. Each of the domains writes history only every 24 hours. This cost is included in the benchmark, however.

 

 

Instructions:

 

Download, configure, and compile the WRF code (see above).  Note that to use nesting, the code must be compiled with either RSL or RSL_LITE. Be sure to select one of these options from the ./configure script. Hint: our preliminary tests on IBM systems have shown that RSL_LITE was faster than RSL for this case, but this is a preliminary result and may differ on other systems/installations.

 

Obtain the input data for the benchmark case.

 

Read the README.BENCHMARK file the distribution.

 

Put the files from the distribution in the test/em_real directory or in another directory that contains all the files from test/em_real. 

 

Use the namelist.input file from the benchmark distribution, not the one that comes with the WRF code. You should only edit the namelist.input file to change the decomposition (number of processors in X and Y), the tiling (for shared- or hybrid shared/distributed-memory runs), or the number of I/O processes (not typically needed for this case, which does not measure I/O performance). 

 

Run the model on a series of different numbers of processes and submit the following files from each run:

 

·            namelist.input

·            configure.wrf

·            Either:

o          rsl.error.0000 and rsl.out.0000 (distributed memory parallel)

o          terminal output redirected wrf.exe (non-distributed memory)

·            The file: tracking.txt (see below)

 

Also submit a tar file of the WRFV2 source directory with any source code modifications you may have made. Only one such file is needed unless you used different versions of the code for different runs. Please run clean –a in the WRFV2 directory and delete any wrfinput, wrfbdy, wrfout, and any other extraneous large files before archiving. Gzip the tar file. It will not be larger than 10MB if you have cleaned and removed data files.  It is only necessary to submit one tar file of the WRFV2 source directory if you are submitting benchmarks for more than one case and if you have not modified the source code from case to case.

                                                                                                                            

For the Hurricane Ivan benchmark, track information is used as a measure of correctness.  The track data is output in the rsl.error.0000 file (or in the terminal output) for the run.  It is extracted with grep ’ATCF’ rsl.error.0000 > tracking.txt . Click here for sample tracking data.