WRF V3 Parallel Benchmark Page

Click here for previous WRF benchmark pages.

Contents

Contents

Disclaimer

Introduction

Purpose:

Definititions:

Link to Benchmark Results

Running and Submitting Benchmark Results

Downloadable code, data and configuration

Instructions for compiling WRF

Running Benchmark: General

Modifications to code

Acceptance or rejection of benchmark submissions

Posting of results:

Benchmark Cases and Instructions

Single domain, medium size. 12km CONUS, Oct. 2001

Single domain, large size. 2.5 km CONUS,  June 4, 2005

 

Disclaimer

We make no claims regarding the accuracy or appropriateness of these results for any purpose. All results are for the fixed-size WRF cases described here, on the specific machines listed, and no claim is made of representativeness for other model scenarios, configurations, code versions, or hardware installations.

Introduction

Purpose:

To demonstrate computational performance and scaling of the WRF model on target architectures.  This is a measure of integration speed and, unless otherwise noted for a specific case, ignores I/O and initialization costs. The benchmarks are intended to provide a means for comparing the performance of different architectures and for comparing WRF computational performance and scaling with other models.

Definititions:

Performance is model speed, ignoring I/O and initialization cost, directly measured as the average cost per time step over a representative period of model integration. Performance is presented as simulation speed and floating-point rate when an operation count is available.

Floating-point rate provides a measure of efficiency relative to the theoretical peak capability of computing system. Floating point rate is the average number of floating-point operations per time step divided by the average number of seconds per time step over a representative model run.  Average floating-point operations per time step is determined by executing the test case over the integration period, obtaining an operations count from the hardware, and then dividing by the number of time steps in the integration period. Operations counts vary from system to system. Therefore, the minimum operations count over all systems measured is used for determining floating point rate. Using a minimum avoids overstating performance and efficiency of the WRF code. The average time per time step is the sum of the times for each time step in the integration period divided by the number of time steps. The first time step, which may include initialization and I/O, is discarded.

Simulation speed is a measure of actual time-to-solution. Simulation speed is the ratio of model time simulated to actual time. Simulation speed is defined as the ratio of the model time step to the average time per time step as defined above.

Scaling is the ratio of increase in simulation speed (or floating point rate) to the increase in the number of parallel processes.

A parallel process is the independent variable of this experiment. It is the unit of parallelism that is scaled up or down when running WRF on a parallel system.

Link to Benchmark Results

Click here for 12 KM CONUS benchmark results pages.

Click here for 2.5 KM CONUS benchmark results pages.

Running and Submitting Benchmark Results

Anyone may download and run the benchmark code and datasets from this page for their own purposes or for the purpose of submitting results to us for posting, subject to the terms and conditions outlined below. This section provides information on downloading, building, and running WRF on the benchmark cases and on submitting candidate results for posting.

Downloadable code, data and configuration

The standard benchmark WRF is v3.0 release of the model. Click here to download.

Notes, fixes:

        None Currently

Instructions for compiling WRF

This section provides general instructions for building WRF.  Also refer to any instructions that may be listed with the specific benchmark case you are running (see Benchmark Cases and Instructions).

Download the standard WRF benchmark code from the link, above. Unzip and untar: 

     gzip -c -d WRFV3.TAR.gz | tar xf -

creating the WRFV3 directory. 

Compiling WRF with the NetCDF library is needed to read the input data files provided with the benchmark. If it is not on your system, obtain and install NetCDF.  Click here for more information. NetCDF may need to be compiled using the same compilers as WRF.

Compiling WRF with a version of the Message Passing Interface (MPI) library may be needed on your system if you intend to run WRF in parallel distributed-memory mode. If it is not on your system, obtain and install MPI. Note that there are a number of implementations of MPI, some for specific interconnection networks or vendor systems. Click here for information on the freely available MPICH reference implementation of MPI from Argonne National Laboratory. MPI may need to be compiled with the same compilers as WRF. Note too, most build configurations for WRF assume that MPI provides the compile commands mpif90 and mpicc. These should be accessible through your shell command path when compiling WRF for MPI.

In the WRFV3 directory, run ./configure. On systems already supported by WRF, this script will present a list of compile options for your computer, select the option you w ish to compile. A configure.wrf file will be created. This file configure.wrf file is to be submitted with the benchmark results. It may be modified if desired. For other systems, it may be necessary to construct a configure.wrf file. Refer to templates listed in the configure.defaults file in the arch directory.

Once a configure.wrf script has been generated, compile the code by typing 

    ./compile em_real

in the top-level WRFV3 directory.  The resulting executable is wrf.exe in the run directory. A symbolic link to the executable is created in test/em_real, where the model is typically run.

When reconfiguring the code to compile using a different configure option, type 

    ./clean -a

before issuing the configure and compile commands again.  The previous configure.wrf file will be deleted by clean -a.

For more information on compiling WRF, please see http://www.mmm.ucar.edu/wrf/users/docs/user_guide_V3/users_guide_chap5.htm.

Running Benchmark: General

Chose and download the case you wish to run from the next section.  Refer to additional case-specific information listed on this page or in a READ-ME.txt file which may be included in the tar file.

Unless otherwise directed, it is not necessary to edit the namelist.input file that is provided with the case; however, for distributed memory parallel runs, the default domain decomposition o! ver processors in X and Y can be overridden, if desired. Add nproc_x a nd nproc_y variables to the domains section of namelist.input and set these to the number of processors in X and Y, keeping in mind that the product of these two variables must equal the number of MPI tasks specified with mpirun minus the number of I/O processes (if any). 

The namelist.input file also provides control over the number and shape of the tiling WRF uses within each patch. On most systems, the default tiling will be only over the Y dimension; the X dimension is left undecomposed over tiles. This default behavior is defined in frame/module_machine.F in the routine init_module_machine. The default number of tiles per patch is 1 if the code is not compiled for OpenMP or the value of the OpenMP function omp_get_max_threads(). This default is defined in frame/module_tiles.F in the routine set_tiles2. The default number of tiles may be overridden by setting the variable numtiles to a value greater than 1 in the domains section of the namelist.input file.  Or the default shape of a tile can be specified by setting tile_sz_x and tile_sz_y in the domains section of the namelist.input file. These will override both the default tiling and numtiles if it is set.

The namelist.input file also provides an option for specifying additional MPI tasks to act as output servers. Unless the benchmark case specifically measures I/O performance, there is probably no benefit to doing so.

WRF runs are typically done on the run directory or in one of the subdirectories of the test directory. However, if desired, the wrf.exe program can be run in a different directory. Simply copy all of the files in or linked to from this! directory to the directory you wish to run the model in.

Run the model according to the system-specific procedures for executing jobs on your system. The model will generate output data to files beginning with wrfout_d01 and, for distributed memory runs, a set of rsl.out.dddd and rsl.error.dddd files where dddd is the MPI task number. The domain decomposition information appears in the form of patch starting and ending indices at the beginning of each rsl.error.* file. For non-distributed memory runs, capture the output from the running program to a file to be turned in along with the benchmark results. 

Since one of the benchmark measures is scalability, you will typically run the model for a series of processor counts. Save the files from each of the runs. 

For more information on running WRF, please see http://www.mmm.ucar.edu/wrf/users/docs/user_guide_V3/users_guide_chap5.htm.

Modifications to code

Modifications to the WRF source code are allowed, but the modified source code and configure.wrf file should be returned when the benchmark is submitted. A description of the modifications and their purpose should be included.

Acceptance or rejection of benchmark submissions

Benchmark results may be rejected at our sole discretion. Submitters are advised to contact us first before preparing a benchmark submission. Reasons for rejection may include:

Posting of results:

Results will be updated on the page on a scheduled basis, not to exceed twice per year. We may, at our discretion, occasionally update on an as-received basis.

Benchmark Cases and Instructions

Single domain, medium size. 12km CONUS, Oct. 2001

Description: 48-hour, 12km resolution case over the Continental U.S. (CONUS) domain October 24, 2001 with a time step of 72 seconds. The benchmark period is hours 25-27 (3 hours), starting from a restart file from the end of hour 24 (provided). Click here for an animation.

Click here for input data: http://www.mmm.ucar.edu/WG2bench/conus12km_data_v3

Instructions: 

Download, configure, and compile the WRF code (see above). 

Obtain the input data for the benchmark case. 

Put the files from the distribution in the test/em_real directory or in another directory that contains all the files from test/em_real.

Use the namelist.input file from the benchmark distribution, not the one that comes with the WRF code. You should only edit the namelist.input file to change the decomposition (number of processors in X and Y), the tiling (for shared- or hybrid shared/distributed-memory runs), or the number of I/O processes.  

Run the model on a series of different numbers of processes and submit the following files from each run: 

Include with your submission a system description containing the following:

1.     Name of system (product name, hostname, institution)

2.     Model version of WRF (found in main README of distribution)

3.     Operating system and version

4.     Compiler and version

5.     Processor: manufacturer, type and speed; include cache sizes if known

6.     Cores per socket and sockets per node

7.     Main memory per core

8.     Interconnect: type (e.g. Infiniband, GigE), product name, and network topology (if known)

9.     Other relevant information

Please also submit a tar file of the WRFV3 source directory with any source code modifications you may have made. Only one such file is needed unless you used different versions of the code for different runs. Please run clean -a in the WRFV3 directory and delete any wrfinput, wrfbdy, wrfout, and any other extraneous large files before archiving. Gzip the tar file. It will not be larger than 10MB if you have cleaned and removed data files. It is only necessary to submit one tar file of the WRFV3 source directory if you are submitting benchmarks for more than one case and if you have not modified the source code from case to case.

Do not submit wrfout_d01 files; they are too large. Instead, please submit the output from diffwrf for the wrfout_d01 files generated by each of your runs. The input data contains a reference output file named wrfout_reference. Diffwrf will compare your output with the reference output file and generate difference statistics for each field that is not bit-for-bit identical to the reference output. Run diffwrf as follows: 

        diffwrf your_output wrfout_reference > diffout_tag

and return the captured output (tag is just a name you give the file to indicate which run the diffout file is for). The diffwrf program is distributed and compiled automatically with WRF. The executable file is external/io_netcdf/diffwrf.

Click here to see a sample diffout_tag file. This file was produced by comparing two output files for this benchmark generated on the same machine, one of which was generated with input perturbed in the lowest order bit to simulate the effect of floating point round-off error. The number of digits of agreement in your diffwrf_tag files should not differ significantly from that shown in this sample output.

Finally, please also submit a tar file of the WRFV3 source.

Addendum (20090724): If you would like to compute the GF/s from a benchmark run yourself, use the following command:

grep 'Timing for main' rsl.error.0000 | tail -149 | awk '{print $9}' | awk - stats.awk

This command will output the average time per time step as the mean value. Simulation speed is the model time step, 72 seconds, divided by average time per time step. Gigaflops per second is simulation speed times 0.418 for this case based on a measured operation count of 30.1 billion floating point operations per second (or simply the operation count divided by the average time per time step).

Single domain, large size. 2.5 km CONUS,  June 4, 2005

Description: Latter 3 hours of a 9-hour, 2.5km resolution case covering the Continental U.S. (CONUS) domain June 4, 2005 with a 15 second time step.  The benchmark period is hours 6-9 (3 hours), starting from a restart file from the end of the initial 6 hour period. As an alternative, the model may be run 9 hours from cold start.

Click here for input data: http://www.mmm.ucar.edu/WG2bench/conus_2.5_v3/.  Be sure to read the READ-ME.txt file in that directory.

Instructions: 

Download, configure, and compile the WRF code (see above). 

Obtain the input data for the benchmark case. 

Read the README.BENCHMARK file that distribution. 

Put the files from the distribution in the test/em_real directory or in another directory that contains all the files from test/em_real.

Use the appropriate namelist.input file from the benchmark distribution, not the one that comes with the WRF code. The namelist.input file you use will depend on whether you are doing the benchmark as a restart or a cold start. See README.BENCHMARK. You should only edit the namelist.input file to change the decomposition (number of processors in X and Y), the tiling (for shared- or hybrid shared/distributed-memory runs), or the number of I/O processes.  

Addendum 20090922: the namelist.input file from the benchmark distribution is set to use parallel NetCDF (http://trac.mcs.anl.gov/projects/parallel-netcdf). If you prefer to use regular NetCDF, change the io_form_* settings in the time_control section of the namelist.input file from 11 to 2.

Run the model on a series of different numbers of processes and submit the following files from each run:

Include with your submission a system description containing the following:

1.     Name of system (product name, hostname, institution)

2.     Model version of WRF (found in main README of distribution)

3.     Operating system and version

4.     Compiler and version

5.     Processor: manufacturer, type and speed; include cache sizes if known

6.     Cores per socket and sockets per node

7.     Main memory per core

8.     Interconnect: type (e.g. Infiniband, GigE), product name, and network topology (if known)

9.     Other relevant information

Finally, please also submit a tar file submit a tar file of the WRFV3 source directory with any source code modifications you may have made. Only one such file is needed unless you used different versions of the code for different runs. Please run clean ľa in the WRFV3 directory and delete any wrfinput, wrfbdy, wrfout, and any other extraneous large files before archiving. Gzip the tar file. It will not be larger than 10MB if you have cleaned and removed data files. It is only necessary to submit one tar file of the WRFV3 source directory if you are submitting benchmarks for more than one case and if you have not modified the source code from case to case.

Do not submit wrfout_d01 files; they are too large. Instead, please submit the output from diffwrf for the wrfout_d01 files generated by each of your runs. The input data contains a reference output file named wrfout_reference. Diffwrf will compare your output with the reference output file and generate difference statistics for each field that is not bit-for-bit identical to the reference output. Run diffwrf as follows: 

        diffwrf your_output wrfout_reference > diffout_tag

and return the captured output (tag is just a name you give the file to indicate which run the diffout file is for). The diffwrf program is distributed and compiled automatically with WRF. The executable file is external/io_netcdf/diffwrf. 

Click here to see a sample diffout_tag file for this case. This file was produced by comparing two output files for this benchmark generated on the same machine, one of which was generated with input perturbed in the lowest order bit to simulate the effect of floating point round-off error. The number of digits of agreement in your diffwrf_tag files should not differ significantly from that shown in this sample output.

Addendum (20090724): If you would like to compute the GF/s from a benchmark run yourself, use the following command:

grep 'Timing for main' rsl.error.0000 | tail -149 | awk '{print $9}' | awk -f stats.awk

This command will output the average time per time step as the mean value. Simulation speed is the model time step, 15 seconds, divided by average time per time step. Gigaflops per second is simulation speed times 27.45 for this case based on an operation count of 411.7 billion floating point operations per time step (or simply the operation count divided by the average time per time step).

 

 

Page created (6/18/08) by Chris Eldred, Pittsburgh Supercomputing Center

Maintained by John Michalakes, NCAR