Parallel MM5 benchmarks, January-August 2000

http://www2.mmm.ucar.edu/mm5/mpp/helpdesk

Disclaimer

Some results on this page are contributed (as thanked below). Contributed results are provided on this page along with results from runs we conducted ourselves with the following caveat: Reasonable efforts have been made to verify contributed results, including consulting with the contributors and inspection of configuration and raw benchmark result files. We have complete confidence in the integrity an competence of the contributors; however, we assume no responsibility for nor make any claims regarding the accuracy, veracity, or appropriateness for any purpose whatsoever of contributed results. Further, all results, whether contributed or generated by us, are for a single fixed-size MM5 case on the specific machines listed, and no claim is made of representativeness for other model scenarios, code versions, or hardware installations.

Press here for the newer Parallel MM5 2003 Benchmarks Page.
Press here for the newer Parallel MM5 2002 Benchmarks Page.
Press here for the newer Parallel MM5 2001 Benchmarks Page.

Click here to download the input data for this MM5 benchmark case.
For additional information on MM5 benchmarks, please click here.
For information on the MM5 distributed-memory parallel code, please click here.
Scroll down this page for additional explanation of the figures shown here.

Figure 1a. MM5 floating-point performance on various platforms. Some not shown (see next figure).

Figure 1b. MM5 floating-point performance on various platforms (zoomed).

Figures 1a-b shows performance results in Mflop/second and in simulated hours per hour on a variety of platforms.

The AlphaServerSC/667 timings were completed on an AlphaServer configuration at Compaq. Each SMP node in the cluster contains four 667 MHz EV67 processors. The model was run straight-MPI (no OpenMP) MPI-over-shared memory for communication within nodes and using MPI-over Quadrics for communication between nodes. Contributed. Thanks Steve Leventer, Compaq. Note: these timings have been updated since first being posted 5/19/00. They are now all full-node timings (i.e. 4 CPUs per node). (6/26/00)

The Fujitsu VPP5000 is a distributed-memory machine with vector processors, linked by a high-speed crossbar interconnect. The model was run using Fujitsu's implementation of MPI and with a one-dimensional data decomposition to preserve vector length in the I-dimension. This is an update of earlier Fujitsu VPP 5000 benchmarks first posted (1/5/00). This new benchmark data involves more processors and vector optimization of several key MM5 physics routines (optimizations that will be made available as part of the next release of MM5).

The IBM SP WH2 timings were obtained on the NCAR/SCD IBM Winterhawk-II machine (four 375 Mhz Power3 CPUs per node). This is the post-upgrade blackforest.ucar.edu. The model was run using four MPI tasks per node (no OpenMP). Note, these numbers have been updated since 5/19/00; we reran using the MP_SHARED_MEMORY environment variable (improving the 256 node time by approximately 14 percent). These timings use 4 processors per node. (5/22/00)

The HPTi ACL/667 timings were conducted on an Alpha Linux machine at NOAA Forecast Systems Laboratory having a single-Alpha 667 MHz processor per node. The model was run using MPI-over-Myrinet message-passing between single-threaded processes, one per node. The code is compiled using Compaq compilers on Linux. Contributed. Thanks Greg Lindahl, HPTi. (3/10/00)

The EV6 cluster is an eight-node four-Alpha 500 MHz processor per-node configuration at NCAR/SCD. MM5 was run on this machine in several configurations using MPI-over-QSW on up to four-way multithreaded (OpenMP) processes. (3/10/00)

The Origin3000 400 MHz timings were obtained using MPI message-passing. Because the lines are bunching up, this plot is shown only in the zoomed figure and in the table below. Contributed. Thanks Elizabeth Hayes, SGI. (3/14/01)

The Origin2000 300 MHz timings were obtained in dedicated (exclusive access) mode using MPI message-passing on 64- and 128-processor configurations of R12000 processors. This plot is in color only for readability. Contributed. Thanks Wesley Jones, SGI. (1/5/00)

The Origin2000 400 MHz timings were obtained in dedicated (exclusive access) mode using MPI message-passing on a 64-processor configuration of R12000 processors. This plot is in color only for readability. Contributed. Thanks Wesley Jones, SGI. (6/4/00)

The Beowulf EV56 cluster timings were on the Centurion machine at the University of Virginia having a single-Alpha processor per node (part of the Legion project). The model was run on this machine using MPI-over-Myrinet message passing between single-threaded processes, one per node. Thanks Greg Lindahl, HPTi. (1999)

The IBM SP WH1 timings were obtained on the NCAR/SCD IBM Winterhawk-I machine (two 200 Mhz Power3 CPUs per node). This is the pre-upgrade version of blackforest.ucar.edu. The model was run using MPI over the SP-switch between 2-way multi-threaded (OpenMP) processes, one per node. (3/10/00)

The Pentium-III Beowulf timings were conducted on the Argonne National Laboratory "Chiba City" Cluster (http://www.mcs.anl.gov/chiba). The Chiba city compute nodes are 100baseT Ethernet-connected dual Pentium-III (500 Mhz) with 512 MB/node. The runs shown are straight-MPI, using only one processor per node. Code was compiled using Portland Group compilers. (1/5/00)

The Cray T90 timings were obtained on the Cray T932 at the NOAA Geophysical Fluid Dynamics Laboratory. All timings are dedicated mode and represent elapsed time except one-processor timing obtained in non-dedicated mode which represents CPU time. All runs were in shared-memory mode (Cray Microtasking; not MPI). The T90 runs were also used to determine the floating point operation count on which the Mflop/second estimates are based. (1/5/00)

All runs were of a 36-kilometer resolution domain over Europe; the grid consisted of 136 cells in the east/west dimension, 112 north/south, and 33 vertical layers (503,000 cells). There is a link to the input data for this case at the top of this page. The operation count for this scenario is 2,398 million floating point operations per average time step. I/O and model initialization were not included in the timings. All timing runs were performed at single (32-bit) floating-point precision except the T90. Scaling was measured as the speedup divided by the factor of increase in the number of processors. The results were as follows:

- Futisu VPP5000 (8/9/00), 1 to 40 CPU (2156 to 40638 Mflop/sec), 47 percent
-   "       "       "    , 1 to 20 CPU (2156 to 27815 Mflop/sec), 77 percent
-   "       "       "    , 1 to 10 CPU (2156 to 16650 Mflop/sec), 89 percent
- Futisu VPP5000 (1/5/00), 1 to 10 CPU (1512 to 11951 Mflop/sec), 79 percent
- Compaq AlphaServerSC/667, 4 to 512 CPU (1255 to 45317 Mflop/sec), 28 percent
-   "        "        "   , 4 to 256 CPU (1255 to 32471 Mflop/sec), 40 percent
-   "        "        "   , 4 to 128 CPU (1255 to 22015 Mflop/sec), 55 percent
-   "        "        "   , 4 to  64 CPU (1255 to 11709 Mflop/sec), 58 percent
- HPTi ACL/667, 1 to 128 CPU (330 to 22960 Mflop/sec), 54 percent
-   "    "   " (1 to  64 CPU (330 to 12680 Mflop/sec), 60 percent
-   "    "   " (1 to  32 CPU (330 to  7900 Mflop/sec), 75 percent
- EV56 Alpha Beowulf, 4 to 64 CPU (575 to 5938 Mflop/sec), 65 percent
- EV6 Alpha Cluster, 1 to 32 CPU (266 to 6244 Mflop/sec), 73 percent
- SGI O2000 300 MHz, 1 to 120 CPU (158 to 15080 Mflop/sec), 80 percent
- SGI O2000 400 MHz, 4 to 64 CPU (804 to 12686 Mflop/sec), 98 percent (1)
- SGI O3000 400 MHz, 4 to 64 CPU (815 to 14045 Mflop/sec), 108 percent (1)
- IBM WH2, 4 to 256 CPU (674 to 24219 Mflop/sec), 55 percent
-  "   " , 4 to 128 CPU (674 to 16767 Mflop/sec), 77 percent
-  "   " , 4 to  64 CPU (674 to  9082 Mflop/sec), 83 percent
- IBM WH1, 2 to 128 CPU (155 to 8594 Mflop/sec), 87 percent
- Pentium-III Beowulf, 1 to 64 CPU (99 to 2858 Mflop/sec), 45 percent
    "      "      "   (1 to 32 CPU (99 to 1988 Mflop/sec), 63 percent)
- Cray T90, 1 to 20 CPU (569 to 8472 Mflop/sec), 74 percent

Performance and scaling of the same-source parallel MM5 are consistent with earlier, hand-parallelized implementations of the model.

Notes

(1) Linear or superlinear scaling. This does not necessarily mean superior scalability; rather, it may be indicative of degraded run times because of memory/cache effects on the runs with smaller numbers of processors.

John Michalakes, michalak@ucar.edu
Created: January 14, 2000
---
Updated: March 14, 2001. Added data for O3000 400MHz 1-64p.

Updated: October 18, 2000. Add figure 1a. showing results out to 512 processors. The information was listed in the text prior to this, but not on the figure.

Updated: August 9, 2000. New Fujitsu VPP5000 benchmarks.

Updated: June 26, 2000. Updated Compaq AlphaserverSC numbers with full-node timings.

Update: June 5, 2000. Added Origin 2000 400 MHz plot.

Updates: May 23-24, 2000. Noted that the Compaq AlphaserverSC numbers are partial node timings and genereated separate slide for partial node results.

Updated: May 22, 2000. (Corrected IBM WH-II benchmarks; the set of benchmarks posted May 19, 2000 had been run without an important environment setting, MP_SHARED_MEMORY, that allows MPI processes on the same node to communication through shared memory.)