Here to return to http://www2.mmm.ucar.edu/mm5/mpp/helpdesk .
Some results on this page are contributed (as thanked below). Contributed results are provided on this page along with results from runs we conducted ourselves with the following caveat: Reasonable efforts have been made to verify contributed results, including consulting with the contributors and inspection of configuration and raw benchmark result files. We have complete confidence in the integrity and competence of the contributors; however, we assume no responsibility for nor make any claims regarding the accuracy, veracity, or appropriateness for any purpose whatsoever of contributed results. Further, all results, whether contributed or generated by us, are for a single fixed-size MM5 case on the specific machines listed, and no claim is made of representativeness for other model scenarios, code versions, or hardware installations. The explanatory text, written in consultation with vendor/contributors, is the work and responsibility of John Michalakes (michalak@ucar.edu).
NOTE: Some older results have been discarded in this September 2003 revision of the MM5 benchmarks page.
Press here for older (pre-September 2003) version of Parallel MM5 2003 Benchmarks Page
Press here for the
older Parallel
MM5 2002 Benchmarks Page.
Press here for the older Parallel MM5
2001 Benchmarks Page.
Press here for the older Parallel MM5
2000 Benchmarks Page.
Click here to
download the input data for this MM5 benchmark case.
For additional information on MM5 benchmarks, please click here.
For information on the MM5 distributed-memory parallel code, please click here.
Scroll down this page for additional explanation of the figures shown here.
Figure 1a. MM5 floating-point performance on various platforms. (1)See notes below. (Updated October 30, 2003).
Figure 1b. MM5 floating-point performance on various platforms (zoomed). (1)See notes below. (Updated October 30, 2003)
All runs were of a 36-kilometer resolution domain over Europe; the grid
consisted of 136 cells in the east/west dimension, 112 north/south, and 33
vertical layers (503,000 cells). Time step is 81 seconds. There is a link to
the input data for this case at the top of this page. The operation count for
this scenario is 2,398 million floating point operations per average time step.
I/O and model initialization were not included in the timings. All timing runs
were performed at single (32-bit) floating-point precision. Scaling is
calculated as the speedup divided by the factor of increase in the number of
processors.
Figures 1a-b shows performance results in Mflop/second and in simulated hours per hour on a variety of platforms.
The Cray X1 timings were obtained on a 128 CPU 800MHz X1 system. The x-axis is the number of single-streaming processors (SSPs) used for each run. Contributed; thanks Peter Johnsen, Cray, and Tony Meys, Army High Performance Computing Research Center, (10/2003)
Timings for the Pittsburgh Supercomputer Center Terascale Computing System were conducted by J. Michalakes on October 20-21, 2001; additional runs Oct. 26 and 29. Thanks to Ralph Roskies, Sergiu Sanielevici, Roberto Gomez, and others at PSC. The 6 TFlop/s peak TCS comprises 3000 1 GHz HP Alpha EV68 processors (750 ES45 nodes). The model was run straight-MPI (no OpenMP) MPI-over-shared memory for communication within nodes and using MPI-over Quadrics for communication between nodes. These are full-node timings (i.e. using all 4 CPUs on each node) on a dedicated system. Two rails of the Quadrics interconnect were used for the Feb. 2002 updated runs. (10/23/2001; updated 10/26/2001, 2/18/2002)
Timings for the IBM Power 4, P690 1.3 GHz were conducted on an IBM system.
Each Regatta node used in the test contained 32 processors subdivided into
8-way logical partiions (LPARS). The interconnection network is the
IBM Colony switch. Contributed; thanks Jim Abeles, IBM (10/2003).(1)
Timings for the IBM Power 4, P655+ 1.7 GHz, were conducted using 4-way LPARS and the Colony switch. Contributed; thanks Jim Abeles, IBM (9/2003).(1)
The SGI Origin 3900 timings were obtained using MPI on a system with 64 700MHz MIPS processors; 4 cpus per node. The interconnection network is SGI NUMAlink. Contributed; thanks Peter Johnsen, SGI (4/2003).
The SGI Altix timings were obtained using MPI on a system with 1.5 GHz Intel Itanium-2 processsors, 2 cpus per node. Compiled with Intel Fortran95. The interconnection network is SGI NUMAlink. Contributed; thanks Peter Johnsen, SGI (4/2003).
The HP Superdome timings were obtained on a system with 1.5 GHz Itanium-2 processors. Contributed; thanks Logan Sankaran, HP (9/2003).
The Fujitsu VPP5000 is a distributed-memory machine with vector processors, linked by a high-speed crossbar interconnect. The model was run using Fujitsu's implementation of MPI and with a one-dimensional data decomposition to preserve vector length in the I-dimension.
The IBM Power3 SP were obtained on the NCAR/SCD IBM Winterhawk-II machine (four 375 Mhz Power3 CPUs per node). The model was run using four MPI tasks per node. These timings use 4 processors per node. (5/22/00)
The Jazz timings were conducted on a 350-node Pentium/Linux cluster in the Argonne National Laboratory Laboratory Computing Resource Center. Each node has a single Xeon 2.4 GHz processor and either 1 GB or 2 GB of memory. Model was run over Myrinet 2000. The MM5 code was compiled using the Intel compiler. Contributed; thanks John Taylor, ANL. (3/2003)
The iJet HPTi Xeon timings were conducted on iJet a Pentium/Linux cluster at NOAA Forecast Systems Laboratory. Each node has dual Xeon 2.2 GHz processors. Model was run straight-MPI over Myrinet (no-OpenMP). The MM5 code was compiled using the Intel compiler. Contributed; thanks Craig Tierney, HPTi. (11/2002)
Note:
(1) The runs were conducted using a special optimization in the Reisner explicit moisture routine, EXMOISR, originally included in the MM5 benchmark code for optimization on the NEC SX/5 system. The modification entails replacing the FORTRAN power operator ‘**’ with EXP and LOG: that is, x ** y is replaced with EXP( LOG(x)*y). This option is turned on at compile time by compiling the EXMOISR routine with KMA defined as a CPP preprocessor macro.
The following table shows scaling
information for the machines tested.
Scaling efficiency is computed as the percentage of ideal speedup as the
number of processors increase, relative to some minimum number of processors.
This minimum may vary from machine to machine based on node memory limits for
the individual system, or to avoid the appearance of superlinear scaling. This can occur when the benchmark code
does not fit well into memory on a small number of processors but nevertheless
runs slowly using virtual memory; cache problems on small numbers of processors
may also cause superlinear scaling.
SYSTEM |
P |
|
P |
|
|
|
|
|
|
Scaling |
Cray X1,
SSPs (9/23/2003) |
8 |
to |
384 |
( |
6587 |
to |
102465 |
|
Mflop/sec
), |
32% |
|
8 |
to |
192 |
( |
6587 |
to |
82112 |
|
Mflop/sec
), |
52% |
|
8 |
to |
96 |
( |
6587 |
to |
48340 |
|
Mflop/sec
), |
61% |
|
8 |
to |
64 |
( |
6587 |
to |
36328 |
|
Mflop/sec
), |
69% |
|
8 |
to |
32 |
( |
6587 |
to |
21218 |
|
Mflop/sec
), |
81% |
|
8 |
to |
16 |
( |
6587 |
to |
11472 |
|
Mflop/sec
), |
87% |
Fujitsu
VPP 5000 (8/9/2000) |
1 |
to |
40 |
( |
2156 |
to |
40638 |
|
Mflop/sec
), |
47% |
|
1 |
to |
20 |
( |
2156 |
to |
27815 |
|
Mflop/sec
), |
65% |
|
1 |
to |
10 |
( |
2156 |
to |
16650 |
|
Mflop/sec
), |
77% |
Pittsburgh
SC TCS (11/26/2001) |
1 |
to |
512 |
( |
586 |
to |
105923 |
|
Mflop/sec
), |
35% |
|
1 |
to |
256 |
( |
586 |
to |
74346 |
|
Mflop/sec
), |
50% |
|
1 |
to |
128 |
( |
586 |
to |
49252 |
|
Mflop/sec
), |
66% |
|
1 |
to |
64 |
( |
586 |
to |
29119 |
|
Mflop/sec
), |
78% |
|
1 |
to |
32 |
( |
586 |
to |
16063 |
|
Mflop/sec
), |
86% |
|
1 |
to |
4 |
( |
586 |
to |
2038 |
|
Mflop/sec
), |
87% |
IBM P690
1.3 GHz (10/2003) |
1 |
to |
512 |
( |
791 |
to |
95907 |
|
Mflop/sec
), |
24% |
|
1 |
to |
256 |
( |
791 |
to |
70520 |
|
Mflop/sec
), |
35% |
|
1 |
to |
128 |
( |
791 |
to |
47953 |
|
Mflop/sec
), |
47% |
|
1 |
to |
64 |
( |
791 |
to |
29240 |
|
Mflop/sec
), |
58% |
|
1 |
to |
32 |
( |
791 |
to |
17005 |
|
Mflop/sec
), |
67% |
|
1 |
to |
16 |
( |
791 |
to |
9014 |
|
Mflop/sec
), |
71% |
|
1 |
to |
8 |
( |
791 |
to |
4863 |
|
Mflop/sec
), |
77% |
|
1 |
to |
4 |
( |
791 |
to |
2808 |
|
Mflop/sec
), |
89% |
|
1 |
to |
2 |
( |
791 |
to |
1539 |
|
Mflop/sec
), |
97% |
IBM
P655+ 1.7 GHz (10/2003) |
1 |
to |
64 |
( |
1704 |
to |
45239 |
|
Mflop/sec
), |
41% |
|
1 |
to |
32 |
( |
1704 |
to |
26641 |
|
Mflop/sec
), |
49% |
|
1 |
to |
16 |
( |
1704 |
to |
14620 |
|
Mflop/sec
), |
54% |
|
1 |
to |
8 |
( |
1704 |
to |
7710 |
|
Mflop/sec
), |
57% |
|
1 |
to |
4 |
( |
1704 |
to |
3970 |
|
Mflop/sec
), |
58% |
|
1 |
to |
2 |
( |
1704 |
to |
2098 |
|
Mflop/sec
), |
62% |
Xeon 2.4 GHz,
Myrinet, IFC, (2/2003) |
4 |
to |
336 |
( |
1541 |
to |
58480 |
|
Mflop/sec
), |
45% |
|
4 |
to |
256 |
( |
1541 |
to |
49951 |
|
Mflop/sec
), |
51% |
|
4 |
to |
128 |
( |
1541 |
to |
30350 |
|
Mflop/sec
), |
62% |
|
4 |
to |
64 |
( |
1541 |
to |
18028 |
|
Mflop/sec
), |
73% |
|
4 |
to |
32 |
( |
1541 |
to |
10203 |
|
Mflop/sec
), |
83% |
|
4 |
to |
16 |
( |
1541 |
to |
5235 |
|
Mflop/sec
), |
85% |
|
4 |
to |
8 |
( |
1541 |
to |
2928 |
|
Mflop/sec
), |
95% |
Xeon 2.2 GHz,
Myrinet, IFC (11/2002) |
2 |
to |
256 |
( |
596 |
to |
39242 |
|
Mflop/sec
), |
51% |
|
2 |
to |
128 |
( |
596 |
to |
26204 |
|
Mflop/sec
), |
69% |
|
2 |
to |
64 |
( |
596 |
to |
14272 |
|
Mflop/sec
), |
75% |
|
2 |
to |
32 |
( |
596 |
to |
8443 |
|
Mflop/sec
), |
89% |
|
2 |
to |
16 |
( |
596 |
to |
4229 |
|
Mflop/sec
), |
89% |
SGI
Altix, 1.5 GHz (7/2003) |
2 |
to |
32 |
( |
1795 |
to |
24466 |
|
Mflop/sec
), |
85% |
|
2 |
to |
24 |
( |
1795 |
to |
19029 |
|
Mflop/sec
), |
88% |
|
2 |
to |
16 |
( |
1795 |
to |
13031 |
|
Mflop/sec
), |
91% |
|
2 |
to |
8 |
( |
1795 |
to |
6697 |
|
Mflop/sec
), |
93% |
|
2 |
to |
4 |
( |
1795 |
to |
3406 |
|
Mflop/sec
), |
95% |
SGI
Origin 3900 700 MHz (1/2003) |
16 |
to |
64 |
( |
5848 |
to |
21797 |
|
Mflop/sec
), |
93% |
|
16 |
to |
48 |
( |
5848 |
to |
17126 |
|
Mflop/sec
), |
98% |
|
16 |
to |
32 |
( |
5848 |
to |
11417 |
|
Mflop/sec
), |
98% |
HP
Superdome IA-64 1.5 GHz (9/2003) |
1 |
to |
64 |
( |
903 |
to |
45091 |
|
Mflop/sec
), |
78% |
|
1 |
to |
32 |
( |
903 |
to |
23463 |
|
Mflop/sec
), |
81% |
|
1 |
to |
16 |
( |
903 |
to |
12406 |
|
Mflop/sec
), |
86% |
|
1 |
to |
8 |
( |
903 |
to |
6336 |
|
Mflop/sec
), |
88% |
|
1 |
to |
4 |
( |
903 |
to |
3204 |
|
Mflop/sec
), |
89% |
|
1 |
to |
2 |
( |
903 |
to |
1797 |
|
Mflop/sec
), |
100% |
John Michalakes, michalak@ucar.edu
---
New page: September 24, 2003
Revised, Cray X1 numbers, September 29, 2003
Revised, restored missing p=128 point on X1 plots, October 14, 2003
Revised, removed Cray X1 numbers. October 28, 2003
Revised, restored and updated Cray X1 numbers; new IBM P690 and P655+ numbers. October 30, 2003