Parallel MM5 benchmarks, 2003 (Updated: click here for latest)

Here to return to http://www2.mmm.ucar.edu/mm5/mpp/helpdesk .


Disclaimer

Some results on this page are contributed (as thanked below). Contributed results are provided on this page along with results from runs we conducted ourselves with the following caveat: Reasonable efforts have been made to verify contributed results, including consulting with the contributors and inspection of configuration and raw benchmark result files. We have complete confidence in the integrity and competence of the contributors; however, we assume no responsibility for nor make any claims regarding the accuracy, veracity, or appropriateness for any purpose whatsoever of contributed results. Further, all results, whether contributed or generated by us, are for a single fixed-size MM5 case on the specific machines listed, and no claim is made of representativeness for other model scenarios, code versions, or hardware installations. The explanatory text, written in consultation with vendor/contributors, is the work and responsibility of John Michalakes (michalak@ucar.edu).

NOTE: Some older results have been discarded in this September 2003 revision of the MM5 benchmarks page.

Press here for older (pre-September 2003) version of Parallel MM5 2003 Benchmarks Page

Press here for the older Parallel MM5 2002 Benchmarks Page.
Press here for the older Parallel MM5 2001 Benchmarks Page.
Press here for the older Parallel MM5 2000 Benchmarks Page.


Click here to download the input data for this MM5 benchmark case.
For additional information on MM5 benchmarks, please click here.
For information on the MM5 distributed-memory parallel code, please click here.
Scroll down this page for additional explanation of the figures shown here.




Figure 1a. MM5 floating-point performance on various platforms. (1)See notes below. (Updated October 30, 2003).





Figure 1b. MM5 floating-point performance on various platforms (zoomed). (1)See notes below. (Updated October 30, 2003)

 


All runs were of a 36-kilometer resolution domain over Europe; the grid consisted of 136 cells in the east/west dimension, 112 north/south, and 33 vertical layers (503,000 cells). Time step is 81 seconds. There is a link to the input data for this case at the top of this page. The operation count for this scenario is 2,398 million floating point operations per average time step. I/O and model initialization were not included in the timings. All timing runs were performed at single (32-bit) floating-point precision. Scaling is calculated as the speedup divided by the factor of increase in the number of processors.

Figures 1a-b shows performance results in Mflop/second and in simulated hours per hour on a variety of platforms.

The Cray X1 timings were obtained on a 128 CPU 800MHz X1 system. The x-axis is the number of single-streaming processors (SSPs) used for each run. Contributed; thanks Peter Johnsen, Cray, and Tony Meys, Army High Performance Computing Research Center, (10/2003)

Timings for the Pittsburgh Supercomputer Center Terascale Computing System were conducted by J. Michalakes on October 20-21, 2001; additional runs Oct. 26 and 29. Thanks to Ralph Roskies, Sergiu Sanielevici, Roberto Gomez, and others at PSC. The 6 TFlop/s peak TCS comprises 3000 1 GHz HP Alpha EV68 processors (750 ES45 nodes). The model was run straight-MPI (no OpenMP) MPI-over-shared memory for communication within nodes and using MPI-over Quadrics for communication between nodes. These are full-node timings (i.e. using all 4 CPUs on each node) on a dedicated system. Two rails of the Quadrics interconnect were used for the Feb. 2002 updated runs. (10/23/2001; updated 10/26/2001, 2/18/2002)

Timings for the IBM Power 4, P690 1.3 GHz were conducted on an IBM system. Each Regatta node used in the test contained 32 processors subdivided into 8-way logical partiions (LPARS). The interconnection network is the IBM Colony switch. Contributed; thanks Jim Abeles, IBM (10/2003).(1)

Timings for the IBM Power 4, P655+ 1.7 GHz, were conducted using 4-way LPARS and the Colony switch. Contributed; thanks Jim Abeles, IBM (9/2003).(1)

The SGI Origin 3900 timings were obtained using MPI on a system with 64 700MHz MIPS processors; 4 cpus per node. The interconnection network is SGI NUMAlink. Contributed; thanks Peter Johnsen, SGI (4/2003).

The SGI Altix timings were obtained using MPI on a system with 1.5 GHz Intel Itanium-2 processsors, 2 cpus per node. Compiled with Intel Fortran95. The interconnection network is SGI NUMAlink. Contributed; thanks Peter Johnsen, SGI (4/2003).

The HP Superdome timings were obtained on a system with 1.5 GHz Itanium-2 processors. Contributed; thanks Logan Sankaran, HP (9/2003).

The Fujitsu VPP5000 is a distributed-memory machine with vector processors, linked by a high-speed crossbar interconnect. The model was run using Fujitsu's implementation of MPI and with a one-dimensional data decomposition to preserve vector length in the I-dimension.

The IBM Power3 SP were obtained on the NCAR/SCD IBM Winterhawk-II machine (four 375 Mhz Power3 CPUs per node). The model was run using four MPI tasks per node. These timings use 4 processors per node. (5/22/00)

The Jazz timings were conducted on a 350-node Pentium/Linux cluster in the Argonne National Laboratory Laboratory Computing Resource Center. Each node has a single Xeon 2.4 GHz processor and either 1 GB or 2 GB of memory. Model was run over Myrinet 2000. The MM5 code was compiled using the Intel compiler. Contributed; thanks John Taylor, ANL. (3/2003)

The iJet HPTi Xeon timings were conducted on iJet a Pentium/Linux cluster at NOAA Forecast Systems Laboratory. Each node has dual Xeon 2.2 GHz processors. Model was run straight-MPI over Myrinet (no-OpenMP). The MM5 code was compiled using the Intel compiler. Contributed; thanks Craig Tierney, HPTi. (11/2002)

Note:

(1)  The runs were conducted using a special optimization in the Reisner explicit moisture routine, EXMOISR, originally included in the MM5 benchmark code for optimization on the NEC SX/5 system. The modification entails replacing the FORTRAN power operator ‘**’ with EXP and LOG:  that is,  x ** y is replaced with EXP( LOG(x)*y).  This option is turned on at compile time by compiling the EXMOISR routine with KMA defined as a CPP preprocessor macro.

 


Scaling

The following table shows scaling information for the machines tested.  Scaling efficiency is computed as the percentage of ideal speedup as the number of processors increase, relative to some minimum number of processors. This minimum may vary from machine to machine based on node memory limits for the individual system, or to avoid the appearance of superlinear scaling.  This can occur when the benchmark code does not fit well into memory on a small number of processors but nevertheless runs slowly using virtual memory; cache problems on small numbers of processors may also cause superlinear scaling.

SYSTEM

P

 

P

 

 

 

 

 

 

Scaling

Cray X1, SSPs (9/23/2003)

8

to

384

(

6587

to

102465

 

Mflop/sec ),

32%

 

8

to

192

(

6587

to

82112

 

Mflop/sec ),

52%

 

8

to

96

(

6587

to

48340

 

Mflop/sec ),

61%

 

8

to

64

(

6587

to

36328

 

Mflop/sec ),

69%

 

8

to

32

(

6587

to

21218

 

Mflop/sec ),

81%

 

8

to

16

(

6587

to

11472

 

Mflop/sec ),

87%

Fujitsu VPP 5000 (8/9/2000)

1

to

40

(

2156

to

40638

 

Mflop/sec ),

47%

 

1

to

20

(

2156

to

27815

 

Mflop/sec ),

65%

 

1

to

10

(

2156

to

16650

 

Mflop/sec ),

77%

Pittsburgh SC TCS (11/26/2001)

1

to

512

(

586

to

105923

 

Mflop/sec ),

35%

 

1

to

256

(

586

to

74346

 

Mflop/sec ),

50%

 

1

to

128

(

586

to

49252

 

Mflop/sec ),

66%

 

1

to

64

(

586

to

29119

 

Mflop/sec ),

78%

 

1

to

32

(

586

to

16063

 

Mflop/sec ),

86%

 

1

to

4

(

586

to

2038

 

Mflop/sec ),

87%

IBM P690 1.3 GHz (10/2003)

1

to

512

(

791

to

95907

 

Mflop/sec ),

24%

 

1

to

256

(

791

to

70520

 

Mflop/sec ),

35%

 

1

to

128

(

791

to

47953

 

Mflop/sec ),

47%

 

1

to

64

(

791

to

29240

 

Mflop/sec ),

58%

 

1

to

32

(

791

to

17005

 

Mflop/sec ),

67%

 

1

to

16

(

791

to

9014

 

Mflop/sec ),

71%

 

1

to

8

(

791

to

4863

 

Mflop/sec ),

77%

 

1

to

4

(

791

to

2808

 

Mflop/sec ),

89%

 

1

to

2

(

791

to

1539

 

Mflop/sec ),

97%

IBM P655+  1.7 GHz (10/2003)

1

to

64

(

1704

to

45239

 

Mflop/sec ),

41%

 

1

to

32

(

1704

to

26641

 

Mflop/sec ),

49%

 

1

to

16

(

1704

to

14620

 

Mflop/sec ),

54%

 

1

to

8

(

1704

to

7710

 

Mflop/sec ),

57%

 

1

to

4

(

1704

to

3970

 

Mflop/sec ),

58%

 

1

to

2

(

1704

to

2098

 

Mflop/sec ),

62%

Xeon 2.4 GHz, Myrinet, IFC, (2/2003)

4

to

336

(

1541

to

58480

 

Mflop/sec ),

45%

 

4

to

256

(

1541

to

49951

 

Mflop/sec ),

51%

 

4

to

128

(

1541

to

30350

 

Mflop/sec ),

62%

 

4

to

64

(

1541

to

18028

 

Mflop/sec ),

73%

 

4

to

32

(

1541

to

10203

 

Mflop/sec ),

83%

 

4

to

16

(

1541

to

5235

 

Mflop/sec ),

85%

 

4

to

8

(

1541

to

2928

 

Mflop/sec ),

95%

Xeon 2.2 GHz, Myrinet, IFC (11/2002)

2

to

256

(

596

to

39242

 

Mflop/sec ),

51%

 

2

to

128

(

596

to

26204

 

Mflop/sec ),

69%

 

2

to

64

(

596

to

14272

 

Mflop/sec ),

75%

 

2

to

32

(

596

to

8443

 

Mflop/sec ),

89%

 

2

to

16

(

596

to

4229

 

Mflop/sec ),

89%

SGI Altix, 1.5 GHz (7/2003)

2

to

32

(

1795

to

24466

 

Mflop/sec ),

85%

 

2

to

24

(

1795

to

19029

 

Mflop/sec ),

88%

 

2

to

16

(

1795

to

13031

 

Mflop/sec ),

91%

 

2

to

8

(

1795

to

6697

 

Mflop/sec ),

93%

 

2

to

4

(

1795

to

3406

 

Mflop/sec ),

95%

SGI Origin 3900 700 MHz (1/2003)

16

to

64

(

5848

to

21797

 

Mflop/sec ),

93%

 

16

to

48

(

5848

to

17126

 

Mflop/sec ),

98%

 

16

to

32

(

5848

to

11417

 

Mflop/sec ),

98%

HP Superdome IA-64 1.5 GHz (9/2003)

1

to

64

(

903

to

45091

 

Mflop/sec ),

78%

 

1

to

32

(

903

to

23463

 

Mflop/sec ),

81%

 

1

to

16

(

903

to

12406

 

Mflop/sec ),

86%

 

1

to

8

(

903

to

6336

 

Mflop/sec ),

88%

 

1

to

4

(

903

to

3204

 

Mflop/sec ),

89%

 

1

to

2

(

903

to

1797

 

Mflop/sec ),

100%

 


John Michalakes, michalak@ucar.edu
---
New page:  September 24, 2003

Revised, Cray X1 numbers, September 29, 2003

Revised, restored missing p=128 point on X1 plots, October 14, 2003

Revised, removed Cray X1 numbers. October 28, 2003

Revised, restored and updated Cray X1 numbers; new IBM P690 and P655+ numbers. October 30, 2003