Parallel MM5 benchmarks, 2003

Parallel MM5 benchmarks, 2003 (Updated: click here for latest)

Here to return to http://www2.mmm.ucar.edu/mm5/mpp/helpdesk .

Disclaimer

Some results on this page are contributed (as thanked below). Contributed results are provided on this page along with results from runs we conducted ourselves with the following caveat: Reasonable efforts have been made to verify contributed results, including consulting with the contributors and inspection of configuration and raw benchmark result files. We have complete confidence in the integrity and competence of the contributors; however, we assume no responsibility for nor make any claims regarding the accuracy, veracity, or appropriateness for any purpose whatsoever of contributed results. Further, all results, whether contributed or generated by us, are for a single fixed-size MM5 case on the specific machines listed, and no claim is made of representativeness for other model scenarios, code versions, or hardware installations. The explanatory text, written in consultation with vendor/contributors, is the work and responsibility of John Michalakes (michalak@ucar.edu).

NOTE: Some older results have been discarded in this September 2003 revision of the MM5 benchmarks page.

Press here for older (pre-September 2003) version of Parallel MM5 2003 Benchmarks Page

Press here for the older Parallel MM5 2002 Benchmarks Page.
Press here for the older Parallel MM5 2001 Benchmarks Page.
Press here for the older Parallel MM5 2000 Benchmarks Page.

Click here to download the input data for this MM5 benchmark case.
For additional information on MM5 benchmarks, please click here.
For information on the MM5 distributed-memory parallel code, please click here.
Scroll down this page for additional explanation of the figures shown here.

Figure 1a. MM5 floating-point performance on various platforms. ⁽¹⁾See notes below. (Updated October 30, 2003).

Figure 1b. MM5 floating-point performance on various platforms (zoomed). ⁽¹⁾See notes below. (Updated October 30, 2003)

All runs were of a 36-kilometer resolution domain over Europe; the grid consisted of 136 cells in the east/west dimension, 112 north/south, and 33 vertical layers (503,000 cells). Time step is 81 seconds. There is a link to the input data for this case at the top of this page. The operation count for this scenario is 2,398 million floating point operations per average time step. I/O and model initialization were not included in the timings. All timing runs were performed at single (32-bit) floating-point precision. Scaling is calculated as the speedup divided by the factor of increase in the number of processors.

Figures 1a-b shows performance results in Mflop/second and in simulated hours per hour on a variety of platforms.

The Cray X1 timings were obtained on a 128 CPU 800MHz X1 system. The x-axis is the number of single-streaming processors (SSPs) used for each run. Contributed; thanks Peter Johnsen, Cray, and Tony Meys, Army High Performance Computing Research Center, (10/2003)

Timings for the Pittsburgh Supercomputer Center Terascale Computing System were conducted by J. Michalakes on October 20-21, 2001; additional runs Oct. 26 and 29. Thanks to Ralph Roskies, Sergiu Sanielevici, Roberto Gomez, and others at PSC. The 6 TFlop/s peak TCS comprises 3000 1 GHz HP Alpha EV68 processors (750 ES45 nodes). The model was run straight-MPI (no OpenMP) MPI-over-shared memory for communication within nodes and using MPI-over Quadrics for communication between nodes. These are full-node timings (i.e. using all 4 CPUs on each node) on a dedicated system. Two rails of the Quadrics interconnect were used for the Feb. 2002 updated runs. (10/23/2001; updated 10/26/2001, 2/18/2002)

Timings for the IBM Power 4, P690 1.3 GHz were conducted on an IBM system. Each Regatta node used in the test contained 32 processors subdivided into 8-way logical partiions (LPARS)^. The interconnection network is the IBM Colony switch. Contributed; thanks Jim Abeles, IBM (10/2003).⁽¹⁾

Timings for the IBM Power 4, P655+ 1.7 GHz, were conducted using 4-way LPARS and the Colony switch. Contributed; thanks Jim Abeles, IBM (9/2003).⁽¹⁾

The SGI Origin 3900 timings were obtained using MPI on a system with 64 700MHz MIPS processors; 4 cpus per node. The interconnection network is SGI NUMAlink. Contributed; thanks Peter Johnsen, SGI (4/2003).

The SGI Altix timings were obtained using MPI on a system with 1.5 GHz Intel Itanium-2 processsors, 2 cpus per node. Compiled with Intel Fortran95. The interconnection network is SGI NUMAlink. Contributed; thanks Peter Johnsen, SGI (4/2003).

The HP Superdome timings were obtained on a system with 1.5 GHz Itanium-2 processors. Contributed; thanks Logan Sankaran, HP (9/2003).

The Fujitsu VPP5000 is a distributed-memory machine with vector processors, linked by a high-speed crossbar interconnect. The model was run using Fujitsu's implementation of MPI and with a one-dimensional data decomposition to preserve vector length in the I-dimension.

The IBM Power3 SP were obtained on the NCAR/SCD IBM Winterhawk-II machine (four 375 Mhz Power3 CPUs per node). The model was run using four MPI tasks per node. These timings use 4 processors per node. (5/22/00)

The Jazz timings were conducted on a 350-node Pentium/Linux cluster in the Argonne National Laboratory Laboratory Computing Resource Center. Each node has a single Xeon 2.4 GHz processor and either 1 GB or 2 GB of memory. Model was run over Myrinet 2000. The MM5 code was compiled using the Intel compiler. Contributed; thanks John Taylor, ANL. (3/2003)

The iJet HPTi Xeon timings were conducted on iJet a Pentium/Linux cluster at NOAA Forecast Systems Laboratory. Each node has dual Xeon 2.2 GHz processors. Model was run straight-MPI over Myrinet (no-OpenMP). The MM5 code was compiled using the Intel compiler. Contributed; thanks Craig Tierney, HPTi. (11/2002)

Note:

(1) The runs were conducted using a special optimization in the Reisner explicit moisture routine, EXMOISR, originally included in the MM5 benchmark code for optimization on the NEC SX/5 system. The modification entails replacing the FORTRAN power operator ‘**’ with EXP and LOG: that is, x ** y is replaced with EXP( LOG(x)*y). This option is turned on at compile time by compiling the EXMOISR routine with KMA defined as a CPP preprocessor macro.

Scaling

The following table shows scaling information for the machines tested. Scaling efficiency is computed as the percentage of ideal speedup as the number of processors increase, relative to some minimum number of processors. This minimum may vary from machine to machine based on node memory limits for the individual system, or to avoid the appearance of superlinear scaling. This can occur when the benchmark code does not fit well into memory on a small number of processors but nevertheless runs slowly using virtual memory; cache problems on small numbers of processors may also cause superlinear scaling.

SYSTEM	P		P						Scaling
Cray X1, SSPs (9/23/2003)	8	to	384	(	6587	to	102465	Mflop/sec ),	32%
	8	to	192	(	6587	to	82112	Mflop/sec ),	52%
	8	to	96	(	6587	to	48340	Mflop/sec ),	61%
	8	to	64	(	6587	to	36328	Mflop/sec ),	69%
	8	to	32	(	6587	to	21218	Mflop/sec ),	81%
	8	to	16	(	6587	to	11472	Mflop/sec ),	87%
Fujitsu VPP 5000 (8/9/2000)	1	to	40	(	2156	to	40638	Mflop/sec ),	47%
	1	to	20	(	2156	to	27815	Mflop/sec ),	65%
	1	to	10	(	2156	to	16650	Mflop/sec ),	77%
Pittsburgh SC TCS (11/26/2001)	1	to	512	(	586	to	105923	Mflop/sec ),	35%
	1	to	256	(	586	to	74346	Mflop/sec ),	50%
	1	to	128	(	586	to	49252	Mflop/sec ),	66%
	1	to	64	(	586	to	29119	Mflop/sec ),	78%
	1	to	32	(	586	to	16063	Mflop/sec ),	86%
	1	to	4	(	586	to	2038	Mflop/sec ),	87%
IBM P690 1.3 GHz (10/2003)	1	to	512	(	791	to	95907	Mflop/sec ),	24%
	1	to	256	(	791	to	70520	Mflop/sec ),	35%
	1	to	128	(	791	to	47953	Mflop/sec ),	47%
	1	to	64	(	791	to	29240	Mflop/sec ),	58%
	1	to	32	(	791	to	17005	Mflop/sec ),	67%
	1	to	16	(	791	to	9014	Mflop/sec ),	71%
	1	to	8	(	791	to	4863	Mflop/sec ),	77%
	1	to	4	(	791	to	2808	Mflop/sec ),	89%
	1	to	2	(	791	to	1539	Mflop/sec ),	97%
IBM P655+ 1.7 GHz (10/2003)	1	to	64	(	1704	to	45239	Mflop/sec ),	41%
	1	to	32	(	1704	to	26641	Mflop/sec ),	49%
	1	to	16	(	1704	to	14620	Mflop/sec ),	54%
	1	to	8	(	1704	to	7710	Mflop/sec ),	57%
	1	to	4	(	1704	to	3970	Mflop/sec ),	58%
	1	to	2	(	1704	to	2098	Mflop/sec ),	62%
Xeon 2.4 GHz, Myrinet, IFC, (2/2003)	4	to	336	(	1541	to	58480	Mflop/sec ),	45%
	4	to	256	(	1541	to	49951	Mflop/sec ),	51%
	4	to	128	(	1541	to	30350	Mflop/sec ),	62%
	4	to	64	(	1541	to	18028	Mflop/sec ),	73%
	4	to	32	(	1541	to	10203	Mflop/sec ),	83%
	4	to	16	(	1541	to	5235	Mflop/sec ),	85%
	4	to	8	(	1541	to	2928	Mflop/sec ),	95%
Xeon 2.2 GHz, Myrinet, IFC (11/2002)	2	to	256	(	596	to	39242	Mflop/sec ),	51%
	2	to	128	(	596	to	26204	Mflop/sec ),	69%
	2	to	64	(	596	to	14272	Mflop/sec ),	75%
	2	to	32	(	596	to	8443	Mflop/sec ),	89%
	2	to	16	(	596	to	4229	Mflop/sec ),	89%
SGI Altix, 1.5 GHz (7/2003)	2	to	32	(	1795	to	24466	Mflop/sec ),	85%
	2	to	24	(	1795	to	19029	Mflop/sec ),	88%
	2	to	16	(	1795	to	13031	Mflop/sec ),	91%
	2	to	8	(	1795	to	6697	Mflop/sec ),	93%
	2	to	4	(	1795	to	3406	Mflop/sec ),	95%
SGI Origin 3900 700 MHz (1/2003)	16	to	64	(	5848	to	21797	Mflop/sec ),	93%
	16	to	48	(	5848	to	17126	Mflop/sec ),	98%
	16	to	32	(	5848	to	11417	Mflop/sec ),	98%
HP Superdome IA-64 1.5 GHz (9/2003)	1	to	64	(	903	to	45091	Mflop/sec ),	78%
	1	to	32	(	903	to	23463	Mflop/sec ),	81%
	1	to	16	(	903	to	12406	Mflop/sec ),	86%
	1	to	8	(	903	to	6336	Mflop/sec ),	88%
	1	to	4	(	903	to	3204	Mflop/sec ),	89%
	1	to	2	(	903	to	1797	Mflop/sec ),	100%

John Michalakes, michalak@ucar.edu
---
New page: September 24, 2003

Revised, Cray X1 numbers, September 29, 2003

Revised, restored missing p=128 point on X1 plots, October 14, 2003

Revised, removed Cray X1 numbers. October 28, 2003

Revised, restored and updated Cray X1 numbers; new IBM P690 and P655+ numbers. October 30, 2003