Parallel performance of CM1:
This page presents information about the performance of CM1 on distributed memory supercomputers.
Contents: | A Strong Scaling Test | A Weak Scaling Test |
A Strong Scaling Test with cm1r19 on NCAR's cheyenne: Supercell thunderstorm simulation (posted 27 June 2017)
System: NCAR's cheyenne: SGI ICE XA Cluster with Intel Broadwell processors.
Compiler: ifort 16.0.3 Code: cm1r19.1
CM1 Configuration: MPI
Case: Idealized supercell thunderstorm, 2 h integration, 250 m horizontal grid spacing
Total domain dimensions: 512 × 512 × 128
Time steps: 2,880
NOTE: This is a STRONG SCALING test (i.e., problem size is fixed) that includes moisture as well as Input/Output
Results with I/O: (full 3d output every 15 min; 8 output times total; 28 GB total)
Comments: This test demonstrates that the CM1 solver scales reasonably well for >10,000 cores (black line above). However, when using more than 1,000 cores, the time required to write output begins to affect performance. For binary GrADS-format output (output_format = 1) (red line) performance degrades beyond roughly 4,000 cores. For netcdf output ("output_format = 2) (blue line) parallel performance does not scale as well.
Results will vary depending on the frequency of output, the total size of output, and (especially) based on model physical schemes. For example, simulations with more expensive microphysics schemes and simulations with atmospheric radiation will require more CPU time to complete.
Recommendation: For this configuration (that is, with the Morrison microphysics scheme, the LES subgrid turbulence model, and no radiation scheme), a good formula for estimating cheyenne core-hours required for a simulation is:
where C is the total number of cheyenne core-hours, Nx is the number of grid points in the x direction, Ny is the number of grid points in the y direction, Nz is the number of grid points in the z direction, and Nt is the total number of timesteps.
For example: a 512 × 512 × 128 domain, integrated for 2 hours with a 2.5-s timestep (and thus 2,880 total time steps), would require approximately 242 cheyenne core-hours. So, assuming 144 cores (i.e., 4 nodes) are used, then roughly 1.7 wallclock hours would be needed to run this simulation on cheyenne.
A Weak Scaling Test for cm1r18 on NCAR's yellowstone: Large Eddy Simulation of a convective boundary layer (posted 10 September 2015)
System: NCAR's yellowstone: IBM iDataPlex with Intel Sandy Bridge processors.
CM1 Configuration: MPI
Case: LES of convective boundary layer, 20 m horizontal grid spacing
Domain dimensions: varies with number of cores (i.e., processors). Each core has 8 × 8 × 256 grid points. The largest total domain size is 1024 × 1024 × 256 grid points.
Time steps: 7,200
NOTE: This is a WEAK SCALING test (i.e., problem size scales with number of processors) that does not account for moisture/microphysics but does include Input/Output
Results with I/O: (full 3d output every 5 min; 12 output times total; 14 MB per core)
Comments: For this test the domain size increases as cores are added, with the goal of testing how CM1 performs for very large domains. In this case, the total run time should (ideally) remain the same as cores are added.
Results in the figure above show that the CM1 solver performs well to at least 16,000 cores (black line). For more that 1,000 cores, however, the time required to write output begins to degrade performance. For the "output_filetype = 2" option (in which all MPI processes write to one output file) (red line) performance degrades beyond 1,000 cores. For the "output_filetype = 3" option (in which every MPI process writes a separate output file) (blue line) performance remains reasonable up to 16,000 cores.
Last updated: 27 June 2017
return to cm1 home page