On Monday, April 26, 1999 8:12 PM, Rick Knabb [SMTP:knabb@soest.hawaii.edu] wrote: > Hello, > > Thanks for starting the new MPP mailing list... I hope it's OK > to send inquiries to you as well. > > I have the model compiled and running on a Fujitsu VPP in a > four-processor queue submitted via NQS. The sys admin would > like me to provide him info on how efficient the parallel > processing is for MM5-MPP running on this machine... goal is > to determine if I need more processors or changes to the code > to get a particular configuration to run faster. I'm somewhat > familiar with NQS, etc.... but I need suggestions for getting > some numbers on how well the model is using the resources I'm > asking for. Does MM5 output any info of this nature, or do > you know of NQS commands that provide such details? > > Thanks very much! > Rick Knabb > > -- > Rick Knabb, Research Meteorologist > University of Hawaii > Institute for Astronomy > 2680 Woodlawn Drive > Honolulu, HI 96822 Dear Rick, What a coincidence. I've been spending quite a bit of time lately trying to understand and improve performance and scaling on a Fujitsu VPP. What I have discovered so far is that single-domain scenarios scale well, at least up to small numbers of processors (I only have 6 available to me on the machine I am using and the runs so far have only been on 4 and fewer). Processor performance is "ok" -- definitely not stellar: somewhere between 250 and 400 Mflop/sec/PE. The real bottleneck has been the parallel efficiency of the code when nesting is turned on, in particular the nest-forcing code in the MPP version (see MPP/RSL/parallel_src/mp_stotndt.F). It's a combination of load imbalance, which has always been a problem with the MPP nesting code, made more accute by the fact that per-PE performance is pretty poor in the VPP vector nodes for this part of the code. The nest forcing is taking 25-35% of the time on this machine in a two-domain 2-way nested case I've been testing. The usual hit is 10-15%. I haven't even started looking at I/O performance yet. The other thing that I've been looking into is whether a 2-d or 1-d processor decomposition is better on the VPP. My original thinking -- that 1-d was better because it avoids decomposing the domain in the i-domain to preserve vector performance -- seems to be correct. Vector performance does suffer, dropping below 200 Mflop/s/PE, if we decompose in i as well as j. The command I've been using to gauge processor performance is in the following run script: # @$-s /bin/csh # @$-eo # @$-q vqpe3 ##setenv VPP_STATS 8 #profile tool setenv VPP_MBX_SIZE 64000000 #mail box for mpi setenv FJSAMP "file:samp.dat,type:vtime,interval:10,pe:on" /bin/rm -f samp.out cd data.t3a timex mm5.mpp /usr/lang/bin/fjsamp mm5.mpp >& samp.out The fjsamp utility will put a vector profile in the file samp.out when the script completes. Another technique I've been using has been to analyze the timing information that the model outputs in the rsl.error.0000 file. Use the following script: #!/bin/csh grep '\*\*\*[0-9]* 1 ' rsl.error.0000 | \ awk '{if(prev!=0)print $4-prev;prev=$4}' to convert this raw timing info into milliseconds per time coarse domain time step. Finally, I have been doing peformance plots using Upshot and MPE (part of MPICH). This part has been a little tricky because I have had to hack on MPICH, MPE, and the model a little to get the MPE_LOGEVENT stuff in MPE to work correctly with Fujitsu's version of MPI. If this is crucial for you I can coach you through it. Tell me a little more about your work on this machine when you have a chance. John