On Monday, April 26, 1999 8:12 PM, Rick Knabb [SMTP:knabb@soest.hawaii.edu] wrote:
> Hello,
>
> Thanks for starting the new MPP mailing list... I hope it's OK
> to send inquiries to you as well.
>
> I have the model compiled and running on a Fujitsu VPP in a
> four-processor queue submitted via NQS.  The sys admin would
> like me to provide him info on how efficient the parallel
> processing is for MM5-MPP running on this machine... goal is
> to determine if I need more processors or changes to the code
> to get a particular configuration to run faster.  I'm somewhat
> familiar with NQS, etc.... but I need suggestions for getting
> some numbers on how well the model is using the resources I'm
> asking for.  Does MM5 output any info of this nature, or do
> you know of NQS commands that provide such details?
>
> Thanks very much!
> Rick Knabb
>
> --
> Rick Knabb, Research Meteorologist
> University of Hawaii
> Institute for Astronomy
> 2680 Woodlawn Drive
> Honolulu, HI 96822

Dear Rick,

What a coincidence.  I've been spending quite a bit of time lately trying 
to understand and improve performance and scaling on a Fujitsu VPP.  What I 
have discovered so far is that single-domain scenarios scale well, at least 
up to small numbers of processors (I only have 6 available to me on the 
machine I am using and the runs so far have only been on 4 and fewer). 
 Processor performance is "ok" -- definitely not stellar: somewhere between 
250 and 400 Mflop/sec/PE.  The real bottleneck has been the parallel 
efficiency of the code when nesting is turned on, in particular the 
nest-forcing code in the MPP version (see 
MPP/RSL/parallel_src/mp_stotndt.F).  It's a combination of load imbalance, 
which has always been a problem with the MPP nesting code, made more accute 
by the fact that per-PE performance is pretty poor in the VPP vector nodes 
for this part of the code.  The nest forcing is taking 25-35% of the time 
on this machine in a two-domain 2-way nested case I've been testing.  The 
usual hit is 10-15%.  I haven't even started looking at I/O performance 
yet.

The other thing that I've been looking into is whether a 2-d or 1-d 
processor decomposition is better on the VPP.  My original thinking -- that 
1-d was better because it avoids decomposing the domain in the i-domain to 
preserve vector performance -- seems to be correct.  Vector performance 
does suffer, dropping below 200 Mflop/s/PE, if we decompose in i as well as 
j.

The command I've been using to gauge processor performance is in the 
following run script:

  # @$-s /bin/csh
  # @$-eo
  # @$-q vqpe3

  ##setenv VPP_STATS 8                    #profile tool
  setenv VPP_MBX_SIZE 64000000          #mail box for mpi
  setenv FJSAMP "file:samp.dat,type:vtime,interval:10,pe:on"
  /bin/rm -f samp.out
  cd data.t3a
  timex mm5.mpp
  /usr/lang/bin/fjsamp mm5.mpp >& samp.out

The fjsamp utility will put a vector profile in the file samp.out when the 
script completes.

Another technique I've been using has been to analyze the timing 
information that the model outputs in the rsl.error.0000 file.  Use the 
following script:

  #!/bin/csh

  grep '\*\*\*[0-9]* 1 ' rsl.error.0000 | \
      awk '{if(prev!=0)print $4-prev;prev=$4}'

to convert this raw timing info into milliseconds per time coarse domain 
time step.

Finally, I have been doing peformance plots using Upshot and MPE (part of 
MPICH).  This part has been a little tricky because I have had to hack on 
MPICH, MPE, and the model a little to get the MPE_LOGEVENT stuff in MPE to 
work correctly with Fujitsu's version of MPI.  If this is crucial for you I 
can coach you through it.

Tell me a little more about your work on this machine when you have a 
chance.

John