Sender: wesley@sgi.com Date: Mon, 19 Jun 2000 14:42:34 -0700 From: Wesley Jones To: "John G. Michalakes" , toigo@gps.caltech.edu Subject: Re: Fwd: Question on an MPI error Hi Anthony, Per the MPI man page, "man mpi," MPI_MSGS_PER_PROC is an environment variable. MPI_MSGS_PER_PROC Sets the maximum number of message headers to be allocated from sending process space for outbound messages going to the same host. (This variable might be required by standard-compliant programs.) MPI allocates buffer space for local messages based on the message destination. Space for messages that are destined for local processes is allocated as additional process space for the sending process. Default: 1024 A large number of message headers is only required at the very beginning of MM5 where is it is distributing a bunch of information to the rest of the MPI ranks via a call to broadcast. You should be able to overcome the problem by csh "setenv MPI_MSGS_PER_PROC 4096", ksh "export MPI_MSGS_PER_PROC=4096" You might want to try setting the MPI_STATS environment variable and see if you have any RETRIES. If you do you can try and increase the appropriate buffer by looking at the end of the error output file where the information about retries will be located and looking at the man page. After testing for RETRIES you will want to unset the MPI_STATS environment variable. Let me know if you have other problems, WEs "John G. Michalakes" wrote: > > Wes, do these error messages coming out of the SGI version of MPI make any > sense to you? Any ideas about what to do? > > John > > >From: Anthony Toigo > >Date: Sun, 18 Jun 2000 22:57:30 -0700 (PDT) > >To: "John G. Michalakes" > >Subject: Question on an MPI error > >X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs Lucid > >Reply-To: toigo@gps.caltech.edu > > . . . . > > > >I tried taking your advice and running my modified model with as few > >modifications as possible, and I came up with a strange error. I was > >wondering if you could identify it for me. > > > >Although the question is simple, the setup is a little long: > > > >My test run is (up to) 4 domains, with the following settings (digested > >from the deck file, extra values removed for ease of reading): > > > >LEVIDN = 0,1,2,3, ; level of nest for each domain > >NUMNC = 1,1,2,3, ; ID of mother domain for each nest > >NESTIX = 72, 76, 76, 76, ; domain size i > >NESTJX = 72, 76, 76, 76, ; domain size j > >NESTI = 1, 24, 26, 26, ; start location i > >NESTJ = 1, 24, 26, 26, ; start location i > > > >Using an SGI Origin 2000, I got the code to run successfully with one and > >two domains. However, when I switched to 3 or 4 domains, I got the > >following errors: > > > > > >% timex mpirun -v -np 64 ./mm5.mpp > >MPI: libxmpi.so 'SGI MPI 3.2.0.7 01/31/00 12:48:41' > >MPI: libmpi.so 'SGI MPI 3.2.0.7 01/31/00 11:47:23 (N32_M4)' > >MPI: MPI_MSGS_PER_HOST= 0 > >MPI: MPI_MSGS_PER_PROC= 1024 > >MPI: MPI_MSG_RETRIES= 500 > >MPI: MPI_BUFS_PER_HOST= 0 > >MPI: MPI_BUFS_PER_PROC= 32 > >MPI: MPI_MSG_LISTS= 8 > >MPI: MPI_BUF_LISTS= 0 > >. > >. (allocated processor chatter removed) > >. > >MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize() > >MPI: aborting job > > > > > > > >Looking in the rsl.error.0000 file: > > > >% more rsl.error.0000 > >*** MPI has run out of PER_PROC message headers. > >*** The current allocation levels are: > >*** MPI_MSGS_PER_HOST = 0 > >*** MPI_MSGS_PER_PROC = 1024 > >*** MPI_MSG_RETRIES = 500 > >IOT Trap > > > > > > > > > >Finally, I get a core file that I don't usually get on the Beowulf clusters: > > > > > >The core file reports were different for the 3 and 4 domain runs, although > >all of the above information (output from the mpirun command and the > >contents of rsl.error.0000) were identical. > > > >For 3 domains: > > > >% cvdump mm5.mpp core.277591 > >Executable: /tmp/toigo/MM5/Run/mm5.mpp > >Core file: /tmp/toigo/MM5/Run/core.277591 > >Core from signal SIGABRT: Abort (see abort(3c)) > >========================================= > > > >_kill() ["kill.s":15, 0x0fad4928] > >_raise() ["raise.c":27, 0x0fad52a4] > >abort() ["abort.c":52, 0x0fa3de40] > >sigdie() ["main.c":156, 0x0ad99d24] > >sigidie() ["main.c":117, 0x0ad99c00] > >_sigtramp() ["sigtramp.s":71, 0x0fad4dcc] > >_kill() ["kill.s":15, 0x0fad4928] > >_raise() ["raise.c":27, 0x0fad52a4] > >abort() ["abort.c":44, 0x0fa3de0c] > >MPI_SGI_request_send() ["req.c":235, 0x030c8968] > >PMPI_Send() ["send.c":82, 0x03102734] > >rsl_mon_bcast_() ["rsl_mon_bcast.c":131, 0x100de070] > >DM_BCAST_INTEGERS() ["dm_io.f":249, 0x10089108] > >PARAM() ["param.f":1437, 0x10030910] > >MM5() ["mm5.f":895, 0x1001c704] > >main() ["main.c":97, 0x0ad99b20] > >__start() ["crt1text.s":177, 0x100086e8] > > > >where param.f:1437 is: > > CALL DM_BCAST_INTEGERS(START_INDEX,4) > > > >and dm_io.f:249 is: > > CALL RSL_MON_BCAST( BUF, N*4 ) > >in: > > SUBROUTINE DM_BCAST_INTEGERS( BUF, N ) > > IMPLICIT NONE > > INTEGER BUF(*) > > INTEGER N > > CALL RSL_MON_BCAST( BUF, N*4 ) > > RETURN > > END > > > > > > > >For 4 domains, it stopped on a different part of param.f: > > > >% cvdump mm5.mpp core.273381 > >Executable: /tmp/toigo/MM5/Run/mm5.mpp > >Core file: /tmp/toigo/MM5/Run/core.273381 > >Core from signal SIGABRT: Abort (see abort(3c)) > >========================================= > > > >_kill() ["kill.s":15, 0x0fad4928] > >_raise() ["raise.c":27, 0x0fad52a4] > >abort() ["abort.c":52, 0x0fa3de40] > >sigdie() ["main.c":156, 0x0ad99d24] > >sigidie() ["main.c":117, 0x0ad99c00] > >_sigtramp() ["sigtramp.s":71, 0x0fad4dcc] > >_kill() ["kill.s":15, 0x0fad4928] > >_raise() ["raise.c":27, 0x0fad52a4] > >abort() ["abort.c":44, 0x0fa3de0c] > >MPI_SGI_request_send() ["req.c":235, 0x030c8968] > >PMPI_Send() ["send.c":82, 0x03102734] > >rsl_mon_bcast_() ["rsl_mon_bcast.c":131, 0x100de5e0] > >DM_BCAST_STRING() ["dm_io.f":287, 0x10089880] > >PARAM() ["param.f":1440, 0x10030e60] > >MM5() ["mm5.f":895, 0x1001cb9c] > >main() ["main.c":97, 0x0ad99b20] > >__start() ["crt1text.s":177, 0x10008708] > > > >Line 1440 of param.f is: > > CALL DM_BCAST_STRING(STAGGERING,4) > > > >Line 287 of dm_io.f is: > > CALL RSL_MON_BCAST( IBUF, N*4 ) > >in: > > SUBROUTINE DM_BCAST_STRING( BUF, N ) > > IMPLICIT NONE > > INTEGER N > > CHARACTER*(*) BUF > > INTEGER IBUF(256),I > > IF (N .GT. 256) N = 256 > > IF (N .GT. 0 ) THEN > > DO I = 1, N > > IBUF(I) = ICHAR(BUF(I:I)) > > ENDDO > > CALL RSL_MON_BCAST( IBUF, N*4 ) > > DO I = 1, N > > BUF(I:I) = CHAR(IBUF(I)) > > ENDDO > > ENDIF > > RETURN > > END > > > >and I guess the rest you would know ... > > > > > > > >My question is: is this some error caused by my modifications or is it > >something about setting up parallel runs that I don't understand? > > > >I'm hoping it's a set-up error, since I got it to work with one or two > >domains. If you need any further information, please let me know. > > > >Once again, thank you very much for taking the time to help me out. > > > >Anthony Toigo > > ---------------------------------------------------------------------- > John Michalakes, michalak@ucar.edu, http://www.mcs.anl.gov/~michalak > ---------------------------------------------------------------------- > MCS Division | MMM Division > Argonne National Laboratory | National Center for Atmospheric Research > | 3450 Mitchell Lane, Boulder, CO 80301 > | 303-497-8199 > ---------------------------------------------------------------------- -- Wesley B. Jones, PhD wesley@sgi.com SGI, Boulder, CO Phone: (303)-448-1165 Performance Engineering FAX: