How to use dbx with MPICH 1.2 to debug the parallel MM5 on a network of workstations.

Gentle Reader,

Parallel debugging is probably one of the most vexing enterprises you
will deal with, but sometimes it is unavoidable. Say, for example, your
parallel MM5 is running correctly on one process, but when you switch
to two or more processes you get a non-descript floating point
exception or segmentation violation on one of the other processes?
The steps on this page will allow you to start a parallel debugging session
where each processes is under the control of a separate dbx session and
can be debugged individually as part of the running parallel program.

PREREQUISITES and NOTES: A windowing terminal with the ability to open
as many windows as processes you wish to debug. Mpich must have been
configured and compiled to use the ch_p4 ADI. Please note that the
example shown here was run on an Alpha workstation running TRU-64
Unix.  Have also tested this on a Linux PC cluster (beowulf1.pnl.com);
the only change here is that you need to use gdb instead of dbx.
The steps on other types of UNIX workstation should be similar.

Steps:

1. Run the program once just using the regular mpirun command so 
   that it generates a P4 procgroup file. The name will be something
   like PI28726 and it will contain something like:

     maple 0 /maple/michalak/CWO5/parwork/MM5/Run/mm5.mpp
     maple 1 /maple/michalak/CWO5/parwork/MM5/Run/mm5.mpp

   You will use the name of this file in subsequent steps.

   (Note that maple is the host I am using, and in this case,
   I'm actually running both parallel processes on the same workstation.
   That is not a requirement; you can start up debugging sessions on 
   multiple hosts if you wish, by editing the contents of this file.
   Note that your debugging windows should be opened on the host on
   which the process sits.)


2. Start up the first mpi process under dbx:

   % dbx mm5.mpp
   (dbx) run -p4norem -p4pg PI28726
   thread 0x3 signal [signal 33554443] at >*[aio_init, 0x3ff81115b5c] ...
   (dbx) cont
   New child attached.  Use switch to gain access to process 15798
   waiting for process on host maple:
   /maple/michalak/CWO5/parwork/MM5/Run/mm5.mpp maple 1969 -p4amslave
   (dbx)

At this point process zero is started and waiting for the other
process(es) to start up. I am running on only two processes in this
example, so I only have to strart up one additional process. Note
that the output from the first process gives information that will
be used to start the second one.

3. In another window, start up the second mpi process under dbx:

   maple% dbx mm5.mpp
   (dbx) run maple 1969 -p4amslave
   thread 0x3 signal [signal 33554443] at >*[aio_init, 0x3ff81115b5c] ...
   (dbx) cont
   New child attached.  Use switch to gain access to process 28693
   maple -- rsl_nproc_all 2, rsl_myproc 1

The last line of the output, above, is an indication that the 
code is now running on process 1 (the second process). You'll notice
that in the original window where you started process 0, there's
also a new line of output there too:

   maple -- rsl_nproc_all 2, rsl_myproc 0

At this point you can debug the program on each process as a separate job.

Happy Hacking....

-Rotang

Feb. 13, 2001