Gentle Reader, Parallel debugging is probably one of the most vexing enterprises you will deal with, but sometimes it is unavoidable. Say, for example, your parallel MM5 is running correctly on one process, but when you switch to two or more processes you get a non-descript floating point exception or segmentation violation on one of the other processes? The steps on this page will allow you to start a parallel debugging session where each processes is under the control of a separate dbx session and can be debugged individually as part of the running parallel program. PREREQUISITES and NOTES: A windowing terminal with the ability to open as many windows as processes you wish to debug. Mpich must have been configured and compiled to use the ch_p4 ADI. Please note that the example shown here was run on an Alpha workstation running TRU-64 Unix. Have also tested this on a Linux PC cluster (beowulf1.pnl.com); the only change here is that you need to use gdb instead of dbx. The steps on other types of UNIX workstation should be similar. Steps: 1. Run the program once just using the regular mpirun command so that it generates a P4 procgroup file. The name will be something like PI28726 and it will contain something like: maple 0 /maple/michalak/CWO5/parwork/MM5/Run/mm5.mpp maple 1 /maple/michalak/CWO5/parwork/MM5/Run/mm5.mpp You will use the name of this file in subsequent steps. (Note that maple is the host I am using, and in this case, I'm actually running both parallel processes on the same workstation. That is not a requirement; you can start up debugging sessions on multiple hosts if you wish, by editing the contents of this file. Note that your debugging windows should be opened on the host on which the process sits.) 2. Start up the first mpi process under dbx: % dbx mm5.mpp (dbx) run -p4norem -p4pg PI28726 thread 0x3 signal [signal 33554443] at >*[aio_init, 0x3ff81115b5c] ... (dbx) cont New child attached. Use switch to gain access to process 15798 waiting for process on host maple: /maple/michalak/CWO5/parwork/MM5/Run/mm5.mpp maple 1969 -p4amslave (dbx) At this point process zero is started and waiting for the other process(es) to start up. I am running on only two processes in this example, so I only have to strart up one additional process. Note that the output from the first process gives information that will be used to start the second one. 3. In another window, start up the second mpi process under dbx: maple% dbx mm5.mpp (dbx) run maple 1969 -p4amslave thread 0x3 signal [signal 33554443] at >*[aio_init, 0x3ff81115b5c] ... (dbx) cont New child attached. Use switch to gain access to process 28693 maple -- rsl_nproc_all 2, rsl_myproc 1 The last line of the output, above, is an indication that the code is now running on process 1 (the second process). You'll notice that in the original window where you started process 0, there's also a new line of output there too: maple -- rsl_nproc_all 2, rsl_myproc 0 At this point you can debug the program on each process as a separate job. Happy Hacking.... -Rotang Feb. 13, 2001