On Tuesday, May 11, 1999 10:20 AM, XXXXX wrote: > Hi, > > I'm running the MPP version of MM5 V2 release 10 on a cluster of linux > PCs. The program runs fine when all the nodes run kernel 2.0.x. However, > on kernel 2.2.x (the latest stable kernel) the execution enters a deadlock > at random. > > After analysing the stuck processes, it seems that 2 or 3 processes try > to send a message to each other concurrently (usually 3). Looking at their > stacks, i could see that the deadlock is deep inside the MPI code. When > this happens, these 2 or 3 processes remain busy-waiting (repeatedly > attmpting to read from the socket), while the remaining processes are > sleeping (waiting for data on their sockets). > > The MPI support suggested that this might be due to 'unsafe' use of MPI by > the RSL library, because buffering with MPI_Send is implementation > dependent. > > This suggests that either: > > 1. The RSL library (using MPI) is broken. Maybe it should use MPI_Bsend > (with appropriate buffer allocation). > > 2. The MPICH code is flawed. > > 3. The linux kernel 2.2.x is causing the problem (evidently there is a > difference between 2.2.x and 2.0.x which works ok) > > or > > 4. A combination of 2 and 3: incompatible behaviour of the kernel and > MPICH incorrectly handling this feature. > > > I'm trying to further investigate case no. 1, by using MPI_Bsend and > adding buffering. I'll be happy to provide more information on this upon > request. > > Any comments? Has anybody experienced such behaviour? I'd appreciate if > other users of MM5 under linux would share their experience. > > Thanks, > Hi XXXXX, Is it possible to identify which sends are jamming? That is, where is MPI_Send being called from? If you'd like to try using MPI_Bsend instead of MPI_Send, edit the file MPP/RSL/RSL/rsl_comm.h and change the definition of the RSL_SEND CPP macro at or around line 283. Then make clean in the MPP/RSL/RSL directory, remake the library for your architecture, and then cd ../../.. and make mpp. It shouldn't be necessary to recompile MM5 itself. Please let me know how it goes... John