Gentle user, This is a a follow up note that we received in response to a note we sent concerning MPI timeout problems on MPI message passing under Linux. It turns out this is a Linux/MPI problem, and the MPI-CH group at ANL is aware of this and working on a solution. See also Helpdesk item: "1999 05 11, MPI_Send deadlock on Linux Cluster". Rotang ------- Date: Thu, 10 Jun 1999 11:29:56 -0500 To: michalak@mmm.mmm.ucar.EDU Cc: mpi-maint@mcs.anl.gov Reply-To: mpi-maint@mcs.anl.gov Subject: Re: [MPI #4442] p4 error on Hong Kong beowulf cluster > I'm trying to help some parallel MM5 users at the U. Hong Kong, who are > running on a Pentium/Linux/MPICH configuration. The model runs a number > of time steps, and then dies with this message to standard output on > processor 0: > > net_recv failed for fd = 8 > p0_1048: p4_error: net_recv read, errno = : 104 > > Does this suggest possible sources of the problem? The errno means "Connection reset by peer" and is the LINUX TCP problem that we have been having. Basically, the LINUX TCP implementation becomes unhappy and closes the connection. We're planning to fix this is two ways: we are adding our own flow control (for the next release), and also planning to add automatic reconnection (probably not in the next release). Bill