Gentle user,

This is a a follow up note that we received in response to a note we
sent concerning MPI timeout problems on MPI message passing under
Linux.  It turns out this is a Linux/MPI problem, and the MPI-CH group
at ANL is aware of this and working on a solution.  See also Helpdesk
item: "1999 05 11, MPI_Send deadlock on Linux Cluster".

Rotang

-------

Date: Thu, 10 Jun 1999 11:29:56 -0500
To: michalak@mmm.mmm.ucar.EDU
Cc: mpi-maint@mcs.anl.gov
Reply-To: mpi-maint@mcs.anl.gov
Subject: Re: [MPI #4442] p4 error on Hong Kong beowulf cluster

> I'm trying to help some parallel MM5 users at the U. Hong Kong, who are
> running on a Pentium/Linux/MPICH configuration.  The model runs a number
> of time steps, and then dies with this message to standard output on 
> processor 0:
> 
>    net_recv failed for fd = 8
>    p0_1048:  p4_error: net_recv read, errno = : 104
> 
> Does this suggest possible sources of the problem?

The errno means "Connection reset by peer" and is the LINUX TCP problem
that we have been having.  Basically, the LINUX TCP implementation becomes
unhappy and closes the connection.  We're planning to fix this is two 
ways: we are adding our own flow control (for the next release), and
also planning to add automatic reconnection (probably not in the next
release).

Bill