Date: 	Thu, 24 Jun 1999 19:41:18 +0800
From: David Yeung <dyeung@ust.hk>
Organization: HKUST
To: John Michalakes <michalak@mmm.mmm.ucar.EDU>
Subject: Re: MM5 on PC cluster fails after modifying value in XENNES

John

Your solution works! I have run the same cases twice and
both are successful.

Thanks

david


> David,
> 
> I believe this is the same problem as before, having to do with MPI and
> p4.  I have one idea to try.
> 
> In the file MPP/RSL/RSL/makefile.linux add -DRSL_SYNCIO to the CFLAGS
> line.  While still in that directory, type 'make clean' and 'make
> linux', then cd back up to top level and make mpp.
> 
> This will cause node zero to send a message to each processor allowing
> the processor to send its data to be output.  Ordinarily the nodes just
> send their data whether node 0 is ready or not.  The change takes some
> of the stress off MPI to be buffering up these sends before node0 can
> pull off the data.
> 
> Please try this and let me know if it helps.
> 
> Thanks,
> 
> John
> 
>>
>> Dear John
>>
>> We have recently modified the XENNES values in mm5.deck from:
>>
>>   XENNES = 4320., 2880.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.
>>
>> to
>>
>>   XENNES = 4320., 4320.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.
>>
>>
>> The execution time becomes much longer but it becomes much more frequently
>> to be failed due to the p4_error. The error message is a little bit different
>> from last time I reported to you. This time the message displayed from
>> the mm5.deck on the screen is:
>>
>>  p4_error: net_recv read:  probable EOF on socket: 1
>>
>> I am not sure whether it is the same problem as last time. However, even
>> I reboot the all cluster PC's this time, it won't help. And the error
>> occurs always at the end of the run. I have run the case about 5 times,
>> and only once is successful. The successful run took about almost 4 hour to finish,
>> and the other runs always failed at around 3:57 or later.
>>
>> Here is the error messages (from mm5.deck) for my last two runs:
>>
>> -------------------------
>> running /home/dyeung/mm5/Run/mm5.mpp on 8 LINUX ch_p4 processors
>> Created /home/dyeung/mm5/Run/PI1111
>> hqlxcl01 -- rsl_nproc_all 8, rsl_myproc 0
>> mpi02.clhq -- rsl_nproc_all 8, rsl_myproc 1
>> mpi04.clhq -- rsl_nproc_all 8, rsl_myproc 3
>> mpi03.clhq -- rsl_nproc_all 8, rsl_myproc 2
>> mpi05.clhq -- rsl_nproc_all 8, rsl_myproc 4
>> mpi06.clhq -- rsl_nproc_all 8, rsl_myproc 5
>> mpi07.clhq -- rsl_nproc_all 8, rsl_myproc 6
>> mpi08.clhq -- rsl_nproc_all 8, rsl_myproc 7
>> rm_l_4_752:  p4_error: interrupt SIGINT: 2
>> rm_l_7_746:  p4_error: interrupt SIGINT: 2
>> bm_list_1302:  p4_error: interrupt SIGINT: 2
>> rm_l_5_746:  p4_error: interrupt SIGINT: 2
>> Command exited with non-zero status 1
>> 10566.13user 1033.14system 3:57:26elapsed 81%CPU (0avgtext+0avgdata 0maxresident)k
>> 0inputs+0outputs (16602major+326718minor)pagefaults 115swaps
>> --------------------------
>> running /home/dyeung/mm5/Run/mm5.mpp on 8 LINUX ch_p4 processors
>> Created /home/dyeung/mm5/Run/PI876
>> hqlxcl01 -- rsl_nproc_all 8, rsl_myproc 0
>> mpi04.clhq -- rsl_nproc_all 8, rsl_myproc 3
>> mpi02.clhq -- rsl_nproc_all 8, rsl_myproc 1
>> mpi03.clhq -- rsl_nproc_all 8, rsl_myproc 2
>> mpi05.clhq -- rsl_nproc_all 8, rsl_myproc 4
>> mpi06.clhq -- rsl_nproc_all 8, rsl_myproc 5
>> mpi07.clhq -- rsl_nproc_all 8, rsl_myproc 6
>> mpi08.clhq -- rsl_nproc_all 8, rsl_myproc 7
>> Broken pipe
>> rm_l_5_1230:  p4_error: net_recv read:  probable EOF on socket: 1
>> rm_l_7_1228:  p4_error: interrupt SIGINT: 2
>> rm_l_3_1316:  p4_error: interrupt SIGINT: 2
>> rm_l_4_1242:  p4_error: interrupt SIGINT: 2
>> rm_l_1_1319:  p4_error: interrupt SIGINT: 2
>> bm_list_1067:  p4_error: interrupt SIGINT: 2
>> Command exited with non-zero status 1
>> 10556.35user 994.07system 3:57:36elapsed 81%CPU (0avgtext+0avgdata 0maxresident)k
>> 0inputs+0outputs (16653major+258186minor)pagefaults 651swaps
>> --------------------------
>> Thanks
>>
>> david
>>

