[Note: this problem is not strictly MPP-related; however, it will effect
 the MPP code as well when OpenMP is used. -Rotang]

----------------

From: "Kelly, Michael A." <makelly@tasc.com>
To: "mm5-users@UCAR. EDU (E-mail)" <mm5-users@UCAR.EDU>
Subject: FW: FW: Problem running on dual processors
Date: Mon, 28 Aug 2000 15:39:51 -0400

To the MM5 Users' Group:

Thanks to you all for your excellent suggestions. I have forwarded Marc
Michelson's e-mail (see below) to the entire group, because his suggestion
turned out to be correct. I hope Marc does not mind.

The problem that prevented me from running on dual processors was that the
contiguous virtual memory (defined by vm-vpagemax) was too small for this
nest configuration. After I increased vpagemax from 16384 to 128000, this
configuration ran successfully. [To increase vpagemax, simply click on vm in
the kernel GUI, and then enter 128000 for the vpagemax entry. You will then
have to reboot.] 

David Ovens reported that he had a similar error when he used the
Burk-Thompson PBL scheme. In this case, I did not use that scheme.
 
Thanks again.

Mike 

-----Original Message-----
From: Marc Michelsen [mailto:marc@atmos.washington.edu]
Sent: Monday, August 28, 2000 12:47 PM
To: makelly@tasc.com
Subject: Re: FW: Problem running on dual processors


Hi Michael,

I ran into this problem before on an es40 running V4.0 unix:

> more t.f
      parameter (m = 34000000)
C      parameter (m = 33000000)  
      common /d/ b(m)

!$omp paralleldo
      do i= 1, m
         b(i) = i
      end do
      end

> f90 -omp t.f
> a.out
%DECthreads bugcheck (version V3.15-397), terminating execution.
% Reason:  Failure initializing the manager thread tcb (12)
% Running on OSF1 V4.0 on Compaq AlphaServer ES40, 2048Mb; 4 CPUs
IOT trap
> 

but it works fine if m = 33000000.
the amount of memory to run the program is 136MB:
> size a.out
text    data    bss     dec     hex
8192    8192    135992320       136008704       81b5400
> 

tracing the system calls it makes, it dies trying to modify the protection
of the page of memory thats at the bottom of a thread's stack to no access.
This is done so that if one thread's stack grows too much it will
hit the no access page and seg fault rather than extending into the next 
thread's stack postponing the inevitable problem and making it more
difficult
to debug.

mprotect (0x1481c6000, 8192, PROT_NONE) = -1, Errno 12 (Not enough space)
write (2, 0xc01b6f90, 65) = %DECthreads bugcheck (version V3.15-397),
terminating execution.

looking at the mprotect man page it mentions the most common cause
for ENOMEM (Errno 12) is the process exceeds the vpagemax kernel config
variable. This variable limits the size of a contiguous virtual address
space
that can have individual permissions for each page within it.

That variable is currently set to:
> /sbin/sysconfig -q vm | grep pagemax
vm-vpagemax = 16384
> 
16,384 pages. the page size is 8K so the largest amount of contiguous
virtual
address space is 16,384 * 8K = 131072K or about 131MB.
so on the first mprotect call it makes it dies when m = 34000000, ie.
amount of memory used is 136MB which is > 131MB.
I believe it puts all your static storage (common blocks) and your thread
stacks,
with a noaccess page after each thread stack, in one contiguous virtual
address space.

To fix the problem I believe you just need to increase the vm-vpagemax
kernel
variable. I was never able to test this, however, because it wasnt my
machine.
I found the following doing a web search for vpagemax:

   This parameter is described in the Tru64 UNIX System
   Configuration and Tuning manual. Refer to the section on Tuning
   Virtual Memory Limits. Relevant tools are briefly described by man
   sysconfig and man sysconfigdb. To change the vm-vpagemax
   parameter, use the /sbin/sysconfigdb command to update a
   section of /etc/sysconfigtab. For example, to modify
   vm-vpagemax to be 128000, the input file to the
   /sbin/sysconfigdb command (specified using the -f option)
   would contain the following:

   vm:
   vm-pagemax = 128000

   The default value for vm-vpagemax is 16384. Use the following
   command to determine the value of vm-vpagemax on a running
   system:

   % > sysconfig -q vm vm-vpagemax
   vm:
   vm-vpagemax = 16384

Apparently V5.0 unix does not have this kernel variable 
(/sbin/sysconfig -q vm does not show it) and every page of 
the address space can have individual permissions regardless
how large it is, so its not a problem on V5.0.

Marc


> 
> Hi All,
> 
> I am attempting to run a 6-nest MM5 simulation on dual processors on a
DS20
> Dec Alpha. This setup runs successfully on a single processor. But on dual
> processors, I get a core file and the run bombs. The error in
mm5.print.out
> is shown below. I have successfully run on dual processors for simpler MM5
> nest configurations.
> 
> Wei Wang at NCAR suggested: "setenv MP_STACK_SIZE 32000000". I set the
> MP_STACK_SIZE as suggested (and tried much larger numbers as well), but
> these changes had no effect. The run still bombs.
> 
> Has anyone seen this error before? Does anyone have any suggestions?
> 
> Thank you.
> 
> 
> Michael A. Kelly, Ph.D.
> Atmospheric Scientist
> Litton-TASC
> 4801 Stonecroft Blvd.
> Chantilly VA 20151
> (703) 633-8300 x8695
> 
> 
> >
**************************************************************************
> > *******************************************************************
> > Here is the error message in mm5.print.out when we try to run on two
> > processors:
> > 
> > %DECthreads bugcheck (version 3.15-397), terminating execution
> > %Reason: Failure initializing the manager thread tcb (12)
> > %Running on OSF1 v4.0 on AlphaServer DS20 500MHz, 2048MB,
> > 2 CPUs
> > 
> > 
> > The setup of the run is as follows:
> > LEVIDN = 0, 1, 1, 2, 2, 2
> > NUMNC= 1, 1, 1, 2, 2, 3
> > NESTIX=  80, 112, 55, 91, 91, 91
> > NESTJX=100, 109, 55, 91, 91, 91.
> > 
>