[Note: this problem is not strictly MPP-related; however, it will effect the MPP code as well when OpenMP is used. -Rotang] ---------------- From: "Kelly, Michael A." To: "mm5-users@UCAR. EDU (E-mail)" Subject: FW: FW: Problem running on dual processors Date: Mon, 28 Aug 2000 15:39:51 -0400 To the MM5 Users' Group: Thanks to you all for your excellent suggestions. I have forwarded Marc Michelson's e-mail (see below) to the entire group, because his suggestion turned out to be correct. I hope Marc does not mind. The problem that prevented me from running on dual processors was that the contiguous virtual memory (defined by vm-vpagemax) was too small for this nest configuration. After I increased vpagemax from 16384 to 128000, this configuration ran successfully. [To increase vpagemax, simply click on vm in the kernel GUI, and then enter 128000 for the vpagemax entry. You will then have to reboot.] David Ovens reported that he had a similar error when he used the Burk-Thompson PBL scheme. In this case, I did not use that scheme. Thanks again. Mike -----Original Message----- From: Marc Michelsen [mailto:marc@atmos.washington.edu] Sent: Monday, August 28, 2000 12:47 PM To: makelly@tasc.com Subject: Re: FW: Problem running on dual processors Hi Michael, I ran into this problem before on an es40 running V4.0 unix: > more t.f parameter (m = 34000000) C parameter (m = 33000000) common /d/ b(m) !$omp paralleldo do i= 1, m b(i) = i end do end > f90 -omp t.f > a.out %DECthreads bugcheck (version V3.15-397), terminating execution. % Reason: Failure initializing the manager thread tcb (12) % Running on OSF1 V4.0 on Compaq AlphaServer ES40, 2048Mb; 4 CPUs IOT trap > but it works fine if m = 33000000. the amount of memory to run the program is 136MB: > size a.out text data bss dec hex 8192 8192 135992320 136008704 81b5400 > tracing the system calls it makes, it dies trying to modify the protection of the page of memory thats at the bottom of a thread's stack to no access. This is done so that if one thread's stack grows too much it will hit the no access page and seg fault rather than extending into the next thread's stack postponing the inevitable problem and making it more difficult to debug. mprotect (0x1481c6000, 8192, PROT_NONE) = -1, Errno 12 (Not enough space) write (2, 0xc01b6f90, 65) = %DECthreads bugcheck (version V3.15-397), terminating execution. looking at the mprotect man page it mentions the most common cause for ENOMEM (Errno 12) is the process exceeds the vpagemax kernel config variable. This variable limits the size of a contiguous virtual address space that can have individual permissions for each page within it. That variable is currently set to: > /sbin/sysconfig -q vm | grep pagemax vm-vpagemax = 16384 > 16,384 pages. the page size is 8K so the largest amount of contiguous virtual address space is 16,384 * 8K = 131072K or about 131MB. so on the first mprotect call it makes it dies when m = 34000000, ie. amount of memory used is 136MB which is > 131MB. I believe it puts all your static storage (common blocks) and your thread stacks, with a noaccess page after each thread stack, in one contiguous virtual address space. To fix the problem I believe you just need to increase the vm-vpagemax kernel variable. I was never able to test this, however, because it wasnt my machine. I found the following doing a web search for vpagemax: This parameter is described in the Tru64 UNIX System Configuration and Tuning manual. Refer to the section on Tuning Virtual Memory Limits. Relevant tools are briefly described by man sysconfig and man sysconfigdb. To change the vm-vpagemax parameter, use the /sbin/sysconfigdb command to update a section of /etc/sysconfigtab. For example, to modify vm-vpagemax to be 128000, the input file to the /sbin/sysconfigdb command (specified using the -f option) would contain the following: vm: vm-pagemax = 128000 The default value for vm-vpagemax is 16384. Use the following command to determine the value of vm-vpagemax on a running system: % > sysconfig -q vm vm-vpagemax vm: vm-vpagemax = 16384 Apparently V5.0 unix does not have this kernel variable (/sbin/sysconfig -q vm does not show it) and every page of the address space can have individual permissions regardless how large it is, so its not a problem on V5.0. Marc > > Hi All, > > I am attempting to run a 6-nest MM5 simulation on dual processors on a DS20 > Dec Alpha. This setup runs successfully on a single processor. But on dual > processors, I get a core file and the run bombs. The error in mm5.print.out > is shown below. I have successfully run on dual processors for simpler MM5 > nest configurations. > > Wei Wang at NCAR suggested: "setenv MP_STACK_SIZE 32000000". I set the > MP_STACK_SIZE as suggested (and tried much larger numbers as well), but > these changes had no effect. The run still bombs. > > Has anyone seen this error before? Does anyone have any suggestions? > > Thank you. > > > Michael A. Kelly, Ph.D. > Atmospheric Scientist > Litton-TASC > 4801 Stonecroft Blvd. > Chantilly VA 20151 > (703) 633-8300 x8695 > > > > ************************************************************************** > > ******************************************************************* > > Here is the error message in mm5.print.out when we try to run on two > > processors: > > > > %DECthreads bugcheck (version 3.15-397), terminating execution > > %Reason: Failure initializing the manager thread tcb (12) > > %Running on OSF1 v4.0 on AlphaServer DS20 500MHz, 2048MB, > > 2 CPUs > > > > > > The setup of the run is as follows: > > LEVIDN = 0, 1, 1, 2, 2, 2 > > NUMNC= 1, 1, 1, 2, 2, 3 > > NESTIX= 80, 112, 55, 91, 91, 91 > > NESTJX=100, 109, 55, 91, 91, 91. > > >