Troubleshooting¶
When a simulation stops without completion, check error log(s) for any error details. For WRF compiled with distributed memory (dmpar), rsl.out.* and rsl.error.* files are output - one for each processor used. Otherwise, when the model is run, the error and output should be sent to a file, using the &> syntax. For e.g.,
./wrf.exe >& output.log
When multiple rsl* files exist, often the rsl.error.0000 file contains the error, but not always. If no error is listed at the end of rsl.error.0000, check whether any other rsl* files are larger in size, which indicates more information (possibly the error) is printed to that file. To see the file size, issue:
ls -ls rsl.*
Model Stops - No Error or Segmentation Fault¶
A quick abort of the model likely indicates insufficient memory for the configuration or issues with the input data.
Inadequate Memory¶
Small Systems¶
e.g., desktop workstation or laptop
Prior to configuring and compiling, try setting one of the following to determine if more memory and/or stack size can be utilized:
unlimt
**OR**
ulimit -s unlimited
Note
For OpenMP (smpar-compiled code), stack size should be set to a large value, but not unlimited, as this may crash the system.
HPC systems¶
Typically adding additional processors will resolve these issues. See Choosing an Appropriate Number of Processors, which provides a method to roughly estimate a reasonable number of processors, based on domain size.
Input Data Issue¶
To check whether the input data are causing the problem, use ncview or another netCDF file browser to check fields in the met_em* or wrfinput_d0* files. Look at all times, variables, and levels for any missing or unrealistic data.
Segmentation Fault¶
Segmentation faults can be difficult to track down because the error messages are not specific; therefore it may take more steps to track down the issue.
A segmentation fault is often the result of using too many or too few processors, or a bad decomposition. See Choosing an Appropriate Number of Processors, which provides a method to roughly estimate a reasonable number of processors, based on domain size.
A lack of disk space can result in a segmentation fault. Check the available disk space, and whether that is sufficient for writing the files. For large and/or high-resolution domains, output files are much larger (sometimes a few GB).
A seg-fault can be caused by CFL errors, which occur when the model has become numerically unstable - the time_step used to advance the model is too large for a stable solution. The most common reasons for this are due to complex terrain, model layers that are too thin, or using a large domain where the corners of the domain use a large map-scale factor (it should be ~1.0), reducing the equivalent earth distance to be much smaller than the model grid size. To check for this error, issue the command:
grep cfl rsl.error*
If CFL notifications print to the screen, use one (or a combination) of the following steps to attempt to resolve the issue:
Reduce the time_step. The standard time_step recommendation is \(\leq\) 6xDX (e.g., for dx=30000, time_step should be \(\leq\) 180), but when CFL errors occur, it may need to be reduced to 4xDX or 3xDX to attempt to get past the instability.
If CFL errors occur along boundary zones, try adding smooth_cg_topo=.true. to the namelist.input &domains record prior to running real.exe. This smoothes the coarse grid’s outer rows/columns to match the low resolution topography included with the driving data.
If CFL errors occur near complex terrain, try adding epssm=0.2 (up to 0.9) to the namelist.input &dynamics record to slightly forward the centering of the vertical pressure gradient (or sound waves) in an effort to damp 3D divergence.
Set w_damping=1 in the namelist.input &dynamics record.
Complex Topography at High Model Resolutions¶
High model resolutions (dx/dy \(\leq\) ~3000) may cause issues due to one of the following:
Relatively steep terrain
Un-representative data, due to its origin from a coarser external source
Imbalances at the initial time
Add namelist parameter epssm to the namelist.input &domains record, setting it to values 0.1 to 0.9, to attempt to overcome this issue.
Debugging¶
If the model stops and none of the above suggestions are helpful, it may be necessary to add debugging statements to the code to determine the issue. Following are two debugging options:
For a small domain capable of running on a single processor, the “GNU” debugger can be used by issuing the following prior to recompiling WRF:
./clean -a ./configure -D (choose a serial compilation)
After recompiling, run the model with the following command:
gdb ./wrf.exe
When prompted, enter:
runThe model should stop on the line causing the error. Typing
listwill provide additional information. Typequitwhen done.For larger domains, and/or to turn on bounds checking, tracebacks, etc., issue the following commands prior to recompiling WRF:
./clean -a ./configure -D
After recompiling, run the model. When it fails, check the error logs (e.g., rsl.error.0000 or a user-initiated error output log), which should print the line of code that caused the model to fail.
Note
It is NOT recommend to set debug_level in namelist.input. This option is removed from default namelists because it rarely provides useful information and adds numerous prints to log files, making them difficult to read, and occasionally causing model failures due to their large size.
Namelist Issues¶
“ERRORS while reading namelists…”¶
This error indicates errors/typos exist in namelist.input. In the error log, the lines just above the error message should indicate where in the namelist the issue resides. Check and modify the line(s) mentioned. (When using a nested domain) this error is commonly due to setting a value for each domain for a parameter that requires only a single entry. For example, run_days requires only a single value, so the following would result in this error:
run_days = 2, 2, 2
Fix this by removing the values for columns 2+ (i.e., set to run_days=1), saving namelist.input and running again. If unsure, always start with a default namelist template, or find the namelist parameter in the WRF/Registry/* files to determine how many entries it requires for a nested simulation. If settings for each domain are required, max_dom will be listed in the parameter’s line in the registry file.
“SIZE MISMATCH”¶
If this error occurs, there should also be information like the following included in the error log.
input_wrf.F:SIZE MISMATCH:namelist e_we = 70
input_wrf.F:SIZE MISMATCH:input file WEST-EAST_GRID_DIMENSION = 74
The above message indicates a discrepancy in information between the input file and the namelist. The input file has a west-east grid dimension of 74 grid spaces, while the namelist’s west-east dimension (e_we) is set to 70. The namelist should be set according to the input files. In this example, setting e_we=74 corrects the issue.
Best Practices¶
The following resources include recommendations for setting up a model domain, and how to use runtime options to help avoid errors and improve results.
Namelist.wps: Best Practices : Defines common namelist.wps parameters and includes best practice guidance for setting up reasonable domains
Namelist.input: Best Practices : Defines common namelist.input parameters for running real.exe and wrf.exe, and includes best practice recommendations
Best Practice Presentation : Lecture from biannual WRF tutorials
Best Practice Presentations : Best Practice presentations delivered during previous WRF Workshops
Frequently Asked Questions (FAQ)¶
To see a full library of frequently asked questions, see the FAQ section of the WRF & MPAS-A Users’ Forum. It may also be beneficial to use the “search” utility on the forum to see other inquiries and responses related to run-time issues.