.. role:: underline
    :class: underline

===============
Troubleshooting
===============

|

When a simulation stops without completion, check error log(s) for any error details. For WRF compiled with distributed memory (dmpar), *rsl.out.\** and *rsl.error.\** files are output - one for each processor used. Otherwise, when the model is run, the error and output should be sent to a file, using the ``&>`` syntax. For e.g., 

.. code-block::

        ./wrf.exe >& output.log

|

|

When multiple *rsl\** files exist, often the *rsl.error.0000* file contains the error, but not always. If no error is listed at the end of *rsl.error.0000*, check whether any other *rsl\** files are larger in size, which indicates more information (possibly the error) is printed to that file. To see the file size, issue:

.. code-block::

        ls -ls rsl.*

|

|

|

|

|

Model Stops - No Error or Segmentation Fault
============================================

A quick abort of the model likely indicates insufficient memory for the configuration or issues with the input data. 

|

|

Inadequate Memory
-----------------

|

Small Systems
+++++++++++++
*e.g., desktop workstation or laptop*

Prior to configuring and compiling, try setting one of the following to determine if more memory and/or stack size can be utilized:

.. code-block::

        unlimt

            **OR**

        ulimit -s unlimited

|

|

.. note::
   For OpenMP (smpar-compiled code), stack size should be set to a large value, but not unlimited, as this may crash the system.

|

|

|

HPC systems
+++++++++++

Typically adding additional processors will resolve these issues. `See Choosing an Appropriate Number of Processors <https://forum.mmm.ucar.edu/threads/how-many-processors-should-i-use-to-run-wrf.5082/>`_, which provides a method to roughly estimate a reasonable number of processors, based on domain size.

|

|

|

|

Input Data Issue
----------------

To check whether the input data are causing the problem, use *ncview* or another netCDF file browser to check fields in the *met_em*\* or *wrfinput_d0*\* files. Look at all times, variables, and levels for any missing or unrealistic data.

|

|

|

|

Segmentation Fault
------------------

Segmentation faults can be difficult to track down because the error messages are not specific; therefore it may take more steps to track down the issue.

*   A segmentation fault is often the result of using too many or too few processors, or a bad decomposition. `See Choosing an Appropriate Number of Processors <https://forum.mmm.ucar.edu/threads/how-many-processors-should-i-use-to-run-wrf.5082/>`_, which provides a method to roughly estimate a reasonable number of processors, based on domain size.

    |

*   A lack of disk space can result in a segmentation fault. Check the available disk space, and whether that is sufficient for writing the files. For large and/or high-resolution domains, output files are much larger (sometimes a few GB).

    |

*   A seg-fault can be caused by **CFL errors**, which occur when the model has become numerically unstable - the *time_step* used to advance the model is too large for a stable solution. The most common reasons for this are due to complex terrain, model layers that are too thin, or using a large domain where the corners of the domain use a large map-scale factor (it should be *~1.0*), reducing the equivalent earth distance to be much smaller than the model grid size. To check for this error, issue the command:

    .. code-block::

            grep cfl rsl.error*

    |

    If CFL notifications print to the screen, use one (or a combination) of the following steps to attempt to resolve the issue:

    #.   Reduce the *time_step*. The standard *time_step* recommendation is :math:`\leq` *6xDX* (e.g., for *dx=30000*, *time_step* should be :math:`\leq` 180), but when CFL errors occur, it may need to be reduced to *4xDX* or *3xDX* to attempt to get past the instability.
    #.   If CFL errors occur along boundary zones, try adding *smooth_cg_topo=.true.* to the *namelist.input* *&domains* record prior to running *real.exe*. This smoothes the coarse grid's outer rows/columns to match the low resolution topography included with the driving data. 
    #.   If CFL errors occur near complex terrain, try adding *epssm=0.2* (up to *0.9*) to the *namelist.input* *&dynamics* record to slightly forward the centering of the vertical pressure gradient (or sound waves) in an effort to damp 3D divergence. 
    #.   Set *w_damping=1* in the *namelist.input* *&dynamics* record.

|

|

|

|

Complex Topography at High Model Resolutions
--------------------------------------------

High model resolutions (*dx/dy* :math:`\leq` *~3000*) may cause issues due to one of the following:

*   Relatively steep terrain
*   Un-representative data, due to its origin from a coarser external source
*   Imbalances at the initial time

|

Add namelist parameter *epssm* to the *namelist.input* *&domains* record, setting it to values *0.1* to *0.9*, to attempt to overcome this issue.

|

|

|

|

Debugging
---------

If the model stops and none of the above suggestions are helpful, it may be necessary to add debugging statements to the code to determine the issue. Following are two debugging options:

#.   For a small domain capable of running on a single processor, the "GNU" debugger can be used by issuing the following prior to recompiling WRF:

     |

     .. code-block::

             ./clean -a
             ./configure -D       (choose a serial compilation)

     |

     |

     After recompiling, run the model with the following command:

     .. code-block::

             gdb ./wrf.exe

     |

     When prompted, enter: ``run``

     |

     The model should stop on the line causing the error. Typing ``list`` will provide additional information. Type ``quit`` when done.

     |

     |

     |

#.   For larger domains, and/or to turn on bounds checking, tracebacks, etc., issue the following commands prior to recompiling WRF:

     .. code-block::

             ./clean -a
             ./configure -D

     |

     After recompiling, run the model. When it fails, check the error logs (e.g., *rsl.error.0000* or a user-initiated error output log), which should print the line of code that caused the model to fail.

|

|

.. note::
   It is NOT recommend to set *debug_level* in *namelist.input*. This option is removed from default namelists because it rarely provides useful information and adds numerous prints to log files, making them difficult to read, and occasionally causing model failures due to their large size.

|

|

|

|

|

Namelist Issues
===============

|

"ERRORS while reading namelists..."
-----------------------------------

This error indicates errors/typos exist in *namelist.input*. In the error log, the lines just above the error message should indicate where in the namelist the issue resides. Check and modify the line(s) mentioned. (When using a nested domain) this error is commonly due to setting a value for each domain for a parameter that requires only a single entry. For example, *run_days* requires only a single value, so the following would result in this error:

.. code-block::
        
        run_days = 2, 2, 2

|

Fix this by removing the values for columns *2+* (i.e., set to *run_days=1*), saving *namelist.input* and running again. If unsure, always start with a default namelist template, or find the namelist parameter in the *WRF/Registry/\** files to determine how many entries it requires for a nested simulation. If settings for each domain are required, *max_dom* will be listed in the parameter's line in the registry file.

|

|

|

|

"SIZE MISMATCH"
---------------

If this error occurs, there should also be information like the following included in the error log. 

.. code-block::

        input_wrf.F:SIZE MISMATCH:namelist e_we = 70
        input_wrf.F:SIZE MISMATCH:input file WEST-EAST_GRID_DIMENSION = 74

|

The above message indicates a discrepancy in information between the input file and the namelist. The input file has a west-east grid dimension of *74* grid spaces, while the namelist's west-east dimension (*e_we*) is set to *70*. The namelist should be set according to the input files. In this example, setting *e_we=74* corrects the issue.

|

|

|

|

|

Best Practices
==============

The following resources include recommendations for setting up a model domain, and how to use runtime options to help avoid errors and improve results.

*   `Namelist.wps: Best Practices <https://www2.mmm.ucar.edu/wrf/users/namelist_best_prac_wps.html>`_ : Defines common *namelist.wps* parameters and includes best practice guidance for setting up reasonable domains
*   `Namelist.input: Best Practices <https://www2.mmm.ucar.edu/wrf/users/namelist_best_prac_wrf.html>`_ : Defines common *namelist.input* parameters for running real.exe and wrf.exe, and includes best practice recommendations
*   `Best Practice Presentation <https://www2.mmm.ucar.edu/wrf/users/tutorial/presentation_pdfs/202101/chen_better_performance.pdf>`_ : Lecture from biannual WRF tutorials
*   `Best Practice Presentations <https://www2.mmm.ucar.edu/wrf/users/supports/best_practices_lectures_workshop.html>`_ : Best Practice presentations delivered during previous WRF Workshops

|

|

|

|

|

Frequently Asked Questions (FAQ)
================================

To see a full library of frequently asked questions, see the `FAQ section <https://forum.mmm.ucar.edu/forums/frequently-asked-questions.115/>`_ of the `WRF & MPAS-A Users' Forum <https://forum.mmm.ucar.edu/>`_. It may also be beneficial to use the "search" utility on the forum to see other inquiries and responses related to run-time issues.

|

|

|

|

|