Hangwatch | James Tanner

Recovered from the older tannerjc.net wiki snapshot dated January 23, 2016.

Purpose

Hangwatch is an application which requires tuning to get sysrq data at the appropriate time. Without proper tuning, hangwatch will trigger at erroneous times and not provide useful data.

http://people.redhat.com/astokes/hangwatch/

Background

Before discussing how to tune, an understanding of each tunable’s purpose is necessary.

From the hangwatch sysconfig file …

-s to set sysrq keys, default is:
     m - memory allocation
     t - dump thread state
     p - current cpu registers and flags

-t to set threshold based on load average
-i sets your interval (in minutes).

Sysrq

The -s flag is pretty straight forward. These are the keys that you want to issue to sysrq. The common(default) flags are mtp, which dump memory, thread and cpu information. If sysrq is unable to write to syslog, c can be useful to trigger a panic and netdump/kdump facilities can send the vmcore elsewhere.

A full listing of sysrq keys: http://en.wikipedia.org/wiki/Magic_SysRq_key

Threshold

The -t flag is threshold. Whatever number you provide for this flag will be compared against the first column in /proc/loadavg.

In order to figure out what threshold to use, you’ll need to profile the system via sar data or have the customer watch /proc/loadavg. Here’s an example of using SAR data:

# sar -q -s 15:00:00 -e 19:00:00 -f sa21

XX:XX:XX PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
03:34:01 PM         1       152      0.01      0.48      0.51
03:36:01 PM         2       158      0.42      0.42      0.47
03:38:01 PM         1       155      0.92      0.61      0.53
03:40:01 PM         1       155      0.99      0.73      0.58
03:42:01 PM         1       155      0.99      0.82      0.63
03:44:01 PM         3       155      1.05      0.91      0.68
03:46:01 PM         2       155      1.04      0.95      0.72
03:48:01 PM         1       154      0.66      0.88      0.72
03:50:01 PM         0       156      0.09      0.58      0.63
03:52:01 PM         1       158      0.80      0.65      0.64

Use the ldavg-1 column to define what peak load is going to be. In this case 1.05 is the peak. I’ll define my threshold as .9 to grab sysrq data right before it’s hit.

Interval

The final flag is -i for interval. The unit is minutes, so hangwatch will poll /proc/loadavg every N minutes. An important aspect of setting the interval is that the system could be completely hung before the check interval is reached, or the interval could be too low and causes sysrq to falsely create more load at the wrong time. Sar data can be used to determine an appropriate interval. Using the data shown in my first example, the load increases rapidly and peaks within 8 minutes. If my interval were 10, the machine could theoretically hang completely before sysrq is triggered. A better interval would be around 4-6 minutes. If I were to choose too low of an interval, I might create additional unnecessary load and put syslog into a D state.

Caveats

Be aware that once sysrq is triggered, a lot of additional load will be added. Example from 06:18 to 06:22 …

06:16:01 PM         2       154      0.91      0.76      0.43
06:18:01 PM         0       153      1.75      1.09      0.58
06:20:01 PM         0       153      2.58      1.56      0.81
06:22:01 PM         0       152      2.49      1.87      1.01