New Specification Method for Thresholds
Ton Voon, March 17, 2008
Overview
The method for defining thresholds via the command line is inconsistent and difficult to interpret. This proposal suggests a different way of specifying thresholds, which will also change the metrics of performance data returned.
Problem
The current method of specifying thresholds is confusing when there are
different checks required. For instance, in check_http
, to check page size
and time, you can specify -w {warn time}
, -c {crit time}
,
-m {minpagesize}[:maxpagesize]
, -M {maxage of document}
.
Also, note the ways of defining the range are inconsistent. Some alert above the value (time, maxage), some alert below the value (pagesize). This is inconsistent for the same plugin!
So, to check that a web page is returned within 5 seconds, the minimum page size is 10K and the maximum age is 1 day, you would invoke:
check_http -H $HOSTADDRESS$ -c 5 -m 10000 -M 1d
Furthermore, the current specification for ranges in the developer guidelines fails the “obviousness” test: a range of 3:5 will alert if the value is outside that range, rather than inside as you would expect.
Also, the performance data returned by check_http
is always time and size.
Perhaps you want only time, or you want age as well.
Proposal
Thresholds
This document proposes that threshold arguments are specified like:
--threshold={threshold definition}
--th={threshold definition}
The threshold definition is a subgetopt format of the form:
metric={metric},ok={range},warn={range},crit={range},unit={unit},prefix={SI prefix}
Where:
ok
,warn
,crit
are called “levels”- any of
ok
,warn
,crit
,unit
orprefix
are optional - if
ok
,warning
andcritical
are not specified, then no alert is raised, but the performance data will be returned - the
unit
can be specified with plugins that do not know about the type of value returned (SNMP, Windows performance counters, etc.) - the
prefix
is used to multiply the input range and possibly for display data. The prefixes allowed are defined by NIST:
http://physics.nist.gov/cuu/Units/prefixes.html
http://physics.nist.gov/cuu/Units/binary.html ok
,warning
orcritical
can be repeated to define an additional range. This allows non-continuous ranges to be definedwarning
can be abbreviated towarn
orw
critical
can be abbreviated tocrit
orc
Simple Range
The range values have two specifications: simple and complex. Simple ranges are of the format:
start..end
Where:
start
andend
must be definedstart
andend
match the regular expression/^[+-]?[0-9]+\.?[0-9]*$|^inf$/
(ie, a numeric or “inf”)start ≤ end
- if
start
=inf
, this is negative infinity. This can also be written as-inf
- if
end
=inf
, this is positive infinity - endpoints are inclusive of the range
- alert is raised if value is inside
start
andend
range
(Note: this may be extended in future for adding multiple ranges using a
separator - I think this is catered for by repeating ok=,warn=,crit=
.)
This simple range does not require quoting at the shell.
Complex Range
Complex ranges are defined as:
[^](start..end)
or
[^]start..end
Where:
start
andend
must be definedstart
andend
match the regular expression/\^[+-]?[0-9]+\.?[0-9]*$|^inf$/
(ie, a numeric or “inf”)start
≤end
- if
start
=inf
, this is negative infinity. This can also be written as-inf
- if
end
=inf
, this is positive infinity - endpoints are excluded from the range if () are used, otherwise endpoints are included in the range
- alert is raised if value is within
start
andend
range, unless^
is used, in which case alert is raised if outside the range
Note that due to shell characters, quoting may be required.
Rules for Determining State
Given a numeric value, the state of the threshold is calculated from the following ordered rules:
- If no levels are specified, return
OK
- If an
ok
level is specified and value is within range, returnOK
- If a
critical
level is specified and value is within range, returnCRITICAL
- If a
warning
level is specified and value is within range, returnWARNING
- If an
ok
level is specified, returnCRITICAL
- Otherwise return
OK
Looking Back …
So the check_http
example becomes:
check_http -H $HOSTADDRESS$ \
--th metric=time,ok=0..5 \
--th metric=size,ok=10..inf,prefix=Ki \
--th metric=age,ok=0..1,unit=d
I believe this is more readable (I’m interested in the time, the size and the age) and more consistent (I’m alerting above 5, less than 10 and above 1, respectively).
In addition, performance data will only be output if the metric has been
specified. So only show time performance data if --th metric=time
has been
specified on the command line. Both warning range or critical range can be
unspecified - this effectively means “I am not going to alert on this value,
but I’d like to be informed about it in the performance data”.
Because the specification for a range has changed, the warning and critical parts of the performance data can no longer be guaranteed. There is an additional piece of work required to fix a new format for performance data. However, the basic
label=value[uom]
will still be valid.
Examples
Other examples.
To check httpd processes are OK
if the virtual size is under 8096 bytes.
Warn until they reach 16182, but bigger than that is CRITICAL
.
# old
check_procs -w 8096 -c 16182 -C httpd --metric VSZ
# new
check_procs -C httpd --th metric=vsize,ok=0..8096,warn=8097..16182
There should always be one and only one ‘tnslsnr’ process. Otherwise
CRITICAL
.
# old
check_procs -w 1:1 -c 1:1 -C tnslsnr
# new
check_procs -C tnslsnr --th metric=count,ok=1..1
Load averages (1,5,15 minute) should be within reasonable ranges.
# old
check_load -w 1.0,0.8,0.7 -c 1.5,1.3,1.0
# new
check_load --th metric=1min,ok=0..1.0,warn=1.0..1.5 \
--th metric=5min,ok=0..0.8,warn=0.8..1.3 \
--th metric=15min,ok=0..0.7,warn=0.7..1.0
Plan
I personally plan on updating check_procs
.
The basic syntax is:
check_procs [filter options] [threshold options]
Where filter options are the current -u {username}
, -C {command}
, etc.
This reduces the set of processes that are to be calculated.
The new threshold metrics will be:
- number - alert on number of matching processes. Performance data returns number of processes
- rss-threshold - alert on rss size if any matching process is in range. Perf data returns average rss
- rss-max - Same as
--rss
, but perf data returns max rss - rss-sum - alert on the total rss of all matching processes. Perf data returns rss_sum
- vsz-threshold - alert on vsz size if any matching process is in range. Perf data returns average vsz
- vsz-max - Same as
--vsz
, but perf data returns max rss - vsz-sum - alert on the total vsz of all matching processes. Perf data returns vsz_sum
- cpu-threshold - alert on cpu % of all matching processes. Perf data returns average cpu
- cpu-max - Same as
--cpu
, but perf data returns max cpu - cpu-sum - alert on total cpu. Perf data returns cpu_sum
There will be C library routines for parsing the threshold values.
There will be C library routines for the collection and output of the performance data.
Terminology
- metric
- Something that a check is going to be measured against. For example, for disk checks, it could be used or free or inodes_free; for HTTP checks, it could be time taken or size; for process checks, it could be cpu or number of processes or vsz
- range
- This defines a continuous range of values when an alert would be raised
- level
- This is an alert level within Nagios -
OK
,WARNING
orCRITICAL
- threshold
- This consists of a level with a range
Limitations
This assumes that you are always comparing numbers as the metric values.
There maybe some limitations in the precision of values. All internal logic should use double precision.
If there are multiple metrics, the alert will be on an OR basis, that is, any single metric which passes its threshold will cause the plugin to return a failed state.