summaryrefslogtreecommitdiffstats
path: root/web/input
diff options
context:
space:
mode:
Diffstat (limited to 'web/input')
-rw-r--r--web/input/doc/new-threshold-syntax.md256
1 files changed, 256 insertions, 0 deletions
diff --git a/web/input/doc/new-threshold-syntax.md b/web/input/doc/new-threshold-syntax.md
new file mode 100644
index 0000000..c3eb8b7
--- /dev/null
+++ b/web/input/doc/new-threshold-syntax.md
@@ -0,0 +1,256 @@
1title: New Threshold Syntax
2parent: Documentation
3---
4
5<!--% # Auto-imported from: http://nagiosplugins.org/rfc/new_threshold_syntax # %-->
6
7# New Specification Method for Thresholds
8
9_Ton Voon, March 17, 2008_
10
11## Overview
12
13The method for defining thresholds via the command line is inconsistent and
14difficult to interpret. This proposal suggests a different way of specifying
15thresholds, which will also changes the metrics of performance data returned.
16
17## Problem
18
19The current method of specifying thresholds is confusing when there are
20different checks required. For instance, in check\_http, to check page size
21and time, you can specify -w {warn time}, -c {crit time}, -m
22{minpagesize}[:maxpagesize], -M {maxage of document}.
23
24Also, note the ways of defining the range are inconsistent. Some alert above
25the value (time, maxage), some alert below the value (pagesize). This is
26inconsistent for the same plugin!
27
28So, to check that a web page is returned within 5 seconds, the minimum page
29size is 10K and the maximum age is 1 day, you would invoke:
30
31 check_http -H $HOSTADDRESS$ -c 5 -m 10000 -M 1d
32
33Furthermore, the current specification for ranges in the developer guidelines
34fails the “obviousness” test: a range of 3:5 will alert if the value is
35outside that range, rather than inside as you would expect.
36
37Also, the performance data returned by check\_http is always time and size.
38Perhaps you want only time, or you want age as well.
39
40## Proposal
41
42### Thresholds
43
44This document proposes that threshold arguments are specified like:
45
46 --threshold={threshold definition}
47 --th={threshold definition}
48
49The threshold definition is a subgetopt format of the form:
50
51 metric={metric},ok={range},warn={range},crit={range},unit={unit},prefix={SI prefix}
52
53Where:
54
55- ok, warn, crit are called “levels”
56- any of ok, warn, crit, unit or prefix are optional
57- if ok, warning and critical are not specified, then no alert is raised,
58 but the performance data will be returned
59- the unit can be specified with plugins that do not know about the type of
60 value returned (SNMP, Windows performance counters, etc.)
61- the prefix is used to multiply the input range and possibly for display
62 data. The prefixes allowed are defined by NIST:
63 <http://physics.nist.gov/cuu/Units/prefixes.html>
64 <http://physics.nist.gov/cuu/Units/binary.html>
65- ok, warning or critical can be repeated to define an additional range.
66 This allows non-continuous ranges to be defined
67- warning can be abbreviated to warn or w
68- critical can be abbreviated to crit or c
69
70### Simple Range
71
72The range values have two specifications: simple and complex. Simple ranges
73are of the format:
74
75 start..end
76
77Where:
78
79- start and end must be defined
80- start and end match the regular expression
81 /^[+-]?[0-9]+\\.?[0-9]\*$|^inf$/ (ie, a numeric or “inf”)
82- start ≤ end
83- if start = “inf”, this is negative infinity. This can also be written as
84 “-inf”
85- if end = “inf”, this is positive infinity
86- endpoints are inclusive of the range
87- alert is raised if value is inside start and end range
88
89(Note: this may be extended in future for adding multiple ranges using a
90separator - I think this is catered for by repeating ok=,warn=,crit=.)
91
92This simple range does not require quoting at the shell.
93
94### Complex Range
95
96Complex ranges are defined as:
97
98 [^](start..end)
99
100or
101
102 [^]start..end
103
104Where:
105
106- start and end must be defined
107- start and end match the regular expression
108 /\^[+-]?[0-9]+\\.?[0-9]\*\$|\^inf\$/ (ie, a numeric or “inf”)
109- start ≤ end
110- if start = “inf”, this is negative infinity. This can also be written as
111 “-inf”
112- if end = “inf”, this is positive infinity
113- endpoints are excluded from the range if () are used, otherwise endpoints
114 are included in the range
115- alert is raised if value is within start and end range, unless \^ is used,
116 in which case alert is raised if outside the range
117
118Note that due to shell characters, quoting may be required.
119
120### Rules for Determining State
121
122Given a numeric value, the state of the threshold is calculated from the
123following ordered rules:
124
1251. If no levels are specified, return OK
1262. If an ok level is specified and value is within range, return OK
1273. If a critical level is specified and value is within range, return
128 CRITICAL
1294. If a warning level is specified and value is within range, return WARNING
1305. If an ok level is specified, return CRITICAL
1316. Otherwise return OK
132
133### Looking Back …
134
135So the check\_http example becomes:
136
137 check_http -H $HOSTADDRESS$ \
138 --th metric=time,ok=0..5 \
139 --th metric=size,ok=10..inf,prefix=Ki \
140 --th metric=age,ok=0..1,unit=d
141
142I believe this is more readable (I’m interested in the time, the size and the
143age) and more consistent (I’m alerting above 5, less than 10 and above 1,
144respectively).
145
146In addition, performance data will only be output if the metric has been
147specified. So only show time performance data if “--th metric=time” has been
148specified on the command line. Both warning\_range or critical\_range can be
149unspecified - this effectively means “I am not going to alert on this value,
150but I’d like to be informed about it in the performance data”.
151
152Because the specification for a range has changed, the warning and critical
153parts of the performance data can no longer be guaranteed. There is an
154additional piece of work required to fix a new format for performance data.
155However, the basic
156
157 label=value[uom]
158
159Will still be valid.
160
161### Examples
162
163Other examples.
164
165To check httpd processes are OK if the virtual size is under 8096 bytes. Warn
166until they reach 16182, but bigger than that is CRITICAL.
167
168 # old
169 check_procs -w 8096 -c 16182 -C httpd --metric VSZ
170
171 # new
172 check_procs -C httpd --th metric=vsize,ok=0..8096,warn=8097..16182
173
174There should always be one and only one ‘tnslsnr’ process. Otherwise
175critical.
176
177 # old
178 check_procs -w 1:1 -c 1:1 -C tnslsnr
179
180 # new
181 check_procs -C tnslsnr --th metric=count,ok=1..1
182
183Load averages (1,5,15 minute) should be within reasonable ranges.
184
185 # old
186 check_load -w 1.0,0.8,0.7 -c 1.5,1.3,1.0
187
188 # new
189 check_load --th metric=1min,ok=0..1.0,warn=1.0..1.5 \
190 --th metric=5min,ok=0..0.8,warn=0.8..1.3 \
191 --th metric=15min,ok=0..0.7,warn=0.7..1.0
192
193## Plan
194
195I personally plan on updating check\_procs.
196
197The basic syntax is:
198
199 check_procs [filter options] [threshold options]
200
201Where filter options are the current -u {username}, -C {command}, etc. This
202reduces the set of processes that are to be calculated.
203
204The new threshold metrics will be:
205
206- number - alert on number of matching processes. Performance data returns
207 number of processes
208- rss-threshold - alert on rss size if any matching process is in range.
209 Perf data returns average rss
210- rss-max - Same as --rss, but perf data returns max rss
211- rss-sum - alert on the total rss of all matching processes. Perf data
212 returns rss\_sum
213- vsz-threshold - alert on vsz size if any matching process is in range.
214 Perf data returns average vsz
215- vsz-max - Same as --vsz, but perf data returns max rss
216- vsz-sum - alert on the total vsz of all matching processes. Perf data
217 returns vsz\_sum
218- cpu-threshold - alert on cpu % of all matching processes. Perf data
219 returns average cpu
220- cpu-max - Same as --cpu, but perf data returns max cpu
221- cpu-sum - alert on total cpu. Perf data returns cpu\_sum
222
223There will be C library routines for parsing the threshold values.
224
225There will be C library routines for the collection and output of the
226performance data.
227
228## Terminology
229
230**metric**
231: Something that a check is going to be measured against. For example, for
232 disk checks, it could be used or free or inodes\_free; for http checks, it
233 could be time [taken] or size; for process checks, it could be cpu or
234 number [of processes] or vsz
235
236**range**
237: This defines a continuous range of values when an alert would be raised
238
239**level**
240: This is an alert level within Nagios - OK, WARNING or CRITICAL
241
242**threshold**
243: This consists of a level with a range
244
245## Limitations
246
247This assumes that you are always comparing numbers as the metric values.
248
249There maybe some limitations in the precision of values. All internal logic
250should use double precision.
251
252If there are multiple metrics, the alert will be on an OR basis, that is, any
253single metric which passes its threshold will cause the plugin to return a
254failed state.
255
256<!--% # vim:set filetype=markdown textwidth=78 joinspaces: # %-->