summaryrefslogtreecommitdiffstats
path: root/web/input
diff options
context:
space:
mode:
Diffstat (limited to 'web/input')
-rw-r--r--web/input/doc/new-threshold-syntax.md171
1 files changed, 86 insertions, 85 deletions
diff --git a/web/input/doc/new-threshold-syntax.md b/web/input/doc/new-threshold-syntax.md
index c3eb8b7..4cf8cf6 100644
--- a/web/input/doc/new-threshold-syntax.md
+++ b/web/input/doc/new-threshold-syntax.md
@@ -11,18 +11,18 @@ _Ton Voon, March 17, 2008_
11## Overview 11## Overview
12 12
13The method for defining thresholds via the command line is inconsistent and 13The method for defining thresholds via the command line is inconsistent and
14difficult to interpret. This proposal suggests a different way of specifying 14difficult to interpret. This proposal suggests a different way of specifying
15thresholds, which will also changes the metrics of performance data returned. 15thresholds, which will also change the metrics of performance data returned.
16 16
17## Problem 17## Problem
18 18
19The current method of specifying thresholds is confusing when there are 19The current method of specifying thresholds is confusing when there are
20different checks required. For instance, in check\_http, to check page size 20different checks required. For instance, in `check_http`, to check page size
21and time, you can specify -w {warn time}, -c {crit time}, -m 21and time, you can specify `-w {warn time}`, `-c {crit time}`,
22{minpagesize}[:maxpagesize], -M {maxage of document}. 22`-m {minpagesize}[:maxpagesize]`, `-M {maxage of document}`.
23 23
24Also, note the ways of defining the range are inconsistent. Some alert above 24Also, note the ways of defining the range are inconsistent. Some alert above
25the value (time, maxage), some alert below the value (pagesize). This is 25the value (time, maxage), some alert below the value (pagesize). This is
26inconsistent for the same plugin! 26inconsistent for the same plugin!
27 27
28So, to check that a web page is returned within 5 seconds, the minimum page 28So, to check that a web page is returned within 5 seconds, the minimum page
@@ -34,7 +34,7 @@ Furthermore, the current specification for ranges in the developer guidelines
34fails the “obviousness” test: a range of 3:5 will alert if the value is 34fails the “obviousness” test: a range of 3:5 will alert if the value is
35outside that range, rather than inside as you would expect. 35outside that range, rather than inside as you would expect.
36 36
37Also, the performance data returned by check\_http is always time and size. 37Also, the performance data returned by `check_http` is always time and size.
38Perhaps you want only time, or you want age as well. 38Perhaps you want only time, or you want age as well.
39 39
40## Proposal 40## Proposal
@@ -52,42 +52,42 @@ The threshold definition is a subgetopt format of the form:
52 52
53Where: 53Where:
54 54
55- ok, warn, crit are called “levels” 55- `ok`, `warn`, `crit` are called “levels”
56- any of ok, warn, crit, unit or prefix are optional 56- any of `ok`, `warn`, `crit`, `unit` or `prefix` are optional
57- if ok, warning and critical are not specified, then no alert is raised, 57- if `ok`, `warning` and `critical` are not specified, then no alert is
58 but the performance data will be returned 58 raised, but the performance data will be returned
59- the unit can be specified with plugins that do not know about the type of 59- the `unit` can be specified with plugins that do not know about the type of
60 value returned (SNMP, Windows performance counters, etc.) 60 value returned (SNMP, Windows performance counters, etc.)
61- the prefix is used to multiply the input range and possibly for display 61- the `prefix` is used to multiply the input range and possibly for display
62 data. The prefixes allowed are defined by NIST: 62 data. The prefixes allowed are defined by NIST:
63 <http://physics.nist.gov/cuu/Units/prefixes.html> 63 <http://physics.nist.gov/cuu/Units/prefixes.html>
64 <http://physics.nist.gov/cuu/Units/binary.html> 64 <http://physics.nist.gov/cuu/Units/binary.html>
65- ok, warning or critical can be repeated to define an additional range. 65- `ok`, `warning` or `critical` can be repeated to define an additional range.
66 This allows non-continuous ranges to be defined 66 This allows non-continuous ranges to be defined
67- warning can be abbreviated to warn or w 67- `warning` can be abbreviated to `warn` or `w`
68- critical can be abbreviated to crit or c 68- `critical` can be abbreviated to `crit` or `c`
69 69
70### Simple Range 70### Simple Range
71 71
72The range values have two specifications: simple and complex. Simple ranges 72The range values have two specifications: simple and complex. Simple ranges
73are of the format: 73are of the format:
74 74
75 start..end 75 start..end
76 76
77Where: 77Where:
78 78
79- start and end must be defined 79- `start` and `end` must be defined
80- start and end match the regular expression 80- `start` and `end` match the regular expression
81 /^[+-]?[0-9]+\\.?[0-9]\*$|^inf$/ (ie, a numeric or “inf”) 81 `/^[+-]?[0-9]+\.?[0-9]*$|^inf$/` (ie, a numeric or “inf”)
82- start ≤ end 82- `start ≤ end`
83- if start = inf, this is negative infinity. This can also be written as 83- if `start` = `inf`, this is negative infinity. This can also be written as
84 -inf 84 `-inf`
85- if end = inf, this is positive infinity 85- if `end` = `inf`, this is positive infinity
86- endpoints are inclusive of the range 86- endpoints are inclusive of the range
87- alert is raised if value is inside start and end range 87- alert is raised if value is inside `start` and `end` range
88 88
89(Note: this may be extended in future for adding multiple ranges using a 89(Note: this may be extended in future for adding multiple ranges using a
90separator - I think this is catered for by repeating ok=,warn=,crit=.) 90separator - I think this is catered for by repeating `ok=,warn=,crit=`.)
91 91
92This simple range does not require quoting at the shell. 92This simple range does not require quoting at the shell.
93 93
@@ -103,17 +103,17 @@ or
103 103
104Where: 104Where:
105 105
106- start and end must be defined 106- `start` and `end` must be defined
107- start and end match the regular expression 107- `start` and `end` match the regular expression
108 /\^[+-]?[0-9]+\\.?[0-9]\*\$|\^inf\$/ (ie, a numeric or “inf”) 108 `/\^[+-]?[0-9]+\.?[0-9]*$|^inf$/` (ie, a numeric or “inf”)
109- start ≤ end 109- `start``end`
110- if start = inf, this is negative infinity. This can also be written as 110- if `start` = `inf`, this is negative infinity. This can also be written as
111 -inf 111 `-inf`
112- if end = inf, this is positive infinity 112- if `end` = `inf`, this is positive infinity
113- endpoints are excluded from the range if () are used, otherwise endpoints 113- endpoints are excluded from the range if () are used, otherwise endpoints
114 are included in the range 114 are included in the range
115- alert is raised if value is within start and end range, unless \^ is used, 115- alert is raised if value is within `start` and `end` range, unless `^` is
116 in which case alert is raised if outside the range 116 used, in which case alert is raised if outside the range
117 117
118Note that due to shell characters, quoting may be required. 118Note that due to shell characters, quoting may be required.
119 119
@@ -122,17 +122,18 @@ Note that due to shell characters, quoting may be required.
122Given a numeric value, the state of the threshold is calculated from the 122Given a numeric value, the state of the threshold is calculated from the
123following ordered rules: 123following ordered rules:
124 124
1251. If no levels are specified, return OK 1251. If no levels are specified, return `OK`
1262. If an ok level is specified and value is within range, return OK 1262. If an `ok` level is specified and value is within range, return `OK`
1273. If a critical level is specified and value is within range, return 1273. If a `critical` level is specified and value is within range, return
128 CRITICAL 128 `CRITICAL`
1294. If a warning level is specified and value is within range, return WARNING 1294. If a `warning` level is specified and value is within range, return
1305. If an ok level is specified, return CRITICAL 130 `WARNING`
1316. Otherwise return OK 1315. If an `ok` level is specified, return `CRITICAL`
1326. Otherwise return `OK`
132 133
133### Looking Back … 134### Looking Back …
134 135
135So the check\_http example becomes: 136So the `check_http` example becomes:
136 137
137 check_http -H $HOSTADDRESS$ \ 138 check_http -H $HOSTADDRESS$ \
138 --th metric=time,ok=0..5 \ 139 --th metric=time,ok=0..5 \
@@ -144,26 +145,26 @@ age) and more consistent (I’m alerting above 5, less than 10 and above 1,
144respectively). 145respectively).
145 146
146In addition, performance data will only be output if the metric has been 147In addition, performance data will only be output if the metric has been
147specified. So only show time performance data if --th metric=time has been 148specified. So only show time performance data if `--th metric=time` has been
148specified on the command line. Both warning\_range or critical\_range can be 149specified on the command line. Both warning range or critical range can be
149unspecified - this effectively means “I am not going to alert on this value, 150unspecified - this effectively means “I am not going to alert on this value,
150but I’d like to be informed about it in the performance data”. 151but I’d like to be informed about it in the performance data”.
151 152
152Because the specification for a range has changed, the warning and critical 153Because the specification for a range has changed, the warning and critical
153parts of the performance data can no longer be guaranteed. There is an 154parts of the performance data can no longer be guaranteed. There is an
154additional piece of work required to fix a new format for performance data. 155additional piece of work required to fix a new format for performance data.
155However, the basic 156However, the basic
156 157
157 label=value[uom] 158 label=value[uom]
158 159
159Will still be valid. 160will still be valid.
160 161
161### Examples 162### Examples
162 163
163Other examples. 164Other examples.
164 165
165To check httpd processes are OK if the virtual size is under 8096 bytes. Warn 166To check httpd processes are `OK` if the virtual size is under 8096 bytes.
166until they reach 16182, but bigger than that is CRITICAL. 167Warn until they reach 16182, but bigger than that is `CRITICAL`.
167 168
168 # old 169 # old
169 check_procs -w 8096 -c 16182 -C httpd --metric VSZ 170 check_procs -w 8096 -c 16182 -C httpd --metric VSZ
@@ -171,8 +172,8 @@ until they reach 16182, but bigger than that is CRITICAL.
171 # new 172 # new
172 check_procs -C httpd --th metric=vsize,ok=0..8096,warn=8097..16182 173 check_procs -C httpd --th metric=vsize,ok=0..8096,warn=8097..16182
173 174
174There should always be one and only one ‘tnslsnr’ process. Otherwise 175There should always be one and only one ‘tnslsnr’ process. Otherwise
175critical. 176`CRITICAL`.
176 177
177 # old 178 # old
178 check_procs -w 1:1 -c 1:1 -C tnslsnr 179 check_procs -w 1:1 -c 1:1 -C tnslsnr
@@ -192,33 +193,33 @@ Load averages (1,5,15 minute) should be within reasonable ranges.
192 193
193## Plan 194## Plan
194 195
195I personally plan on updating check\_procs. 196I personally plan on updating `check_procs`.
196 197
197The basic syntax is: 198The basic syntax is:
198 199
199 check_procs [filter options] [threshold options] 200 check_procs [filter options] [threshold options]
200 201
201Where filter options are the current -u {username}, -C {command}, etc. This 202Where filter options are the current `-u {username}`, `-C {command}`, etc.
202reduces the set of processes that are to be calculated. 203This reduces the set of processes that are to be calculated.
203 204
204The new threshold metrics will be: 205The new threshold metrics will be:
205 206
206- number - alert on number of matching processes. Performance data returns 207- number - alert on number of matching processes. Performance data returns
207 number of processes 208 number of processes
208- rss-threshold - alert on rss size if any matching process is in range. 209- rss-threshold - alert on rss size if any matching process is in range. Perf
209 Perf data returns average rss 210 data returns average rss
210- rss-max - Same as --rss, but perf data returns max rss 211- rss-max - Same as `--rss`, but perf data returns max rss
211- rss-sum - alert on the total rss of all matching processes. Perf data 212- rss-sum - alert on the total rss of all matching processes. Perf data
212 returns rss\_sum 213 returns rss\_sum
213- vsz-threshold - alert on vsz size if any matching process is in range. 214- vsz-threshold - alert on vsz size if any matching process is in range. Perf
214 Perf data returns average vsz 215 data returns average vsz
215- vsz-max - Same as --vsz, but perf data returns max rss 216- vsz-max - Same as `--vsz`, but perf data returns max rss
216- vsz-sum - alert on the total vsz of all matching processes. Perf data 217- vsz-sum - alert on the total vsz of all matching processes. Perf data
217 returns vsz\_sum 218 returns vsz\_sum
218- cpu-threshold - alert on cpu % of all matching processes. Perf data 219- cpu-threshold - alert on cpu % of all matching processes. Perf data returns
219 returns average cpu 220 average cpu
220- cpu-max - Same as --cpu, but perf data returns max cpu 221- cpu-max - Same as `--cpu`, but perf data returns max cpu
221- cpu-sum - alert on total cpu. Perf data returns cpu\_sum 222- cpu-sum - alert on total cpu. Perf data returns cpu\_sum
222 223
223There will be C library routines for parsing the threshold values. 224There will be C library routines for parsing the threshold values.
224 225
@@ -228,16 +229,16 @@ performance data.
228## Terminology 229## Terminology
229 230
230**metric** 231**metric**
231: Something that a check is going to be measured against. For example, for 232: Something that a check is going to be measured against. For example, for
232 disk checks, it could be used or free or inodes\_free; for http checks, it 233 disk checks, it could be used or free or inodes\_free; for HTTP checks, it
233 could be time [taken] or size; for process checks, it could be cpu or 234 could be time taken or size; for process checks, it could be cpu or
234 number [of processes] or vsz 235 number of processes or vsz
235 236
236**range** 237**range**
237: This defines a continuous range of values when an alert would be raised 238: This defines a continuous range of values when an alert would be raised
238 239
239**level** 240**level**
240: This is an alert level within Nagios - OK, WARNING or CRITICAL 241: This is an alert level within Nagios - `OK`, `WARNING` or `CRITICAL`
241 242
242**threshold** 243**threshold**
243: This consists of a level with a range 244: This consists of a level with a range
@@ -246,7 +247,7 @@ performance data.
246 247
247This assumes that you are always comparing numbers as the metric values. 248This assumes that you are always comparing numbers as the metric values.
248 249
249There maybe some limitations in the precision of values. All internal logic 250There maybe some limitations in the precision of values. All internal logic
250should use double precision. 251should use double precision.
251 252
252If there are multiple metrics, the alert will be on an OR basis, that is, any 253If there are multiple metrics, the alert will be on an OR basis, that is, any