diff options
Diffstat (limited to 'web')
-rw-r--r-- | web/input/doc/new-threshold-syntax.md | 171 |
1 files changed, 86 insertions, 85 deletions
diff --git a/web/input/doc/new-threshold-syntax.md b/web/input/doc/new-threshold-syntax.md index c3eb8b7..4cf8cf6 100644 --- a/web/input/doc/new-threshold-syntax.md +++ b/web/input/doc/new-threshold-syntax.md | |||
@@ -11,18 +11,18 @@ _Ton Voon, March 17, 2008_ | |||
11 | ## Overview | 11 | ## Overview |
12 | 12 | ||
13 | The method for defining thresholds via the command line is inconsistent and | 13 | The method for defining thresholds via the command line is inconsistent and |
14 | difficult to interpret. This proposal suggests a different way of specifying | 14 | difficult to interpret. This proposal suggests a different way of specifying |
15 | thresholds, which will also changes the metrics of performance data returned. | 15 | thresholds, which will also change the metrics of performance data returned. |
16 | 16 | ||
17 | ## Problem | 17 | ## Problem |
18 | 18 | ||
19 | The current method of specifying thresholds is confusing when there are | 19 | The current method of specifying thresholds is confusing when there are |
20 | different checks required. For instance, in check\_http, to check page size | 20 | different checks required. For instance, in `check_http`, to check page size |
21 | and time, you can specify -w {warn time}, -c {crit time}, -m | 21 | and time, you can specify `-w {warn time}`, `-c {crit time}`, |
22 | {minpagesize}[:maxpagesize], -M {maxage of document}. | 22 | `-m {minpagesize}[:maxpagesize]`, `-M {maxage of document}`. |
23 | 23 | ||
24 | Also, note the ways of defining the range are inconsistent. Some alert above | 24 | Also, note the ways of defining the range are inconsistent. Some alert above |
25 | the value (time, maxage), some alert below the value (pagesize). This is | 25 | the value (time, maxage), some alert below the value (pagesize). This is |
26 | inconsistent for the same plugin! | 26 | inconsistent for the same plugin! |
27 | 27 | ||
28 | So, to check that a web page is returned within 5 seconds, the minimum page | 28 | So, to check that a web page is returned within 5 seconds, the minimum page |
@@ -34,7 +34,7 @@ Furthermore, the current specification for ranges in the developer guidelines | |||
34 | fails the “obviousness” test: a range of 3:5 will alert if the value is | 34 | fails the “obviousness” test: a range of 3:5 will alert if the value is |
35 | outside that range, rather than inside as you would expect. | 35 | outside that range, rather than inside as you would expect. |
36 | 36 | ||
37 | Also, the performance data returned by check\_http is always time and size. | 37 | Also, the performance data returned by `check_http` is always time and size. |
38 | Perhaps you want only time, or you want age as well. | 38 | Perhaps you want only time, or you want age as well. |
39 | 39 | ||
40 | ## Proposal | 40 | ## Proposal |
@@ -52,42 +52,42 @@ The threshold definition is a subgetopt format of the form: | |||
52 | 52 | ||
53 | Where: | 53 | Where: |
54 | 54 | ||
55 | - ok, warn, crit are called “levels” | 55 | - `ok`, `warn`, `crit` are called “levels” |
56 | - any of ok, warn, crit, unit or prefix are optional | 56 | - any of `ok`, `warn`, `crit`, `unit` or `prefix` are optional |
57 | - if ok, warning and critical are not specified, then no alert is raised, | 57 | - if `ok`, `warning` and `critical` are not specified, then no alert is |
58 | but the performance data will be returned | 58 | raised, but the performance data will be returned |
59 | - the unit can be specified with plugins that do not know about the type of | 59 | - the `unit` can be specified with plugins that do not know about the type of |
60 | value returned (SNMP, Windows performance counters, etc.) | 60 | value returned (SNMP, Windows performance counters, etc.) |
61 | - the prefix is used to multiply the input range and possibly for display | 61 | - the `prefix` is used to multiply the input range and possibly for display |
62 | data. The prefixes allowed are defined by NIST: | 62 | data. The prefixes allowed are defined by NIST: |
63 | <http://physics.nist.gov/cuu/Units/prefixes.html> | 63 | <http://physics.nist.gov/cuu/Units/prefixes.html> |
64 | <http://physics.nist.gov/cuu/Units/binary.html> | 64 | <http://physics.nist.gov/cuu/Units/binary.html> |
65 | - ok, warning or critical can be repeated to define an additional range. | 65 | - `ok`, `warning` or `critical` can be repeated to define an additional range. |
66 | This allows non-continuous ranges to be defined | 66 | This allows non-continuous ranges to be defined |
67 | - warning can be abbreviated to warn or w | 67 | - `warning` can be abbreviated to `warn` or `w` |
68 | - critical can be abbreviated to crit or c | 68 | - `critical` can be abbreviated to `crit` or `c` |
69 | 69 | ||
70 | ### Simple Range | 70 | ### Simple Range |
71 | 71 | ||
72 | The range values have two specifications: simple and complex. Simple ranges | 72 | The range values have two specifications: simple and complex. Simple ranges |
73 | are of the format: | 73 | are of the format: |
74 | 74 | ||
75 | start..end | 75 | start..end |
76 | 76 | ||
77 | Where: | 77 | Where: |
78 | 78 | ||
79 | - start and end must be defined | 79 | - `start` and `end` must be defined |
80 | - start and end match the regular expression | 80 | - `start` and `end` match the regular expression |
81 | /^[+-]?[0-9]+\\.?[0-9]\*$|^inf$/ (ie, a numeric or “inf”) | 81 | `/^[+-]?[0-9]+\.?[0-9]*$|^inf$/` (ie, a numeric or “inf”) |
82 | - start ≤ end | 82 | - `start ≤ end` |
83 | - if start = “inf”, this is negative infinity. This can also be written as | 83 | - if `start` = `inf`, this is negative infinity. This can also be written as |
84 | “-inf” | 84 | `-inf` |
85 | - if end = “inf”, this is positive infinity | 85 | - if `end` = `inf`, this is positive infinity |
86 | - endpoints are inclusive of the range | 86 | - endpoints are inclusive of the range |
87 | - alert is raised if value is inside start and end range | 87 | - alert is raised if value is inside `start` and `end` range |
88 | 88 | ||
89 | (Note: this may be extended in future for adding multiple ranges using a | 89 | (Note: this may be extended in future for adding multiple ranges using a |
90 | separator - I think this is catered for by repeating ok=,warn=,crit=.) | 90 | separator - I think this is catered for by repeating `ok=,warn=,crit=`.) |
91 | 91 | ||
92 | This simple range does not require quoting at the shell. | 92 | This simple range does not require quoting at the shell. |
93 | 93 | ||
@@ -103,17 +103,17 @@ or | |||
103 | 103 | ||
104 | Where: | 104 | Where: |
105 | 105 | ||
106 | - start and end must be defined | 106 | - `start` and `end` must be defined |
107 | - start and end match the regular expression | 107 | - `start` and `end` match the regular expression |
108 | /\^[+-]?[0-9]+\\.?[0-9]\*\$|\^inf\$/ (ie, a numeric or “inf”) | 108 | `/\^[+-]?[0-9]+\.?[0-9]*$|^inf$/` (ie, a numeric or “inf”) |
109 | - start ≤ end | 109 | - `start` ≤ `end` |
110 | - if start = “inf”, this is negative infinity. This can also be written as | 110 | - if `start` = `inf`, this is negative infinity. This can also be written as |
111 | “-inf” | 111 | `-inf` |
112 | - if end = “inf”, this is positive infinity | 112 | - if `end` = `inf`, this is positive infinity |
113 | - endpoints are excluded from the range if () are used, otherwise endpoints | 113 | - endpoints are excluded from the range if () are used, otherwise endpoints |
114 | are included in the range | 114 | are included in the range |
115 | - alert is raised if value is within start and end range, unless \^ is used, | 115 | - alert is raised if value is within `start` and `end` range, unless `^` is |
116 | in which case alert is raised if outside the range | 116 | used, in which case alert is raised if outside the range |
117 | 117 | ||
118 | Note that due to shell characters, quoting may be required. | 118 | Note that due to shell characters, quoting may be required. |
119 | 119 | ||
@@ -122,17 +122,18 @@ Note that due to shell characters, quoting may be required. | |||
122 | Given a numeric value, the state of the threshold is calculated from the | 122 | Given a numeric value, the state of the threshold is calculated from the |
123 | following ordered rules: | 123 | following ordered rules: |
124 | 124 | ||
125 | 1. If no levels are specified, return OK | 125 | 1. If no levels are specified, return `OK` |
126 | 2. If an ok level is specified and value is within range, return OK | 126 | 2. If an `ok` level is specified and value is within range, return `OK` |
127 | 3. If a critical level is specified and value is within range, return | 127 | 3. If a `critical` level is specified and value is within range, return |
128 | CRITICAL | 128 | `CRITICAL` |
129 | 4. If a warning level is specified and value is within range, return WARNING | 129 | 4. If a `warning` level is specified and value is within range, return |
130 | 5. If an ok level is specified, return CRITICAL | 130 | `WARNING` |
131 | 6. Otherwise return OK | 131 | 5. If an `ok` level is specified, return `CRITICAL` |
132 | 6. Otherwise return `OK` | ||
132 | 133 | ||
133 | ### Looking Back … | 134 | ### Looking Back … |
134 | 135 | ||
135 | So the check\_http example becomes: | 136 | So the `check_http` example becomes: |
136 | 137 | ||
137 | check_http -H $HOSTADDRESS$ \ | 138 | check_http -H $HOSTADDRESS$ \ |
138 | --th metric=time,ok=0..5 \ | 139 | --th metric=time,ok=0..5 \ |
@@ -144,26 +145,26 @@ age) and more consistent (I’m alerting above 5, less than 10 and above 1, | |||
144 | respectively). | 145 | respectively). |
145 | 146 | ||
146 | In addition, performance data will only be output if the metric has been | 147 | In addition, performance data will only be output if the metric has been |
147 | specified. So only show time performance data if “--th metric=time” has been | 148 | specified. So only show time performance data if `--th metric=time` has been |
148 | specified on the command line. Both warning\_range or critical\_range can be | 149 | specified on the command line. Both warning range or critical range can be |
149 | unspecified - this effectively means “I am not going to alert on this value, | 150 | unspecified - this effectively means “I am not going to alert on this value, |
150 | but I’d like to be informed about it in the performance data”. | 151 | but I’d like to be informed about it in the performance data”. |
151 | 152 | ||
152 | Because the specification for a range has changed, the warning and critical | 153 | Because the specification for a range has changed, the warning and critical |
153 | parts of the performance data can no longer be guaranteed. There is an | 154 | parts of the performance data can no longer be guaranteed. There is an |
154 | additional piece of work required to fix a new format for performance data. | 155 | additional piece of work required to fix a new format for performance data. |
155 | However, the basic | 156 | However, the basic |
156 | 157 | ||
157 | label=value[uom] | 158 | label=value[uom] |
158 | 159 | ||
159 | Will still be valid. | 160 | will still be valid. |
160 | 161 | ||
161 | ### Examples | 162 | ### Examples |
162 | 163 | ||
163 | Other examples. | 164 | Other examples. |
164 | 165 | ||
165 | To check httpd processes are OK if the virtual size is under 8096 bytes. Warn | 166 | To check httpd processes are `OK` if the virtual size is under 8096 bytes. |
166 | until they reach 16182, but bigger than that is CRITICAL. | 167 | Warn until they reach 16182, but bigger than that is `CRITICAL`. |
167 | 168 | ||
168 | # old | 169 | # old |
169 | check_procs -w 8096 -c 16182 -C httpd --metric VSZ | 170 | check_procs -w 8096 -c 16182 -C httpd --metric VSZ |
@@ -171,8 +172,8 @@ until they reach 16182, but bigger than that is CRITICAL. | |||
171 | # new | 172 | # new |
172 | check_procs -C httpd --th metric=vsize,ok=0..8096,warn=8097..16182 | 173 | check_procs -C httpd --th metric=vsize,ok=0..8096,warn=8097..16182 |
173 | 174 | ||
174 | There should always be one and only one ‘tnslsnr’ process. Otherwise | 175 | There should always be one and only one ‘tnslsnr’ process. Otherwise |
175 | critical. | 176 | `CRITICAL`. |
176 | 177 | ||
177 | # old | 178 | # old |
178 | check_procs -w 1:1 -c 1:1 -C tnslsnr | 179 | check_procs -w 1:1 -c 1:1 -C tnslsnr |
@@ -192,33 +193,33 @@ Load averages (1,5,15 minute) should be within reasonable ranges. | |||
192 | 193 | ||
193 | ## Plan | 194 | ## Plan |
194 | 195 | ||
195 | I personally plan on updating check\_procs. | 196 | I personally plan on updating `check_procs`. |
196 | 197 | ||
197 | The basic syntax is: | 198 | The basic syntax is: |
198 | 199 | ||
199 | check_procs [filter options] [threshold options] | 200 | check_procs [filter options] [threshold options] |
200 | 201 | ||
201 | Where filter options are the current -u {username}, -C {command}, etc. This | 202 | Where filter options are the current `-u {username}`, `-C {command}`, etc. |
202 | reduces the set of processes that are to be calculated. | 203 | This reduces the set of processes that are to be calculated. |
203 | 204 | ||
204 | The new threshold metrics will be: | 205 | The new threshold metrics will be: |
205 | 206 | ||
206 | - number - alert on number of matching processes. Performance data returns | 207 | - number - alert on number of matching processes. Performance data returns |
207 | number of processes | 208 | number of processes |
208 | - rss-threshold - alert on rss size if any matching process is in range. | 209 | - rss-threshold - alert on rss size if any matching process is in range. Perf |
209 | Perf data returns average rss | 210 | data returns average rss |
210 | - rss-max - Same as --rss, but perf data returns max rss | 211 | - rss-max - Same as `--rss`, but perf data returns max rss |
211 | - rss-sum - alert on the total rss of all matching processes. Perf data | 212 | - rss-sum - alert on the total rss of all matching processes. Perf data |
212 | returns rss\_sum | 213 | returns rss\_sum |
213 | - vsz-threshold - alert on vsz size if any matching process is in range. | 214 | - vsz-threshold - alert on vsz size if any matching process is in range. Perf |
214 | Perf data returns average vsz | 215 | data returns average vsz |
215 | - vsz-max - Same as --vsz, but perf data returns max rss | 216 | - vsz-max - Same as `--vsz`, but perf data returns max rss |
216 | - vsz-sum - alert on the total vsz of all matching processes. Perf data | 217 | - vsz-sum - alert on the total vsz of all matching processes. Perf data |
217 | returns vsz\_sum | 218 | returns vsz\_sum |
218 | - cpu-threshold - alert on cpu % of all matching processes. Perf data | 219 | - cpu-threshold - alert on cpu % of all matching processes. Perf data returns |
219 | returns average cpu | 220 | average cpu |
220 | - cpu-max - Same as --cpu, but perf data returns max cpu | 221 | - cpu-max - Same as `--cpu`, but perf data returns max cpu |
221 | - cpu-sum - alert on total cpu. Perf data returns cpu\_sum | 222 | - cpu-sum - alert on total cpu. Perf data returns cpu\_sum |
222 | 223 | ||
223 | There will be C library routines for parsing the threshold values. | 224 | There will be C library routines for parsing the threshold values. |
224 | 225 | ||
@@ -228,16 +229,16 @@ performance data. | |||
228 | ## Terminology | 229 | ## Terminology |
229 | 230 | ||
230 | **metric** | 231 | **metric** |
231 | : Something that a check is going to be measured against. For example, for | 232 | : Something that a check is going to be measured against. For example, for |
232 | disk checks, it could be used or free or inodes\_free; for http checks, it | 233 | disk checks, it could be used or free or inodes\_free; for HTTP checks, it |
233 | could be time [taken] or size; for process checks, it could be cpu or | 234 | could be time taken or size; for process checks, it could be cpu or |
234 | number [of processes] or vsz | 235 | number of processes or vsz |
235 | 236 | ||
236 | **range** | 237 | **range** |
237 | : This defines a continuous range of values when an alert would be raised | 238 | : This defines a continuous range of values when an alert would be raised |
238 | 239 | ||
239 | **level** | 240 | **level** |
240 | : This is an alert level within Nagios - OK, WARNING or CRITICAL | 241 | : This is an alert level within Nagios - `OK`, `WARNING` or `CRITICAL` |
241 | 242 | ||
242 | **threshold** | 243 | **threshold** |
243 | : This consists of a level with a range | 244 | : This consists of a level with a range |
@@ -246,7 +247,7 @@ performance data. | |||
246 | 247 | ||
247 | This assumes that you are always comparing numbers as the metric values. | 248 | This assumes that you are always comparing numbers as the metric values. |
248 | 249 | ||
249 | There maybe some limitations in the precision of values. All internal logic | 250 | There maybe some limitations in the precision of values. All internal logic |
250 | should use double precision. | 251 | should use double precision. |
251 | 252 | ||
252 | If there are multiple metrics, the alert will be on an OR basis, that is, any | 253 | If there are multiple metrics, the alert will be on an OR basis, that is, any |