diff options
Diffstat (limited to 'web/input/doc')
-rw-r--r-- | web/input/doc/new-threshold-syntax.md | 256 |
1 files changed, 256 insertions, 0 deletions
diff --git a/web/input/doc/new-threshold-syntax.md b/web/input/doc/new-threshold-syntax.md new file mode 100644 index 0000000..c3eb8b7 --- /dev/null +++ b/web/input/doc/new-threshold-syntax.md | |||
@@ -0,0 +1,256 @@ | |||
1 | title: New Threshold Syntax | ||
2 | parent: Documentation | ||
3 | --- | ||
4 | |||
5 | <!--% # Auto-imported from: http://nagiosplugins.org/rfc/new_threshold_syntax # %--> | ||
6 | |||
7 | # New Specification Method for Thresholds | ||
8 | |||
9 | _Ton Voon, March 17, 2008_ | ||
10 | |||
11 | ## Overview | ||
12 | |||
13 | The method for defining thresholds via the command line is inconsistent and | ||
14 | difficult to interpret. This proposal suggests a different way of specifying | ||
15 | thresholds, which will also changes the metrics of performance data returned. | ||
16 | |||
17 | ## Problem | ||
18 | |||
19 | The current method of specifying thresholds is confusing when there are | ||
20 | different checks required. For instance, in check\_http, to check page size | ||
21 | and time, you can specify -w {warn time}, -c {crit time}, -m | ||
22 | {minpagesize}[:maxpagesize], -M {maxage of document}. | ||
23 | |||
24 | Also, note the ways of defining the range are inconsistent. Some alert above | ||
25 | the value (time, maxage), some alert below the value (pagesize). This is | ||
26 | inconsistent for the same plugin! | ||
27 | |||
28 | So, to check that a web page is returned within 5 seconds, the minimum page | ||
29 | size is 10K and the maximum age is 1 day, you would invoke: | ||
30 | |||
31 | check_http -H $HOSTADDRESS$ -c 5 -m 10000 -M 1d | ||
32 | |||
33 | Furthermore, the current specification for ranges in the developer guidelines | ||
34 | fails the “obviousness” test: a range of 3:5 will alert if the value is | ||
35 | outside that range, rather than inside as you would expect. | ||
36 | |||
37 | Also, the performance data returned by check\_http is always time and size. | ||
38 | Perhaps you want only time, or you want age as well. | ||
39 | |||
40 | ## Proposal | ||
41 | |||
42 | ### Thresholds | ||
43 | |||
44 | This document proposes that threshold arguments are specified like: | ||
45 | |||
46 | --threshold={threshold definition} | ||
47 | --th={threshold definition} | ||
48 | |||
49 | The threshold definition is a subgetopt format of the form: | ||
50 | |||
51 | metric={metric},ok={range},warn={range},crit={range},unit={unit},prefix={SI prefix} | ||
52 | |||
53 | Where: | ||
54 | |||
55 | - ok, warn, crit are called “levels” | ||
56 | - any of ok, warn, crit, unit or prefix are optional | ||
57 | - if ok, warning and critical are not specified, then no alert is raised, | ||
58 | but the performance data will be returned | ||
59 | - the unit can be specified with plugins that do not know about the type of | ||
60 | value returned (SNMP, Windows performance counters, etc.) | ||
61 | - the prefix is used to multiply the input range and possibly for display | ||
62 | data. The prefixes allowed are defined by NIST: | ||
63 | <http://physics.nist.gov/cuu/Units/prefixes.html> | ||
64 | <http://physics.nist.gov/cuu/Units/binary.html> | ||
65 | - ok, warning or critical can be repeated to define an additional range. | ||
66 | This allows non-continuous ranges to be defined | ||
67 | - warning can be abbreviated to warn or w | ||
68 | - critical can be abbreviated to crit or c | ||
69 | |||
70 | ### Simple Range | ||
71 | |||
72 | The range values have two specifications: simple and complex. Simple ranges | ||
73 | are of the format: | ||
74 | |||
75 | start..end | ||
76 | |||
77 | Where: | ||
78 | |||
79 | - start and end must be defined | ||
80 | - start and end match the regular expression | ||
81 | /^[+-]?[0-9]+\\.?[0-9]\*$|^inf$/ (ie, a numeric or “inf”) | ||
82 | - start ≤ end | ||
83 | - if start = “inf”, this is negative infinity. This can also be written as | ||
84 | “-inf” | ||
85 | - if end = “inf”, this is positive infinity | ||
86 | - endpoints are inclusive of the range | ||
87 | - alert is raised if value is inside start and end range | ||
88 | |||
89 | (Note: this may be extended in future for adding multiple ranges using a | ||
90 | separator - I think this is catered for by repeating ok=,warn=,crit=.) | ||
91 | |||
92 | This simple range does not require quoting at the shell. | ||
93 | |||
94 | ### Complex Range | ||
95 | |||
96 | Complex ranges are defined as: | ||
97 | |||
98 | [^](start..end) | ||
99 | |||
100 | or | ||
101 | |||
102 | [^]start..end | ||
103 | |||
104 | Where: | ||
105 | |||
106 | - start and end must be defined | ||
107 | - start and end match the regular expression | ||
108 | /\^[+-]?[0-9]+\\.?[0-9]\*\$|\^inf\$/ (ie, a numeric or “inf”) | ||
109 | - start ≤ end | ||
110 | - if start = “inf”, this is negative infinity. This can also be written as | ||
111 | “-inf” | ||
112 | - if end = “inf”, this is positive infinity | ||
113 | - endpoints are excluded from the range if () are used, otherwise endpoints | ||
114 | are included in the range | ||
115 | - alert is raised if value is within start and end range, unless \^ is used, | ||
116 | in which case alert is raised if outside the range | ||
117 | |||
118 | Note that due to shell characters, quoting may be required. | ||
119 | |||
120 | ### Rules for Determining State | ||
121 | |||
122 | Given a numeric value, the state of the threshold is calculated from the | ||
123 | following ordered rules: | ||
124 | |||
125 | 1. If no levels are specified, return OK | ||
126 | 2. If an ok level is specified and value is within range, return OK | ||
127 | 3. If a critical level is specified and value is within range, return | ||
128 | CRITICAL | ||
129 | 4. If a warning level is specified and value is within range, return WARNING | ||
130 | 5. If an ok level is specified, return CRITICAL | ||
131 | 6. Otherwise return OK | ||
132 | |||
133 | ### Looking Back … | ||
134 | |||
135 | So the check\_http example becomes: | ||
136 | |||
137 | check_http -H $HOSTADDRESS$ \ | ||
138 | --th metric=time,ok=0..5 \ | ||
139 | --th metric=size,ok=10..inf,prefix=Ki \ | ||
140 | --th metric=age,ok=0..1,unit=d | ||
141 | |||
142 | I believe this is more readable (I’m interested in the time, the size and the | ||
143 | age) and more consistent (I’m alerting above 5, less than 10 and above 1, | ||
144 | respectively). | ||
145 | |||
146 | In addition, performance data will only be output if the metric has been | ||
147 | specified. So only show time performance data if “--th metric=time” has been | ||
148 | specified on the command line. Both warning\_range or critical\_range can be | ||
149 | unspecified - this effectively means “I am not going to alert on this value, | ||
150 | but I’d like to be informed about it in the performance data”. | ||
151 | |||
152 | Because the specification for a range has changed, the warning and critical | ||
153 | parts of the performance data can no longer be guaranteed. There is an | ||
154 | additional piece of work required to fix a new format for performance data. | ||
155 | However, the basic | ||
156 | |||
157 | label=value[uom] | ||
158 | |||
159 | Will still be valid. | ||
160 | |||
161 | ### Examples | ||
162 | |||
163 | Other examples. | ||
164 | |||
165 | To check httpd processes are OK if the virtual size is under 8096 bytes. Warn | ||
166 | until they reach 16182, but bigger than that is CRITICAL. | ||
167 | |||
168 | # old | ||
169 | check_procs -w 8096 -c 16182 -C httpd --metric VSZ | ||
170 | |||
171 | # new | ||
172 | check_procs -C httpd --th metric=vsize,ok=0..8096,warn=8097..16182 | ||
173 | |||
174 | There should always be one and only one ‘tnslsnr’ process. Otherwise | ||
175 | critical. | ||
176 | |||
177 | # old | ||
178 | check_procs -w 1:1 -c 1:1 -C tnslsnr | ||
179 | |||
180 | # new | ||
181 | check_procs -C tnslsnr --th metric=count,ok=1..1 | ||
182 | |||
183 | Load averages (1,5,15 minute) should be within reasonable ranges. | ||
184 | |||
185 | # old | ||
186 | check_load -w 1.0,0.8,0.7 -c 1.5,1.3,1.0 | ||
187 | |||
188 | # new | ||
189 | check_load --th metric=1min,ok=0..1.0,warn=1.0..1.5 \ | ||
190 | --th metric=5min,ok=0..0.8,warn=0.8..1.3 \ | ||
191 | --th metric=15min,ok=0..0.7,warn=0.7..1.0 | ||
192 | |||
193 | ## Plan | ||
194 | |||
195 | I personally plan on updating check\_procs. | ||
196 | |||
197 | The basic syntax is: | ||
198 | |||
199 | check_procs [filter options] [threshold options] | ||
200 | |||
201 | Where filter options are the current -u {username}, -C {command}, etc. This | ||
202 | reduces the set of processes that are to be calculated. | ||
203 | |||
204 | The new threshold metrics will be: | ||
205 | |||
206 | - number - alert on number of matching processes. Performance data returns | ||
207 | number of processes | ||
208 | - rss-threshold - alert on rss size if any matching process is in range. | ||
209 | Perf data returns average rss | ||
210 | - rss-max - Same as --rss, but perf data returns max rss | ||
211 | - rss-sum - alert on the total rss of all matching processes. Perf data | ||
212 | returns rss\_sum | ||
213 | - vsz-threshold - alert on vsz size if any matching process is in range. | ||
214 | Perf data returns average vsz | ||
215 | - vsz-max - Same as --vsz, but perf data returns max rss | ||
216 | - vsz-sum - alert on the total vsz of all matching processes. Perf data | ||
217 | returns vsz\_sum | ||
218 | - cpu-threshold - alert on cpu % of all matching processes. Perf data | ||
219 | returns average cpu | ||
220 | - cpu-max - Same as --cpu, but perf data returns max cpu | ||
221 | - cpu-sum - alert on total cpu. Perf data returns cpu\_sum | ||
222 | |||
223 | There will be C library routines for parsing the threshold values. | ||
224 | |||
225 | There will be C library routines for the collection and output of the | ||
226 | performance data. | ||
227 | |||
228 | ## Terminology | ||
229 | |||
230 | **metric** | ||
231 | : Something that a check is going to be measured against. For example, for | ||
232 | disk checks, it could be used or free or inodes\_free; for http checks, it | ||
233 | could be time [taken] or size; for process checks, it could be cpu or | ||
234 | number [of processes] or vsz | ||
235 | |||
236 | **range** | ||
237 | : This defines a continuous range of values when an alert would be raised | ||
238 | |||
239 | **level** | ||
240 | : This is an alert level within Nagios - OK, WARNING or CRITICAL | ||
241 | |||
242 | **threshold** | ||
243 | : This consists of a level with a range | ||
244 | |||
245 | ## Limitations | ||
246 | |||
247 | This assumes that you are always comparing numbers as the metric values. | ||
248 | |||
249 | There maybe some limitations in the precision of values. All internal logic | ||
250 | should use double precision. | ||
251 | |||
252 | If there are multiple metrics, the alert will be on an OR basis, that is, any | ||
253 | single metric which passes its threshold will cause the plugin to return a | ||
254 | failed state. | ||
255 | |||
256 | <!--% # vim:set filetype=markdown textwidth=78 joinspaces: # %--> | ||