[Nagiosplug-help] Usage of check_procs
Ralph.Grothe at itdz-berlin.de
Ralph.Grothe at itdz-berlin.de
Tue Sep 11 17:03:00 CEST 2007
Dear Nagiosplug Users/Hackers,
I am currently puzzled about the intended correct usage of the
check_procs plugin.
The help screen of the plugin isn't all that helpful.
(I haven't yet looked at the implementation in the code)
Actually, I need to monitor a proc whose command name is known
beforehand,
as it is (due to an unfixed bug in the employed release?)
susceptible to hog
an entire CPU to 100% (platform is an HP9000 multi CPU server
with HP-UX 11.11)
>From the plugin's help screen I started like this:
$ /usr/local/nagios/libexec/check_procs -m CPU -w 5:10 -c 11:
CPU CRITICAL: 339 crit, 0 warn out of 339 processes
But fetching from the proc table I get quite different results
(Ok, I acknowledge that check_procs might use another syscall
(maybe pstat()?)
But differences shouldn't be that blatant)
$ UNIX95= ps -e -o pid,ppid,uid,time,state,cpu,pcpu,comm|awk
'NR==1||$7>1'|sort -n -k 7,7
PID PPID UID TIME S C %CPU COMMAND
28985 1 0 25:22 S 0 1.37 saposcol
27337 27113 203 04:54:28 S 8 1.64 oracleZ01
27336 27113 203 02:57:16 S 8 2.55 oracleZ01
6953 1 203 01:27 S 1 2.78 oracleZ01
17430 1 203 07:40 S 29 3.40 oracleZ01
14016 1 203 26:51 S 0 6.99 oracleZ01
29566 1 203 11:34 S 66 12.46 oracleZ01
27335 27113 203 14:12:13 R 67 22.81 oracleZ01
27334 27113 203 14:21:29 S 64 22.93 oracleZ01
Maybe I forgot the % units specifier?
But no difference
$ /usr/local/nagios/libexec/check_procs -m CPU -w 5:10% -c 11:%
CPU CRITICAL: 337 crit, 0 warn out of 337 processes
Well, at least the proc count seems right ;-)
$ UNIX95= ps -e -o pid=|wc -w
328
Then I tried the ominous -P swtch.
But I cannot fathom why than the (mandatory) warn and crit ranges
still are necessary?
Anyway, no difference.
$ /usr/local/nagios/libexec/check_procs -m CPU -P 1 -w 5:10% -c
11:%
CPU OK: 0 processes with PCPU >= 1.00
But what I really want to achieve is, monitor this beast
$ UNIX95= ps -C dmisp -o pid,ppid,uid,time,state,cpu,pcpu,comm
PID PPID UID TIME S C %CPU COMMAND
1347 1 0 02:17 R 0 0.21 dmisp
As can be seen, now it's behaving, but it eventually will grab
100%
So I tried this
$ /usr/local/nagios/libexec/check_procs -m CPU -C dmisp -w 5:10%
-c 11:%
CPU CRITICAL: 1 crit, 0 warn out of 1 process with command name
'dmisp'
Why critical when still down at 0.21% ?
This also makes no sense
$ /usr/local/nagios/libexec/check_procs -m CPU -C dmisp -P 0.2
CPU OK: 0 processes with command name 'dmisp', PCPU >= 0.20
Could anyone demistify check_procs to me and show its correct
usage
to catch the cpu hog?
Regards
Ralph
More information about the Help
mailing list