[Nagiosplug-help] How to setup "delayed" host down/unreachable notifications with e.g. check_icmp?
Ralph.Grothe at itdz-berlin.de
Ralph.Grothe at itdz-berlin.de
Mon Jan 8 11:39:55 CET 2007
Dear Nagios Users,
although after having carefully referred to three different
sources of Nagios documentation
(including especially the "Notifications" section of the provided
online docs)
yet the behaviour of my nagios setup with regard to the
check_icmp plugin that I employ
and related notifications are totally unclear to me.
Because I feared some strange retention effects between reconfig
restarts of the nagios daemon
I made sure to have this explicitly set in the main config:
$ grep ^retain_state nagios.cfg
retain_state_information=0
Because I have read at various spots of the perused docs that one
better should nagios
let decide when to perform host checks,
I also made sure that in none of my host definitions the
check_interval is defined.
As check_command in them I have used the check_host designation
of the check_icmp plugin (i.e. hard or soft link).
In my host and service definitions also notifications_enabled is
1, all notification_options apart from flapping
(whose detection I disabled globally in nagios.cfg) are set (i.e.
d,u,r for hosts, and w,u,c,r for services),
check as well as notification periods are 24x7, and
max_check_attempts for both is 5.
Only for the contacts did I set host_notification_options to n.
This was because otherwise there could be the peril of down or
unreachable host notification
floods in case a host was unpingable for a relatively short time,
like during a quick reboot
or network or router outage with respect to the route from the
nagios server
(or would this already account for flapping?).
With these in place this is what happens for example if I down a
NIC on a host temporarily:
[1168094094] HOST ALERT: tiber;DOWN;SOFT;1;123.123.123.123 is
DOWN - rta: nan, lost 100%
[1168094105] HOST ALERT: tiber;DOWN;SOFT;2;123.123.123.123 is
DOWN - rta: nan, lost 100%
[1168094116] HOST ALERT: tiber;DOWN;SOFT;3;123.123.123.123 is
DOWN - rta: nan, lost 100%
[1168094127] HOST ALERT: tiber;DOWN;SOFT;4;123.123.123.123 is
DOWN - rta: nan, lost 100%
[1168094139] HOST ALERT: tiber;DOWN;HARD;5;123.123.123.123 is
DOWN - rta: nan, lost 100%
[1168094139] SERVICE ALERT:
tiber;icmp-host-alive;CRITICAL;HARD;1;CRITICAL - 123.123.123.123:
rta nan, lost 100%
>From the docs I would have assumed that a service notification
would be emitted
because the icmp-host-alive service transited right into a hard
critical state (i.e. hard state change).
But this didn't happen.
Admittedly, even such a service notification wouldn't alliveate
anything as it would still come too early,
instantly after the 5th host check attempt which nagios scheduled
on its own in very short intervals
(because all other checks are deferred during these, I assume as
cause for the narrow intervals
from what I have read).
On the other hand, once a host was confirmed to be down (or
unreachable)
I would assume that nagios wouldn't schedule the icmp-host-alive
service for this host anymore
but instead reattempt own (randomly?) scheduled host checks until
one host_check packet returned OK
and relapsed to a host HARD OK state, which in turn would
reactivate regularily scheduled service checks.
While that host was down (until I upped the NIC again) also no
other service checks
were performed (what seems quite in order, because what sense
what they make).
The downside however was, that as well not a single notification
about the sudden unavailability of
any service related to this host was sent to configured contacts.
So such an outage would at worst pass totally unnoticed by the
responsible admins
which defeats the whole purpose of monitoring.
So how can one reconcile the seemingly contradicting requirements
of delayed host down notifications
and service critical notifications?
Regards
Ralph
More information about the Help
mailing list