[Nagiosplug-help] How to setup "delayed" host down/unreachable notifications with e.g. check_icmp?
Andreas Ericsson
ae at op5.se
Mon Jan 8 13:20:22 CET 2007
Ralph.Grothe at itdz-berlin.de wrote:
> Dear Nagios Users,
>
>
> As check_command in them I have used the check_host designation
> of the check_icmp plugin (i.e. hard or soft link).
>
> In my host and service definitions also notifications_enabled is
> 1, all notification_options apart from flapping
> (whose detection I disabled globally in nagios.cfg) are set (i.e.
> d,u,r for hosts, and w,u,c,r for services),
> check as well as notification periods are 24x7, and
> max_check_attempts for both is 5.
>
> Only for the contacts did I set host_notification_options to n.
> This was because otherwise there could be the peril of down or
> unreachable host notification
> floods in case a host was unpingable for a relatively short time,
> like during a quick reboot
> or network or router outage with respect to the route from the
> nagios server
> (or would this already account for flapping?).
>
No, flapping is when something changes state more than (insert proper
vaiable name for flapping-percentage here) percent times over the last
21 executions of its check.
>
> With these in place this is what happens for example if I down a
> NIC on a host temporarily:
>
> [1168094094] HOST ALERT: tiber;DOWN;SOFT;1;123.123.123.123 is
> DOWN - rta: nan, lost 100%
> [1168094105] HOST ALERT: tiber;DOWN;SOFT;2;123.123.123.123 is
> DOWN - rta: nan, lost 100%
> [1168094116] HOST ALERT: tiber;DOWN;SOFT;3;123.123.123.123 is
> DOWN - rta: nan, lost 100%
> [1168094127] HOST ALERT: tiber;DOWN;SOFT;4;123.123.123.123 is
> DOWN - rta: nan, lost 100%
> [1168094139] HOST ALERT: tiber;DOWN;HARD;5;123.123.123.123 is
> DOWN - rta: nan, lost 100%
> [1168094139] SERVICE ALERT:
> tiber;icmp-host-alive;CRITICAL;HARD;1;CRITICAL - 123.123.123.123:
> rta nan, lost 100%
>
>
>>From the docs I would have assumed that a service notification
> would be emitted
> because the icmp-host-alive service transited right into a hard
> critical state (i.e. hard state change).
> But this didn't happen.
>
Service notifications are suppressed for hosts that are down. This is to
prevent a flood of notifications when hosts go down.
> Admittedly, even such a service notification wouldn't alliveate
> anything as it would still come too early,
Precisely, and you wouldn't get just one, but several notifications (one
for each service).
>
> On the other hand, once a host was confirmed to be down (or
> unreachable)
> I would assume that nagios wouldn't schedule the icmp-host-alive
> service for this host anymore
> but instead reattempt own (randomly?) scheduled host checks until
> one host_check packet returned OK
> and relapsed to a host HARD OK state, which in turn would
> reactivate regularily scheduled service checks.
>
> While that host was down (until I upped the NIC again) also no
> other service checks
> were performed (what seems quite in order, because what sense
> what they make).
> The downside however was, that as well not a single notification
> about the sudden unavailability of
> any service related to this host was sent to configured contacts.
> So such an outage would at worst pass totally unnoticed by the
> responsible admins
> which defeats the whole purpose of monitoring.
>
Errors passing unnoticed certainly defeats the purpose of monitoring,
but you explicitly told Nagios not to notify you about host down events,
so it's doing The Right Thing(tm).
> So how can one reconcile the seemingly contradicting requirements
> of delayed host down notifications
> and service critical notifications?
>
Enable host down notifications for all contacts.
To prevent host notifications going out for temporary glitches (fe a
reboot), use the first_host_notification_delay patch coded by Mathias
Sundman and sent in by me to the nagios-devel list. You'll find it in
the archives somewhere and it has been incorporated into the Nagios 3
codebase.
What it does, basically, is to add a new variable called
first_host_notification_delay to host objects. When a host goes down,
the first notification for that host is delayed until *at least* the
configured time has passed. I say at least, because nagios doesn't even
look at the value until it does another check of the same host and
notices that it's still down. If, by then, first_host_notification_delay
* interval_length seconds have passed, it will send a notification.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
More information about the Help
mailing list