[Nagiosplug-help] nrpe on solaris stops reporting exit code without nscd
Todd Fleisher
todd at fleish.org
Fri Apr 11 21:44:08 CEST 2008
I use nrpe in a mixed environment. My nagios servers that run
check_nrpe are Debian Linux and they poll a variety of systems running
mostly Debian Linux or Solaris 10 Update 4 i86. The versions vary
between 2.6 & 2.8.1, but I found my problem to be common to both -
only on the Solaris platform. At a certain point, I found that
although the text result of an nrpe check would report WARNING or
CRITICAL, the exit code was always set to 0. The result was that
nagios would not change the status field from OK to WARNING or
CRITICAL, but would display the text that showed the check was WARNING
or CRITICAL. This resulted in many missed notifications of alerts from
Solaris machines.
Making matters worse was the fact that the problem wasn't consistent
across the environment. Though all Solaris nodes are running identical
versions of code, some would have the issue and others would not. In
the end, I found that turning on the name-service-cache service (nscd)
in Solaris fixed the issue. I then mentally envisioned the timeline of
what must have happened:
- We originally deployed Solaris & left nscd turned on
- We installed & started nrpe
- Sometime later we disabled nscd to keep Solaris from caching DNS
information
- nrpe continued to function until it was restarted
- hosts that still had nrpe running from a long time ago when nscd
was present were fine - while hosts where nrpe had been restarted or
where nrpe had been newly installed on a system where nscd wasn't
running experienced the issue
Now for the kicker, to fix the issue but keep Solaris from caching DNS
information, I configured /etc/nscd.conf to disable caching for
everything it claims to be able to cache for. I then started the name-
service-cache service and confirmed that DNS was not being cached.
Here is an excerpt from /etc/nscd.conf
# Currently supported cache names:
# audit_user, auth_attr, bootparams, ethers
# exec_attr, group, hosts, ipnodes, netmasks
# networks, passwd, printers, prof_attr, project
# protocols, rpc, services, tnrhdb, tnrhtp, user_attr
#
logfile /var/adm/nscd.log
enable-cache hosts no
enable-cache audit_user no
enable-cache auth_attr no
enable-cache bootparams no
enable-cache ethers no
enable-cache exec_attr no
enable-cache group no
enable-cache ipnodes no
enable-cache netmasks no
enable-cache networks no
enable-cache passwd no
enable-cache printers no
enable-cache prof_attr no
enable-cache project no
enable-cache protocols no
enable-cache rpc no
enable-cache services no
enable-cache tnrhdb no
enable-cache tnrhtp no
enable-cache user_attr no
I then started nrpe, and the issue was gone. My next step is to truss
the process to see if I can determine what's different in the 2
scenarios. But I wanted to post this to see if others have experienced
the same issue already. I couldn't find anything on the mailing list
archives that matched.
Thanks,
Todd
More information about the Help
mailing list