[Nagiosplug-help] Tracking down pthread/check_dns problem on CentOS4 w/ 1.4.2 plugins.

Ton Voon ton.voon at altinity.com
Tue Nov 29 13:24:01 CET 2005


On 29 Nov 2005, at 16:40, John P. Rouillard wrote:

>> Are you saying that if you run it 10 times, it is 100% successful?
>
> If I run "run_tests 10" 10 times, I get a 2 of the 10 element runs
> to fail on avergae, but I have had a run of 15 error free. I am just
> guessing, but it may be load related. If I pause between the runs, it
> seems less likely to happen. However I never had a run of 1000 pass.
>
>> I'm happy with increasing the number of iterations if it catches the
>> problem more of the time.
>
> While 1000 may be overkill, I am seeing a 50% detection of failure
> when running it in a while loop. The 10 iteration version is failing
> less often. I've didn't try 100 or 500.
>
> However I did a bit more testing. The results aren't reliable. I have
> had 20 runs of "run_test 10" fail in a row and 20 pass in a row. As
> the number passed to run_tests goes up, I have fewer passes, but no
> definate way of determining oif the problem exists. E.G. with
> a single run of "run_tests 500" I got the following distribution:
>
>       1 Success=372 Fail=128
>       1 Success=400 Fail=100
>       2 Success=496 Fail=4
>       1 Success=498 Fail=2
>       1 Success=499 Fail=1
>      14 Success=500 Fail=0
> 80% success. For a "run_tests 10", I get:
>
>      19 Success=10 Fail=0
>       1 Success=7 Fail=3
> 95% success or
>
>       2 Success=10 Fail=0
>       5 Success=5 Fail=5
>       3 Success=6 Fail=4
>       4 Success=7 Fail=3
>       6 Success=8 Fail=2
> 10% success or
>
>       5 Success=5 Fail=5
>       4 Success=6 Fail=4
>       4 Success=7 Fail=3
>       5 Success=8 Fail=2
>       2 Success=9 Fail=1
> 0% success.
>
> For a count of 1000 I got:
>       5 Success=1000 Fail=0
>       1 Success=780 Fail=220
>       1 Success=986 Fail=14
>       1 Success=990 Fail=10
>       1 Success=995 Fail=5
>       2 Success=996 Fail=4
>       6 Success=997 Fail=3
>       3 Success=999 Fail=1
> 25% success or
>
>       9 Success=1000 Fail=0
>       1 Success=833 Fail=167
>       1 Success=944 Fail=56
>       1 Success=990 Fail=10
>       1 Success=996 Fail=4
>       1 Success=997 Fail=3
>       2 Success=998 Fail=2
>       4 Success=999 Fail=1
> 45% success.
>
> Not sure if the data is of any use, but more runs seems to be better.

I agree this is a pain to detect. If there are any ideas on a better  
test, I'm all ears.

What about running 100 x iterations of 10? If there is any failure,  
break out and apply fix. If all 100 are okay, then assume system is  
okay.

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon






More information about the Help mailing list