[Deganawida-sysadmin] deganawida reliability
ski at indymedia.org
ski at indymedia.org
Thu Sep 25 01:19:12 PDT 2003
Ok.
So far dega has had a really shitty reliability record, and I'm really
sorry about that. I was responsible for arkansas IMC's downtime a little
over a week ago - that was a change I made without testing. I will be more
careful in the future. However, I'm afraid the other 3 downtimes we've had
don't match any kind of pattern I'm aware of, and thus I'm uncertain what
positive steps we can take to keep away any further troubles...
1) 9/11 - about 5 hours downtime - libphp4.so corrupted -> apache crash
2) 9/15 - about 6 hours downtime - emergency relocation
3) 9/25 - about 6 hours downtime - freebsd arp security patch bug
Certainly we can't prepare for emergency physical relocation of the system.
And we can't *not* patch the system for security holes out of fear that
there will be a bug in it, as it would leave the system insecure. For what
it's worth, I am pushing for the creation of a list to discuss this sort
of problem on freebsd.
The only downtime that still worries me is the libphp4.so corruption that
we experienced. We still don't know why that file got corrupted/changed. I
have set both the system immutable flag and the system undeletable flag on
this file (which is somewhat like setting a file to be read only on a
windows machine). That is to say, ordinary programs won't be able to
change or delete this file unless someone with sudo access runs "chflags
noschg libphp4.so" and "chflags nosunlnk libphp4.so" first (ie if you need
to recompile php4, take note of this). It is possible that this is due to
a latent hardware bug or an unusual load condition, but it's also possible
it's just a one time software fluke (we've upgraded php and all of the
programs in the stock freebsd install since, so in this case we shouldn't
see the problem again).
Anyway, please accept my apologies for the downtimes on dega (we're
looking at about a 97.4% uptime so far - ick). I am confident that these
problems have all just been flukes and we've just been damn unlucky so
far.
I realize in retrospect that I "oversold" dega to some of you as a way to
solve the problems on zero, and so far it hasn't been delivering. I really
do believe that in the future we won't have this sort of problem, but I
still accept the blame for this situation.
Any ideas/rants/comments?
--
Brian Szymanski
ski at indymedia.org
bks10 at cornell.edu
More information about the Deganawida-sysadmin
mailing list