[axxs-sysadmin] borg analysis and sudo request.
Andrew McNaughton
andrew at scoop.co.nz
Tue Mar 9 02:25:24 PST 2010
Dave Fregon wrote:
> On Mon, 2010-03-08 at 23:34 -0800, chummy fleming wrote:
>
>> ...is down...
>>
> slackbastard is on borg, I couldn't ssh in but after it ground away for
> a bit it sent me a number of emails like this, then started up again
>
> Dave
>
> -------- Forwarded Message --------
> From: Cron Daemon <root at borg.axxs.org>
> To: root at borg.axxs.org
> Subject: Cron <munin at borg> if [ -x /usr/bin/munin-cron ];
> then /usr/bin/munin-cron; fi
> Date: Tue, 09 Mar 2010 07:44:59 +0000
>
> Lock already exists: /var/run/munin/munin-update.lock. Dying.
> Lock already exists: /var/run/munin/munin-graph.lock. Dying.
> Lock already exists: /var/run/munin/munin-html.lock. Dying
I'd interpret that as probably meaning that either:
1) munin processes from one cron run were still running when the next
munin cron job came due. Ie they ran slowly, presumably because of the
same system load issue that made the website unavailable (presumably
other sites were down also?).
2) Munin could have died mid-run. Eg if memory and swap ran out entirely
and the OOM Killer came into play.
Munin data is visible at
http://borg.axxs.org/munin/localdomain/localhost.localdomain.html
Just before the outage, there's a significant spike in application
memory use. There's also a spike in interrupts from "3w-xxxx". After a
bit of digging around, I found this in /var/log/dmesg.
[ 25.423440] 3w-xxxx: AEN: ERROR: Unit degraded: Unit #0.
[ 25.872495] 3w-xxxx: scsi0: Found a 3ware Storage Controller at 0xa800,
IRQ: 18.
Kernel logs might be interesting, and also having a look at S.M.A.R.T.
control stuff. This thread suggests that drives can get out of sync
without actually having failed:
http://lists.leap-cf.org/pipermail/leaplist/2009-May/006703.html
Also of interest - has some commands that may be of use for re-syncing
the drive:
https://secure.bonkabonka.com/blog/2008/01/03/and_remember_this_is_for_posterity_so_be_honest.html
And this one, which suggests a kernel bug may be involved, and a kernel
update would correct that. The problematic kernel, driver and hardware
versions line up with what we have at present.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=518812
There's a bunch of things I would have looked at if I had access. Can I
have sudo on borg please?
Andrew
More information about the axxs-sysadmin
mailing list