[axxs-sysadmin] borg analysis and sudo request.
andrew at scoop.co.nz
Tue Mar 9 07:24:03 PST 2010
Andrew McNaughton wrote:
> dave at netaxxs.com.au wrote:
>> I added you to sudo, sorry I can't do much atm racing to get ready for
>> a convention heading to melbourne tomorrow morning ..
> SMART isn't telling us much. out of date version of smartctl in
> smartmontools distribution isn't good, but probably isn't critical.
> In the kernel log I can see this:
> Mar 9 07:46:17 borg kernel: [902049.129135] Out of memory: kill process
> 9727 (mysqld) score 59930 or a child
> Mar 9 07:46:17 borg kernel: [902049.129193] Killed process 9727 (mysqld)
> Ie we did run out of memory, including swap. Looks like it's been
> happening for a while. Surprising we haven't seen more problems.
> Disk performance could lead to a pile up of processes, so it's not ruled
> out as a cause. If the disk error is correct, then we don't have
> functional RAID redundancy, and performance may be impacted both by lack
> of the redundant drive and perhaps by the RAID controller attempting to
> rebuild. Or this could be a furfy.
> In any case, where the system doesn't keep up, it should not just keep
> spawning processes. Will look into memory tuning next before worrying
> about the disk.
Apache was set to spawn more processes than it had space for in main
memory given the size of the processes. I've limited MaxProcesses to 40.
Have also set the number of requests each apache process will serve
before respawning, which will reduce the size of the processes. One or
both of these may be able to be relaxed a bit in future.
The timing of our outage coincides with the /etc/cron.daily/* jobs,
which creates a bit of load.
I tracked it down to /etc/cron.daily/locate which runs a find process
which takes up about 850MB of memory and takes ages. Things spiral down
I've replaced the locate package with the mlocate package, which is more
efficient in many respects. the cron job runs *heaps* faster, and
watching in top the highest memory use I saw was 107MB.
I haven't figured out why this job was taking so much memory, or why it
was different to before. It could be a change in disk performance, and
the disk error report is still a significant concern.
More information about the axxs-sysadmin