View previous topic :: View next topic |
Author |
Message |
danny_b85 n00b
Joined: 16 Jun 2008 Posts: 10
|
Posted: Mon Sep 01, 2008 10:09 am Post subject: High load average caused by high %iowait |
|
|
Hi everyone,
I'm having a problem with a couple of servers on which the load average goes to values such as 17 and even 22. I've checked and rechecked and the problem wasn't being caused by high cpu usage from programs (that's what surprised me) or by the RAM being to full and going to swap, but by high %iowait values in the output of iostat.
From what I know (and please correct me if I'm wrong) %iowait goes up when information needs to be written to disk and it's either too much at once and the bandwidth is being eaten up or the information is scattered in different parts of the disk and the write process is causing the HDD's actuators move the read/write heads franticly, thus causing the same problem.
I've noticed that %iowait goes up when one of the HDD's (for some reason) drops out of the software RAID 1 array and is doing the rebuild process or when the RAM is full and the swap starts getting used, but this isn't the case now, and because the problem has manifested itself on several servers, I'm thinking it could be some kind of exploit of one/several software programs running on the server which is causing this. It could be a longshot, I could just be paranoid about it, but it's part of the daily life of a sysadmin to be paranoid.
Now my question, how can I track which process is causing high disk usage? I need something similar to the output of iostat, but at a process level, as iostat is reporting the entire disk usage status.
Please also advise to other possibilities that could lead to solving this problem.
Thank you in advance.
L.E.: this is what I mean by high %iowait:
Code: | avg-cpu: %user %nice %sys %iowait %idle
3.74 0.00 2.99 82.04 11.22
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
hda 0.00 79.21 15.84 132.67 110.89 1063.37 55.45 531.68 7.91 188.89 1308.77 6.67 99.11
hdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
hdc 0.00 79.21 12.87 174.26 102.97 1071.29 51.49 535.64 6.28 119.43 1100.41 5.30 99.11
md0 0.00 0.00 3.96 59.41 31.68 475.25 15.84 237.62 8.00 0.00 0.00 0.00 0.00
md4 0.00 0.00 0.00 24.75 0.00 49.50 0.00 24.75 2.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 22.77 64.36 182.18 514.85 91.09 257.43 8.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 |
|
|
Back to top |
|
|
ibins n00b
Joined: 27 Jul 2007 Posts: 27
|
Posted: Mon Sep 01, 2008 8:36 pm Post subject: |
|
|
Hi
you can try "sys-process/iotop" but you will need a kernel 2.6.20 or higher with support for TASK_IO_ACCOUNTING enabled |
|
Back to top |
|
|
HeissFuss Guru
Joined: 11 Jan 2005 Posts: 414
|
Posted: Tue Sep 02, 2008 10:13 pm Post subject: |
|
|
You may also be able to tell by looking at process states.
'ps aux' or htop with sort by S.
A state of D indicates uninterruptible sleep. If there's a non filesystem process consistently in that state, that's probably the culprit. |
|
Back to top |
|
|
danny_b85 n00b
Joined: 16 Jun 2008 Posts: 10
|
Posted: Wed Sep 03, 2008 2:01 pm Post subject: |
|
|
@ibins
Great suggestion with iotop, unfortunately it doesn't help me because I don't have the 2.6.20 kernel.
Quote: | If there's a non filesystem process consistently in that state, that's probably the culprit. |
A non filesystem process such as Apache? |
|
Back to top |
|
|
HeissFuss Guru
Joined: 11 Jan 2005 Posts: 414
|
Posted: Wed Sep 03, 2008 2:22 pm Post subject: |
|
|
danny_b85 wrote: | @ibins
A non filesystem process such as Apache? |
Even on a webserver, I don't think Apache should be consistently in that state. That's probably the cause of the IO wait.
Is Apache serving pages/files over NFS or other network filesystem by chance? |
|
Back to top |
|
|
|