View previous topic :: View next topic |
Author |
Message |
jba n00b
Joined: 22 Sep 2005 Posts: 35 Location: new york, ny
|
Posted: Thu Oct 29, 2009 5:30 am Post subject: Weird behavior with iozone, ext4, 2.6.30-r5, megaraid_sas. |
|
|
Hi all,
Fairly weird problem. I'm benchmarking a new server that will run postgres. Dell R710. 48GB ram. Dual Intel X5570 CPU's. One of the bigger boxes i've had gentoo on.
It's connected up to a Dell MD1120 via a PERC 6/E SAS raid controller. 13 15krpm drives in there (12 in a raid10 set, and a hot spare).
I'm running the 2.6.30-gentoo-r5 with a pretty normal config. Generally the system seems stable.
Through the benchmarks i've been throwing at it, i've run into a weird issue as the benchmarks finish.
Here's a sample IOzone w/ a 64k stripe size:
Code: |
db3 data # iozone -i 0 -i 1 -i 2 -i 8 -s 10g -T -t 8
Iozone: Performance Test of File I/O
Version $Revision: 3.242 $
Compiled for 64 bit mode.
Build: linux-AMD64
Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
Al Slater, Scott Rhine, Mike Wisner, Ken Goss
Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
Randy Dunlap, Mark Montague, Dan Million,
Jean-Marc Zucconi, Jeff Blomberg,
Erik Habbinga, Kris Strecker, Walter Wong.
Run began: Thu Oct 29 00:50:20 2009
File size set to 10485760 KB
Command line used: iozone -i 0 -i 1 -i 2 -i 8 -s 10g -T -t 8
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
Throughput test with 8 threads
Each thread writes a 10485760 Kbyte file in 4 Kbyte records
Children see throughput for 8 initial writers = 588562.83 KB/sec
Parent sees throughput for 8 initial writers = 474765.53 KB/sec
Min throughput per thread = 70407.44 KB/sec
Max throughput per thread = 75926.89 KB/sec
Avg throughput per thread = 73570.35 KB/sec
Min xfer = 9727024.00 KB
Children see throughput for 8 rewriters = 578675.23 KB/sec
Parent sees throughput for 8 rewriters = 513802.44 KB/sec
Min throughput per thread = 71805.98 KB/sec
Max throughput per thread = 72727.81 KB/sec
Avg throughput per thread = 72334.40 KB/sec
Min xfer = 10352288.00 KB
Children see throughput for 8 readers = 231325.92 KB/sec
Parent sees throughput for 8 readers = 231319.97 KB/sec
Min throughput per thread = 28134.68 KB/sec
Max throughput per thread = 30442.98 KB/sec
Avg throughput per thread = 28915.74 KB/sec
Min xfer = 9690864.00 KB
Children see throughput for 8 re-readers = 232281.25 KB/sec
Parent sees throughput for 8 re-readers = 232276.11 KB/sec
Min throughput per thread = 28333.96 KB/sec
Max throughput per thread = 29996.86 KB/sec
Avg throughput per thread = 29035.16 KB/sec
Min xfer = 9904752.00 KB
|
Not bad numbers, though, the fact that readers are worse than writers is another issue, but as the last test finishes, the system gets very slow, and i start to see this in dstat:
Code: |
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
0 0 93 6 0 0| 12M 0 | 560B 428B| 0 0 |5621 5083
0 0 94 6 0 0| 12M 32k| 696B 1128B| 0 0 |5638 5042
0 0 93 6 0 0| 12M 0 | 560B 492B| 0 0 |5698 5117
0 0 93 7 0 0| 13M 0 |1132B 428B| 0 0 |5783 5188
0 0 94 6 0 0| 12M 0 |1211B 428B| 0 0 |5638 5050
0 0 93 7 0 0| 12M 0 | 876B 428B| 0 0 |5666 5131
0 0 93 6 0 0| 12M 0 | 706B 428B| 0 0 |5611 5014
0 0 93 6 0 0| 12M 0 | 684B 428B| 0 0 |5652 5098
0 0 93 7 0 0| 12M 0 | 816B 428B| 0 0 |5662 5125
0 0 94 6 0 0| 12M 0 | 684B 428B| 0 0 |5664 5125
0 0 92 7 0 0| 12M 0 | 368B 428B| 0 0 |5668 5016
0 0 94 6 0 0| 12M 0 | 620B 428B| 0 0 |5639 5089
0 0 92 7 0 0| 12M 0 | 706B 428B| 0 0 |5670 5079
0 0 94 6 0 0| 12M 0 | 766B 428B| 0 0 |5675 5122
0 0 94 6 0 0| 12M 0 | 624B 428B| 0 0 |5672 5129
0 0 93 7 0 0| 12M 0 | 620B 428B| 0 0 |5628 5022
0 0 93 6 0 0| 12M 0 | 368B 428B| 0 0 |5678 5136
0 0 94 6 0 0| 12M 0 | 556B 428B| 0 0 |5730 5126
0 0 92 7 0 0| 12M 0 | 624B 428B| 0 0 |5688 5164
0 0 94 6 0 0| 12M 0 | 620B 428B| 0 0 |5626 5090
0 0 93 7 0 0| 12M 0 | 496B 428B| 0 0 |5633 5077
0 0 93 6 0 0| 12M 0 | 894B 428B| 0 0 |5684 5094
0 0 94 6 0 0| 12M 0 | 578B 428B| 0 0 |5671 5103
0 0 93 7 0 0| 12M 0 | 906B 428B| 0 0 |5687 5128
0 0 94 6 0 0| 12M 0 |1072B 428B| 0 0 |5724 5113
0 0 94 6 0 0| 12M 0 | 812B 428B| 0 0 |5640 5080
0 0 93 7 0 0| 12M 0 | 752B 428B| 0 0 |5671 5125
0 0 93 6 0 0| 12M 0 | 748B 428B| 0 0 |5723 5177
0 0 93 6 0 0| 12M 0 | 624B 428B| 0 0 |5633 5082
0 0 93 7 0 0| 12M 0 | 958B 428B| 0 0 |5622 5080
0 0 94 6 0 0| 11M 0 | 560B 428B| 0 0 |5579 4952
0 0 92 8 0 0| 12M 0 | 574B 428B| 0 0 |5677 5057
0 0 94 6 0 0| 12M 0 | 496B 428B| 0 0 |5628 5056
0 0 93 7 0 0| 12M 0 | 876B 428B| 0 0 |5639 5051
|
Almost perfectly consistent 12MB/sec reads, and any access to the tested volume becomes extremely lagged - to the point of the system almost locking up. Even an 'ls' on that volume might take 10 minutes to complete - yet IOwait is sub 10%.. (wtf?)
Eventually the IOzone completes, but many minutes (hours?) after the last test has already completed.
If i strace on the iozone process, you can see it doing tons of lseaks and reads:
Code: |
[pid 4710] <... read resumed> "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4710] lseek(6, 7012470784, SEEK_SET) = 7012470784
[pid 4710] read(6, "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4710] lseek(6, 5325385728, SEEK_SET) = 5325385728
[pid 4710] read(6, "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4710] lseek(6, 2260058112, SEEK_SET) = 2260058112
[pid 4710] read(6, <unfinished ...>
[pid 4708] <... read resumed> "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4708] lseek(5, 9403219968, SEEK_SET) = 9403219968
[pid 4708] read(5, "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4708] lseek(5, 2561925120, SEEK_SET) = 2561925120
[pid 4708] read(5, "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4708] lseek(5, 2125836288, SEEK_SET) = 2125836288
[pid 4708] read(5, <unfinished ...>
[pid 4715] <... read resumed> "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4715] lseek(10, 9672151040, SEEK_SET) = 9672151040
[pid 4715] read(10, "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4715] lseek(10, 3074637824, SEEK_SET) = 3074637824
[pid 4715] read(10, <unfinished ...>
[pid 4713] <... read resumed> "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4713] lseek(7, 5254766592, SEEK_SET) = 5254766592
[pid 4713] read(7, "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4713] lseek(7, 1041141760, SEEK_SET) = 1041141760
[pid 4713] read(7, <unfinished ...>
[pid 4709] <... read resumed> "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4709] lseek(4, 2355765248, SEEK_SET) = 2355765248
[pid 4709] read(4, <unfinished ...>
[pid 4711] <... read resumed> "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4711] lseek(3, 632438784, SEEK_SET) = 632438784
[pid 4711] read(3, <unfinished ...>
[pid 4712] <... read resumed> "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4712] lseek(8, 8557805568, SEEK_SET) = 8557805568
[pid 4712] read(8, "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4712] lseek(8, 8371425280, SEEK_SET) = 8371425280
[pid 4712] read(8, "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4712] lseek(8, 4976214016, SEEK_SET) = 4976214016
[pid 4712] read(8, "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4712] lseek(8, 1170989056, SEEK_SET) = 1170989056
[pid 4712] read(8, <unfinished ...>
[pid 4714] <... read resumed> "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4714] lseek(9, 2156314624, SEEK_SET) = 2156314624
[pid 4714] read(9, "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4714] lseek(9, 9273806848, SEEK_SET) = 9273806848
[pid 4714] read(9, "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4714] lseek(9, 4972687360, SEEK_SET) = 4972687360
[pid 4714] read(9, "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4714] lseek(9, 2344914944, SEEK_SET) = 2344914944
[pid 4714] read(9, <unfinished ...>
[pid 4715] <... read resumed> "\245\245\245\245\245\245\245\245\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
[pid 4715] lseek(10, 72925184, SEEK_SET) = 72925184
[pid 4715] read(10, <unfinished ...>
|
Nothing there explains why the system is so lagged.
I haven't tested with other filesystems yet, maybe it's an ext4 bug, but that seems dubious.
Any ideas where else i can look?
TIA, as always. |
|
Back to top |
|
|
jba n00b
Joined: 22 Sep 2005 Posts: 35 Location: new york, ny
|
Posted: Thu Oct 29, 2009 9:25 pm Post subject: |
|
|
An update - the problem appears to be extremely slow flushing of filesystem buffers. The raid device is extremely fast, so i don't quite understand this:
Code: |
db3 foo # time dd if=/dev/zero of=bigfile bs=8k count=44779372 ; time sync
44779372+0 records in
44779372+0 records out
366832615424 bytes (367 GB) copied, 657.944 s, 558 MB/s
real 10m58.016s
user 0m5.992s
sys 7m7.905s
real 12m8.595s
user 0m0.000s
sys 12m8.202s
|
Keep in mind the box has 48G of ram, so buffering shouldn't come into play that much with the initial dd numbers due to the size of that write. Any ideas? Anyone!? Bueller?! |
|
Back to top |
|
|
jba n00b
Joined: 22 Sep 2005 Posts: 35 Location: new york, ny
|
Posted: Fri Oct 30, 2009 12:51 am Post subject: |
|
|
Wow, looks like a pretty serious regression with ext4 - re-formatted with XFS:
Code: |
[root@db3][08:07:51][/data]> time dd if=/dev/zero of=bigfile bs=8k count=44779372 ; time sync
44779372+0 records in
44779372+0 records out
366832615424 bytes (367 GB) copied, 660.52 s, 555 MB/s
real 11m7.174s
user 0m5.829s
sys 7m20.474s
real 0m23.431s
user 0m0.000s
sys 0m4.006s
|
That's much more like it. Going to screw around with other kernel versions now. |
|
Back to top |
|
|
jba n00b
Joined: 22 Sep 2005 Posts: 35 Location: new york, ny
|
Posted: Fri Oct 30, 2009 1:17 am Post subject: |
|
|
Oh yeah, confirmed massive ext4 regression - with volume formatted as ext4 again:
Code: |
[root@db3][08:54:21][/data]> time dd if=/dev/zero of=bigfile bs=8k count=44779372 ; time sync
44779372+0 records in
44779372+0 records out
366832615424 bytes (367 GB) copied, 650.501 s, 564 MB/s
real 10m50.589s
user 0m6.356s
sys 7m16.458s
real 10m28.523s
user 0m0.000s
sys 10m28.211s
|
Pretty giant fail. Wonder if this is worthy of a gentoo kernel bug report. |
|
Back to top |
|
|
jba n00b
Joined: 22 Sep 2005 Posts: 35 Location: new york, ny
|
Posted: Fri Oct 30, 2009 4:09 am Post subject: |
|
|
Btw, downgrading to 2.6.28-hardened fixed this issue.
With 2.6.28-hardened and ext4:
Code: |
[root@db3][09:58:34][/data]> time dd if=/dev/zero of=bigfile bs=8k count=44779372 ; time sync
44779372+0 records in
44779372+0 records out
366832615424 bytes (367 GB) copied, 666.51 s, 550 MB/s
real 11m6.589s
user 0m7.502s
sys 7m24.853s
real 0m6.865s
user 0m0.000s
sys 0m0.001s
|
|
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|