View previous topic :: View next topic |
Author |
Message |
sdauth l33t
Joined: 19 Sep 2018 Posts: 658 Location: Ásgarðr
|
Posted: Wed Jun 26, 2024 11:01 pm Post subject: SAS HBA, random drive reset under load, io scheduler issue ? |
|
|
Hello,
I have a server with 8 hdd attached to a SAS HBA (Dell Perc H310 with IT firmware), no raid is used : for each hard disk, a filesystem and a mountpoint.
When the system is under high load with lot of r/w, I have a disk (not the same each time !) which reset itself for no reason.
This time it was /dev/sdg, when it happened, I was running a (long) smart test on 4 drives and started a copy to an other disk. /dev/sdg was totally idle..
Other than that, the airflow is good, temp are ok and I already replaced the mini sas and power cables..
Code: | [111738.379589] sd 0:0:6:0: device_block, handle(0x000f)
[111740.379542] sd 0:0:6:0: device_unblock and setting to running, handle(0x000f)
[111740.406332] sd 0:0:6:0: [sdg] Synchronizing SCSI cache
[111740.406369] sd 0:0:6:0: [sdg] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[111740.407030] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221106000000)
[111740.407040] mpt2sas_cm0: removing handle(0x000f), sas_addr(0x4433221106000000)
[111740.407044] mpt2sas_cm0: enclosure logical id(0x5d8ae120b146b700), slot(5)
[111758.130732] mpt2sas_cm0: handle(0xf) sas_address(0x4433221106000000) port_type(0x1)
[111758.385681] scsi 0:0:8:0: Direct-Access ATA WDC WD180EDGZ-11 0A85 PQ: 0 ANSI: 6
[111758.385704] scsi 0:0:8:0: SATA: handle(0x000f), sas_addr(0x4433221106000000), phy(6), device_name(0x0000000000000000)
[111758.385707] scsi 0:0:8:0: enclosure logical id (0x5d8ac140b146b500), slot(5)
[111758.385808] scsi 0:0:8:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[111758.385812] scsi 0:0:8:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[111758.454308] sd 0:0:8:0: Power-on or device reset occurred
[111758.455266] sd 0:0:8:0: Attached scsi generic sg6 type 0
[111758.457603] sd 0:0:8:0: [sdj] 35156656128 512-byte logical blocks: (18.0 TB/16.4 TiB)
[111758.457611] sd 0:0:8:0: [sdj] 4096-byte physical blocks
[111758.475549] end_device-0:8: add: handle(0x000f), sas_addr(0x4433221106000000)
[111758.481002] sd 0:0:8:0: [sdj] Write Protect is off
[111758.481011] sd 0:0:8:0: [sdj] Mode Sense: 7f 00 10 08
[111758.487995] sd 0:0:8:0: [sdj] Write cache: enabled, read cache: enabled, supports DPO and FUA
[111758.561591] sdj: sdj1
[111758.561773] sd 0:0:8:0: [sdj] Attached SCSI disk |
Can it be related to the io scheduler used (MQ_IOSCHED_DEADLINE) ? Should I switch to BFQ ? Any idea or tip to troubleshoot this ?
Last edited by sdauth on Fri Jun 28, 2024 9:16 am; edited 1 time in total |
|
Back to top |
|
|
sdauth l33t
Joined: 19 Sep 2018 Posts: 658 Location: Ásgarðr
|
Posted: Thu Jun 27, 2024 7:44 am Post subject: |
|
|
I recompiled my kernel with BFQ built-in and added this udev rule :
/etc/udev/rules.d/60-ioschedulers.rules
Code: | # HDD
ACTION=="add|change", KERNEL=="sd[a-z]*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"
# SSD
ACTION=="add|change", KERNEL=="sd[a-z]*|mmcblk[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="bfq" |
grep "" /sys/block/*/queue/scheduler
Code: | /sys/block/sda/queue/scheduler:none [bfq]
/sys/block/sdb/queue/scheduler:none [bfq]
/sys/block/sdc/queue/scheduler:none [bfq]
/sys/block/sdd/queue/scheduler:none [bfq]
/sys/block/sde/queue/scheduler:none [bfq]
/sys/block/sdf/queue/scheduler:none [bfq]
/sys/block/sdg/queue/scheduler:none [bfq]
/sys/block/sdh/queue/scheduler:none [bfq]
/sys/block/sdi/queue/scheduler:none [bfq] |
Looks good.
Right now, I'm finishing copying a lot of files to a drive, so far so good. But there isn't too much stress for now. Once done, I will start again an extended smart test on the four previous drives and stress the card a little with some file copy between disks to see if the problem happens again.
On the other hand, I wonder if the Dell Perc H310 isn't just overheating. There is no sensor for the card but last time I powered off this server, the card was hot as hell. Unfortunately I can't mount a fan on it because of a usb3 pci express card just below in a pcie-x1 slot. Maybe I can mount a 120mm fan to the panel of the case to push some air on the pcie slots though. |
|
Back to top |
|
|
sdauth l33t
Joined: 19 Sep 2018 Posts: 658 Location: Ásgarðr
|
Posted: Fri Jun 28, 2024 1:39 pm Post subject: |
|
|
Switching to bfq io scheduler instead of mq-deadline seems to does wonders. Almost 48hours uptime with concurrent smart test (1 drive still not finished..) and various rsync (local and remote) copy on different hdd and not a single drive reset yet. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|