View previous topic :: View next topic |
Author |
Message |
sdauth l33t
Joined: 19 Sep 2018 Posts: 667 Location: Ásgarðr
|
Posted: Wed Jun 26, 2024 11:01 pm Post subject: SAS HBA, random drive reset under load, io scheduler issue ? |
|
|
Hello,
I have a server with 8 hdd attached to a SAS HBA (Dell Perc H310 with IT firmware), no raid is used : for each hard disk, a filesystem and a mountpoint.
When the system is under high load with lot of r/w, I have a disk (not the same each time !) which reset itself for no reason.
This time it was /dev/sdg, when it happened, I was running a (long) smart test on 4 drives and started a copy to an other disk. /dev/sdg was totally idle..
Other than that, the airflow is good, temp are ok and I already replaced the mini sas and power cables..
Code: | [111738.379589] sd 0:0:6:0: device_block, handle(0x000f)
[111740.379542] sd 0:0:6:0: device_unblock and setting to running, handle(0x000f)
[111740.406332] sd 0:0:6:0: [sdg] Synchronizing SCSI cache
[111740.406369] sd 0:0:6:0: [sdg] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[111740.407030] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221106000000)
[111740.407040] mpt2sas_cm0: removing handle(0x000f), sas_addr(0x4433221106000000)
[111740.407044] mpt2sas_cm0: enclosure logical id(0x5d8ae120b146b700), slot(5)
[111758.130732] mpt2sas_cm0: handle(0xf) sas_address(0x4433221106000000) port_type(0x1)
[111758.385681] scsi 0:0:8:0: Direct-Access ATA WDC WD180EDGZ-11 0A85 PQ: 0 ANSI: 6
[111758.385704] scsi 0:0:8:0: SATA: handle(0x000f), sas_addr(0x4433221106000000), phy(6), device_name(0x0000000000000000)
[111758.385707] scsi 0:0:8:0: enclosure logical id (0x5d8ac140b146b500), slot(5)
[111758.385808] scsi 0:0:8:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[111758.385812] scsi 0:0:8:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[111758.454308] sd 0:0:8:0: Power-on or device reset occurred
[111758.455266] sd 0:0:8:0: Attached scsi generic sg6 type 0
[111758.457603] sd 0:0:8:0: [sdj] 35156656128 512-byte logical blocks: (18.0 TB/16.4 TiB)
[111758.457611] sd 0:0:8:0: [sdj] 4096-byte physical blocks
[111758.475549] end_device-0:8: add: handle(0x000f), sas_addr(0x4433221106000000)
[111758.481002] sd 0:0:8:0: [sdj] Write Protect is off
[111758.481011] sd 0:0:8:0: [sdj] Mode Sense: 7f 00 10 08
[111758.487995] sd 0:0:8:0: [sdj] Write cache: enabled, read cache: enabled, supports DPO and FUA
[111758.561591] sdj: sdj1
[111758.561773] sd 0:0:8:0: [sdj] Attached SCSI disk |
Can it be related to the io scheduler used (MQ_IOSCHED_DEADLINE) ? Should I switch to BFQ ? Any idea or tip to troubleshoot this ?
Last edited by sdauth on Fri Jun 28, 2024 9:16 am; edited 1 time in total |
|
Back to top |
|
|
sdauth l33t
Joined: 19 Sep 2018 Posts: 667 Location: Ásgarðr
|
Posted: Thu Jun 27, 2024 7:44 am Post subject: |
|
|
I recompiled my kernel with BFQ built-in and added this udev rule :
/etc/udev/rules.d/60-ioschedulers.rules
Code: | # HDD
ACTION=="add|change", KERNEL=="sd[a-z]*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"
# SSD
ACTION=="add|change", KERNEL=="sd[a-z]*|mmcblk[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="bfq" |
grep "" /sys/block/*/queue/scheduler
Code: | /sys/block/sda/queue/scheduler:none [bfq]
/sys/block/sdb/queue/scheduler:none [bfq]
/sys/block/sdc/queue/scheduler:none [bfq]
/sys/block/sdd/queue/scheduler:none [bfq]
/sys/block/sde/queue/scheduler:none [bfq]
/sys/block/sdf/queue/scheduler:none [bfq]
/sys/block/sdg/queue/scheduler:none [bfq]
/sys/block/sdh/queue/scheduler:none [bfq]
/sys/block/sdi/queue/scheduler:none [bfq] |
Looks good.
Right now, I'm finishing copying a lot of files to a drive, so far so good. But there isn't too much stress for now. Once done, I will start again an extended smart test on the four previous drives and stress the card a little with some file copy between disks to see if the problem happens again.
On the other hand, I wonder if the Dell Perc H310 isn't just overheating. There is no sensor for the card but last time I powered off this server, the card was hot as hell. Unfortunately I can't mount a fan on it because of a usb3 pci express card just below in a pcie-x1 slot. Maybe I can mount a 120mm fan to the panel of the case to push some air on the pcie slots though. |
|
Back to top |
|
|
sdauth l33t
Joined: 19 Sep 2018 Posts: 667 Location: Ásgarðr
|
Posted: Fri Jun 28, 2024 1:39 pm Post subject: |
|
|
Switching to bfq io scheduler instead of mq-deadline seems to does wonders. Almost 48hours uptime with concurrent smart test (1 drive still not finished..) and various rsync (local and remote) copy on different hdd and not a single drive reset yet. |
|
Back to top |
|
|
sdauth l33t
Joined: 19 Sep 2018 Posts: 667 Location: Ásgarðr
|
Posted: Wed Nov 27, 2024 12:00 pm Post subject: |
|
|
Just a quick update just in case someone find this thread through web search and think setting bfq is enough, it is not
Indeed, in early September, a drive reset again from the SAS HBA, strangely the same drive again (18tb)
You can find online a couple of threads about this drive, so maybe a buggy firmware ? Who knows.
Here is what I did :
- Replacing the thermal paste on the SAS HBA, the LSI chipset can reach freaking hot temperatures. When one is using IT firmware, you can't access temp sensors (only with IR firmware) but you can literally burn yourself when touching the heatsink for a fraction of second so something is not right
Indeed after removing the heatsink, the old thermal pad was almost completely gone, so I cleaned the heatsink surface & chipset then applied new thermal paste. Still hot but much less than before. (I need to find a way to push some air on it though..)
- I also replaced the mini sas to sata cables again. molex to sata cables too.
While it seemed to help for a few weeks, it happened again mid October. I started to get really mad
Finally, what seems to help was to turn off TLER on this specific drive.
Actually, I turned it off for all WDC drives with this udev rule :
/etc/udev/rules.d/50-disks.rules
Code: | ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{model}=="WDC*", RUN+="/usr/sbin/smartctl -l scterc,0,0 /dev/%k" |
It might not be desirable if you're using a RAID / ZFS (not used in my case) Please comment on that if you have more info.
Another possibility is to raise the scsi timeout: (default to 30 for timeout & 10 for eh_timeout)
(Note that it only targets WDC drives in this example)
Code: | ACTION=="add|change", SUBSYSTEM=="block", ATTRS{model}=="WDC*", RUN+="/bin/sh -c 'echo 180 >/sys/$DEVPATH/device/timeout'"
ACTION=="add|change", SUBSYSTEM=="block", ATTRS{model}=="WDC*", RUN+="/bin/sh -c 'echo 60 >/sys/$DEVPATH/device/eh_timeout'" |
to check it is correctly applied:
Code: | find /sys/class/scsi_disk/*/device/*timeout -exec grep -H . '{}' \; |
But for now, i have not tried that since turning off TLER *seems to* (= for now) fix the issue. (hopefully I will not have to update this thread with bad news )
Cheers
edit: (Little update) After one month uptime with heavy disk activity, not a single drive reset, system really stable. Hopefully it stays like this. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|