Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
SAS HBA, random drive reset under load, io scheduler issue ?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
sdauth
l33t
l33t


Joined: 19 Sep 2018
Posts: 666
Location: Ásgarðr

PostPosted: Wed Jun 26, 2024 11:01 pm    Post subject: SAS HBA, random drive reset under load, io scheduler issue ? Reply with quote

Hello,
I have a server with 8 hdd attached to a SAS HBA (Dell Perc H310 with IT firmware), no raid is used : for each hard disk, a filesystem and a mountpoint.
When the system is under high load with lot of r/w, I have a disk (not the same each time !) which reset itself for no reason.
This time it was /dev/sdg, when it happened, I was running a (long) smart test on 4 drives and started a copy to an other disk. /dev/sdg was totally idle..
Other than that, the airflow is good, temp are ok and I already replaced the mini sas and power cables.. :o

Code:
[111738.379589] sd 0:0:6:0: device_block, handle(0x000f)
[111740.379542] sd 0:0:6:0: device_unblock and setting to running, handle(0x000f)
[111740.406332] sd 0:0:6:0: [sdg] Synchronizing SCSI cache
[111740.406369] sd 0:0:6:0: [sdg] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[111740.407030] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221106000000)
[111740.407040] mpt2sas_cm0: removing handle(0x000f), sas_addr(0x4433221106000000)
[111740.407044] mpt2sas_cm0: enclosure logical id(0x5d8ae120b146b700), slot(5)
[111758.130732] mpt2sas_cm0: handle(0xf) sas_address(0x4433221106000000) port_type(0x1)
[111758.385681] scsi 0:0:8:0: Direct-Access     ATA      WDC WD180EDGZ-11 0A85 PQ: 0 ANSI: 6
[111758.385704] scsi 0:0:8:0: SATA: handle(0x000f), sas_addr(0x4433221106000000), phy(6), device_name(0x0000000000000000)
[111758.385707] scsi 0:0:8:0: enclosure logical id (0x5d8ac140b146b500), slot(5)
[111758.385808] scsi 0:0:8:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[111758.385812] scsi 0:0:8:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[111758.454308] sd 0:0:8:0: Power-on or device reset occurred
[111758.455266] sd 0:0:8:0: Attached scsi generic sg6 type 0
[111758.457603] sd 0:0:8:0: [sdj] 35156656128 512-byte logical blocks: (18.0 TB/16.4 TiB)
[111758.457611] sd 0:0:8:0: [sdj] 4096-byte physical blocks
[111758.475549]  end_device-0:8: add: handle(0x000f), sas_addr(0x4433221106000000)
[111758.481002] sd 0:0:8:0: [sdj] Write Protect is off
[111758.481011] sd 0:0:8:0: [sdj] Mode Sense: 7f 00 10 08
[111758.487995] sd 0:0:8:0: [sdj] Write cache: enabled, read cache: enabled, supports DPO and FUA
[111758.561591]  sdj: sdj1
[111758.561773] sd 0:0:8:0: [sdj] Attached SCSI disk


Can it be related to the io scheduler used (MQ_IOSCHED_DEADLINE) ? Should I switch to BFQ ? Any idea or tip to troubleshoot this ?


Last edited by sdauth on Fri Jun 28, 2024 9:16 am; edited 1 time in total
Back to top
View user's profile Send private message
sdauth
l33t
l33t


Joined: 19 Sep 2018
Posts: 666
Location: Ásgarðr

PostPosted: Thu Jun 27, 2024 7:44 am    Post subject: Reply with quote

I recompiled my kernel with BFQ built-in and added this udev rule :
/etc/udev/rules.d/60-ioschedulers.rules
Code:
# HDD
ACTION=="add|change", KERNEL=="sd[a-z]*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"
# SSD
ACTION=="add|change", KERNEL=="sd[a-z]*|mmcblk[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="bfq"


grep "" /sys/block/*/queue/scheduler
Code:
/sys/block/sda/queue/scheduler:none [bfq]
/sys/block/sdb/queue/scheduler:none [bfq]
/sys/block/sdc/queue/scheduler:none [bfq]
/sys/block/sdd/queue/scheduler:none [bfq]
/sys/block/sde/queue/scheduler:none [bfq]
/sys/block/sdf/queue/scheduler:none [bfq]
/sys/block/sdg/queue/scheduler:none [bfq]
/sys/block/sdh/queue/scheduler:none [bfq]
/sys/block/sdi/queue/scheduler:none [bfq]


Looks good.

Right now, I'm finishing copying a lot of files to a drive, so far so good. But there isn't too much stress for now. Once done, I will start again an extended smart test on the four previous drives and stress the card a little with some file copy between disks to see if the problem happens again.
On the other hand, I wonder if the Dell Perc H310 isn't just overheating. There is no sensor for the card but last time I powered off this server, the card was hot as hell. Unfortunately I can't mount a fan on it because of a usb3 pci express card just below in a pcie-x1 slot. Maybe I can mount a 120mm fan to the panel of the case to push some air on the pcie slots though.
Back to top
View user's profile Send private message
sdauth
l33t
l33t


Joined: 19 Sep 2018
Posts: 666
Location: Ásgarðr

PostPosted: Fri Jun 28, 2024 1:39 pm    Post subject: Reply with quote

Switching to bfq io scheduler instead of mq-deadline seems to does wonders. Almost 48hours uptime with concurrent smart test (1 drive still not finished..) and various rsync (local and remote) copy on different hdd and not a single drive reset yet.
Back to top
View user's profile Send private message
sdauth
l33t
l33t


Joined: 19 Sep 2018
Posts: 666
Location: Ásgarðr

PostPosted: Wed Nov 27, 2024 12:00 pm    Post subject: Reply with quote

Just a quick update just in case someone find this thread through web search and think setting bfq is enough, it is not :o
Indeed, in early September, a drive reset again from the SAS HBA, strangely the same drive again (18tb)
You can find online a couple of threads about this drive, so maybe a buggy firmware ? Who knows.
Here is what I did :
- Replacing the thermal paste on the SAS HBA, the LSI chipset can reach freaking hot temperatures. When one is using IT firmware, you can't access temp sensors (only with IR firmware) but you can literally burn yourself when touching the heatsink for a fraction of second so something is not right :lol:
Indeed after removing the heatsink, the old thermal pad was almost completely gone, so I cleaned the heatsink surface & chipset then applied new thermal paste. Still hot but much less than before. (I need to find a way to push some air on it though..)
- I also replaced the mini sas to sata cables again. molex to sata cables too.

While it seemed to help for a few weeks, it happened again mid October. I started to get really mad :lol:
Finally, what seems to help was to turn off TLER on this specific drive.
Actually, I turned it off for all WDC drives with this udev rule :

/etc/udev/rules.d/50-disks.rules
Code:
ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{model}=="WDC*", RUN+="/usr/sbin/smartctl -l scterc,0,0 /dev/%k"


It might not be desirable if you're using a RAID / ZFS (not used in my case) Please comment on that if you have more info.

Another possibility is to raise the scsi timeout: (default to 30 for timeout & 10 for eh_timeout)
(Note that it only targets WDC drives in this example)
Code:
ACTION=="add|change", SUBSYSTEM=="block", ATTRS{model}=="WDC*", RUN+="/bin/sh -c 'echo 180 >/sys/$DEVPATH/device/timeout'"
ACTION=="add|change", SUBSYSTEM=="block", ATTRS{model}=="WDC*", RUN+="/bin/sh -c 'echo 60 >/sys/$DEVPATH/device/eh_timeout'"

to check it is correctly applied:
Code:
find /sys/class/scsi_disk/*/device/*timeout -exec grep -H . '{}' \;


But for now, i have not tried that since turning off TLER *seems to* (= for now) fix the issue. (hopefully I will not have to update this thread with bad news :lol: )

Cheers

edit: (Little update) After one month uptime with heavy disk activity, not a single drive reset, system really stable. Hopefully it stays like this.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum