SAS HBA, random drive reset under load, io scheduler issue ?

sdauth · l33t Joined: 19 Sep 2018 Posts: 659 Location: Ásgarðr

Hello,
I have a server with 8 hdd attached to a SAS HBA (Dell Perc H310 with IT firmware), no raid is used : for each hard disk, a filesystem and a mountpoint.
When the system is under high load with lot of r/w, I have a disk (not the same each time !) which reset itself for no reason.
This time it was /dev/sdg, when it happened, I was running a (long) smart test on 4 drives and started a copy to an other disk. /dev/sdg was totally idle..
Other than that, the airflow is good, temp are ok and I already replaced the mini sas and power cables..

sdauth · l33t Joined: 19 Sep 2018 Posts: 659 Location: Ásgarðr

I recompiled my kernel with BFQ built-in and added this udev rule :
/etc/udev/rules.d/60-ioschedulers.rules

sdauth · l33t Joined: 19 Sep 2018 Posts: 659 Location: Ásgarðr

Switching to bfq io scheduler instead of mq-deadline seems to does wonders. Almost 48hours uptime with concurrent smart test (1 drive still not finished..) and various rsync (local and remote) copy on different hdd and not a single drive reset yet.

sdauth · l33t Joined: 19 Sep 2018 Posts: 659 Location: Ásgarðr

Just a quick update just in case someone find this thread through web search and think setting bfq is enough, it is not

Indeed, in early September, a drive reset again from the SAS HBA, strangely the same drive again (18tb)
You can find online a couple of threads about this drive, so maybe a buggy firmware ? Who knows.
Here is what I did :
- Replacing the thermal paste on the SAS HBA, the LSI chipset can reach freaking hot temperatures. When one is using IT firmware, you can't access temp sensors (only with IR firmware) but you can literally burn yourself when touching the heatsink for a fraction of second so something is not right :lol:

Indeed after removing the heatsink, the old thermal pad was almost completely gone, so I cleaned the heatsink surface & chipset then applied new thermal paste. Still hot but much less than before. (I need to find a way to push some air on it though..)
- I also replaced the mini sas to sata cables again. molex to sata cables too.

While it seemed to help for a few weeks, it happened again mid October. I started to get really mad :lol:

Finally, what seems to help was to turn off TLER on this specific drive.
Actually, I turned it off for all WDC drives with this udev rule :

/etc/udev/rules.d/50-disks.rules