Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
SAS HBA, random drive reset under load, io scheduler issue ?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
sdauth
l33t
l33t


Joined: 19 Sep 2018
Posts: 611
Location: Ásgarðr

PostPosted: Wed Jun 26, 2024 11:01 pm    Post subject: SAS HBA, random drive reset under load, io scheduler issue ? Reply with quote

Hello,
I have a server with 8 hdd attached to a SAS HBA (Dell Perc H310 with IT firmware), no raid is used : for each hard disk, a filesystem and a mountpoint.
When the system is under high load with lot of r/w, I have a disk (not the same each time !) which reset itself for no reason.
This time it was /dev/sdg, when it happened, I was running a (long) smart test on 4 drives and started a copy to an other disk. /dev/sdg was totally idle..
Other than that, the airflow is good, temp are ok and I already replaced the mini sas and power cables.. :o

Code:
[111738.379589] sd 0:0:6:0: device_block, handle(0x000f)
[111740.379542] sd 0:0:6:0: device_unblock and setting to running, handle(0x000f)
[111740.406332] sd 0:0:6:0: [sdg] Synchronizing SCSI cache
[111740.406369] sd 0:0:6:0: [sdg] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[111740.407030] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221106000000)
[111740.407040] mpt2sas_cm0: removing handle(0x000f), sas_addr(0x4433221106000000)
[111740.407044] mpt2sas_cm0: enclosure logical id(0x5d8ae120b146b700), slot(5)
[111758.130732] mpt2sas_cm0: handle(0xf) sas_address(0x4433221106000000) port_type(0x1)
[111758.385681] scsi 0:0:8:0: Direct-Access     ATA      WDC WD180EDGZ-11 0A85 PQ: 0 ANSI: 6
[111758.385704] scsi 0:0:8:0: SATA: handle(0x000f), sas_addr(0x4433221106000000), phy(6), device_name(0x0000000000000000)
[111758.385707] scsi 0:0:8:0: enclosure logical id (0x5d8ac140b146b500), slot(5)
[111758.385808] scsi 0:0:8:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[111758.385812] scsi 0:0:8:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[111758.454308] sd 0:0:8:0: Power-on or device reset occurred
[111758.455266] sd 0:0:8:0: Attached scsi generic sg6 type 0
[111758.457603] sd 0:0:8:0: [sdj] 35156656128 512-byte logical blocks: (18.0 TB/16.4 TiB)
[111758.457611] sd 0:0:8:0: [sdj] 4096-byte physical blocks
[111758.475549]  end_device-0:8: add: handle(0x000f), sas_addr(0x4433221106000000)
[111758.481002] sd 0:0:8:0: [sdj] Write Protect is off
[111758.481011] sd 0:0:8:0: [sdj] Mode Sense: 7f 00 10 08
[111758.487995] sd 0:0:8:0: [sdj] Write cache: enabled, read cache: enabled, supports DPO and FUA
[111758.561591]  sdj: sdj1
[111758.561773] sd 0:0:8:0: [sdj] Attached SCSI disk


Can it be related to the io scheduler used (MQ_IOSCHED_DEADLINE) ? Should I switch to BFQ ? Any idea or tip to troubleshoot this ?


Last edited by sdauth on Fri Jun 28, 2024 9:16 am; edited 1 time in total
Back to top
View user's profile Send private message
sdauth
l33t
l33t


Joined: 19 Sep 2018
Posts: 611
Location: Ásgarðr

PostPosted: Thu Jun 27, 2024 7:44 am    Post subject: Reply with quote

I recompiled my kernel with BFQ built-in and added this udev rule :
/etc/udev/rules.d/60-ioschedulers.rules
Code:
# HDD
ACTION=="add|change", KERNEL=="sd[a-z]*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"
# SSD
ACTION=="add|change", KERNEL=="sd[a-z]*|mmcblk[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="bfq"


grep "" /sys/block/*/queue/scheduler
Code:
/sys/block/sda/queue/scheduler:none [bfq]
/sys/block/sdb/queue/scheduler:none [bfq]
/sys/block/sdc/queue/scheduler:none [bfq]
/sys/block/sdd/queue/scheduler:none [bfq]
/sys/block/sde/queue/scheduler:none [bfq]
/sys/block/sdf/queue/scheduler:none [bfq]
/sys/block/sdg/queue/scheduler:none [bfq]
/sys/block/sdh/queue/scheduler:none [bfq]
/sys/block/sdi/queue/scheduler:none [bfq]


Looks good.

Right now, I'm finishing copying a lot of files to a drive, so far so good. But there isn't too much stress for now. Once done, I will start again an extended smart test on the four previous drives and stress the card a little with some file copy between disks to see if the problem happens again.
On the other hand, I wonder if the Dell Perc H310 isn't just overheating. There is no sensor for the card but last time I powered off this server, the card was hot as hell. Unfortunately I can't mount a fan on it because of a usb3 pci express card just below in a pcie-x1 slot. Maybe I can mount a 120mm fan to the panel of the case to push some air on the pcie slots though.
Back to top
View user's profile Send private message
sdauth
l33t
l33t


Joined: 19 Sep 2018
Posts: 611
Location: Ásgarðr

PostPosted: Fri Jun 28, 2024 1:39 pm    Post subject: Reply with quote

Switching to bfq io scheduler instead of mq-deadline seems to does wonders. Almost 48hours uptime with concurrent smart test (1 drive still not finished..) and various rsync (local and remote) copy on different hdd and not a single drive reset yet.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum