Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
AMD-Vi: Event logged IO_PAGE_FAULT
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
engineermdr
Guru
Guru


Joined: 08 Nov 2003
Posts: 305
Location: Altoona, WI, USA

PostPosted: Fri Feb 09, 2024 5:21 am    Post subject: AMD-Vi: Event logged IO_PAGE_FAULT Reply with quote

My fairly new NAS suffered an error today while I was gone at work. I came home to find any access to ZFS would hang, including a "zpool status". I looked through syslog and found

Code:
Feb  7 23:11:23 neroon kernel: ahci 0000:0b:00.0: Using 64-bit DMA addresses
Feb  7 23:11:23 neroon kernel: ahci 0000:0b:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0016 address=0x7ffffe00000 flags=0x0000]
Feb  7 23:11:23 neroon kernel: ahci 0000:0b:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0016 address=0x7ffffe00500 flags=0x0000]
Feb  7 23:11:54 neroon kernel: ata9.00: exception Emask 0x0 SAct 0x78fff8 SErr 0x0 action 0x6 frozen


Then this gets reported over and over with just slightly different numbers

Code:
Feb  7 23:11:54 neroon kernel: ata9.00: failed command: WRITE FPDMA QUEUED
Feb  7 23:11:54 neroon kernel: ata9.00: cmd 61/58:18:78:5b:19/06:00:e5:00:00/40 tag 3 ncq dma 8314
88 out\x0a         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Feb  7 23:11:54 neroon kernel: ata9.00: status: { DRDY }


After a reboot, this system came up error free, ZFS resilvered the failed drives (2) and all appears well, so far. I'm definitely going to exercise the system for a while before I trust any data to it.

Now, I'm wondering if this is an IOMMU issue or a drive/controller issue? It seems strange that days after boot the kernel would start "Using 64-bit DMA addresses". What would cause that? Maybe I have something wrongly configured. Which way should I investigate first? SMART is not showing any issues. If it happens again though, I'll start swapping drives or cables to try and isolate the problem.
Back to top
View user's profile Send private message
engineermdr
Guru
Guru


Joined: 08 Nov 2003
Posts: 305
Location: Altoona, WI, USA

PostPosted: Fri Feb 09, 2024 3:42 pm    Post subject: Reply with quote

I found this related post where someone else has the same problem with the ASM1061 that I have, although mine is on my motherboard.
https://lkml.iu.edu/hypermail/linux/kernel/2401.3/00559.html

Edit: And this https://lkml.org/lkml/2024/1/23/1196 which is easier to read and overlaps some. I'm not sure I'm following all of it, but it appears I'm having a kernel driver problem.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23062

PostPosted: Fri Feb 09, 2024 4:06 pm    Post subject: Reply with quote

What kernel version is this? My initial attempts to find Using 64-bit DMA turned up no relevant hits, so either this is oddly line-wrapped in the kernel (defeating git grep), it is an out-of-tree driver, or the message is composed in parts (and I failed to guess the right composition).

My first guess would be that the Using 64-bit DMA line is just debug output that happens whenever the kernel (re)initializes the device.

As I interpret the linked mailing list posts, that reporter (and possibly you, too) have a buggy firmware problem. The device claims it can do 64-bit DMA, so the kernel proceeds to use 64-bit DMA with it. In practice, it can only do 43-bit DMA, and zeroes the remaining bits. If any of the upper (64 - 43) = 21 bits are set in the address the kernel picks, the device clears them, then proceeds to write to the wrong address. The device needs to write to exactly the address it was given. An appropriate entry in the kernel's quirk table can instruct the kernel to ignore the claimed 64-bit support, and ensure that the kernel gives the device only addresses that can be represented in 43 bits. Then the bits that the device wrongly zeroes are already zero (because the kernel took care to pick an address with that property), so the truncated address is still the "right" address, and no fault occurs.

Assuming the above paragraph is right, then this suddenly started failing because ordinary operations caused the kernel to finally pick an address that the device cannot handle. Your prior successes were because the kernel happened, by luck, to be picking addresses the device handled properly.
Back to top
View user's profile Send private message
engineermdr
Guru
Guru


Joined: 08 Nov 2003
Posts: 305
Location: Altoona, WI, USA

PostPosted: Fri Feb 09, 2024 7:51 pm    Post subject: Reply with quote

This all started after upgrading to gentoo-sources-6.6.16. I had previously been using 6.1.67. I'm also having nfsd issues with the 6.6.16 update. So, I'm going back to 6.1 for the time being.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum