AMD-Vi: Event logged IO_PAGE_FAULT

engineermdr

My fairly new NAS suffered an error today while I was gone at work. I came home to find any access to ZFS would hang, including a "zpool status". I looked through syslog and found

engineermdr · Posted: Fri Feb 09, 2024 3:42 pm Post subject:

I found this related post where someone else has the same problem with the ASM1061 that I have, although mine is on my motherboard.
https://lkml.iu.edu/hypermail/linux/kernel/2401.3/00559.html

Edit: And this https://lkml.org/lkml/2024/1/23/1196 which is easier to read and overlaps some. I'm not sure I'm following all of it, but it appears I'm having a kernel driver problem.

Hu · Administrator Joined: 06 Mar 2007 Posts: 23062

What kernel version is this? My initial attempts to find Using 64-bit DMA turned up no relevant hits, so either this is oddly line-wrapped in the kernel (defeating git grep), it is an out-of-tree driver, or the message is composed in parts (and I failed to guess the right composition).

My first guess would be that the Using 64-bit DMA line is just debug output that happens whenever the kernel (re)initializes the device.

As I interpret the linked mailing list posts, that reporter (and possibly you, too) have a buggy firmware problem. The device claims it can do 64-bit DMA, so the kernel proceeds to use 64-bit DMA with it. In practice, it can only do 43-bit DMA, and zeroes the remaining bits. If any of the upper (64 - 43) = 21 bits are set in the address the kernel picks, the device clears them, then proceeds to write to the wrong address. The device needs to write to exactly the address it was given. An appropriate entry in the kernel's quirk table can instruct the kernel to ignore the claimed 64-bit support, and ensure that the kernel gives the device only addresses that can be represented in 43 bits. Then the bits that the device wrongly zeroes are already zero (because the kernel took care to pick an address with that property), so the truncated address is still the "right" address, and no fault occurs.

Assuming the above paragraph is right, then this suddenly started failing because ordinary operations caused the kernel to finally pick an address that the device cannot handle. Your prior successes were because the kernel happened, by luck, to be picking addresses the device handled properly.

engineermdr · Posted: Fri Feb 09, 2024 7:51 pm Post subject:

This all started after upgrading to gentoo-sources-6.6.16. I had previously been using 6.1.67. I'm also having nfsd issues with the 6.6.16 update. So, I'm going back to 6.1 for the time being.