Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[solved] PCIE bus errors with kernel 6.1.6/7
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
steve_v
Guru
Guru


Joined: 20 Jun 2004
Posts: 416
Location: New Zealand

PostPosted: Fri Jan 20, 2023 10:20 am    Post subject: [solved] PCIE bus errors with kernel 6.1.6/7 Reply with quote

So I thought I'd give the 6.1 branch a spin, just for funsies and all that. Built gentoo-sources-6.1.7 with genkernel (from the generated config), nothing peculiar.

Then there I am, seeking through a vijeo (from fileserver, over NFS), and *poof*, my X11 goes away.
Inspecting dmesg/syslog, I see a bunch of rather, errm, interesting things have been going on since shortly after boot, such as:
Code:

kernel: pcieport 0000:00:1b.4: AER: Multiple Corrected error received: 0000:06:02.0
kernel: pcieport 0000:06:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
kernel: pcieport 0000:06:02.0:   device [111d:8061] error status/mask=00000080/00002000
kernel: pcieport 0000:06:02.0:    [ 7] BadDLLP               
kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:06:04.0
kernel: pcieport 0000:06:04.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
kernel: pcieport 0000:06:04.0:   device [111d:8061] error status/mask=00000001/00002000
kernel: pcieport 0000:06:04.0:    [ 0] RxErr


Usually followed by a whole lot of:
Code:
 
kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:08:00.0
kernel: bnx2 0000:08:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
kernel: bnx2 0000:08:00.0:   device [14e4:1639] error status/mask=00003000/00002000
kernel: bnx2 0000:08:00.0:    [12] Timeout

Sometimes a bit of:
Code:

kernel: pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:00.1
kernel: snd_hda_intel 0000:03:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
kernel: snd_hda_intel 0000:03:00.1:   device [1002:ab28] error status/mask=00100000/00000000
kernel: snd_hda_intel 0000:03:00.1:    [20] UnsupReq               (First)
kernel: snd_hda_intel 0000:03:00.1: AER:   TLP Header: 6000000a 000000ff 00000046 fc4fd140

And eventually, the GPU crash that brought it to my attention:
Code:

kernel: amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:24 vmid:3 pasid:32811, for process mpv pid 28669 thread mpv:cs0 pid 28682)
kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080010dcfb000 from client 0x12 (VMC)
kernel: amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00305631
kernel: amdgpu 0000:03:00.0: amdgpu: \x09 Faulty UTCL2 client ID: VCN0 (0x2b)
kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MORE_FAULTS: 0x1
kernel: amdgpu 0000:03:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
kernel: amdgpu 0000:03:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x3
kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
kernel: amdgpu 0000:03:00.0: amdgpu: \x09 RW: 0x0
kernel: amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:24 vmid:3 pasid:32811, for process mpv pid 28669 thread mpv:cs0 pid 28682)
kernel: amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x000080010dcfb000 from client 0x12 (VMC)
kernel: amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00000000
kernel: amdgpu 0000:03:00.0: amdgpu: \x09 Faulty UTCL2 client ID: unknown (0x0)
kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MORE_FAULTS: 0x0
kernel: amdgpu 0000:03:00.0: amdgpu: \x09 WALKER_ERROR: 0x0
kernel: amdgpu 0000:03:00.0: amdgpu: \x09 PERMISSION_FAULTS: 0x0
kernel: amdgpu 0000:03:00.0: amdgpu: \x09 MAPPING_ERROR: 0x0
kernel: amdgpu 0000:03:00.0: amdgpu: \x09 RW: 0x0
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_dec_0 timeout, signaled seq=74377, emitted seq=74379
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process mpv pid 28669 thread mpv:cs0 pid 28682
kernel: amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
kernel: [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
kernel: [drm] Register(0) [mmUVD_RBC_RB_RPTR] failed to reach value 0x00000100 != 0x00000340
kernel: [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
kernel: amdgpu 0000:03:00.0: amdgpu: free PSP TMR buffer
kernel: amdgpu 0000:03:00.0: amdgpu: MODE1 reset
kernel: amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
kernel: amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset


Where:
Code:

00:00.0 Host bridge: Intel Corporation Comet Lake-S 6c Host Bridge/DRAM Controller (rev 05)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 05)
00:14.0 USB controller: Intel Corporation Comet Lake USB 3.1 xHCI Host Controller
00:14.2 RAM memory: Intel Corporation Comet Lake PCH Shared SRAM
00:15.0 Serial bus controller: Intel Corporation Comet Lake PCH Serial IO I2C Controller #0
00:15.1 Serial bus controller: Intel Corporation Comet Lake PCH Serial IO I2C Controller #1
00:16.0 Communication controller: Intel Corporation Comet Lake HECI Controller
00:17.0 SATA controller: Intel Corporation Comet Lake SATA AHCI Controller
00:1b.0 PCI bridge: Intel Corporation Comet Lake PCI Express Root Port #17 (rev f0)
00:1b.4 PCI bridge: Intel Corporation Comet Lake PCI Express Root Port #21 (rev f0)
00:1c.0 PCI bridge: Intel Corporation Device 06b8 (rev f0)
00:1c.4 PCI bridge: Intel Corporation Device 06bc (rev f0)
00:1c.7 PCI bridge: Intel Corporation Device 06bf (rev f0)
00:1d.0 PCI bridge: Intel Corporation Comet Lake PCI Express Root Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation Z490 Chipset LPC/eSPI Controller
00:1f.3 Audio device: Intel Corporation Comet Lake PCH cAVS
00:1f.4 SMBus: Intel Corporation Comet Lake PCH SMBus Controller
00:1f.5 Serial bus controller: Intel Corporation Comet Lake PCH SPI Controller
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch (rev c1)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M] (rev c1)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
05:00.0 PCI bridge: Microsemi / PMC / IDT PES12T3G2 PCI Express Gen2 Switch (rev 01)
06:02.0 PCI bridge: Microsemi / PMC / IDT PES12T3G2 PCI Express Gen2 Switch (rev 01)
06:04.0 PCI bridge: Microsemi / PMC / IDT PES12T3G2 PCI Express Gen2 Switch (rev 01)
07:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
07:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
08:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
08:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
0a:00.0 Ethernet controller: Intel Corporation Ethernet Controller I225-V (rev 01)
0b:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
*

Everything appears completely stable with 5.15.xx (and always has been), no scary messages either. Hardware hasn't changed, AFAICT the only variable is the kernel.

I'm not sure exactly where that PCIE switch on 06:02.0 physically is either (motherboard? network card?), and there are too many changes from 5.15.88 - 6.1.7 to go looking for suspects. Looks like there has been some work on the bnx2 driver recently though. Maybe? My guessing does suck...

It's not a big deal ATM, I'm back on 5.15.88 for now and all is well.
I can probably get some cleaner logs if I actually try to repro it on 6.1.7, but for the moment I'm just fishing. Anyone seen this kind of thing?
6.1.7 kernel config.


*Yes, I have much ethernet. More ethernet more better, don't ask.
_________________
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.


Last edited by steve_v on Sat Jan 21, 2023 9:54 am; edited 1 time in total
Back to top
View user's profile Send private message
alamahant
Advocate
Advocate


Joined: 23 Mar 2019
Posts: 3948

PostPosted: Fri Jan 20, 2023 12:42 pm    Post subject: Reply with quote

Try
Code:

lspci -nn | grep -E "111d:8061|14e4:1639"

To see which is the culprit.
_________________
:)
Back to top
View user's profile Send private message
steve_v
Guru
Guru


Joined: 20 Jun 2004
Posts: 416
Location: New Zealand

PostPosted: Fri Jan 20, 2023 1:12 pm    Post subject: Reply with quote

alamahant wrote:
lspci -nn | grep -E "111d:8061|14e4:1639"

Points to the broadcom NIC, as expected:
Code:
05:00.0 PCI bridge [0604]: Microsemi / PMC / IDT PES12T3G2 PCI Express Gen2 Switch [111d:8061] (rev 01)
06:02.0 PCI bridge [0604]: Microsemi / PMC / IDT PES12T3G2 PCI Express Gen2 Switch [111d:8061] (rev 01)
06:04.0 PCI bridge [0604]: Microsemi / PMC / IDT PES12T3G2 PCI Express Gen2 Switch [111d:8061] (rev 01)
07:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet [14e4:1639] (rev 20)
07:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet [14e4:1639] (rev 20)
08:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet [14e4:1639] (rev 20)
08:00.1 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme II BCM5709 Gigabit Ethernet [14e4:1639] (rev 20)

Not really explaining why it doesn't cause any drama with 5.15.x, but it's something.

I reseated it and threw a random fan from ye-box-o-random-fans at it for good measure (the "Hot Surface" printed on these cards is absolutely not a joke), but still. Totally stable with 5.15 (and 5.10 for that matter), same environment, same load, same everything really. :?

Compiling 6.1.7 again now, I guess we will see what excitement ensues.
_________________
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.
Back to top
View user's profile Send private message
steve_v
Guru
Guru


Joined: 20 Jun 2004
Posts: 416
Location: New Zealand

PostPosted: Fri Jan 20, 2023 1:41 pm    Post subject: Reply with quote

Aaand there it is.

Code:
pcieport 0000:00:1b.4: AER: Corrected error received: 0000:08:00.0
[Sat Jan 21 02:35:54 2023] bnx2 0000:08:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[Sat Jan 21 02:35:54 2023] bnx2 0000:08:00.0:   device [14e4:1639] error status/mask=00003000/00002000
[Sat Jan 21 02:35:54 2023] bnx2 0000:08:00.0:    [12] Timeout               
[Sat Jan 21 02:35:55 2023] pcieport 0000:00:1b.4: AER: Corrected error received: 0000:07:00.0
[Sat Jan 21 02:35:55 2023] bnx2 0000:07:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[Sat Jan 21 02:35:55 2023] bnx2 0000:07:00.0:   device [14e4:1639] error status/mask=00003000/00002000
[Sat Jan 21 02:35:55 2023] bnx2 0000:07:00.0:    [12] Timeout


On 6.1, all that's needed to trigger that is an iperf run. On 5.15, silence.
Off to swap cards I go...
_________________
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.
Back to top
View user's profile Send private message
steve_v
Guru
Guru


Joined: 20 Jun 2004
Posts: 416
Location: New Zealand

PostPosted: Fri Jan 20, 2023 2:12 pm    Post subject: Reply with quote

Well, rats.

Swapped out that NIC with an identical (known good) unit from another box, and nothing whatsoever changed.
That's gotta mean it's either the kernel's fault all along, or this motherboard just hates me today. Not sure how I feel about that TBH, but at leas I'm not shopping for a new NIC.

Next?


Ed.

It's power-management. Because of course it's frickin' power management.

Took that sucker out behind the shed, shot it:
Code:
pcie_aspm=off

And now everything appears to be working fine. Good riddance I say.

I'll slap a [solved] on this mess once I've thrashed it a bit more, but so far, so good.

Ed.

Figure a day is long enough soak, and I can't get the system to miss a beat with aspm disabled.
Still none the wiser as to what changed in the newer kernel releases, but for now I'm chalking it up to "probably just another BIOS bug, and probably specific to this hardware configuration."
_________________
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum