Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
ixgbe unable to reset after PCI-e fault
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Unsupported Software
View previous topic :: View next topic  
Author Message
DocJava
n00b
n00b


Joined: 25 Oct 2017
Posts: 1

PostPosted: Wed Oct 25, 2017 4:55 am    Post subject: ixgbe unable to reset after PCI-e fault Reply with quote

Hello,

I'm having a bit of a networking driver issue about once every day all the way up to two weeks between failures on my system. The system remains stable when this issue occurs, however the network devices never come back up unless a full power cycle is completed. I used to run Gentoo, so I'm posting the question here instead of on the Ubuntu forums just out of preference.

Any thoughts on whether this is a hardware failure or a solvable kernel bug would be appreciated.

My motherboard is an Asus X99-E-10G-WS with dual 10GbE slots. I'm maxed out with 64 GiB of RAM and 4 dual slot 16x graphics cards used for deep learning, in addition to 4x PCI lanes being dedicated to the Ethernet NICs, which is less than what the 20x lanes they should get normally. The CPU is a Intel Core i7-6850K Broadwell-E 6-Core 3.6 GHz LGA 2011-V3, with 44 PCI-e lanes.

In addition to "rmmod ixgbe; modprobe ixgbe", I've already tried removing the PCI devices from /sys/bus/pci and issuing a echo 1 > rescan command, as well as issuing a reset directly to the PCI devices themselves. None of these approaches worked and they left the system without network devices at the end of the process (especially when doing a remove + rescan). Other attempts resulted in dmesg and the ixgbe driver complaining about the network devices being in the D3 power state and unable to add an adapter during driver probing. All attempts left the devices non-visible in ifconfig after the bug presents itself.

I've only had a few chances to do some poking around when the issue occurs, and it mostly occurs randomly -- so theres no way to reproduce it on demand and I haven't been able to find a workaround. I've already tried isolating the system on its own power circuit with a dedicated UPS, so I don't think the power supply is an issue (the PSU itself is a Corsair AXi Series, AX1500i) and all available power leads are connected to the motherboard.

More details are enclosed below.

-Doc

Running kernel:
Code:

Linux workstation 4.4.0-97-generic #120-Ubuntu SMP Tue Sep 19 17:28:18 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux


dmesg output when bug occurs:
Code:

[1582877.105447] pcieport 0000:00:01.1: AER: Uncorrected (Non-Fatal) error received: id=0009
[1582877.105454] pcieport 0000:00:01.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0009(Requester ID)
[1582877.105456] pcieport 0000:00:01.1:   device [8086:6f03] error status/mask=00004000/00000000
[1582877.105457] pcieport 0000:00:01.1:    [14] Completion Timeout     (First)
[1582877.105460] pcieport 0000:00:01.1: broadcast error_detected message
[1582877.122936] ixgbe 0000:02:00.1: Adapter removed
[1582877.128974] ixgbe 0000:02:00.0: Adapter removed
[1582879.163633] pcieport 0000:00:01.1: broadcast slot_reset message
[1582879.163910] bridge-enp2s0f1: disabling the bridge
[1582879.176041] bridge-enp2s0f1: down
[1582879.176048] bridge-enp2s0f1: detached
[1582879.176063] ixgbe 0000:02:00.0: Refused to change power state, currently in D3
[1582879.236400] ixgbe 0000:02:00.0: Adapter removed
[1582879.363746] userif-4: sent link down event.
[1582879.363749] userif-4: sent link up event.
[1582880.553604] ixgbe 0000:02:00.0 enp2s0f0: eeprom read at offset 40 failed
[1582880.768797] ixgbe 0000:02:00.0 enp2s0f0: eeprom read at offset 39 failed
[1582880.768806] ixgbe 0000:02:00.0: Hardware Error: -15
[1582880.980561] ixgbe 0000:02:00.0: pci_cleanup_aer_uncorrect_error_status failed 0xfffffffb
[1582880.996022] ixgbe 0000:02:00.1: Refused to change power state, currently in D3
[1582881.056250] ixgbe 0000:02:00.1: Adapter removed
[1582881.700030] nfs: server 10.101.0.3 not responding, timed out
[1582882.373013] ixgbe 0000:02:00.1 enp2s0f1: eeprom read at offset 40 failed
[1582882.589641] ixgbe 0000:02:00.1 enp2s0f1: eeprom read at offset 39 failed
[1582882.589648] ixgbe 0000:02:00.1: Hardware Error: -15
[1582882.804186] ixgbe 0000:02:00.1: pci_cleanup_aer_uncorrect_error_status failed 0xfffffffb
[1582882.804193] pcieport 0000:00:01.1: broadcast resume message
[1582887.851668] pcieport 0000:00:01.1: AER: Device recovery successful
[1582887.851681] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Non-Fatal) error received: id=0009
[1582887.851893] pcieport 0000:00:01.1: can't find device of ID0009
[1582887.851896] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) error received: id=0009
[1582887.852136] pcieport 0000:00:01.1: can't find device of ID0009
[1582887.852139] pcieport 0000:00:01.1: AER: Uncorrected (Fatal) error received: id=0009
[1582887.852365] pcieport 0000:00:01.1: can't find device of ID0009


After bug occurs and after "rmmod ixgbe; modprobe ixgbe":
Code:

[1583237.512365] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 4.2.1-k
[1583237.512367] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
[1583237.528043] ixgbe 0000:02:00.0: Refused to change power state, currently in D3
[1583237.528362] ixgbe 0000:02:00.0: Adapter removed
[1583237.528388] ixgbe: probe of 0000:02:00.0 failed with error -5
[1583237.544034] ixgbe 0000:02:00.1: Refused to change power state, currently in D3
[1583237.544301] ixgbe 0000:02:00.1: Adapter removed
[1583237.544321] ixgbe: probe of 0000:02:00.1 failed with error -5
Back to top
View user's profile Send private message
xaviermiller
Bodhisattva
Bodhisattva


Joined: 23 Jul 2004
Posts: 8710
Location: ~Brussels - Belgique

PostPosted: Wed Oct 25, 2017 6:34 am    Post subject: Reply with quote

Moved from Kernel & Hardware to Unsupported Software (Ubuntu)
_________________
Kind regards,
Xavier Miller
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Unsupported Software All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum