View previous topic :: View next topic |
Author |
Message |
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
Posted: Fri Sep 13, 2024 10:55 am Post subject: Troubles with Intel GPU on sys-kernel/gentoo-sources-6.10.* |
|
|
I'm trying the sys-kernel/gentoo-sources-6.10 branch.
On headless machines, this seems to work fine.
On a Intel i7-11800H mini-PC desktop, the machine reboots violently from time to time (no panic message as far as I know) or slowly chokes and freezes. I am not 100% sure of the cause but I suspect the GPU, as this happened more often when I was running Steam games.
Did anybody have similar problems? What can I do to debug this issue?
cpuinfo & lspci :
model name : 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
00:02.0 VGA compatible controller: Intel Corporation TigerLake-H GT1 [UHD Graphics] (rev 01) |
|
Back to top |
|
|
pietinger Moderator
Joined: 17 Oct 2006 Posts: 5219 Location: Bavaria
|
|
Back to top |
|
|
fkobi n00b
Joined: 21 Jul 2024 Posts: 3 Location: Poland
|
Posted: Mon Sep 16, 2024 5:05 pm Post subject: |
|
|
Does this happen with gentoo-kernel? |
|
Back to top |
|
|
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
Posted: Wed Sep 18, 2024 8:31 am Post subject: |
|
|
fkobi wrote: | Does this happen with gentoo-kernel? |
I did not try it. |
|
Back to top |
|
|
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
Posted: Wed Sep 18, 2024 8:50 am Post subject: |
|
|
Sorry for the later answer. I'm now running 6.10.10 which seems more stable.
It includes fixes for AMD GPUs but not Intel AFAIK. Odd...
pietinger wrote: | Why do you suspect the GPU ? |
Because I had several other issues:
severe GUI slow down while running a game (Civ6 if that matters), odd dmesg messages (that I did not copied unfortunately) ...
emerge --info: https://bpa.st/AIODO
Why can't I upload the config files with wgetpaste?
Last edited by vm666 on Wed Sep 18, 2024 9:58 am; edited 1 time in total |
|
Back to top |
|
|
pietinger Moderator
Joined: 17 Oct 2006 Posts: 5219 Location: Bavaria
|
Posted: Wed Sep 18, 2024 9:44 am Post subject: |
|
|
Hmm ... you have a nice and fast system .... but your swap partition is REALLY too big ...
vm666 wrote: | Why can't I upload the config files with wgetpaste? |
Try another service:
Code: | $ wgetpaste -v --service 0x0 /usr/src/linux/.config
Your paste can be seen here: http://0x0.st/X3eU.txt |
(my old config for my i9) _________________ https://wiki.gentoo.org/wiki/User:Pietinger |
|
Back to top |
|
|
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
Posted: Wed Sep 18, 2024 9:50 am Post subject: |
|
|
pietinger wrote: | Hmm ... you have a nice and fast system .... but your swap partition is REALLY too big ... |
Actually during some experiments I had to add swap.
I'm looking for a simple way to limit RAM usage for some processes by the way (I mean resident, not virtual memory). I could not do it with ulimit, I have to use cgroups
.config for 6.10.7 http://0x0.st/X3ek.txt
.config for 6.10.10 http://0x0.st/X3en.txt
Last edited by vm666 on Wed Sep 18, 2024 4:27 pm; edited 1 time in total |
|
Back to top |
|
|
pietinger Moderator
Joined: 17 Oct 2006 Posts: 5219 Location: Bavaria
|
Posted: Wed Sep 18, 2024 10:08 am Post subject: |
|
|
I see you have much experience with a kernel configuration, but I would change these:
Code: | 1.
# CONFIG_X86_X2APIC is not set
2.
CONFIG_I2C_I801=m
3.
CONFIG_DRM_XE=m
4.
CONFIG_DRM_SIMPLEDRM=m
5.
CONFIG_FB=m
6.
CONFIG_FB_UVESA=m
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
CONFIG_FB_NVIDIA_BACKLIGHT=y
CONFIG_FB_RADEON=m
CONFIG_FB_RADEON_I2C=y
CONFIG_FB_RADEON_BACKLIGHT=y
7.
# CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON is not set |
1. Enable it you have an i7-11800
2. This is the only one you need (you can disable the others)
3. Disable it
4. Disable it
5. You must enable it statically to get EFI-FB. See: https://wiki.gentoo.org/wiki/User:Pietinger/Experimental/Manual_Configuring_Current_Kernel#Framebuffer_Device_and_Console
6. Disable them (after you have enabled VESA and EFI)
7. Enable it _________________ https://wiki.gentoo.org/wiki/User:Pietinger |
|
Back to top |
|
|
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
Posted: Wed Sep 18, 2024 4:26 pm Post subject: |
|
|
pietinger wrote: | I see you have much experience with a kernel configuration, but I would change these:
Code: | 1.
# CONFIG_X86_X2APIC is not set
7.
# CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON is not set |
1. Enable it you have an i7-11800
|
Actually I should enable it on all my machines :-/
(at least 3 other where it is not enabled for whatever stupid reason) |
|
Back to top |
|
|
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
Posted: Fri Sep 20, 2024 8:04 am Post subject: |
|
|
vm666 wrote: | Sorry for the later answer. I'm now running 6.10.10 which seems more stable. |
More stable but not entirely stable. The machine rebooted during the night.
$ uptime -s
2024-09-20 01:59:39
Nothing significant in the logs I'm afraid.
Code: |
Sep 20 01:30:00 grillepain CROND[233439]: (root) CMD (/usr/lib/sa/sa1 1 1)
Sep 20 01:30:00 grillepain CROND[233438]: (root) CMDEND (/usr/lib/sa/sa1 1 1)
Sep 20 01:40:00 grillepain CROND[236093]: (root) CMD (/usr/lib/sa/sa1 1 1)
Sep 20 01:40:00 grillepain CROND[236092]: (root) CMDEND (/usr/lib/sa/sa1 1 1)
Sep 20 01:41:19 grillepain root[236471]: ACPI event unhandled: button/up UP 00000080 00000000 K
Sep 20 01:50:00 grillepain CROND[238862]: (root) CMD (/usr/lib/sa/sa1 1 1)
Sep 20 01:50:00 grillepain CROND[238861]: (root) CMDEND (/usr/lib/sa/sa1 1 1)
Sep 20 01:59:49 grillepain syslog-ng[2076]: syslog-ng starting up; version='4.6.0'
Sep 20 01:59:49 grillepain acpid[2107]: starting up with netlink and the input layer
Sep 20 01:59:49 grillepain acpid[2107]: 1 rule loaded
Sep 20 01:59:49 grillepain acpid[2107]: waiting for events: event logging is off
Sep 20 01:59:49 grillepain dhcpcd[2278]: dhcpcd-10.0.8 starting
Sep 20 01:59:49 grillepain dhcpcd[2284]: dev: loaded udev
Sep 20 01:59:49 grillepain dhcpcd[2284]: DUID 00:01:00:01:2c:c3:07:49:68:1d:ef:35:cd:59
Sep 20 01:59:49 grillepain kernel: 8021q: 802.1Q VLAN Support v1.8
Sep 20 01:59:49 grillepain dhcpcd[2284]: no interfaces have a carrier
Sep 20 01:59:49 grillepain kernel: Loading firmware: rtl_nic/rtl8168h-2.fw
|
Moderation note: Fixed code block formatting. -- Banana
EDIT: It crashed again, I was not in front of the machine unfortunately.
$ uptime -s
2024-09-20 13:59:30
$ |
|
Back to top |
|
|
pietinger Moderator
Joined: 17 Oct 2006 Posts: 5219 Location: Bavaria
|
Posted: Fri Sep 20, 2024 2:53 pm Post subject: |
|
|
vm666 wrote: | Nothing significant in the logs I'm afraid. |
A reboot without any error ... hmm ... what is that -> ?
vm666 wrote: | Code: | Sep 20 01:30:00 grillepain CROND[233439]: (root) CMD (/usr/lib/sa/sa1 1 1)
Sep 20 01:30:00 grillepain CROND[233438]: (root) CMDEND (/usr/lib/sa/sa1 1 1) |
|
(maybe clear your crontab?) _________________ https://wiki.gentoo.org/wiki/User:Pietinger |
|
Back to top |
|
|
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
Posted: Fri Sep 20, 2024 5:37 pm Post subject: |
|
|
pietinger wrote: | vm666 wrote: | Nothing significant in the logs I'm afraid. |
A reboot without any error ... hmm ... what is that -> ?
|
Or there is an error but it is not saved on the file system.
Quote: | (maybe clear your crontab?) |
I suspected that it could be triggered by some cron job, but they all look innocuous.
I had problems with scrub a while ago, but this is not that, I tried a full scrub and it worked fine.
https://forums.gentoo.org/viewtopic-t-1165800-highlight-scrub+balance.html |
|
Back to top |
|
|
pietinger Moderator
Joined: 17 Oct 2006 Posts: 5219 Location: Bavaria
|
Posted: Fri Sep 20, 2024 7:25 pm Post subject: |
|
|
You said you think there is no kernel panic; are you sure ? Maybe it would make sense to go from 6 seconds to 0 (wait forever) to be sure ? ->
Code: | CONFIG_PANIC_TIMEOUT=6 |
(also make sure that there are no settings in sysctl.conf ... like a -1 which does immediately a reboot) _________________ https://wiki.gentoo.org/wiki/User:Pietinger |
|
Back to top |
|
|
Nowa Developer
Joined: 25 Jun 2014 Posts: 447 Location: Nijmegen
|
Posted: Fri Sep 20, 2024 7:57 pm Post subject: |
|
|
Quote: | Or there is an error but it is not saved on the file system. |
It could be that whatever triggers it happens at a very low level, have you checked if the firmware was updated when the kernel was updated?
Possibly a silly suggestion, but have you already verified that the machine is not simply overheating? _________________ OS: Gentoo 6.10.12-gentoo-dist, ~amd64, 23.0/desktop/plasma/systemd
MB: MSI Z370-A PRO
CPU: Intel Core i9-9900KS
GPU: Intel Arc A770 16GB & Intel UHD Graphics 630
SSD: Samsung 970 EVO Plus 2 TB
RAM: Crucial Ballistix 32GB DDR4-2400 |
|
Back to top |
|
|
pietinger Moderator
Joined: 17 Oct 2006 Posts: 5219 Location: Bavaria
|
|
Back to top |
|
|
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
Posted: Tue Sep 24, 2024 7:29 pm Post subject: |
|
|
I don't use KDE
I had another crash 2 days ago. Just before 2 pm. I have a cron job that starts at *:59
It just copies ~/Dropbox to a NFS share though rsync but I don't believe in a coincidence here. The job runs every hour and there are aoften new files, so it is not just the copy that triggers it.
Could I have an issue with the soft or hard lockups detection, or with my watchdog?
I disabled the NMI watchdog, just in case. I'm not sure I need it. |
|
Back to top |
|
|
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
Posted: Wed Sep 25, 2024 4:37 pm Post subject: |
|
|
vm666 wrote: | I disabled the NMI watchdog, just in case. I'm not sure I need it. |
It froze again during the night. The X11 GUI was frozen, the machine did not answer to ping, I could only reboot it.
How can I preserve the last kernel messages after a crash?
One detail:
After some investigation, I discovered that the iTCO_wdt watchdog did not work on this mini PC. I have another machine in the same situation.
AFAIK, iTCO_wdt works on all my other (old) machines.
If I understand correctly, iTCO_wdt is provided by the chipset and the motherboard manufacturer has to wire it correctly |
|
Back to top |
|
|
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
Posted: Mon Oct 28, 2024 12:37 pm Post subject: |
|
|
I am pretty sure now that it only freezes when I am playing Civ6 through Steam. Maybe this is related to Proton (Steam version of Wine) and not the Intel GPU driver.
I could not make any progress to debug this |
|
Back to top |
|
|
pietinger Moderator
Joined: 17 Oct 2006 Posts: 5219 Location: Bavaria
|
|
Back to top |
|
|
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
Posted: Mon Oct 28, 2024 3:08 pm Post subject: |
|
|
pietinger wrote: | I think I cannot help here any further ... sorry (I have no experience with steam games) ... |
if the cause is the GPU driver, it must be some rarely used 3D function. I would be surprised if it were only used by Proton. |
|
Back to top |
|
|
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
Posted: Sat Dec 14, 2024 11:24 pm Post subject: |
|
|
The problem still exists in 6.11.x and 6.12.x kernels.
So I recompiled a 6.12.4 with CONFIG_DRM_I915_DEBUG and I got this when Civ6 froze:
Code: | [Sun Dec 15 00:06:43 2024] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[Sun Dec 15 00:06:43 2024] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:849ffefc, in Civ6 (WinID 2) [181554]
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5c6!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5ca!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5c8!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5cc!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5ce!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5d0!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5d2!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5d4!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5d6!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5d8!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5da!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5dc!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5de!
[Sun Dec 15 00:06:54 2024] Fence expiration time out i915-0000:00:02.0:Civ6 (WinID 2)[181554]:1ac5e0!
[Sun Dec 15 00:06:57 2024] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:849ffefc, in Civ6 (WinID 2) [181554]
[Sun Dec 15 00:06:57 2024] i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
[Sun Dec 15 00:06:57 2024] i915 0000:00:02.0: [drm] GT0: Resetting chip for stopped heartbeat on rcs0
[Sun Dec 15 00:06:58 2024] usb 1-1.2: USB disconnect, device number 51
[Sun Dec 15 00:06:58 2024] i915 0000:00:02.0: [drm] *ERROR* GT0: Failed to reset chip
[Sun Dec 15 00:06:58 2024] i915 0000:00:02.0: [drm] CI tainted: 0x9 by intel_gt_reset+0x31f/0x350 [i915]
[Sun Dec 15 00:06:58 2024] i915 0000:00:02.0: [drm] Civ6 (WinID 2)[181554] context reset due to GPU hang
|
The GUI froze. I connected from another PC. I could get this dmesg output and then the machine entirely froze. |
|
Back to top |
|
|
Nowa Developer
Joined: 25 Jun 2014 Posts: 447 Location: Nijmegen
|
Posted: Sun Dec 15, 2024 9:19 am Post subject: |
|
|
Bugs such as this one you can best report directly to the i915 driver upstream. _________________ OS: Gentoo 6.10.12-gentoo-dist, ~amd64, 23.0/desktop/plasma/systemd
MB: MSI Z370-A PRO
CPU: Intel Core i9-9900KS
GPU: Intel Arc A770 16GB & Intel UHD Graphics 630
SSD: Samsung 970 EVO Plus 2 TB
RAM: Crucial Ballistix 32GB DDR4-2400 |
|
Back to top |
|
|
pietinger Moderator
Joined: 17 Oct 2006 Posts: 5219 Location: Bavaria
|
Posted: Sun Dec 15, 2024 11:30 am Post subject: |
|
|
vm666 wrote: | The problem still exists in 6.11.x and 6.12.x kernels.
So I recompiled a 6.12.4 with CONFIG_DRM_I915_DEBUG and I got this when Civ6 froze:
Code: | [Sun Dec 15 00:06:43 2024] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[Sun Dec 15 00:06:43 2024] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:849ffefc, in Civ6 (WinID 2) [181554]
...
[Sun Dec 15 00:06:57 2024] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:849ffefc, in Civ6 (WinID 2) [181554]
[Sun Dec 15 00:06:57 2024] i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
[Sun Dec 15 00:06:57 2024] i915 0000:00:02.0: [drm] GT0: Resetting chip for stopped heartbeat on rcs0
[Sun Dec 15 00:06:58 2024] usb 1-1.2: USB disconnect, device number 51
[Sun Dec 15 00:06:58 2024] i915 0000:00:02.0: [drm] *ERROR* GT0: Failed to reset chip
[Sun Dec 15 00:06:58 2024] i915 0000:00:02.0: [drm] CI tainted: 0x9 by intel_gt_reset+0x31f/0x350 [i915]
[Sun Dec 15 00:06:58 2024] i915 0000:00:02.0: [drm] Civ6 (WinID 2)[181554] context reset due to GPU hang |
|
Hmmm ... I had the same problem with older kernel versions with my browser running in full screen and playing a 4k movie from youtube ... at that time these kernel command line parameter helped me:
Code: | i915.enable_guc=2 i915.enable_psr=0 |
Might be worth a try? _________________ https://wiki.gentoo.org/wiki/User:Pietinger |
|
Back to top |
|
|
vm666 n00b
Joined: 24 Oct 2003 Posts: 69
|
|
Back to top |
|
|
Nowa Developer
Joined: 25 Jun 2014 Posts: 447 Location: Nijmegen
|
Posted: Sun Dec 15, 2024 7:41 pm Post subject: |
|
|
Not every GPU hang is the same issue, the hang is just a symptom of some bug triggered by some application. I do not see a bug report for Civ6 yet, so likely it is a new issue. Note also that your log exposes that there is a second issue, normally the system should be able to somewhat recover from a GPU hang, but on your system we see "Failed to reset chip" which indicates that the system was somehow not able to recover from the hang and get the GPU back in a working state. _________________ OS: Gentoo 6.10.12-gentoo-dist, ~amd64, 23.0/desktop/plasma/systemd
MB: MSI Z370-A PRO
CPU: Intel Core i9-9900KS
GPU: Intel Arc A770 16GB & Intel UHD Graphics 630
SSD: Samsung 970 EVO Plus 2 TB
RAM: Crucial Ballistix 32GB DDR4-2400 |
|
Back to top |
|
|
|