View previous topic :: View next topic |
Author |
Message |
c00l.wave Apprentice
Joined: 24 Aug 2003 Posts: 268
|
Posted: Thu Apr 13, 2023 11:09 am Post subject: kernel panic triggered by certain ebuilds (MCE broadcast) |
|
|
Compiling certain ebuilds can reliably trigger kernel panics on my machine. I can observe the following kernel messages through netconsole:
Code: |
mce: CPUs not responding to MCE broadcast (may include false positives): 0,4
Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Shutting down cpus with NMI
Kernel Offset: disabled
|
The message always blames CPUs 0 and 4.
The issue is most likely to be triggered by net-fs/samba but also just happened for me on dev-python/PyQt5. It's reproducible in about 80% of all unattended compilations [edit: only on the problematic ebuilds] - it seems to be less likely to occur if I'm doing something else on the (desktop) machine while one of those ebuilds is compiling and it is also less likely (but not impossible) to occur when retrying just after a reboot. It also seems to be unrelated to cooling issues, the system does not get unusually hot during those particular packages and the panic occurs regardless of room temperature or additional fans being disabled or running on high RPM. Apart from issues during compilation of those few affected ebuilds the system is completely stable under various kinds of workloads (incl. sustained full load across all hardware components for multiple hours).
I'm not sure when that problem started but it has been present for at least a year.
Has anyone ever heard of those issues or can confirm/deny that it's a general issue on i7-4790K CPUs with recent microcode updates? Could it be caused by some bad option in the kernel config? The message sounds like a possible solution could be to increase the timeout but I would first have to know what setting that actually is related to.
memtest86 has been run several times, incl. directly after one of those panics happened, but did not report any errors. _________________ nohup nice -n -20 cp /dev/urandom /dev/null &
Last edited by c00l.wave on Thu Apr 13, 2023 12:38 pm; edited 1 time in total |
|
Back to top |
|
|
logrusx Advocate
Joined: 22 Feb 2018 Posts: 2695
|
Posted: Thu Apr 13, 2023 12:28 pm Post subject: |
|
|
To be sure, you need to run at least 4 passes of memtest, which might take quite a lot of time. However from my experience memory errors just hang the system because it just crashes something. MCE's are triggered by hardware and this sounds like bad news. There should be a way to display those errors, but I'm not knowledgeable enough to point you in that direction. Maybe app-admin/mcelog.
What I'd suggest is to make sure temperatures are within limits and all hardware connections are stable. Also if this is a desktop PC, inspect the motherboard carefully for faulty electrolytic capacitors. The cylindric big ones with metal shell and aluminum bottom with cross on it. When those are faulty, their bottom swells, hence the cross - it's easier to see when it swells and it breaks there if it's very swollen.
Best Regards,
Georgi |
|
Back to top |
|
|
c00l.wave Apprentice
Joined: 24 Aug 2003 Posts: 268
|
Posted: Thu Apr 13, 2023 12:51 pm Post subject: |
|
|
I've inspected the mainboard before, no visible issues with the capacitors. I noticed I might have been a bit unclear regarding my "80%" comment - that's only on the affected ebuilds, everything else compiles fine. I can run large system updates for an entire day and unless there's one of the affected ebuilds in between everything will be fine. The only excessive temperatures are reported for CPU package (hitting 100°C under full load and the CPU tends to go into thermal throttling) but it has been like that ever since I built that PC and it has never caused any issues before. That MCE kernel panic has also occurred just after dusting off the entire hardware, so I doubt it can be related to thermal issues. The other system components report relatively low temperatures during compilation as compared to running full system loads which would also involve the GPU. It feels more like there might be some special sequence of CPU instructions during compilation that triggers some bug in the CPU while under load - or a kernel bug - but then it should probably be occurring on other people's systems as well...?
mcelog is installed but doesn't log anything useful in recent versions. I remember that in the past I've seen occasional cache memory errors being reported by the CPU but those messages have disappeared after some microcode update, so that probably was an unrelated false positive or got fixed through microcode. _________________ nohup nice -n -20 cp /dev/urandom /dev/null & |
|
Back to top |
|
|
Goverp Advocate
Joined: 07 Mar 2007 Posts: 2202
|
Posted: Fri Apr 14, 2023 7:55 am Post subject: |
|
|
Not directly relevant, but I recently had a kernel problem which manifested itself as MCE entries in dmesg, but turned out to be a bug in a WiFi driver. The driver was setting the system in an illegal state, so the kernel reported a hardware error. My point being that you might be right suspecting samba or something instead of hardware, though usually I'd bet on the latter.
I guess as it's over a year old, strategies such as bisection won't work. Iguess it should be possible to get useful diagnostic info from the MCE occurrence provided you have the relevant software compiled with symbol tables and maybe debugging support - but of course making those changes might change the code enough to "cure" the problem . I've no experience on trying to debug an MCE event though... _________________ Greybeard |
|
Back to top |
|
|
rab0171610 Guru
Joined: 24 Dec 2022 Posts: 472
|
Posted: Fri Apr 14, 2023 12:13 pm Post subject: |
|
|
If you have time on your hands and want to do some digging, I'd look into:
app-admin/mcelog
https://github.com/andikleen/mcelog
and
app-admin/rasdaemon
https://github.com/mchehab/rasdaemon
I don't have any practical experience in this area. The only time I had mce errors with reboots was when a memory stick was going bad. I took them out one at a time and by process of elimination was able to quickly determine which module needed replaced. |
|
Back to top |
|
|
Goverp Advocate
Joined: 07 Mar 2007 Posts: 2202
|
Posted: Fri Apr 14, 2023 3:44 pm Post subject: |
|
|
I'd give rasdaemon a miss; it's a program that collects ECC errors from memory and possibly PCI bus and stores them in an sql database, but as far as I can see, that's not what's being reported here.
(I've tried rasdaemon now that I have a desktop machine with ECC memory. It's pretty horrible, as it demands the DEBUG_FS filesystem. I plan to rewrite it in Posix shell script - all it needs do is scan a bit of the /sys tree.) _________________ Greybeard |
|
Back to top |
|
|
kolAflash n00b
Joined: 16 Jun 2023 Posts: 1
|
|
Back to top |
|
|
c00l.wave Apprentice
Joined: 24 Aug 2003 Posts: 268
|
Posted: Sat Oct 14, 2023 9:42 am Post subject: |
|
|
Sorry for the late reply, somehow I did not get notified. I had another look after my system crashed again while compiling Samba. I'm not 100% sure if it's really the same issue but from what I found now (I didn't find those threads previously) it appears to be a CPU/microcode bug in power state management in combination with a kernel change that has been documented and also acknowledged on Intel forums for some older CPUs (although mine is still much older). As my CPU no longer receives microcode updates it cannot be fixed that way.
https://askubuntu.com/questions/1222766/sporadic-kernel-panic-not-syncing
https://community.intel.com/t5/Processors/Frequent-crashes-on-i5-11500/td-p/1280709?profile.language=ja
The recommended workaround is to boot with processor.max_cstate=0 intel_idle.max_cstate=0 idle=poll but since that effectively disables power management through ACPI the CPU will run "at full power" all the time and not throttle back while idle which wastes energy and heats the system unnecessarily. However, it may be a valid workaround if the whole session is supposed to actually use 100% CPU anyway (i.e. reboot to those options to perform a long-running world update).
I also seem to have been successful by simply switching to the "performance" CPU governor while compiling. That's not really the same, so the system may still run into the bug, but it appears to be less likely. It could also explain my previous observation that I have less trouble compiling affected ebuilds while I am actively using the system.
It could still be a coincidence and caused by something else but judging from how well the last 2 updates worked since I discovered those threads it may have been the correct clue. _________________ nohup nice -n -20 cp /dev/urandom /dev/null & |
|
Back to top |
|
|
|