kernel panic triggered by certain ebuilds (MCE broadcast)

c00l.wave · Apprentice Joined: 24 Aug 2003 Posts: 268

Compiling certain ebuilds can reliably trigger kernel panics on my machine. I can observe the following kernel messages through netconsole:

logrusx · Advocate Joined: 22 Feb 2018 Posts: 2695

To be sure, you need to run at least 4 passes of memtest, which might take quite a lot of time. However from my experience memory errors just hang the system because it just crashes something. MCE's are triggered by hardware and this sounds like bad news. There should be a way to display those errors, but I'm not knowledgeable enough to point you in that direction. Maybe app-admin/mcelog.

What I'd suggest is to make sure temperatures are within limits and all hardware connections are stable. Also if this is a desktop PC, inspect the motherboard carefully for faulty electrolytic capacitors. The cylindric big ones with metal shell and aluminum bottom with cross on it. When those are faulty, their bottom swells, hence the cross - it's easier to see when it swells and it breaks there if it's very swollen.

Best Regards,
Georgi

c00l.wave · Apprentice Joined: 24 Aug 2003 Posts: 268

I've inspected the mainboard before, no visible issues with the capacitors. I noticed I might have been a bit unclear regarding my "80%" comment - that's only on the affected ebuilds, everything else compiles fine. I can run large system updates for an entire day and unless there's one of the affected ebuilds in between everything will be fine. The only excessive temperatures are reported for CPU package (hitting 100°C under full load and the CPU tends to go into thermal throttling) but it has been like that ever since I built that PC and it has never caused any issues before. That MCE kernel panic has also occurred just after dusting off the entire hardware, so I doubt it can be related to thermal issues. The other system components report relatively low temperatures during compilation as compared to running full system loads which would also involve the GPU. It feels more like there might be some special sequence of CPU instructions during compilation that triggers some bug in the CPU while under load - or a kernel bug - but then it should probably be occurring on other people's systems as well...?

mcelog is installed but doesn't log anything useful in recent versions. I remember that in the past I've seen occasional cache memory errors being reported by the CPU but those messages have disappeared after some microcode update, so that probably was an unrelated false positive or got fixed through microcode.
_________________
nohup nice -n -20 cp /dev/urandom /dev/null &

Goverp · Advocate Joined: 07 Mar 2007 Posts: 2202

Not directly relevant, but I recently had a kernel problem which manifested itself as MCE entries in dmesg, but turned out to be a bug in a WiFi driver. The driver was setting the system in an illegal state, so the kernel reported a hardware error. My point being that you might be right suspecting samba or something instead of hardware, though usually I'd bet on the latter.

I guess as it's over a year old, strategies such as bisection won't work. Iguess it should be possible to get useful diagnostic info from the MCE occurrence provided you have the relevant software compiled with symbol tables and maybe debugging support - but of course making those changes might change the code enough to "cure" the problem :-(

. I've no experience on trying to debug an MCE event though...
_________________
Greybeard

rab0171610 · Guru Joined: 24 Dec 2022 Posts: 472

If you have time on your hands and want to do some digging, I'd look into:
app-admin/mcelog
https://github.com/andikleen/mcelog
and
app-admin/rasdaemon
https://github.com/mchehab/rasdaemon
I don't have any practical experience in this area. The only time I had mce errors with reboots was when a memory stick was going bad. I took them out one at a time and by process of elimination was able to quickly determine which module needed replaced.

Goverp · Advocate Joined: 07 Mar 2007 Posts: 2202

I'd give rasdaemon a miss; it's a program that collects ECC errors from memory and possibly PCI bus and stores them in an sql database, but as far as I can see, that's not what's being reported here.
(I've tried rasdaemon now that I have a desktop machine with ECC memory. It's pretty horrible, as it demands the DEBUG_FS filesystem. I plan to rewrite it in Posix shell script - all it needs do is scan a bit of the /sys tree.)
_________________
Greybeard

kolAflash · n00b Joined: 16 Jun 2023 Posts: 1

c00l.wave · Apprentice Joined: 24 Aug 2003 Posts: 268

Sorry for the late reply, somehow I did not get notified. I had another look after my system crashed again while compiling Samba. I'm not 100% sure if it's really the same issue but from what I found now (I didn't find those threads previously) it appears to be a CPU/microcode bug in power state management in combination with a kernel change that has been documented and also acknowledged on Intel forums for some older CPUs (although mine is still much older). As my CPU no longer receives microcode updates it cannot be fixed that way.

https://askubuntu.com/questions/1222766/sporadic-kernel-panic-not-syncing

https://community.intel.com/t5/Processors/Frequent-crashes-on-i5-11500/td-p/1280709?profile.language=ja

The recommended workaround is to boot with processor.max_cstate=0 intel_idle.max_cstate=0 idle=poll but since that effectively disables power management through ACPI the CPU will run "at full power" all the time and not throttle back while idle which wastes energy and heats the system unnecessarily. However, it may be a valid workaround if the whole session is supposed to actually use 100% CPU anyway (i.e. reboot to those options to perform a long-running world update).

I also seem to have been successful by simply switching to the "performance" CPU governor while compiling. That's not really the same, so the system may still run into the bug, but it appears to be less likely. It could also explain my previous observation that I have less trouble compiling affected ebuilds while I am actively using the system.

It could still be a coincidence and caused by something else but judging from how well the last 2 updates worked since I discovered those threads it may have been the correct clue.
_________________
nohup nice -n -20 cp /dev/urandom /dev/null &