View previous topic :: View next topic |
Author |
Message |
Sławomir Gąsiorowski n00b
Joined: 21 Jul 2004 Posts: 50 Location: Poland
|
Posted: Sat Apr 27, 2024 11:22 am Post subject: Gentoo install ISO boot with 'Kernel Panic' [ROOT CAUSED] |
|
|
Hi all !
When I boot minimal ISO (install-amd64-minimal-20240421T170413Z.iso) builded on top of 6.6.21-gentoo-x86_64 kernel it fails with Kernel Panic just after udev activation (sometimes later). Here are screenshots:
https://ibb.co/zhpWQZg
https://ibb.co/D5Vy11t
My config is:
Core i7 12700K with MSI Z690 DDR4 Motherboard, 32GB of DDR4 3200 RAM, 2xNVMe Lexar discs, 1xSATA disc.
I tried all available minimal and admin ISO images. All of them mostly fail with that Kernel Panic. Sometimes just hang during boot and sometimes they can even coot and everything seems to be fine. I tried with nosmp, also tried to boot on only one core (disabled all but one in BIOS) but without effect.
I also tried Funtoo installation ISO and it boots everytime (it uses linux kernel 5.X.X). Windows 11 Professional I have also works correctly. I just wanted to update my Gentoo installation from scrath. Previously I installed from ISO builded on the top of kernel 5.5.x and my old installation worked really fine. Unfortunately I deleted old ISO - my bad. Is there any archive of old Gentoo install ISO ?
The second question - does anybody have older minimal ISO image ? I remember that when I installed Gentoo on that system with kernel 5.5.X it was really stable.
Thanks in advance _________________ Slawomir Gasiorowski
email: sgasiorowski@gmail.com
Last edited by Sławomir Gąsiorowski on Wed May 01, 2024 7:02 pm; edited 2 times in total |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54550 Location: 56N 3W
|
Posted: Sat Apr 27, 2024 12:07 pm Post subject: |
|
|
Sławomir Gąsiorowski,
I can host any or all of
Code: | 16K -rw-r--r-- 1 roy roy 15K May 16 2021 livedvd-amd64-gentoo-nomultilib-20200902.iso.CONTENTS
5.3M -rw-r--r-- 1 roy roy 5.3M May 16 2021 livedvd-amd64-gentoo-nomultilib-20200902.iso.CONTENTS-squashfs.gz
4.0K -rw-r--r-- 1 roy roy 973 May 16 2021 livedvd-amd64-gentoo-nomultilib-20200902.iso.DIGESTS
4.2G -rw-r--r-- 1 roy roy 4.2G May 16 2021 livedvd-amd64-gentoo-nomultilib-20200902.iso
188M -rw-r--r-- 1 root root 188M Sep 5 2021 stage3-amd64-nomultilib-openrc-20210905T170549Z.tar.xz
433M -rw-r--r-- 1 roy roy 432M Sep 6 2021 install-amd64-minimal-20210829T170531Z.iso
284M -rw-r--r-- 1 roy users 284M Dec 26 2021 install-alpha-minimal-20210728T195334Z.iso
556M -rw-r--r-- 1 roy users 556M Dec 28 2021 livecd-alpha-installer-2006.1.iso
4.8G -rw-r--r-- 1 roy users 4.8G Apr 9 2022 livegui-amd64-20220403T220339Z.iso
454M -rw-r--r-- 1 roy users 454M Apr 28 2022 install-arm64-minimal-20220424T234808Z.iso
454M -rw-r--r-- 1 roy users 454M May 22 2022 install-arm64-minimal-20220515T234802Z.iso
3.7G -rw-r--r-- 1 roy users 3.7G May 23 2023 livegui-amd64-20230101T164658Z.iso
205M -rw-r--r-- 1 roy users 205M Jul 2 2023 stage3-armv6j-openrc-20230701T201658Z.tar.xz
3.3G -rw-r--r-- 1 roy users 3.3G Aug 15 2023 livegui-amd64-20230806T163139Z.iso | and more too :)
but before we go there, Fatal Exception in Interrupt suggests a hardware bug, in that its not dealing with IRQs properly.
You may be able to fix that when you build your own kernel.
Meanwhile there are some kernel command line options you can try.
They are listed on the help screens attached tosoem of the Fx keys.
nomsi and irq=poll come to mind. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Sławomir Gąsiorowski n00b
Joined: 21 Jul 2004 Posts: 50 Location: Poland
|
Posted: Sat Apr 27, 2024 6:53 pm Post subject: |
|
|
Thank you for response. I tried some kernel options, but none of them helped (irq-pool, noapic, nolapic, nohotplug, nosata, nosmp ...). Installer sporadically boots fine, but mostly it fails. I tried Debian Live DVD 12.5 that was builded on top of the linux kernel 6.1.x and it runs great. I don't think it's a hardware problem. I'm going to use Debian Live USB to install Gentoo and I will try with different kernels versions and config. I will put my feedback here. _________________ Slawomir Gasiorowski
email: sgasiorowski@gmail.com |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54550 Location: 56N 3W
|
Posted: Sat Apr 27, 2024 7:00 pm Post subject: |
|
|
Sławomir Gąsiorowski,
That sounds like a plan.
You can probably put the Debian kernel under your Gentoo install as a get-U-going measure too.
That's the kernel, initrd and modules, not just the kernel. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Sławomir Gąsiorowski n00b
Joined: 21 Jul 2004 Posts: 50 Location: Poland
|
Posted: Sun Apr 28, 2024 10:44 am Post subject: |
|
|
Hi, have very interesting feedback. First of all It turned out that Debian Live USB I use was build on top of kernel 6.x and I was able to reproduce exactly the same Kernel error and it is related with NVMe driver. Problem starts reproducing when I start working on partition, just after mounting it. And no matter what filesystem I used. After all I was able to reproduce it very often always just after nvme partition mount or later during massive I/O activity on nvme partition.
The tool smartctl shows no errors and disk seems to be in good condition. Additionaly I performed extended SMART test and also no errors found. I decided to test it under Windows 11. I created NTFS parttition and was able to fill it with data 100% then copy, delete it and fill again. No problems observed.
In my opinion linux 6.x.x has some kind of reggression in NVMe driver maybe strictly related only to model I use: Lexar NM620 512GB 2280 PCI-E 3.0. @NeddySeagoon can you share with me some Gentoo install iso from 2022/2023 that contains linux 5.x ? It's a pity that Gentoo don't host older instalation iso... _________________ Slawomir Gasiorowski
email: sgasiorowski@gmail.com |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54550 Location: 56N 3W
|
Posted: Sun Apr 28, 2024 11:07 am Post subject: |
|
|
Sławomir Gąsiorowski,
Help yourself. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Sławomir Gąsiorowski n00b
Joined: 21 Jul 2004 Posts: 50 Location: Poland
|
Posted: Sun Apr 28, 2024 12:55 pm Post subject: |
|
|
I have another finiding and probably root caused this issue. I tried some old funtoo install iso with 5.18.x kernel and again I reproduced this problem !!! So definitely reggression was introduced by something else. I bet that this is BIOS update, so I made a BIOS downgrade to some very old version from feb 2022 and it looks that it is !!! So far I don't observe problems. Before BIOS downgrade the problem was even during partition mount or stage file unpacking - hangs and kernel crash.
Thanks for the link, but first I will try to complete my installation using latest ISO installer with BIOS downgraded _________________ Slawomir Gasiorowski
email: sgasiorowski@gmail.com |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54550 Location: 56N 3W
|
Posted: Sun Apr 28, 2024 2:16 pm Post subject: |
|
|
Sławomir Gąsiorowski,
That's a good idea. That server has a lot of old stuff on it. Including all my distfiles back to mid 2006. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Sławomir Gąsiorowski n00b
Joined: 21 Jul 2004 Posts: 50 Location: Poland
|
Posted: Sun Apr 28, 2024 9:37 pm Post subject: |
|
|
Hi !
I can officialy confirm. The problem with kernel crashing was caused by one of BIOS updates. I performed successfull Gentoo install using old BIOS 7D25v12 (E7D25IMS.120) from Feb 2022. So far I don't know which BIOS udpate introduced this reggression, but I will find it It looks that setup like mine was not tested by MSI enough or was never validated under Linux
This is definitely not a defect in the Linux Kernel. The defect is in the BIOS and should be fixed by MSI. I will try to report this defect to them with better evidences than one screenshot. Be careful when updating the BIOS if you have setup like mine: MSI PRO Z690-A DDR4 with Core i7 12700K and NVMe disk Lexar NM620 and want to use Linux.
Please mark this thread as [Solved] _________________ Slawomir Gasiorowski
email: sgasiorowski@gmail.com |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54550 Location: 56N 3W
|
Posted: Mon Apr 29, 2024 10:03 am Post subject: |
|
|
Sławomir Gąsiorowski,
Its the does it boot Windows?
Ship it!
school of BIOS quality control. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Hu Administrator
Joined: 06 Mar 2007 Posts: 22447
|
Posted: Mon Apr 29, 2024 2:51 pm Post subject: |
|
|
Sławomir Gąsiorowski wrote: | Please mark this thread as [Solved] | You can (and usually should) do this yourself, by editing the title of the opening post. |
|
Back to top |
|
|
Sławomir Gąsiorowski n00b
Joined: 21 Jul 2004 Posts: 50 Location: Poland
|
Posted: Mon Apr 29, 2024 4:35 pm Post subject: Gentoo install ISO not boot with 'Kernel Panic' [ROOTCAUSED] |
|
|
Have more info from my testing. I narrowed the problem to my 512GB Lexar NM620 512GB M.2 2280 PCI-E x4 Gen3 NVMe. I have second one but 1TB. In theory these are the same discs but with different capacity. HwInfo shows also that there are different controllers on them and different versions of firmware. Unfortunately Lexar not provided any Firmware update for their NM620 disc series.
The failing one is: Lexar NM620 512GB M.2 2280 PCI-E x4 Gen3 NVMe (LNM620X512G-RNNNG) -> Innogrit Shasta+ IG5216 PCIe 3.0 x4 NVMe 1.4 4-channel SSD Controller (Original Device Name: Shenzhen Longsys Electronics, Device ID: 5216)
The passing one is: Lexar NM620 1TB M.2 2280 PCI-E x4 Gen3 NVMe (LNM620X001T-RNNNG) -> Shenzhen Longsys Electronics, Device ID: 1D97 (Original Device Name is the same)
So when I unplugged this 512GB disc Gentoo installer ISO boots from USB without any problem. In kernel logs I see that it detected 1TB NVMe from Lexar. When 512GB NVMe is present kernel can even immediately crash during drivers loading. I will check in linux kernel, maybe there are some "magic" knobs for that NVMe controller ?
Anyway I'm setting this thread as [ROOTCAUSED]. I'm going to update this thread when I will find any interesting info. _________________ Slawomir Gasiorowski
email: sgasiorowski@gmail.com |
|
Back to top |
|
|
Sławomir Gąsiorowski n00b
Joined: 21 Jul 2004 Posts: 50 Location: Poland
|
Posted: Wed May 01, 2024 5:49 pm Post subject: |
|
|
Hi all !
Have some very interesting results of my investigation. I was almost close to give up when I decided to just try how gentoo kernel binary works... So I installed one and IT WORKS !!!
I tried two stable gentoo-kernel-bin:
1. gentoo-kernel-bin-6.1.87 -> PASS
2. gentoo-kernel-bin-6.6.28 -> PASS
So I decided to just use that config from 6.6.28 binary distribution (/proc/config.gz) to reuse it on gentoo-sources-6.6.21 and it produces kernel that boots and is stable. After that I decided to make mrproper and use x86_64_defconfig from gentoo-sources-6.6.21 and after little tuning like NVMe and filesystems compilation into kernel binary it also gives me fully functional system !!!
So finally I have my Gentoo with manually compiled kernel from 6.6.21 sources fully functional with latest MSI BIOS for my MSI Z690 motherboard
I also had a fruitfull discussion with my friend that actually works for Solidigm (previously for Intel like me) and he analyzed my kernel crash and pointed that it looks like some timeout in RCU (Read Copy Update feature). They are familiar with such errors. Most probably the real rootcasue is that Lexar NVMe controller or firmware not meet specifications and after BIOS update and after RCU code refactor in linux kernel some NVMe that don't meet specification may crash...
The last stage of my investigation is to find which kernel option caused that problem. Probably we are experiencing some time racing conditions in on the edge of NVMe and kernel handshake logic. When I will find something interesting I will update this thread. _________________ Slawomir Gasiorowski
email: sgasiorowski@gmail.com |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54550 Location: 56N 3W
|
Posted: Wed May 01, 2024 7:14 pm Post subject: |
|
|
Sławomir Gąsiorowski,
Try a faulty kernel with Code: | rcutree.use_softirq=0 | on the kernel command line.
I have an arm64 server that exhibits the RCU problem you describe.
Nothing after kernel 5.15.x will boot without it. At least, nothing I've tried. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
Sławomir Gąsiorowski n00b
Joined: 21 Jul 2004 Posts: 50 Location: Poland
|
Posted: Thu May 02, 2024 10:48 am Post subject: |
|
|
I tried this option and as a result I got a bunch of other different errors. So this is not a solution in my case. _________________ Slawomir Gasiorowski
email: sgasiorowski@gmail.com |
|
Back to top |
|
|
|