Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
gentoo-sources-5.6 fail to boot without kernel panic
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Hupf
Tux's lil' helper
Tux's lil' helper


Joined: 11 Sep 2005
Posts: 112
Location: Germany

PostPosted: Tue May 05, 2020 6:36 am    Post subject: gentoo-sources-5.6 fail to boot without kernel panic Reply with quote

Hi,
I'm trying to debug why after upgrading from gentoo-sources-5.5.x to 5.6.x, my machine fails to boot. Rollback is easy at the moment, but at some point I'd like to be able to make the switch.
On 5.5.x, my working setup is as detailed below. I've verified that my toolchain should be properly configured by recently upgrading to gentoo-sources-5.5.19 without issues.

Boot stack (working with 5.5):
  • Kernel is manually configured* with embedded firmware (AMD Picasso GPU + CPU microcode), EFI stub, embedded command line and embedded initramfs
  • UEFI loads EFI stub from NVMe
  • CONFIG_CMDLINE omits the root=, init= or initrd= options, neither are they passed via the UEFI/efibootmgr (i.e. defaults are used)
  • EFI stub loads embedded initramfs
  • initramfs built using custom initramfs.list including custom /init script, which invokes cryptsetup on the actual root partition (on NVMe)
  • custom /init hands over to systemd


What I see with 5.5.19:
  • printk's relating to various hardware, e.g. USB keyboard, NVMe partions, unused SATA slots etc
  • "Freeing unused kernel image [...] memory: xxxK"
  • "Run /init as init process"
  • <output from initramfs's /init script>


Config-wise,

What I see with 5.6.10:
  • printk's relating to various hardware, e.g. USB keyboard, NVMe partions, unused SATA slots etc
  • no further boot progress, only a blinking cursor after the last printk

So, with 5.6.10, there is no panic/error message (as far as the scrollback goes - right now I don't have hardware for serial console over USB available). The boot process is missing / stopping before the steps "Freeing unused kernel image memory" and "Running /init as init process".
The NVMe device seems to be recognized properly and partitions listed. The embedded initramfs seems to be there just fine - it is built during make and the resulting bzImage is of very similar size to the 5.5.19 one. I'm uncertain if the handover to the initramfs is the problem at all, since then I would expect complaints about the kernel being unable to find/call /init (or the root device). Also I wonder why I'm not seeing the "Freeing unused kernel memory".

I tried scanning the Documentation/, kernel.org and gentoo bugs as well as changelogs/commit histories for anything related to e.g. initramfs, EFI stub and NVMe but couldn't make out changes specific to my problem. I also tried explicitly specifying root=/dev/ram0 init=/init without seeing changes in the output.


* .config for 5.6.10 (not booting)
Back to top
View user's profile Send private message
dr_wulsen
Tux's lil' helper
Tux's lil' helper


Joined: 21 Aug 2013
Posts: 146
Location: Austria

PostPosted: Tue May 05, 2020 7:12 pm    Post subject: Reply with quote

Have you tried to set the kernel config CONSOLE_LOGLEVEL to 15 ?
With some good luck, you might be able to circle the issue with more output.

Of course, you could try to enable any other relevant VERBOSE option in your kernel for fun and see if output becomes more meaningful.
_________________
There's no stupid questions, only stupid answers.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 55015
Location: 56N 3W

PostPosted: Tue May 05, 2020 7:33 pm    Post subject: Reply with quote

Hupf,

Wild thought. Its all working except the console.

If you can ssh in and read dmesg, you may find that you need a new (extra) firmware file for your GPU.
I have a Polaris11 card and that's bitten me twice since the 4.16 kernel
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Hupf
Tux's lil' helper
Tux's lil' helper


Joined: 11 Sep 2005
Posts: 112
Location: Germany

PostPosted: Tue Jun 02, 2020 3:38 pm    Post subject: Reply with quote

Some updates.
After a session of https://wiki.gentoo.org/wiki/Kernel_git-bisect I identified https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=33960acccfbd7f24d443cb3d0312ac28abe62bae as the offending commit.
So I started playing around (on the most recent gentoo-sources:5.7.0) with the kernel config options related to crypto devices and trusted execution environment.
The culprit seems to be CONFIG_CRYPTO_DEV_CCP_CRYPTO.
Before 33960acccfbd7f24d443cb3d0312ac28abe62bae, my system will boot with CONFIG_CRYPTO_DEV_CCP_CRYPTO=y. After the commit (up until 5.7.0), the system will bot with CONFIG_CRYPTO_DEV_CCP_CRYPTO=n or CONFIG_CRYPTO_DEV_CCP_CRYPTO=m, but not with CONFIG_CRYPTO_DEV_CCP_CRYPTO=y. Also, with CONFIG_CRYPTO_DEV_CCP_CRYPTO=m, the module will load just fine after boot (i.e. without halting/crashing the system for example - I haven't tried actually using the functionality).
Here is the working config-5.7.0-gentoo for reference.
Furthermore, as hinted by the commit's diff, I indeed have a
Code:
~ # lspci -nn | grep -i 15df
0c:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor [1022:15df]

on my
Code:
~ # cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 24
model name      : AMD Ryzen 5 3400G with Radeon Vega Graphics
stepping        : 1
microcode       : 0x8108109
cpu MHz         : 1295.399
cache size      : 512 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass


I believe the appropriate next step would be to file a bug report upstream, which will be something new for me to learn :P - or is there anything else I should do/try before that?
Back to top
View user's profile Send private message
Goverp
Advocate
Advocate


Joined: 07 Mar 2007
Posts: 2216

PostPosted: Wed Jun 03, 2020 8:08 am    Post subject: Reply with quote

Don't assume filing a bug on bugs.kernel.org will achieve anything. You need to look at /usr/src/MAINTAINERS to see where the people who look after the bit of code you have a problem with. When I had a problem with the AMDGPU driver, I needed to send a polite explanation of the bug, a git bisect and context info to the mailing list. A patch would be even more successful! They were very quick to fix it then. Before that, I and others had mistakenly put a bug on bugs.kernel.org and waited, and waited....
_________________
Greybeard
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 6920

PostPosted: Wed Jun 03, 2020 8:17 am    Post subject: Reply with quote

On my system the ccp driver outright refuses to load complaining that the BIOS is playing tricks. It's possible yours is similar and the kernel doesn't detect it.
Back to top
View user's profile Send private message
toralf
Developer
Developer


Joined: 01 Feb 2004
Posts: 3943
Location: Hamburg

PostPosted: Wed Jun 03, 2020 8:35 am    Post subject: Reply with quote

You bisected it - good work!
Now just identify the To: and Cc: by doing something like
Code:
$ ./scripts/get_maintainer.pl -f drivers/crypto/ccp/Makefile
Tom Lendacky <thomas.lendacky@amd.com> (supporter:AMD CRYPTOGRAPHIC COPROCESSOR (CCP) DRIVER)
Herbert Xu <herbert@gondor.apana.org.au> (maintainer:CRYPTO API)
"David S. Miller" <davem@davemloft.net> (maintainer:CRYPTO API)
linux-crypto@vger.kernel.org (open list:AMD CRYPTOGRAPHIC COPROCESSOR (CCP) DRIVER)
linux-kernel@vger.kernel.org (open list)
for the affected files and send an email with the details.

FWIW the kernel bugzilla isn't used by kernel folks, they tend to communicate via email still.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum