View previous topic :: View next topic |
Author |
Message |
RayDude Advocate
Joined: 29 May 2004 Posts: 2078 Location: San Jose, CA
|
Posted: Sun May 29, 2022 11:12 pm Post subject: Problem shutting down, crashing boot sector |
|
|
My system has a new issue which cropped up in the last 6 weeks-ish.
When I shutdown or reboot (like after an emerge @world) the system hangs. It doesn't always hang, just sometimes and I have been trying to figure out why it is hanging, as I stated above, it did not have this issue before.
And, worst of all, sometimes when it hangs, if I press the reset button to get it to reboot, the NVMe (bootsector?) is corrupted and the system won't boot anymore. I have to boot a sysrescue USB key, mount my root file system, chroot, mount /boot and then rerun grub to install the boot sector and then it will boot.
The issue is not 100% reproducable. At first I suspected a hardware failure, but I replaced the NVMe with a different brand and it failed exactly the same way.
I thought it was a motherboard issue and updated my BIOS and that was a huge mistake as the new BIOS would fail to warm boot every time. I've since backed it off.
Clues:
1. If I switch to console 1, login as root and type 'restart' the system will still hang. It hangs with the following last two messages:
Code: | * Stopping NFS daemon
* start-stop-daemon: 1 process refused to stop
|
And it stays there. I can't login or do anything with the system and the NVMe is clearly being accessed because the hard drive LED is blinking.
If I press the power button, I get this:
Code: | * openrc: caught SIGTERM, aborting
* nfs: caught SIGTERM, aborting
INIT: version 3.01 reloading
* Deactivating swap devices ...
* Stopping sshd ...
* STopping NFS mountd ...
* start-stop-daemon: no matching processes found
* Stopping NFS daemon
* start-stop-daemon: 1 process refused to stop |
and it stays there hung with nothing left to do but press reset or hold the power button down for six seconds.
After that, the boot sector might be corrupted.
After getting it to reboot, chkdisk runs on all the filesystems and they are all fine because they are ext4.
What is this BS about a process refusing to stop? Is the kernel not the boss anymore? I mean kill -9 the process and let the shutdown proceed or let me login and fix it. This feels like openRC could throw me a bone if it behaved differently.
I have tried going to an older kernel, I'm running 5.17.1-r1 (gentoo-sources) right now and will reboot it in a bit to see if the problem persists. The trick is I have to do it more than once because sometimes everything is okay.
I have a laptop that has server NFS folders mounted with files open, but I have closed those and unmounted and still experienced the issue.
I have also deliberately left everything connected and open and had it reboot without problem.
It may happen on the second reboot while the files are all open or something strange like that.
If I can get it to fail (again with 5.17.1-r1 I will go back to 5.15.latest to see if it has the issue.
I'm wondering if anyone has any other idea what to check.
Thanks much for reading.
Update: 5.15.39 also had problems shutting down. It is currently in a state where it can't shut down properly. I suspect if I reboot my laptop it will work. I'm getting it back to the latest 5.17.7 and will see if rebooting the laptop fixes the server's NFS issues.
Update 2: shutting down the nfs sharing laptop did not fix the issue.
If it really is the nfs server that's failing, I need to go back on that or perhaps on openRC... I have been getting pretty good at making sure to press the reset button so the drive won't crash...
Update 3: I went through emerge.log and found a likely candidate for the upgrade that started the problem.
nfs-utils-2.5.4-r4 upgrade to nfs-utils-2.6.1
So I downgraded it and so far I cannot get the shutdown to hang. I'll test it again now (it's been up for a few minutes) and if it boots okay this time, I'll check it again after 24 hours because it seems like time might make a difference.
If it does work, I'll file a bug against nfs-utils-2.6.1 _________________ Some day there will only be free software. |
|
Back to top |
|
|
alamahant Advocate
Joined: 23 Mar 2019 Posts: 3918
|
Posted: Mon May 30, 2022 4:59 pm Post subject: |
|
|
Hi
2 thoughts.
1.what happens when you first stop nfs and reboot?
2.In order to avoid hard reboots use kernel cmdline
Code: |
sysrq_always_enabled=1
|
so next time it hangs pres alt+prt sc+reisub to naturally reboot.
Try
Code: |
lsof | grep "/path/to/nfs-share"
|
to check what might preventing the system to reboot. _________________
|
|
Back to top |
|
|
RayDude Advocate
Joined: 29 May 2004 Posts: 2078 Location: San Jose, CA
|
Posted: Mon May 30, 2022 9:51 pm Post subject: |
|
|
alamahant wrote: | Hi
2 thoughts.
1.what happens when you first stop nfs and reboot?
2.In order to avoid hard reboots use kernel cmdline
Code: |
sysrq_always_enabled=1
|
so next time it hangs pres alt+prt sc+reisub to naturally reboot.
Try
Code: |
lsof | grep "/path/to/nfs-share"
|
to check what might preventing the system to reboot. |
I have tried unmounting all the nfs partitions from the other machine, but I haven't tried shutting down nfs server by hand. I'll do that as well as your idea.
Thanks much. The older version of NFS did not fix the problem.
I tried rebooting and it hung as before.
I have gotten pretty good at pressing reset at a time when the nvme is not affected. The trick is to notice the pattern on the HD activity LED.
I'll try to enable sysrq_always_enabled
I found a doc about how to use sysrq. I'll give it a shot next time. _________________ Some day there will only be free software.
Last edited by RayDude on Mon May 30, 2022 10:03 pm; edited 1 time in total |
|
Back to top |
|
|
alamahant Advocate
Joined: 23 Mar 2019 Posts: 3918
|
Posted: Mon May 30, 2022 10:01 pm Post subject: |
|
|
Quote: |
What is the reisub key? Never heard of it.
|
Its not a key.Its the characters
r+e+i+s+u+b _________________
|
|
Back to top |
|
|
RayDude Advocate
Joined: 29 May 2004 Posts: 2078 Location: San Jose, CA
|
Posted: Mon May 30, 2022 10:04 pm Post subject: |
|
|
Thanks. I just figured that out by reading docs. I'll give it a shot. _________________ Some day there will only be free software. |
|
Back to top |
|
|
RayDude Advocate
Joined: 29 May 2004 Posts: 2078 Location: San Jose, CA
|
Posted: Tue May 31, 2022 12:53 am Post subject: |
|
|
I was able to reproduce the failure from the command line:
Code: | server ~ # /etc/init.d/nfs stop
* Stopping NFS mountd ... [ ok ]
* Stopping NFS daemon ...
* start-stop-daemon: 1 process refused to stop [ !! ] |
The command line is currently hung.
lsof shows that one of the folders that I'm sharing with nfs is also shared with samba:
Code: | server ~ # lsof /mnt/raid6/server
lsof: WARNING: can't stat() fuse.portal file system /run/user/1002/doc
Output information may be incomplete.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
smbd 8354 root 15u DIR 9,127 4096 81461249 /mnt/raid6/server |
I don't know if this helps, but this was in dmesg:
Code: | [10756.668232] lockd: couldn't shutdown host module for net f0000000! |
Any ideas? _________________ Some day there will only be free software. |
|
Back to top |
|
|
RayDude Advocate
Joined: 29 May 2004 Posts: 2078 Location: San Jose, CA
|
Posted: Tue May 31, 2022 1:41 am Post subject: |
|
|
Well. The rpc.nfsd 0 command hangs and the script deadlocks.
In fact, if there are mounts on the nfs folders, the system hangs. Right now my system is partially hung, I have many rpc.nfsd 0 commands attempting to run and none of them can be killed -9.
In fact I can't kill -9 nfsd either.
The process won't quit. Which seems fundamentally wrong to me. No process should refuse to shut down when the boss says "go away."
Any ideas as to how to fix this? Should I stop using NFS? It used to work...
I can't have my server crashing when it needs a reboot. Especially since the NVMes are so sensitive. _________________ Some day there will only be free software. |
|
Back to top |
|
|
Hu Administrator
Joined: 06 Mar 2007 Posts: 22666
|
Posted: Tue May 31, 2022 11:27 am Post subject: |
|
|
If a process is blocked in certain ways in the kernel, then the kernel will defer delivery of the SIGKILL. That SIGKILL is pending, and will be processed when the process exits the locked kernel section. No further userspace code from that process will run. The process will die of the SIGKILL instead of resuming userspace execution. However, depending on why the process is blocked in the kernel, it may remain there for a very long time. You need to find why the process is blocked in the kernel, and fix that. |
|
Back to top |
|
|
RayDude Advocate
Joined: 29 May 2004 Posts: 2078 Location: San Jose, CA
|
Posted: Tue May 31, 2022 2:53 pm Post subject: |
|
|
Hu wrote: | If a process is blocked in certain ways in the kernel, then the kernel will defer delivery of the SIGKILL. That SIGKILL is pending, and will be processed when the process exits the locked kernel section. No further userspace code from that process will run. The process will die of the SIGKILL instead of resuming userspace execution. However, depending on why the process is blocked in the kernel, it may remain there for a very long time. You need to find why the process is blocked in the kernel, and fix that. |
Thanks Hu.
That is way above my skill level, I suspect.
I used the magic sysctrl keys last night to get the PC to shut down and you know what? The NVMe boot sector crashed. I had to boot USB key to fix it.
This is the least stable this system has been since I was running a Ryzen 5 1600 with bad silicon and no kernel work arounds set.
I'll see if I can figure out how / why the process is blocked. _________________ Some day there will only be free software. |
|
Back to top |
|
|
RayDude Advocate
Joined: 29 May 2004 Posts: 2078 Location: San Jose, CA
|
Posted: Wed Jun 01, 2022 6:55 am Post subject: |
|
|
With respect to NFS, I found this article: https://www.suse.com/support/kb/doc/?id=000019722
Which implies that the only way to fix this is to reboot the client before rebooting the server. This is a PITA.
I'll keep digging. _________________ Some day there will only be free software. |
|
Back to top |
|
|
RayDude Advocate
Joined: 29 May 2004 Posts: 2078 Location: San Jose, CA
|
Posted: Wed Jun 01, 2022 8:06 am Post subject: |
|
|
I determined that if I shut off the machine running nfs client, then the server will shut down no problem.
My company uses nfs in production for logging tests. I had a heck of a time getting it to be reliable until I disabled NFSv4.
I just disabled NFSv4 on my server and removed the nfsv4 USE flag. I'll test that tomorrow to see if it hangs. _________________ Some day there will only be free software. |
|
Back to top |
|
|
RayDude Advocate
Joined: 29 May 2004 Posts: 2078 Location: San Jose, CA
|
Posted: Sun Jun 05, 2022 6:40 am Post subject: |
|
|
Disabling NFSv4 did not help. _________________ Some day there will only be free software. |
|
Back to top |
|
|
|