View previous topic :: View next topic |
Author |
Message |
guido-pe n00b

Joined: 10 May 2004 Posts: 74
|
Posted: Wed Sep 30, 2020 11:32 am Post subject: System hangs after some days of operation with kernel 5.4.x |
|
|
I have a small always-on system with Gentoo running in my home serving as a sort of home server for all kinds of things (mostly as a file server), currently running on the 4.19 line of kernels. Every time I try to upgrade the kernel to a newer one from the 5.4.x line, the system will end up hanging after some days of operations (usually less than a week).
These hangs seem to be related to filesystem I/O. When they happen, things that happen mostly in RAM and don't require disk access, like responding to pings or DNS queries or NTP queries, continue to work, but things that require filesystem access, like logging in via SSH or accessing a file via NFS will hang and eventually time out. On the rare occasion when I am connected to the system via SSH when this hang happens, the already open shell will continue to work until I tried to start a program that needs to be loaded from the filesystem, at which point that process will hang in state D indefinitely.
There is nothing unusual in the system logs or even in dmesg when this happens. It's like the filesystem just randomly decided to stop serving some requests for no apparent reason.
The disk setup is four SATA hard disks with a big luks partition each combined into one big raid5 array with one LVM volume group on top of that and ext4 filesystems on top of most of the logical volumes, with the exception of one which holds swap.
This freeze happens with all kernels of the 5.4 line from gentoo-sources that I tried, including 5.4.66, but not with any kernels of the 4.19 line or earlier.
Does anybody have any idea what might be going wrong here or how I could debug this further? |
|
Back to top |
|
 |
RayDude Advocate


Joined: 29 May 2004 Posts: 2119 Location: San Jose, CA
|
Posted: Wed Sep 30, 2020 5:45 pm Post subject: |
|
|
Can you post your hardware information? CPU, amount of memory, if you are overclocking, motherboard, video card, etc.
Post the output of dmesg, lsmod, lspci, lsusb for the 4.19 kernels and for the 5.4 kernel.
For the heck of it, I'd like to see the output of dmidecode as well.
I'm guessing there is an issue with the kernel configuration.
When you configure the 5.4 kernel, are you copying the .config from 4.19 into its folder and running make, and then checking all the new options to make sure they are configured correctly? Generally I use the defaults because they default config knows what needs to be enabled.
What version of the kernel are you using? gentoo-sources?
Debugging a hang is difficult, but at least you know it's not hardware because it's stable on 4.19. _________________ Some day there will only be free software. |
|
Back to top |
|
 |
guido-pe n00b

Joined: 10 May 2004 Posts: 74
|
Posted: Wed Sep 30, 2020 7:48 pm Post subject: |
|
|
RayDude wrote: | Can you post your hardware information? CPU, amount of memory, if you are overclocking, motherboard, video card, etc. |
CPU: Intel Pentium J4205
RAM: 16 GiB
Not overclocking
Mainboard: ASRock J4205-ITX
Video chip: Intel HD 505 (integrated in the CPU)
Quote: | Post the output of dmesg, lsmod, lspci, lsusb for the 4.19 kernels and for the 5.4 kernel.
For the heck of it, I'd like to see the output of dmidecode as well. |
That's a lot of stuff, I don't even know if that will fit in one post...
I will need to reboot the system to get at the values on 5.4.66, I will post those later.
Quote: | I'm guessing there is an issue with the kernel configuration.
When you configure the 5.4 kernel, are you copying the .config from 4.19 into its folder and running make, and then checking all the new options to make sure they are configured correctly? Generally I use the defaults because they default config knows what needs to be enabled. |
I copied .config over and than I ran "make oldconfig".
Quote: | What version of the kernel are you using? gentoo-sources? |
Yes, gentoo-sources. I haven't tried any other yet. |
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 55052 Location: 56N 3W
|
Posted: Wed Sep 30, 2020 7:55 pm Post subject: |
|
|
guido-pe,
It puts file contents and command output onto the web for you and returns a link so you can tell us where it is.
The pastes on the web don't last very long but your helpers will quote the interesting bits here.
I do something similar to you but without the LUKS. My file server is may CD/DVD/Bluray collection, so I don't need at rest data protection.
My system has been solid since 2011, barring hardware failures. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
guido-pe n00b

Joined: 10 May 2004 Posts: 74
|
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 55052 Location: 56N 3W
|
Posted: Wed Sep 30, 2020 9:12 pm Post subject: |
|
|
guido-pe,
Very through. There is nothing that stands out.
Keep an eye on free and dmesg while kernel 5.4.66 is in use.
If free shows used memory on swap use increasing, you may have found a memory leak.
Eventually, there is very little free RAM and things slow to a crawl. It looks like a lockup, but its operating normally, just very slowly.
top will show the memory hog, if there is one.
dmesg may show errors and error recovery before it locks up.
You will want to save these things so you can compare them and salvage them after a lockup.
An exercise for the reader.
Create hourly cron jobs to do the work for you, saving dmesg-<timestamp> and free-<timestamp> somewhere in /var _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
RayDude Advocate


Joined: 29 May 2004 Posts: 2119 Location: San Jose, CA
|
Posted: Thu Oct 01, 2020 7:21 am Post subject: |
|
|
guido-pe,
dmesg and the rest of the dumps look good. I checked to see if you are doing any special CPU optimization in .config because I thought maybe that might have something to do with it.
I have to ask, are you overclocking?
It's an ITX system, have you checked temperatures?
The only other thing I can suggest is to disable features while running the 5.4 kernel and see if the freezing stops. You might be able to isolate it to a particular hardware.
You should also try a 5.7 or 5.8 kernel to see if it hangs in the same way. You might have found an actual driver bug that was fixed in a subsequent release (although generally they back port those).
As Neddy suggested, logging over serial port is your next step. You need to catch the failure in the act and hope that the kernel is spitting something out that might give you a clue. _________________ Some day there will only be free software. |
|
Back to top |
|
 |
pietinger Moderator

Joined: 17 Oct 2006 Posts: 5517 Location: Bavaria
|
Posted: Thu Oct 01, 2020 10:59 am Post subject: |
|
|
guido-pe,
in most cases I know, the reason for your problem is a memory leak. Try this: recompile your server applications (= all which is running all the time) AND your glibc under 5.4. again (of course with the new linux-headers for 5.4.) |
|
Back to top |
|
 |
guido-pe n00b

Joined: 10 May 2004 Posts: 74
|
Posted: Thu Oct 01, 2020 7:55 pm Post subject: |
|
|
Several people here are now suggesting that the cause is memory exhaustion caused by a memory leak in some of the user space processes. I don't think that is what is happening, since the symptoms do not quite match up:
- On some occasions, I had a working SSH shell open to the machine and could observe some of what was going on there in top. I do not recall seeing any signs of memory exhaustion on those occasions. I'm pretty sure I would have noticed that, since it's a fairly obvious thing to look out for.
- I have a swap partition configured, so the system would have started swapping, which would have been observable both in top and through the HDD LED blinking a lot. Since I am using mechanical harddisks, it would even have been audible. None of that happened, though, the HDD LED would show no activity what so ever.
- If I didn't have swap configured, or if even the swap had been exhausted, the oomkiller would have eventually started killing random processes. That did not happen either.
- When a system starts swapping because of memory exhaustion, usually all its processes will be affected. Maybe not all equally, there is some random factor involved after all, but there won't be any that are just safe from swapping. Again, that was not what I observed. In my case, only some processes were affected, and strictly only those that were doing local IO. Any already running process that was not doing local IO (only network IO) was unaffected, like my already established SSH session, my NTP server and my DNS server. (Starting new processes, though, usually involves some IO somewhere, so that didn't work much.)
- When a process is affected by swapping, it will slow down dramatically, but with a lot of patience, you can still see some progress. In my case, the affected processes were not slowed down, they were completely frozen.
Granted, I have no idea what a kernel-internal memory leak would look like, but I am fairly certain this problem is not caused by a userspace memory leak.
It's probably still a good idea to keep an eye on that. The problem with that, though, is, I don't have any idea how to log that information. I cannot just write that information to disk, since disk IO breaks down when this problem strikes, and I cannot call "free" to get the data in the first place, because calling free might involve disk IO. That really only leaves the option of writing a small daemon program that will read the data directly from /proc (I think /proc is unaffected by this) and send it via UDP to a reader process on some other machine (maybe netcat), that will in turn log this. I might use my laptop for that...
BTW, does anyone know how to increase the verbosity of dmesg output? Or how to get at dmesg output without calling some external binary? Maybe there is something relevant happening in the system when this problem occurs, which I just don't see because the kernel's logging verbosity is too low...
RayDude wrote: | I have to ask, are you overclocking? |
I am not overclocking this system.
RayDude wrote: | It's an ITX system, have you checked temperatures? |
I haven't. I guess I should start...
RayDude wrote: | The only other thing I can suggest is to disable features while running the 5.4 kernel and see if the freezing stops. You might be able to isolate it to a particular hardware. |
Randomly disabling features may leave my system unusable for what I want to use it for. Also, the way this bug manifests only after a couple of days, this could take months.
I was thinking more of trying the other major releases between 4.19 and 5.4 to see where this problem started, and then doing a git bisect.
RayDude wrote: | You should also try a 5.7 or 5.8 kernel to see if it hangs in the same way. |
Yeah, I think I will try that next. Right after I check if there are any new BIOS versions for board.
RayDude wrote: | As Neddy suggested, logging over serial port is your next step. |
Did I miss something? Where did he suggest serial ports? Anyway, this board does not have any serial ports, at least none that are externally accessible. Maybe there are some connectors on the PCB somewhere, but I'm not about to put soldering iron to that board. I'm not good with that sort of thing and would probably destroy the board in the process.
RayDude wrote: | You need to catch the failure in the act and hope that the kernel is spitting something out that might give you a clue. |
Yes, but how? Netconsole might be an option, but the last time I tried setting that up, I failed miserably.
NeddySeagoon wrote: | An exercise for the reader.
Create hourly cron jobs to do the work for you, saving dmesg-<timestamp> and free-<timestamp> somewhere in /var |
Good idea in principle, but as I said earlier, when this problem strikes, disk IO tends to stop working abruptly. That means to find this problem, everything that needs disk IO is out, meaning logging to disk is out, cron jobs are out, shell scripts are mostly out (unless running continuously and written carefully to only call shell-internal commands) calling external commands like free or even dmesg is out. (Unless I get lucky and binaries that are already in cache can still be started.) At some point, even syslog() will probably stall, because the syslog-ng daemon has frozen. |
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 55052 Location: 56N 3W
|
Posted: Thu Oct 01, 2020 8:04 pm Post subject: |
|
|
guido-pe,
There have been kernel bugs (at least one) that caused excessive IO waits.
Try the 5.8, or even 5.9 kernel if its out.
The default kernel verbosity can be set in the kernel.
I would need to read the help in the kernel to know how to change it at run time.
Beware excessive logspam. The kernel may write so much to logs, it has no time for anything else. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
guido-pe n00b

Joined: 10 May 2004 Posts: 74
|
Posted: Thu Oct 01, 2020 9:47 pm Post subject: |
|
|
NeddySeagoon wrote: | guido-pe,
There have been kernel bugs (at least one) that caused excessive IO waits.
Try the 5.8, or even 5.9 kernel if its out.
The default kernel verbosity can be set in the kernel.
I would need to read the help in the kernel to know how to change it at run time.
Beware excessive logspam. The kernel may write so much to logs, it has no time for anything else. |
Okay, I'm running this system on 5.8 now and have added a note to my calendar to check back in a week. Let's see how this works out. |
|
Back to top |
|
 |
RayDude Advocate


Joined: 29 May 2004 Posts: 2119 Location: San Jose, CA
|
Posted: Fri Oct 02, 2020 3:04 am Post subject: |
|
|
USB serial is a thing!
https://www.amazon.com/Sabrent-Converter-Prolific-Chipset-CB-DB9P/dp/B00IDSM6BW
You can set that up to dump kernel logs to another machine. I've only had to do that once a long time ago for an embedded system I was working on, but you will see the death throws of the kernel.
Good luck with the 5.8 kernel! _________________ Some day there will only be free software. |
|
Back to top |
|
 |
guido-pe n00b

Joined: 10 May 2004 Posts: 74
|
Posted: Sun Nov 08, 2020 8:43 pm Post subject: |
|
|
Update:
I have been running 5.8 for over a month now, and the problem has not resurfaced. It seems this was just a problem in the 5.4.x line. |
|
Back to top |
|
 |
RayDude Advocate


Joined: 29 May 2004 Posts: 2119 Location: San Jose, CA
|
Posted: Mon Nov 09, 2020 4:44 am Post subject: |
|
|
guido-pe wrote: | Update:
I have been running 5.8 for over a month now, and the problem has not resurfaced. It seems this was just a problem in the 5.4.x line. |
Great! Glad to hear it. _________________ Some day there will only be free software. |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|