System hangs after some days of operation with kernel 5.4.x

guido-pe · n00b Joined: 10 May 2004 Posts: 74

I have a small always-on system with Gentoo running in my home serving as a sort of home server for all kinds of things (mostly as a file server), currently running on the 4.19 line of kernels. Every time I try to upgrade the kernel to a newer one from the 5.4.x line, the system will end up hanging after some days of operations (usually less than a week).

These hangs seem to be related to filesystem I/O. When they happen, things that happen mostly in RAM and don't require disk access, like responding to pings or DNS queries or NTP queries, continue to work, but things that require filesystem access, like logging in via SSH or accessing a file via NFS will hang and eventually time out. On the rare occasion when I am connected to the system via SSH when this hang happens, the already open shell will continue to work until I tried to start a program that needs to be loaded from the filesystem, at which point that process will hang in state D indefinitely.

There is nothing unusual in the system logs or even in dmesg when this happens. It's like the filesystem just randomly decided to stop serving some requests for no apparent reason.

The disk setup is four SATA hard disks with a big luks partition each combined into one big raid5 array with one LVM volume group on top of that and ext4 filesystems on top of most of the logical volumes, with the exception of one which holds swap.

This freeze happens with all kernels of the 5.4 line from gentoo-sources that I tried, including 5.4.66, but not with any kernels of the 4.19 line or earlier.

Does anybody have any idea what might be going wrong here or how I could debug this further?

RayDude · Posted: Wed Sep 30, 2020 5:45 pm Post subject:

Can you post your hardware information? CPU, amount of memory, if you are overclocking, motherboard, video card, etc.

Post the output of dmesg, lsmod, lspci, lsusb for the 4.19 kernels and for the 5.4 kernel.

For the heck of it, I'd like to see the output of dmidecode as well.

I'm guessing there is an issue with the kernel configuration.

When you configure the 5.4 kernel, are you copying the .config from 4.19 into its folder and running make, and then checking all the new options to make sure they are configured correctly? Generally I use the defaults because they default config knows what needs to be enabled.

What version of the kernel are you using? gentoo-sources?

Debugging a hang is difficult, but at least you know it's not hardware because it's stable on 4.19.
_________________
Some day there will only be free software.

guido-pe · n00b Joined: 10 May 2004 Posts: 74

NeddySeagoon · Posted: Wed Sep 30, 2020 7:55 pm Post subject:

guido-pe,

guido-pe · n00b Joined: 10 May 2004 Posts: 74

NeddySeagoon · Posted: Wed Sep 30, 2020 9:12 pm Post subject:

guido-pe,

Very through. There is nothing that stands out.

Keep an eye on free and dmesg while kernel 5.4.66 is in use.

If free shows used memory on swap use increasing, you may have found a memory leak.
Eventually, there is very little free RAM and things slow to a crawl. It looks like a lockup, but its operating normally, just very slowly.
top will show the memory hog, if there is one.

dmesg may show errors and error recovery before it locks up.
You will want to save these things so you can compare them and salvage them after a lockup.

An exercise for the reader.
Create hourly cron jobs to do the work for you, saving dmesg-<timestamp> and free-<timestamp> somewhere in /var
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

RayDude · Posted: Thu Oct 01, 2020 7:21 am Post subject:

guido-pe,

dmesg and the rest of the dumps look good. I checked to see if you are doing any special CPU optimization in .config because I thought maybe that might have something to do with it.

I have to ask, are you overclocking?

It's an ITX system, have you checked temperatures?

The only other thing I can suggest is to disable features while running the 5.4 kernel and see if the freezing stops. You might be able to isolate it to a particular hardware.

You should also try a 5.7 or 5.8 kernel to see if it hangs in the same way. You might have found an actual driver bug that was fixed in a subsequent release (although generally they back port those).

As Neddy suggested, logging over serial port is your next step. You need to catch the failure in the act and hope that the kernel is spitting something out that might give you a clue.
_________________
Some day there will only be free software.

pietinger · Posted: Thu Oct 01, 2020 10:59 am Post subject:

guido-pe,

in most cases I know, the reason for your problem is a memory leak. Try this: recompile your server applications (= all which is running all the time) AND your glibc under 5.4. again (of course with the new linux-headers for 5.4.)

guido-pe · n00b Joined: 10 May 2004 Posts: 74

Several people here are now suggesting that the cause is memory exhaustion caused by a memory leak in some of the user space processes. I don't think that is what is happening, since the symptoms do not quite match up:

On some occasions, I had a working SSH shell open to the machine and could observe some of what was going on there in top. I do not recall seeing any signs of memory exhaustion on those occasions. I'm pretty sure I would have noticed that, since it's a fairly obvious thing to look out for.
I have a swap partition configured, so the system would have started swapping, which would have been observable both in top and through the HDD LED blinking a lot. Since I am using mechanical harddisks, it would even have been audible. None of that happened, though, the HDD LED would show no activity what so ever.
If I didn't have swap configured, or if even the swap had been exhausted, the oomkiller would have eventually started killing random processes. That did not happen either.
When a system starts swapping because of memory exhaustion, usually all its processes will be affected. Maybe not all equally, there is some random factor involved after all, but there won't be any that are just safe from swapping. Again, that was not what I observed. In my case, only some processes were affected, and strictly only those that were doing local IO. Any already running process that was not doing local IO (only network IO) was unaffected, like my already established SSH session, my NTP server and my DNS server. (Starting new processes, though, usually involves some IO somewhere, so that didn't work much.)
When a process is affected by swapping, it will slow down dramatically, but with a lot of patience, you can still see some progress. In my case, the affected processes were not slowed down, they were completely frozen.

Granted, I have no idea what a kernel-internal memory leak would look like, but I am fairly certain this problem is not caused by a userspace memory leak.

It's probably still a good idea to keep an eye on that. The problem with that, though, is, I don't have any idea how to log that information. I cannot just write that information to disk, since disk IO breaks down when this problem strikes, and I cannot call "free" to get the data in the first place, because calling free might involve disk IO. That really only leaves the option of writing a small daemon program that will read the data directly from /proc (I think /proc is unaffected by this) and send it via UDP to a reader process on some other machine (maybe netcat), that will in turn log this. I might use my laptop for that...

BTW, does anyone know how to increase the verbosity of dmesg output? Or how to get at dmesg output without calling some external binary? Maybe there is something relevant happening in the system when this problem occurs, which I just don't see because the kernel's logging verbosity is too low...

NeddySeagoon · Posted: Thu Oct 01, 2020 8:04 pm Post subject:

guido-pe,

There have been kernel bugs (at least one) that caused excessive IO waits.
Try the 5.8, or even 5.9 kernel if its out.

The default kernel verbosity can be set in the kernel.
I would need to read the help in the kernel to know how to change it at run time.

Beware excessive logspam. The kernel may write so much to logs, it has no time for anything else.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

guido-pe · n00b Joined: 10 May 2004 Posts: 74

RayDude · Posted: Fri Oct 02, 2020 3:04 am Post subject:

USB serial is a thing!

https://www.amazon.com/Sabrent-Converter-Prolific-Chipset-CB-DB9P/dp/B00IDSM6BW

You can set that up to dump kernel logs to another machine. I've only had to do that once a long time ago for an embedded system I was working on, but you will see the death throws of the kernel.

Good luck with the 5.8 kernel!
_________________
Some day there will only be free software.

guido-pe · n00b Joined: 10 May 2004 Posts: 74

Update:

I have been running 5.8 for over a month now, and the problem has not resurfaced. It seems this was just a problem in the 5.4.x line.

RayDude · Posted: Mon Nov 09, 2020 4:44 am Post subject: