View previous topic :: View next topic |
Author |
Message |
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3893 Location: Rasi, Finland
|
Posted: Mon Jan 13, 2025 11:40 pm Post subject: networkmanager reproduces zombies |
|
|
Firstly:
grknight wrote: | Zucca wrote: | pingtoo wrote: | One of "init" duty for linux is reap zombie process. | Ok. I thought this was kernel's job. Is it really so? That would explain one oddity I have... |
Yes, it is init's job. See a simple example from openrc-init that catches SIGCHLD and reaps |
That ^ link contains the following function: Code: | static void signal_handler(int sig)
{
switch (sig) {
case SIGINT:
handle_shutdown("reboot", RB_AUTOBOOT);
break;
case SIGTERM:
#ifdef SIGPWR
case SIGPWR:
#endif
handle_shutdown("shutdown", RB_HALT_SYSTEM);
break;
case SIGCHLD:
reap_zombies();
break;
default:
printf("Unknown signal received, %d\n", sig);
break;
}
}
| I don't know much of C, but it looks like if openrc-init receives SIGCHLD signal it'll start cleaning out zombie processes.
I tried, but nothing happened. I have one box where dhcpcd processes spawned by networkmanager are eventually left as zombies. So I'm trying to solve it here too.
Also I don't quite understand the following either: reap_zombies(): | static void reap_zombies(void)
{
pid_t pid;
for (;;) {
pid = waitpid(-1, NULL, WNOHANG);
if (pid == 0)
break;
else if (pid == -1) {
if (errno == ECHILD)
break;
perror("waitpid");
continue;
}
}
} |
for (;;)? Is this like while true? Loops forever? No condition passed.
The waitpid should allow zombified process to exit? _________________ ..: Zucca :..
My gentoo installs: | init=/sbin/openrc-init
-systemd -logind -elogind seatd |
Quote: | I am NaN! I am a man! |
Last edited by Zucca on Thu Jan 16, 2025 10:42 am; edited 1 time in total |
|
Back to top |
|
|
GDH-gentoo Veteran
Joined: 20 Jul 2019 Posts: 1797 Location: South America
|
Posted: Tue Jan 14, 2025 12:27 am Post subject: Re: openrc-init - zombie reaping |
|
|
Zucca wrote: | I don't know much of C, but it looks like if openrc-init receives SIGCHLD signal it'll start cleaning out zombie processes. |
It is a POSIX thing more than a C thing. Yes, it will reap zombie processes when receiving a SIGCHLD signal. The signal itself is just an 'alarm' though ("hey, you have a new zombie"), actual reaping happens when appropriate functions are called (such as waitpid()).
Reception of SIGCHLD happens automatically (it is implemented by the kernel) when a child process terminates. openrc-init also has the peculiarity that it runs as process 1, so it 'inherits' as children all processes whose parents terminate, as a special property.
Zucca wrote: | I tried, but nothing happened. |
Tried what?
Zucca wrote: | I have one box where dhcpcd processes spawned by networkmanager are eventually left as zombies. So I'm trying to solve it here too. |
How does the process tree (ps axf) look like when it happens?
Zucca wrote: | Also I don't quite understand the following either:
[...]
for (;;)? Is this like while true? Loops forever? |
Yes. But there are break statements in the loop body that are executed on certain conditions. When they are executed, execution flow continues after the loop.
Zucca wrote: | The waitpid should allow zombified process to exit? |
It is one of the functions that results in reaping a zombie when called, yes. _________________
NeddySeagoon wrote: | I'm not a witch, I'm a retired electronics engineer |
Ionen wrote: | As a packager I just don't want things to get messier with weird build systems and multiple toolchains requirements though |
Last edited by GDH-gentoo on Tue Jan 14, 2025 1:42 am; edited 1 time in total |
|
Back to top |
|
|
Hu Administrator
Joined: 06 Mar 2007 Posts: 23056
|
Posted: Tue Jan 14, 2025 12:48 am Post subject: |
|
|
Building on the hint from GDH-gentoo, my guess is that the zombies that Zucca saw are the children of some not-yet-exited process that is not reaping them in a timely manner. When the parent of a zombie exits, the zombie's parent is changed to the nearest subreaper (which by default is pid 1). The new parent would then reap the zombie. However, if the original parent has not exited yet, then the zombie remains under the parent, because the parent might yet want to reap the zombie, and each zombie can only be reaped once. |
|
Back to top |
|
|
sublogic Guru
Joined: 21 Mar 2022 Posts: 303 Location: Pennsylvania, USA
|
Posted: Tue Jan 14, 2025 3:09 am Post subject: |
|
|
To elaborate a bit on the above:- In C's "for" statement, for(init ; test ; postbody), all three of init, test and postbody can be omitted, and a missing test means "true". So for( ; ; ) is an infinite loop.
- Zombies exist because the parent of a terminated process has the right to collect its resource usage statistics with the wait4() system call.
- Zombies are just tiny stubs, most of the terminated process' resources are already reclaimed. The parent's wait4() reclaims the rest.
- If the parent exits without wait()ing, init inherits the zombie and promptly reaps it without bothering to collect statistics.
- If the parent hangs around and never wait()s, zombies accumulate. Kill the parent !
|
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3893 Location: Rasi, Finland
|
Posted: Tue Jan 14, 2025 8:49 am Post subject: Re: openrc-init - zombie reaping |
|
|
Ok. Thanks guys. I learnt a lot about zombie processes and how those are normally handled.
GDH-gentoo wrote: | Zucca wrote: | I tried, but nothing happened. |
Tried what? | I sent the signal manually to openrc-init, but that not obviously how it works (after reading replies on this topic).
GDH-gentoo wrote: | How does the process tree (ps axf) look like when it happens?
| It happens all the time. I think it occurs when the box loses it's wifi connection. I haven't (yet) counted if "zombies reproduce" when the connection is lost.
Code: | M710q ~ # pstree -A 1519
NetworkManager-+-107*[dhcpcd]
`-3*[{NetworkManager}] | ... where 1519 is the networkmanager main process PID.
*snip* of ps output: | PID TTY STAT TIME COMMAND
1519 ? Ssl 5:21 /usr/sbin/NetworkManager --pid-file /run/NetworkManager/NetworkManager.pid
1905 ? Z 0:00 \_ [dhcpcd] <defunct>
2375 ? Z 0:00 \_ [dhcpcd] <defunct>
4412 ? Z 0:00 \_ [dhcpcd] <defunct>
4591 ? Z 0:00 \_ [dhcpcd] <defunct>
4875 ? Z 0:00 \_ [dhcpcd] <defunct>
5040 ? Z 0:00 \_ [dhcpcd] <defunct>
5188 ? Z 0:00 \_ [dhcpcd] <defunct>
5462 ? Z 0:00 \_ [dhcpcd] <defunct>
5616 ? Z 0:00 \_ [dhcpcd] <defunct>
5787 ? Z 0:00 \_ [dhcpcd] <defunct>
6072 ? Z 0:00 \_ [dhcpcd] <defunct>
8101 ? Z 0:00 \_ [dhcpcd] <defunct>
8585 ? Z 0:00 \_ [dhcpcd] <defunct>
11034 ? Z 0:00 \_ [dhcpcd] <defunct>
11085 ? Z 0:00 \_ [dhcpcd] <defunct>
11117 ? Z 0:00 \_ [dhcpcd] <defunct>
12902 ? Z 0:00 \_ [dhcpcd] <defunct>
13179 ? Z 0:00 \_ [dhcpcd] <defunct>
13559 ? Z 0:00 \_ [dhcpcd] <defunct>
13898 ? Z 0:00 \_ [dhcpcd] <defunct>
15548 ? Z 0:00 \_ [dhcpcd] <defunct>
15835 ? Z 0:00 \_ [dhcpcd] <defunct>
16108 ? Z 0:00 \_ [dhcpcd] <defunct>
16268 ? Z 0:00 \_ [dhcpcd] <defunct>
16287 ? Z 0:00 \_ [dhcpcd] <defunct>
16494 ? Z 0:00 \_ [dhcpcd] <defunct>
17157 ? Z 0:00 \_ [dhcpcd] <defunct>
17181 ? Z 0:00 \_ [dhcpcd] <defunct>
17201 ? Z 0:00 \_ [dhcpcd] <defunct>
17230 ? Z 0:00 \_ [dhcpcd] <defunct>
17259 ? Z 0:00 \_ [dhcpcd] <defunct>
17299 ? Z 0:00 \_ [dhcpcd] <defunct>
17579 ? Z 0:00 \_ [dhcpcd] <defunct>
19249 ? Z 0:00 \_ [dhcpcd] <defunct>
19288 ? Z 0:00 \_ [dhcpcd] <defunct>
19746 ? Z 0:00 \_ [dhcpcd] <defunct>
20047 ? Z 0:00 \_ [dhcpcd] <defunct>
20080 ? Z 0:00 \_ [dhcpcd] <defunct>
21784 ? Z 0:00 \_ [dhcpcd] <defunct>
22523 ? Z 0:00 \_ [dhcpcd] <defunct>
22802 ? Z 0:00 \_ [dhcpcd] <defunct>
23082 ? Z 0:00 \_ [dhcpcd] <defunct>
23116 ? Z 0:00 \_ [dhcpcd] <defunct>
23399 ? Z 0:00 \_ [dhcpcd] <defunct>
23422 ? Z 0:00 \_ [dhcpcd] <defunct>
23459 ? Z 0:00 \_ [dhcpcd] <defunct>
25189 ? Z 0:00 \_ [dhcpcd] <defunct>
25214 ? Z 0:00 \_ [dhcpcd] <defunct>
25497 ? Z 0:00 \_ [dhcpcd] <defunct>
25652 ? Z 0:00 \_ [dhcpcd] <defunct>
25833 ? Z 0:00 \_ [dhcpcd] <defunct>
26164 ? Z 0:00 \_ [dhcpcd] <defunct>
28591 ? Z 0:00 \_ [dhcpcd] <defunct>
28910 ? Z 0:00 \_ [dhcpcd] <defunct>
30632 ? Z 0:00 \_ [dhcpcd] <defunct>
1034 ? Z 0:00 \_ [dhcpcd] <defunct>
1055 ? Z 0:00 \_ [dhcpcd] <defunct>
1368 ? Z 0:00 \_ [dhcpcd] <defunct>
1705 ? Z 0:00 \_ [dhcpcd] <defunct>
3757 ? Z 0:00 \_ [dhcpcd] <defunct>
3920 ? Z 0:00 \_ [dhcpcd] <defunct>
4209 ? Z 0:00 \_ [dhcpcd] <defunct>
4236 ? Z 0:00 \_ [dhcpcd] <defunct>
4254 ? Z 0:00 \_ [dhcpcd] <defunct>
4283 ? Z 0:00 \_ [dhcpcd] <defunct>
4557 ? Z 0:00 \_ [dhcpcd] <defunct>
4870 ? Z 0:00 \_ [dhcpcd] <defunct>
7186 ? Z 0:00 \_ [dhcpcd] <defunct>
13633 ? Z 0:00 \_ [dhcpcd] <defunct>
13906 ? Z 0:00 \_ [dhcpcd] <defunct>
14181 ? Z 0:00 \_ [dhcpcd] <defunct>
14464 ? Z 0:00 \_ [dhcpcd] <defunct>
14755 ? Z 0:00 \_ [dhcpcd] <defunct>
15046 ? Z 0:00 \_ [dhcpcd] <defunct>
15330 ? Z 0:00 \_ [dhcpcd] <defunct>
15615 ? Z 0:00 \_ [dhcpcd] <defunct>
15897 ? Z 0:00 \_ [dhcpcd] <defunct>
16181 ? Z 0:00 \_ [dhcpcd] <defunct>
16465 ? Z 0:00 \_ [dhcpcd] <defunct>
16745 ? Z 0:00 \_ [dhcpcd] <defunct>
17036 ? Z 0:00 \_ [dhcpcd] <defunct>
17328 ? Z 0:00 \_ [dhcpcd] <defunct>
17605 ? Z 0:00 \_ [dhcpcd] <defunct>
21517 ? Z 0:00 \_ [dhcpcd] <defunct>
21552 ? Z 0:00 \_ [dhcpcd] <defunct>
21837 ? Z 0:00 \_ [dhcpcd] <defunct>
21861 ? Z 0:00 \_ [dhcpcd] <defunct>
21890 ? Z 0:00 \_ [dhcpcd] <defunct>
21909 ? Z 0:00 \_ [dhcpcd] <defunct>
21937 ? Z 0:00 \_ [dhcpcd] <defunct>
22003 ? Z 0:00 \_ [dhcpcd] <defunct>
24205 ? Z 0:00 \_ [dhcpcd] <defunct>
24486 ? Z 0:00 \_ [dhcpcd] <defunct>
24523 ? Z 0:00 \_ [dhcpcd] <defunct>
24807 ? Z 0:00 \_ [dhcpcd] <defunct>
24828 ? Z 0:00 \_ [dhcpcd] <defunct>
25206 ? Z 0:00 \_ [dhcpcd] <defunct>
25493 ? Z 0:00 \_ [dhcpcd] <defunct>
25786 ? Z 0:00 \_ [dhcpcd] <defunct>
26074 ? Z 0:00 \_ [dhcpcd] <defunct>
26360 ? Z 0:00 \_ [dhcpcd] <defunct>
26646 ? Z 0:00 \_ [dhcpcd] <defunct>
26930 ? Z 0:00 \_ [dhcpcd] <defunct>
27215 ? Z 0:00 \_ [dhcpcd] <defunct>
27784 ? Z 0:00 \_ [dhcpcd] <defunct>
28067 ? Z 0:00 \_ [dhcpcd] <defunct>
28370 ? S 0:00 \_ dhcpcd: wlp0s20f0u8 [ip4]
1567 ? S 7:58 /usr/sbin/wpa_supplicant -u
1569 ? Sl 0:00 /usr/sbin/ModemManager |
GDH-gentoo wrote: | Zucca wrote: | Also I don't quite understand the following either:
[...]
for (;;)? Is this like while true? Loops forever? |
Yes. But there are break statements in the loop body that are executed on certain conditions. | Ok. Thanks. I'm familiar with break, continue and so on. The (;;) -part was the uncertain part for me. Would while () do the same?
GDH-gentoo wrote: | Zucca wrote: | The waitpid should allow zombified process to exit? |
It is one of the functions that results in reaping a zombie when called, yes. | [/quote]... and I assume waitpid is some standard function from libc? Or even a syscall?
Thanks Hu for the insights to the mechanism.
sublogic wrote: | Zombies are just tiny stubs, most of the terminated process' resources are already reclaimed. The parent's wait4() reclaims the rest. |
sublogic wrote: | If the parent hangs around and never wait()s, zombies accumulate. Kill the parent ! | Indeed. By running rc-config restart NetworkManager, zombies got reaped.
So... The issue lies (probably) in nerworkmanager, as it's direct parent to the dhcpcd.
I could recompile networkmanager to use dhclient instead of dhcpcd to rule out dhcpcd as the culprit. _________________ ..: Zucca :..
My gentoo installs: | init=/sbin/openrc-init
-systemd -logind -elogind seatd |
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
GDH-gentoo Veteran
Joined: 20 Jul 2019 Posts: 1797 Location: South America
|
Posted: Tue Jan 14, 2025 12:57 pm Post subject: Re: openrc-init - zombie reaping |
|
|
Zucca wrote: | *snip* of ps output: | PID TTY STAT TIME COMMAND
1519 ? Ssl 5:21 /usr/sbin/NetworkManager --pid-file /run/NetworkManager/NetworkManager.pid
1905 ? Z 0:00 \_ [dhcpcd] <defunct>
2375 ? Z 0:00 \_ [dhcpcd] <defunct>
4412 ? Z 0:00 \_ [dhcpcd] <defunct>
... |
|
That's a lot of zombie child processes, indeed, that process 1519 seems to not be reaping (in a timely manner at least).
Zucca wrote: | Ok. Thanks. I'm familiar with break, continue and so on. The (;;) -part was the uncertain part for me. Would while () do the same? |
Okay. Yes, for (;;) is valid C syntax that means "loop forever" for the reasons given by sublogic; however while () is not. The condition can't be omitted, so you have to write while (1) or equivalent condition to loop forever.
Zucca wrote: | ... and I assume waitpid is some standard function from libc? Or even a syscall? |
It is a "system interface" specified by the POSIX standard that, on amd64, the libc implements as a wrapper around a Linux system call named wait4().
Zucca wrote: | So... The issue lies (probably) in nerworkmanager, as it's direct parent to the dhcpcd. |
Yeah, looks like. _________________
NeddySeagoon wrote: | I'm not a witch, I'm a retired electronics engineer |
Ionen wrote: | As a packager I just don't want things to get messier with weird build systems and multiple toolchains requirements though |
|
|
Back to top |
|
|
Hu Administrator
Joined: 06 Mar 2007 Posts: 23056
|
Posted: Tue Jan 14, 2025 2:59 pm Post subject: Re: openrc-init - zombie reaping |
|
|
Zucca wrote: | So... The issue lies (probably) in nerworkmanager, as it's direct parent to the dhcpcd.
I could recompile networkmanager to use dhclient instead of dhcpcd to rule out dhcpcd as the culprit. | From the output shown, dhcpcd cannot be the culprit, because it exited in a presumably normal manner. Zombie cleanup is the responsibility of the survivor, not the zombie, so no bugs in dhcpcd could justify the observed results. At worst, a dhcpcd bug might let it die and be replaced "too often", but even then, if NetworkManager were reaping zombies properly, you wouldn't see this accumulation. To me, the only question is whether NetworkManager has separate code paths for dhclient versus dhcpcd, and only one of them is buggy (in which case using the other would work around the problem) or whether it uses equivalent code paths for both, and you will just see the same problem with dhclient zombies instead of dhcpcd zombies. |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3893 Location: Rasi, Finland
|
Posted: Wed Jan 15, 2025 7:08 am Post subject: Re: openrc-init - zombie reaping |
|
|
Hu wrote: | To me, the only question is whether NetworkManager has separate code paths for dhclient versus dhcpcd, and only one of them is buggy | Yes. This is my point.
But it looks like there are, in fact, three possible dhcp configurations for networkmanager.
Code: | * Messages for package net-misc/networkmanager-1.48.10-r1:
* You have enabled USE=dhclient and/or USE=dhcpcd, but NetworkManager since
* version 1.20 defaults to the internal DHCP client. If the internal client
* works for you, and you're happy with, the alternative USE flags can be
* disabled. If you want to use dhclient or dhcpcd, then you need to tweak
* the main.dhcp configuration option to use one of them instead of internal. | ... as we can see from the networkmanager message.
Next I'll conduct the tests... _________________ ..: Zucca :..
My gentoo installs: | init=/sbin/openrc-init
-systemd -logind -elogind seatd |
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3893 Location: Rasi, Finland
|
Posted: Thu Jan 16, 2025 10:41 am Post subject: |
|
|
Well. I'm baffled.
Code: | =================================================================
Package Settings
=================================================================
net-misc/networkmanager-1.48.10-r1::gentoo was built with the following:
USE="bluetooth concheck connection-sharing dhclient introspection modemmanager nftables nss ppp tools wifi -audit -debug -dhcpcd -elogind -gnutls -gtk-doc -iptables -iwd -libedit -ofono -ovs -policykit -psl -resolvconf (-selinux) -syslog -systemd -teamd -test -vala -wext" ABI_X86="(64) -32 (-x32)"
LDFLAGS="-Wl,-O1 -Wl,--as-needed -Wl,-z,pack-relative-relocs -Wl,--undefined-version" | ... now for little over 24h of usage with newly USE-flagged networkmanager ... no zombified dhclient processes. Code: | M710q ~ # pstree -As 18287
openrc-init---NetworkManager-+-dhclient
`-3*[{NetworkManager}] | Normally during a day I would assume at least five zombie processes under networkmanager. My rough guess is that a dhcpcd process gets zombified during dhcp re-lease or when the box loses wifi connection. I haven't dug that deep yet.
But what's happening with networkmanager+dhcpcd what does not happen now anymore (or at least hasn't yet happened)?
- Has dhcpcd's signaling changed at some point that networkmanager doesn't know when dhcpcd is "done"?
- ... if they even "talk" to eachother...
- Or is there really
Hu wrote: | separate code paths for dhclient versus dhcpcd | ?
I'm now tempted to recompile networkmanager again to use its internal dhcp client implementation and see what then happens. _________________ ..: Zucca :..
My gentoo installs: | init=/sbin/openrc-init
-systemd -logind -elogind seatd |
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
Hu Administrator
Joined: 06 Mar 2007 Posts: 23056
|
Posted: Thu Jan 16, 2025 2:37 pm Post subject: |
|
|
For the dhclient case, is that the same dhclient process over an extended period? Check its process start time relative to when you inspect the process tree. If yes, then that says that dhcpcd sometimes dies (and NetworkManager leaves a zombie), and that dhclient does not intermittently die (so there is no zombie, and no way to tell if NetworkManager would clean up the zombies). If no, then dhclient is being restarted in a way that NetworkManager handles well, but dhcpcd is being restarted in a way that NetworkManager does not handle well. |
|
Back to top |
|
|
grknight Retired Dev
Joined: 20 Feb 2015 Posts: 1994
|
Posted: Thu Jan 16, 2025 3:03 pm Post subject: |
|
|
It certainly would be good to avoid dhclient long term as ISC has abandoned that software.
As dhcpcd appears broken with NM (would be nice if did work), its internal implementation seems to be the best choice if it works |
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3893 Location: Rasi, Finland
|
Posted: Thu Jan 16, 2025 5:38 pm Post subject: |
|
|
Hu wrote: | For the dhclient case, is that the same dhclient process over an extended period? Check its process start time relative to when you inspect the process tree. | Will do!
This is the process tree right now: Code: | M710q ~ # pstree -Asp 18287
openrc-init(1)---NetworkManager(18287)-+-dhclient(22372)
|-{NetworkManager}(18288)
|-{NetworkManager}(18289)
`-{NetworkManager}(18291) | I'll check back after I've done several wifi disconnects. To be precise turning WiFi ap off and on. _________________ ..: Zucca :..
My gentoo installs: | init=/sbin/openrc-init
-systemd -logind -elogind seatd |
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
Zucca Moderator
Joined: 14 Jun 2007 Posts: 3893 Location: Rasi, Finland
|
Posted: Thu Jan 16, 2025 6:30 pm Post subject: |
|
|
It seems networkmanager does indeed run a new dhcp client command when a reconnect happens.
Code: | M710q ~ # pstree -Asp 18287
openrc-init(1)---NetworkManager(18287)-+-dhclient(22762)
|-{NetworkManager}(18288)
|-{NetworkManager}(18289)
`-{NetworkManager}(18291) |
I wonder what's then between networkmanager and dhcpcd that doesn't work. I mean, it works, but leaves zombies roaming around. _________________ ..: Zucca :..
My gentoo installs: | init=/sbin/openrc-init
-systemd -logind -elogind seatd |
Quote: | I am NaN! I am a man! |
|
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|