View previous topic :: View next topic |
Author |
Message |
djdunn l33t
Joined: 26 Dec 2004 Posts: 812
|
Posted: Fri Sep 23, 2022 6:02 pm Post subject: random unidentified reboots |
|
|
Im getting random reboots im not really seeing anything in messages, it could happen within 3 minutes, or upwards to 3 days apart.
I'm not seeing any kernel panics in pstore, yes i am sure that pstore can and does save kernel panics.
i disabled reboot on hard/soft lockup in the kernel
i replaced and upgraded my PSU from 650 to 1000W
ive stressed my cpu to 100% for hours.
ive stressed the GPU with cyberpunk 2077 max settings
/var/log/messages doesn't say anything
Code: | Sep 23 00:07:50 Iris zed[4575]: Finished "pool_import-led.sh" eid=3 pid=4589 time=0.052742s exit=0
Sep 23 00:07:54 Iris dhcpcd[3715]: wlan0: no IPv6 Routers available
Sep 23 00:10:26 Iris syslog-ng[3467]: syslog-ng starting up; version='3.37.1'
Sep 22 23:45:47 Iris smartd[4250]: Device: /dev/disk/by-id/ata-WDC_WD4003FRYZ-01F0DB0_V6KSWESD [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 139 to 142
Sep 23 00:07:36 Iris syslog-ng[3536]: syslog-ng starting up; version='3.37.1'
|
here is my dmesg
http://dpaste.com/D5CSVYR77
here is my emerge --info
http://dpaste.com/D5SUTGD8C
5.15.67-gentoo _________________ “Music is a moral law. It gives a soul to the Universe, wings to the mind, flight to the imagination, a charm to sadness, gaiety and life to everything. It is the essence of order, and leads to all that is good and just and beautiful.”
― Plato |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9886 Location: almost Mile High in the USA
|
Posted: Fri Sep 23, 2022 9:50 pm Post subject: |
|
|
How do you test your CPU for hours when the machine crashes in 3 minutes? Or are you saying it's stable only under 100% load? Solution is easy then, just run your CPU at 100% load with nice(1)...
Or are you really not giving us the real situation? _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
djdunn l33t
Joined: 26 Dec 2004 Posts: 812
|
Posted: Sat Sep 24, 2022 12:26 am Post subject: |
|
|
eccerr0r wrote: | How do you test your CPU for hours when the machine crashes in 3 minutes? Or are you saying it's stable only under 100% load? Solution is easy then, just run your CPU at 100% load with nice(1)...
Or are you really not giving us the real situation? |
i'm maybe not being clear,
i did emerge -e world, and it would complete, and run at 100% for hours at a time,
it just randomly reboots, sometimes happen 2-3 minutes apart, ill see in the logs the boot messages, just a couple minutes apart it rebooted maybe 2-3 times within a 10 minute window . sometimes 24 hours between reboots, sometimes it goes as long as 2-3 days. _________________ “Music is a moral law. It gives a soul to the Universe, wings to the mind, flight to the imagination, a charm to sadness, gaiety and life to everything. It is the essence of order, and leads to all that is good and just and beautiful.”
― Plato |
|
Back to top |
|
|
pietinger Moderator
Joined: 17 Oct 2006 Posts: 5370 Location: Bavaria
|
Posted: Sat Sep 24, 2022 7:10 am Post subject: |
|
|
I would guess a power problem. You have already changed your PSU. Do you have a backup battery (you would need an ONLINE UPS) ? If no, can you rent it ? Yes, you can have very short power failures you wont see (from your lights). |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54813 Location: 56N 3W
|
Posted: Sat Sep 24, 2022 8:50 am Post subject: |
|
|
djdunn,
I'll guess that you have a watchdog somewhere and when the w'dog isn't patted, it times out and forces a reboot.
OK, its not a complete guess dmesg: |
[ 0.601493] watchdog: Disabling watchdog on nohz_full cores by default
[ 0.601521] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. |
You get a NMI at timeout.
So what is supposed to pat your watchdog and why isn't it?
This is still is the realms of guesswork as we don't know that its a watchdog timeout.
You could try disabling kernel support for watchdog timers.
That's not really a fix but it may help with some circumstantial evidence.
There may even be a kernel parameter ... poke about in /usr/src/linux/Documentation/admin-guide/... _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9886 Location: almost Mile High in the USA
|
Posted: Sat Sep 24, 2022 1:28 pm Post subject: |
|
|
Appears that the NMI watchdog is just using the hardware PMU counter to pet the watchdog via the NMI line. Though it can explain reboots, however this doesn't quite explain the random nature of the problem... _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
djdunn l33t
Joined: 26 Dec 2004 Posts: 812
|
Posted: Sun Sep 25, 2022 3:00 pm Post subject: |
|
|
NeddySeagoon wrote: | djdunn,
I'll guess that you have a watchdog somewhere and when the w'dog isn't patted, it times out and forces a reboot.
OK, its not a complete guess dmesg: |
[ 0.601493] watchdog: Disabling watchdog on nohz_full cores by default
[ 0.601521] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. |
You get a NMI at timeout.
So what is supposed to pat your watchdog and why isn't it?
This is still is the realms of guesswork as we don't know that its a watchdog timeout.
You could try disabling kernel support for watchdog timers.
That's not really a fix but it may help with some circumstantial evidence.
There may even be a kernel parameter ... poke about in /usr/src/linux/Documentation/admin-guide/... |
I can't remember what I was even using the watchdog for. Or even if I "need" it for anything _________________ “Music is a moral law. It gives a soul to the Universe, wings to the mind, flight to the imagination, a charm to sadness, gaiety and life to everything. It is the essence of order, and leads to all that is good and just and beautiful.”
― Plato |
|
Back to top |
|
|
djdunn l33t
Joined: 26 Dec 2004 Posts: 812
|
Posted: Sun Sep 25, 2022 11:17 pm Post subject: |
|
|
pietinger wrote: | I would guess a power problem. You have already changed your PSU. Do you have a backup battery (you would need an ONLINE UPS) ? If no, can you rent it ? Yes, you can have very short power failures you wont see (from your lights). |
Yeah I have a UPS. Because of the power quality in Florida, power outages can last several minutes, or flicker the lights for several seconds. _________________ “Music is a moral law. It gives a soul to the Universe, wings to the mind, flight to the imagination, a charm to sadness, gaiety and life to everything. It is the essence of order, and leads to all that is good and just and beautiful.”
― Plato |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|