Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
random unidentified reboots
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
djdunn
l33t
l33t


Joined: 26 Dec 2004
Posts: 812

PostPosted: Fri Sep 23, 2022 6:02 pm    Post subject: random unidentified reboots Reply with quote

Im getting random reboots im not really seeing anything in messages, it could happen within 3 minutes, or upwards to 3 days apart.

I'm not seeing any kernel panics in pstore, yes i am sure that pstore can and does save kernel panics.

i disabled reboot on hard/soft lockup in the kernel

i replaced and upgraded my PSU from 650 to 1000W

ive stressed my cpu to 100% for hours.

ive stressed the GPU with cyberpunk 2077 max settings

/var/log/messages doesn't say anything

Code:
Sep 23 00:07:50 Iris zed[4575]: Finished "pool_import-led.sh" eid=3 pid=4589 time=0.052742s exit=0
Sep 23 00:07:54 Iris dhcpcd[3715]: wlan0: no IPv6 Routers available
Sep 23 00:10:26 Iris syslog-ng[3467]: syslog-ng starting up; version='3.37.1'

Sep 22 23:45:47 Iris smartd[4250]: Device: /dev/disk/by-id/ata-WDC_WD4003FRYZ-01F0DB0_V6KSWESD [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 139 to 142
Sep 23 00:07:36 Iris syslog-ng[3536]: syslog-ng starting up; version='3.37.1'


here is my dmesg
http://dpaste.com/D5CSVYR77

here is my emerge --info
http://dpaste.com/D5SUTGD8C

5.15.67-gentoo
_________________
“Music is a moral law. It gives a soul to the Universe, wings to the mind, flight to the imagination, a charm to sadness, gaiety and life to everything. It is the essence of order, and leads to all that is good and just and beautiful.”

― Plato
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9886
Location: almost Mile High in the USA

PostPosted: Fri Sep 23, 2022 9:50 pm    Post subject: Reply with quote

How do you test your CPU for hours when the machine crashes in 3 minutes? Or are you saying it's stable only under 100% load? Solution is easy then, just run your CPU at 100% load with nice(1)...
Or are you really not giving us the real situation?
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
djdunn
l33t
l33t


Joined: 26 Dec 2004
Posts: 812

PostPosted: Sat Sep 24, 2022 12:26 am    Post subject: Reply with quote

eccerr0r wrote:
How do you test your CPU for hours when the machine crashes in 3 minutes? Or are you saying it's stable only under 100% load? Solution is easy then, just run your CPU at 100% load with nice(1)...
Or are you really not giving us the real situation?


i'm maybe not being clear,

i did emerge -e world, and it would complete, and run at 100% for hours at a time,

it just randomly reboots, sometimes happen 2-3 minutes apart, ill see in the logs the boot messages, just a couple minutes apart it rebooted maybe 2-3 times within a 10 minute window . sometimes 24 hours between reboots, sometimes it goes as long as 2-3 days.
_________________
“Music is a moral law. It gives a soul to the Universe, wings to the mind, flight to the imagination, a charm to sadness, gaiety and life to everything. It is the essence of order, and leads to all that is good and just and beautiful.”

― Plato
Back to top
View user's profile Send private message
pietinger
Moderator
Moderator


Joined: 17 Oct 2006
Posts: 5370
Location: Bavaria

PostPosted: Sat Sep 24, 2022 7:10 am    Post subject: Reply with quote

I would guess a power problem. You have already changed your PSU. Do you have a backup battery (you would need an ONLINE UPS) ? If no, can you rent it ? Yes, you can have very short power failures you wont see (from your lights).
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54810
Location: 56N 3W

PostPosted: Sat Sep 24, 2022 8:50 am    Post subject: Reply with quote

djdunn,

I'll guess that you have a watchdog somewhere and when the w'dog isn't patted, it times out and forces a reboot.

OK, its not a complete guess
dmesg:
 
[    0.601493] watchdog: Disabling watchdog on nohz_full cores by default
[    0.601521] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.

You get a NMI at timeout.

So what is supposed to pat your watchdog and why isn't it?

This is still is the realms of guesswork as we don't know that its a watchdog timeout.

You could try disabling kernel support for watchdog timers.
That's not really a fix but it may help with some circumstantial evidence.
There may even be a kernel parameter ... poke about in /usr/src/linux/Documentation/admin-guide/...
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9886
Location: almost Mile High in the USA

PostPosted: Sat Sep 24, 2022 1:28 pm    Post subject: Reply with quote

Appears that the NMI watchdog is just using the hardware PMU counter to pet the watchdog via the NMI line. Though it can explain reboots, however this doesn't quite explain the random nature of the problem...
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
djdunn
l33t
l33t


Joined: 26 Dec 2004
Posts: 812

PostPosted: Sun Sep 25, 2022 3:00 pm    Post subject: Reply with quote

NeddySeagoon wrote:
djdunn,

I'll guess that you have a watchdog somewhere and when the w'dog isn't patted, it times out and forces a reboot.

OK, its not a complete guess
dmesg:
 
[    0.601493] watchdog: Disabling watchdog on nohz_full cores by default
[    0.601521] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.

You get a NMI at timeout.

So what is supposed to pat your watchdog and why isn't it?

This is still is the realms of guesswork as we don't know that its a watchdog timeout.

You could try disabling kernel support for watchdog timers.
That's not really a fix but it may help with some circumstantial evidence.
There may even be a kernel parameter ... poke about in /usr/src/linux/Documentation/admin-guide/...


I can't remember what I was even using the watchdog for. Or even if I "need" it for anything
_________________
“Music is a moral law. It gives a soul to the Universe, wings to the mind, flight to the imagination, a charm to sadness, gaiety and life to everything. It is the essence of order, and leads to all that is good and just and beautiful.”

― Plato
Back to top
View user's profile Send private message
djdunn
l33t
l33t


Joined: 26 Dec 2004
Posts: 812

PostPosted: Sun Sep 25, 2022 11:17 pm    Post subject: Reply with quote

pietinger wrote:
I would guess a power problem. You have already changed your PSU. Do you have a backup battery (you would need an ONLINE UPS) ? If no, can you rent it ? Yes, you can have very short power failures you wont see (from your lights).


Yeah I have a UPS. Because of the power quality in Florida, power outages can last several minutes, or flicker the lights for several seconds.
_________________
“Music is a moral law. It gives a soul to the Universe, wings to the mind, flight to the imagination, a charm to sadness, gaiety and life to everything. It is the essence of order, and leads to all that is good and just and beautiful.”

― Plato
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum