Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[REOPENED] Auto-Rebooting System
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Holysword
l33t
l33t


Joined: 19 Nov 2006
Posts: 946
Location: Greece

PostPosted: Fri May 22, 2009 9:20 pm    Post subject: [REOPENED] Auto-Rebooting System Reply with quote

Recently my system started to reboot instantly without any kind of warning. I don't know what it can be related, 'cause I don't know if there is any kind of log where kernel may register this kind of thing, so the only thing I could do is to check every piece I could find of my system.

I noticed that it crashes only when it is performing a hard task, like compiling, or a hard math calculation (I use matlab frequently). For any other meanings, its running fine.

First I thought it has something to do with kernel. I performed tests with 4 diferent kernel versions: zen-2.6.29, zen-2.6.30-rc5, gentoo-2.6.26-rc4, gentoo-2.6.28.5. All of them presented the same behaviour.

Then I considered any kind of weird bug with composition or graphical related stuffs. I tried to run compilations into pure console, and got the same behaviour.

The crashes are kinda random... it does not have a approximate time to happen, it just happens sometime if I let my computer doing hardwor, sometimes in 4min, sometimes in 4s, sometimes in 4h. So as it does not have a pattern I considered it a memory fail... but I don't know how to test the memory.

Appreciate any help.
_________________
"Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach)


Last edited by Holysword on Wed Jun 10, 2009 12:48 am; edited 2 times in total
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9932
Location: almost Mile High in the USA

PostPosted: Fri May 22, 2009 9:31 pm    Post subject: Reply with quote

Some motherboards are designed to do something when the processor overheats. Might want to check if this is happening. If you can underclock your cpu, that's another thing to try.

Usually RAM errors don't behave like this. Normally it's motherboard or possibly CPU issues.

You can download and test memtest86+ to check your RAM. It's a good idea to check it anyway.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
AaronPPC
Guru
Guru


Joined: 29 May 2005
Posts: 522
Location: Tucson, AZ

PostPosted: Fri May 22, 2009 10:43 pm    Post subject: Reply with quote

I have had 3 computers that had CPU problems and they exibited symptoms very similar to yours.
_________________
--Aaron
Back to top
View user's profile Send private message
ronmon
Veteran
Veteran


Joined: 15 Apr 2002
Posts: 1043
Location: Key West, FL

PostPosted: Fri May 22, 2009 11:37 pm    Post subject: Reply with quote

I'll bet that it's heat. Mine started doing this a few weeks ago when the weather started getting warmer here in the sub-tropics. Watching in gkrellm, I saw my CPU temp go up to almost 70C under heavy loads.

Going into BIOS and dropping the CPU voltage from 1.35 to 1.30 did the trick. Just .05 volt difference, and now it never goes over 40C.
_________________
Ask Questions the Smart Way - by ESR
Back to top
View user's profile Send private message
Holysword
l33t
l33t


Joined: 19 Nov 2006
Posts: 946
Location: Greece

PostPosted: Sat May 23, 2009 9:52 am    Post subject: Reply with quote

I didn't think it was the heat 'cause I've never overclocked my CPU. But when you guys mentioned the BIOS I remembered that I've enabled some "fancy" options in there, like CPU Fan warnings that are disabled by default. "Load Optimal Defaults" fixed. Maybe this kind of "Warning" signal was confusing the kernel?

Anyway, the temperature sensors are not found here even though I enabled it in kernel... where am I supposed to find them?
_________________
"Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach)
Back to top
View user's profile Send private message
ronmon
Veteran
Veteran


Joined: 15 Apr 2002
Posts: 1043
Location: Key West, FL

PostPosted: Sat May 23, 2009 11:08 am    Post subject: Reply with quote

Mine wasn't overclocked either.

You need to emerge lm_sensors, start it and set it to start automatically on boot.
Code:

/etc/init.d/lm_sensors start
rc-update add lm_sensors default

After that, you can check it from a terminal with the "sensors" command or use one of the many desktop applets available to monitor it constantly.

Edit: Just one more thought. If disabling fan and/or temperature warnings in your BIOS stops the machine from shutting down when it really is overheating you could fry your stuff. Get your sensors working to give yourself some peace of mind.
_________________
Ask Questions the Smart Way - by ESR
Back to top
View user's profile Send private message
entrophie
n00b
n00b


Joined: 24 Aug 2006
Posts: 13

PostPosted: Tue May 26, 2009 6:55 am    Post subject: Reply with quote

Holysword: you don't need to overclock your CPU to get overheating. Some silver contacts beetwen cpu and radiator tends to leak. So after you check the temperature with sensors, you can try to improve the contact.
Back to top
View user's profile Send private message
szczerb
Veteran
Veteran


Joined: 24 Feb 2007
Posts: 1709
Location: Poland => Lodz

PostPosted: Tue May 26, 2009 6:59 am    Post subject: Reply with quote

First of all check if your radiator is not full of thick dust that very well stops the airflow.
Back to top
View user's profile Send private message
Holysword
l33t
l33t


Joined: 19 Nov 2006
Posts: 946
Location: Greece

PostPosted: Sun May 31, 2009 12:19 pm    Post subject: Reply with quote

Thank you guys for answering, but I still don't think its overheat. I've checked the temperature with sensors and it seems okay.
_________________
"Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach)
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9932
Location: almost Mile High in the USA

PostPosted: Sun May 31, 2009 1:28 pm    Post subject: Reply with quote

I guess if you rule out all the normal reasons that cause reboot then the only things that remain are the non-normal...

and that are

1. Hackers. But very unlikely.

2. Your hardware is broken. You need to buy a new power supply most likely, or possibly motherboard.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Holysword
l33t
l33t


Joined: 19 Nov 2006
Posts: 946
Location: Greece

PostPosted: Mon Jun 01, 2009 6:07 pm    Post subject: Reply with quote

I still don't think its any of those. As I stated before, I disabled some non-default bios options and the problem has gone.
_________________
"Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach)
Back to top
View user's profile Send private message
Holysword
l33t
l33t


Joined: 19 Nov 2006
Posts: 946
Location: Greece

PostPosted: Wed Jun 10, 2009 12:51 am    Post subject: Reply with quote

Turns out that the problem came back, but more frequently.

Again, its not about the temperature (I still check the temperature sensors, nothing odd. Sometimes it crashes at 47°C). The difference is that now I don't need to be doing something aggressive, which makes this problem even more annoying.
I'll try to install the memtest86+ into that grub fancy way and check the memory, and I'll post here the results.
_________________
"Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach)
Back to top
View user's profile Send private message
rjw8703
Apprentice
Apprentice


Joined: 14 Aug 2004
Posts: 246
Location: Auburn, Al

PostPosted: Wed Jun 10, 2009 2:34 pm    Post subject: Reply with quote

I had this problem a while back. It turned out to be the voltage regulators on the m/b were dying intermittently. Replacing the m/b fixed the problem.
Back to top
View user's profile Send private message
Gusar
Advocate
Advocate


Joined: 09 Apr 2005
Posts: 2665
Location: Slovenia

PostPosted: Wed Jun 10, 2009 3:13 pm    Post subject: Reply with quote

Faulty power supply would be my guess. I had a problem kinda like that. After an hour or so, the machine would reboot. And then it would reboot every few minutes. Leaving it off for a while, it would work for an hour again. Changing the power supply made the problem go away.
Back to top
View user's profile Send private message
Holysword
l33t
l33t


Joined: 19 Nov 2006
Posts: 946
Location: Greece

PostPosted: Thu Jun 11, 2009 11:20 am    Post subject: Reply with quote

Gusar wrote:
Faulty power supply would be my guess. I had a problem kinda like that. After an hour or so, the machine would reboot. And then it would reboot every few minutes. Leaving it off for a while, it would work for an hour again. Changing the power supply made the problem go away.


It cannot be the same problem that I have, since my machine turns off but does not reboot in few minutes (actually it can have hours or days between the reboots). Its really random.

rjw8703 wrote:
I had this problem a while back. It turned out to be the voltage regulators on the m/b were dying intermittently. Replacing the m/b fixed the problem.

There is any way to test that to make sure that its the voltage regulator?
_________________
"Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach)
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9932
Location: almost Mile High in the USA

PostPosted: Sat Jun 13, 2009 7:51 am    Post subject: Reply with quote

Holysword wrote:
Gusar wrote:
Faulty power supply would be my guess. I had a problem kinda like that. After an hour or so, the machine would reboot. And then it would reboot every few minutes. Leaving it off for a while, it would work for an hour again. Changing the power supply made the problem go away.


It cannot be the same problem that I have, since my machine turns off but does not reboot in few minutes (actually it can have hours or days between the reboots). Its really random.

rjw8703 wrote:
I had this problem a while back. It turned out to be the voltage regulators on the m/b were dying intermittently. Replacing the m/b fixed the problem.

There is any way to test that to make sure that its the voltage regulator?


Bad power will produce random results.

There's really no way to cheaply test/check PSUs and motherboard voltage regulators. The equipment needed is basically a high speed DSO. A simple voltage check is not enough, it won't detect intermittent spikes. Replacing the motherboard/psu is _much_ cheaper, even for diagnostics, unless you just so happened to have a DSO burning in your pocket (or garage or something).
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Holysword
l33t
l33t


Joined: 19 Nov 2006
Posts: 946
Location: Greece

PostPosted: Thu Jun 25, 2009 1:43 am    Post subject: Reply with quote

Well, while I don't find someone to test my motherboard, I was checking dmesg and I suddenly realized that from time to time it complains about the voltage of in5 being 0. Follows the relevant part of "sensors":

Code:
in0:       +1.15 V  (min =  +0.00 V, max =  +4.08 V)       
in1:       +2.14 V  (min =  +0.00 V, max =  +4.08 V)       
in2:       +3.39 V  (min =  +0.00 V, max =  +4.08 V)       
in3:       +2.96 V  (min =  +0.00 V, max =  +4.08 V)       
in4:       +0.48 V  (min =  +0.00 V, max =  +0.74 V)       
in5:       +0.00 V  (min =  +0.00 V, max =  +4.08 V)   ALARM
in6:       +1.06 V  (min =  +0.00 V, max =  +4.08 V)
in7:       +3.06 V  (min =  +0.00 V, max =  +4.08 V)
in8:       +3.31 V


I'm not sure about its meaning, and I don't know if that's something to worry about too, but the "ALARM" word never sounds good...
_________________
"Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach)
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9932
Location: almost Mile High in the USA

PostPosted: Thu Jun 25, 2009 11:32 pm    Post subject: Reply with quote

Well that proves to show that lm-sensors is unreliable, and nothing else. There are no transistors that will work at 0V so obviously something's wrong with the detection there, or that input is simply unused. Or perhaps your PSU doesn't supply -5V as most mb's don't use it nowadays, and it's perfectly fine for it to be at 0V.

Not only that, likely the multiplier constants in your lm-sensors.conf do not match your motherboard and is producing really unreliable results - where's your 12V line? Where's your -12V line? Which line is which?

"ALARM" just means the chip detected a number from the voltage line that was outside its bounds. But what are the bounds? The bounds are set up by software which once again may or may not match your motherboard to produce valid results.

I'm sorry, the cheapest way is to buy a new MB or PSU to test. You might be able to get away with a multimeter but again, multimeters also tend to be slow and won't detect fast glitches in your power. There's no other way you can really tell for sure. The on board sensor chips are not only inaccurate, but also slow - at most a few samples every second versus millions of samples for a DSO.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Holysword
l33t
l33t


Joined: 19 Nov 2006
Posts: 946
Location: Greece

PostPosted: Mon Jun 29, 2009 2:22 am    Post subject: Reply with quote

eccerr0r wrote:
Well that proves to show that lm-sensors is unreliable, and nothing else.

Maybe it proves how n00b I am, 'cause I haven't configured properly those things :S Anyway, in the future I will try to configure it.
_________________
"Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach)
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9932
Location: almost Mile High in the USA

PostPosted: Mon Jun 29, 2009 10:57 pm    Post subject: Reply with quote

Holysword wrote:
Anyway, in the future I will try to configure [lm-sensors].

Unless someone else has figured out the numbers for the exact board you have -- there's really no way for a "mere mortal" to configure it, without knowing exactly how it's wired on the motherboard -- some reverse engineering or motherboard manufacturer support is needed.

Just because another motherboard has the same chip as yours, means nothing to configuration. It needs to be the exact same motherboard and revision of motherboard to use the same config. This is because the resistors used may be hooked up differently on different boards and different revisions.

I basically ignore lm-sensors numbers. One of my machine looks like

Code:

subaru:/root# sensors
it8718-isa-0290
Adapter: ISA adapter
in0:       +1.25 V  (min =  +0.00 V, max =  +4.08 V)   
in1:       +1.84 V  (min =  +0.00 V, max =  +4.08 V)   
in2:       +3.31 V  (min =  +0.00 V, max =  +4.08 V)   
in3:       +2.93 V  (min =  +0.00 V, max =  +4.08 V)   
in4:       +3.06 V  (min =  +0.00 V, max =  +4.08 V)   
in5:       +0.00 V  (min =  +0.00 V, max =  +4.08 V)   ALARM
in6:       +1.18 V  (min =  +0.00 V, max =  +4.08 V)   
in7:       +4.08 V  (min =  +0.00 V, max =  +4.08 V)   ALARM
in8:       +3.18 V
fan1:     2096 RPM  (min =   10 RPM)                   
fan2:        0 RPM  (min =    0 RPM)                   
fan3:        0 RPM  (min =    0 RPM)                   
temp1:       +53 C  (low  =  +127 C, high =  +127 C)   sensor = thermistor   
temp2:        -2 C  (low  =  +127 C, high =  +127 C)   sensor = thermistor   
temp3:       +45 C  (low  =  +127 C, high =   +70 C)   sensor = diode   
vid:      +0.000 V


yet this machine runs perfectly fine despite double the number of alarms...

Another of my machines is a bit more correct after hacking /etc/sensors.conf slightly and sensors -s to reload:
Code:

doujima:~$ sensors
it87-isa-0290
Adapter: ISA adapter
CPU:       +1.58 V  (min =  +0.00 V, max =  +4.08 V)   
RAM:       +2.50 V  (min =  +0.00 V, max =  +4.08 V)   
+3.3V:     +3.20 V  (min =  +0.00 V, max =  +4.08 V)   
+5V:       +4.87 V  (min =  +0.00 V, max =  +6.85 V)   
+12V:     +12.16 V  (min =  +0.00 V, max = +16.32 V)   
-12V:     -12.38 V  (min =  -0.00 V, max = -14.69 V)   
-5V:       -5.54 V  (min =  -0.00 V, max =  -6.12 V)   
Stdby:     +5.05 V  (min =  +0.00 V, max =  +6.85 V)   
VBat:      +3.42 V
fan1/CPU: 3375 RPM  (min =    0 RPM, div = 8)         
fan2/PS:  2220 RPM  (min =    0 RPM, div = 8)         
Temp1/MB:    +34 C  (low  =    -2 C, high =  +254 C)   sensor = thermistor   
Temp2/VRM:   +40 C  (low  =    -1 C, high =  +127 C)   sensor = thermistor   
Temp3/CPU:   +53 C  (low  =    -1 C, high =  +127 C)   sensor = thermistor   


yet I still would not bet a penny those numbers are correct. They merely "look" decent but still totally inaccurate (BTW first is a core2 board (gigabyte g31 board), second is an athlon in a fairly popular ECS k7s5a). There could be some correlation from model to model, but there's definitely no guarantee the same input number is attached to the same rail.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum