View previous topic :: View next topic |
Author |
Message |
Holysword l33t


Joined: 19 Nov 2006 Posts: 946 Location: Greece
|
Posted: Fri May 22, 2009 9:20 pm Post subject: [REOPENED] Auto-Rebooting System |
|
|
Recently my system started to reboot instantly without any kind of warning. I don't know what it can be related, 'cause I don't know if there is any kind of log where kernel may register this kind of thing, so the only thing I could do is to check every piece I could find of my system.
I noticed that it crashes only when it is performing a hard task, like compiling, or a hard math calculation (I use matlab frequently). For any other meanings, its running fine.
First I thought it has something to do with kernel. I performed tests with 4 diferent kernel versions: zen-2.6.29, zen-2.6.30-rc5, gentoo-2.6.26-rc4, gentoo-2.6.28.5. All of them presented the same behaviour.
Then I considered any kind of weird bug with composition or graphical related stuffs. I tried to run compilations into pure console, and got the same behaviour.
The crashes are kinda random... it does not have a approximate time to happen, it just happens sometime if I let my computer doing hardwor, sometimes in 4min, sometimes in 4s, sometimes in 4h. So as it does not have a pattern I considered it a memory fail... but I don't know how to test the memory.
Appreciate any help. _________________ "Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach)
Last edited by Holysword on Wed Jun 10, 2009 12:48 am; edited 2 times in total |
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 9932 Location: almost Mile High in the USA
|
Posted: Fri May 22, 2009 9:31 pm Post subject: |
|
|
Some motherboards are designed to do something when the processor overheats. Might want to check if this is happening. If you can underclock your cpu, that's another thing to try.
Usually RAM errors don't behave like this. Normally it's motherboard or possibly CPU issues.
You can download and test memtest86+ to check your RAM. It's a good idea to check it anyway. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
 |
AaronPPC Guru

Joined: 29 May 2005 Posts: 522 Location: Tucson, AZ
|
Posted: Fri May 22, 2009 10:43 pm Post subject: |
|
|
I have had 3 computers that had CPU problems and they exibited symptoms very similar to yours. _________________ --Aaron |
|
Back to top |
|
 |
ronmon Veteran


Joined: 15 Apr 2002 Posts: 1043 Location: Key West, FL
|
Posted: Fri May 22, 2009 11:37 pm Post subject: |
|
|
I'll bet that it's heat. Mine started doing this a few weeks ago when the weather started getting warmer here in the sub-tropics. Watching in gkrellm, I saw my CPU temp go up to almost 70C under heavy loads.
Going into BIOS and dropping the CPU voltage from 1.35 to 1.30 did the trick. Just .05 volt difference, and now it never goes over 40C. _________________ Ask Questions the Smart Way - by ESR |
|
Back to top |
|
 |
Holysword l33t


Joined: 19 Nov 2006 Posts: 946 Location: Greece
|
Posted: Sat May 23, 2009 9:52 am Post subject: |
|
|
I didn't think it was the heat 'cause I've never overclocked my CPU. But when you guys mentioned the BIOS I remembered that I've enabled some "fancy" options in there, like CPU Fan warnings that are disabled by default. "Load Optimal Defaults" fixed. Maybe this kind of "Warning" signal was confusing the kernel?
Anyway, the temperature sensors are not found here even though I enabled it in kernel... where am I supposed to find them? _________________ "Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach) |
|
Back to top |
|
 |
ronmon Veteran


Joined: 15 Apr 2002 Posts: 1043 Location: Key West, FL
|
Posted: Sat May 23, 2009 11:08 am Post subject: |
|
|
Mine wasn't overclocked either.
You need to emerge lm_sensors, start it and set it to start automatically on boot.
Code: |
/etc/init.d/lm_sensors start
rc-update add lm_sensors default
|
After that, you can check it from a terminal with the "sensors" command or use one of the many desktop applets available to monitor it constantly.
Edit: Just one more thought. If disabling fan and/or temperature warnings in your BIOS stops the machine from shutting down when it really is overheating you could fry your stuff. Get your sensors working to give yourself some peace of mind. _________________ Ask Questions the Smart Way - by ESR |
|
Back to top |
|
 |
entrophie n00b

Joined: 24 Aug 2006 Posts: 13
|
Posted: Tue May 26, 2009 6:55 am Post subject: |
|
|
Holysword: you don't need to overclock your CPU to get overheating. Some silver contacts beetwen cpu and radiator tends to leak. So after you check the temperature with sensors, you can try to improve the contact. |
|
Back to top |
|
 |
szczerb Veteran

Joined: 24 Feb 2007 Posts: 1709 Location: Poland => Lodz
|
Posted: Tue May 26, 2009 6:59 am Post subject: |
|
|
First of all check if your radiator is not full of thick dust that very well stops the airflow. |
|
Back to top |
|
 |
Holysword l33t


Joined: 19 Nov 2006 Posts: 946 Location: Greece
|
Posted: Sun May 31, 2009 12:19 pm Post subject: |
|
|
Thank you guys for answering, but I still don't think its overheat. I've checked the temperature with sensors and it seems okay. _________________ "Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach) |
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 9932 Location: almost Mile High in the USA
|
Posted: Sun May 31, 2009 1:28 pm Post subject: |
|
|
I guess if you rule out all the normal reasons that cause reboot then the only things that remain are the non-normal...
and that are
1. Hackers. But very unlikely.
2. Your hardware is broken. You need to buy a new power supply most likely, or possibly motherboard. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
 |
Holysword l33t


Joined: 19 Nov 2006 Posts: 946 Location: Greece
|
Posted: Mon Jun 01, 2009 6:07 pm Post subject: |
|
|
I still don't think its any of those. As I stated before, I disabled some non-default bios options and the problem has gone. _________________ "Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach) |
|
Back to top |
|
 |
Holysword l33t


Joined: 19 Nov 2006 Posts: 946 Location: Greece
|
Posted: Wed Jun 10, 2009 12:51 am Post subject: |
|
|
Turns out that the problem came back, but more frequently.
Again, its not about the temperature (I still check the temperature sensors, nothing odd. Sometimes it crashes at 47°C). The difference is that now I don't need to be doing something aggressive, which makes this problem even more annoying.
I'll try to install the memtest86+ into that grub fancy way and check the memory, and I'll post here the results. _________________ "Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach) |
|
Back to top |
|
 |
rjw8703 Apprentice

Joined: 14 Aug 2004 Posts: 246 Location: Auburn, Al
|
Posted: Wed Jun 10, 2009 2:34 pm Post subject: |
|
|
I had this problem a while back. It turned out to be the voltage regulators on the m/b were dying intermittently. Replacing the m/b fixed the problem. |
|
Back to top |
|
 |
Gusar Advocate

Joined: 09 Apr 2005 Posts: 2665 Location: Slovenia
|
Posted: Wed Jun 10, 2009 3:13 pm Post subject: |
|
|
Faulty power supply would be my guess. I had a problem kinda like that. After an hour or so, the machine would reboot. And then it would reboot every few minutes. Leaving it off for a while, it would work for an hour again. Changing the power supply made the problem go away. |
|
Back to top |
|
 |
Holysword l33t


Joined: 19 Nov 2006 Posts: 946 Location: Greece
|
Posted: Thu Jun 11, 2009 11:20 am Post subject: |
|
|
Gusar wrote: | Faulty power supply would be my guess. I had a problem kinda like that. After an hour or so, the machine would reboot. And then it would reboot every few minutes. Leaving it off for a while, it would work for an hour again. Changing the power supply made the problem go away. |
It cannot be the same problem that I have, since my machine turns off but does not reboot in few minutes (actually it can have hours or days between the reboots). Its really random.
rjw8703 wrote: | I had this problem a while back. It turned out to be the voltage regulators on the m/b were dying intermittently. Replacing the m/b fixed the problem. |
There is any way to test that to make sure that its the voltage regulator? _________________ "Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach) |
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 9932 Location: almost Mile High in the USA
|
Posted: Sat Jun 13, 2009 7:51 am Post subject: |
|
|
Holysword wrote: | Gusar wrote: | Faulty power supply would be my guess. I had a problem kinda like that. After an hour or so, the machine would reboot. And then it would reboot every few minutes. Leaving it off for a while, it would work for an hour again. Changing the power supply made the problem go away. |
It cannot be the same problem that I have, since my machine turns off but does not reboot in few minutes (actually it can have hours or days between the reboots). Its really random.
rjw8703 wrote: | I had this problem a while back. It turned out to be the voltage regulators on the m/b were dying intermittently. Replacing the m/b fixed the problem. |
There is any way to test that to make sure that its the voltage regulator? |
Bad power will produce random results.
There's really no way to cheaply test/check PSUs and motherboard voltage regulators. The equipment needed is basically a high speed DSO. A simple voltage check is not enough, it won't detect intermittent spikes. Replacing the motherboard/psu is _much_ cheaper, even for diagnostics, unless you just so happened to have a DSO burning in your pocket (or garage or something). _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
 |
Holysword l33t


Joined: 19 Nov 2006 Posts: 946 Location: Greece
|
Posted: Thu Jun 25, 2009 1:43 am Post subject: |
|
|
Well, while I don't find someone to test my motherboard, I was checking dmesg and I suddenly realized that from time to time it complains about the voltage of in5 being 0. Follows the relevant part of "sensors":
Code: | in0: +1.15 V (min = +0.00 V, max = +4.08 V)
in1: +2.14 V (min = +0.00 V, max = +4.08 V)
in2: +3.39 V (min = +0.00 V, max = +4.08 V)
in3: +2.96 V (min = +0.00 V, max = +4.08 V)
in4: +0.48 V (min = +0.00 V, max = +0.74 V)
in5: +0.00 V (min = +0.00 V, max = +4.08 V) ALARM
in6: +1.06 V (min = +0.00 V, max = +4.08 V)
in7: +3.06 V (min = +0.00 V, max = +4.08 V)
in8: +3.31 V |
I'm not sure about its meaning, and I don't know if that's something to worry about too, but the "ALARM" word never sounds good... _________________ "Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach) |
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 9932 Location: almost Mile High in the USA
|
Posted: Thu Jun 25, 2009 11:32 pm Post subject: |
|
|
Well that proves to show that lm-sensors is unreliable, and nothing else. There are no transistors that will work at 0V so obviously something's wrong with the detection there, or that input is simply unused. Or perhaps your PSU doesn't supply -5V as most mb's don't use it nowadays, and it's perfectly fine for it to be at 0V.
Not only that, likely the multiplier constants in your lm-sensors.conf do not match your motherboard and is producing really unreliable results - where's your 12V line? Where's your -12V line? Which line is which?
"ALARM" just means the chip detected a number from the voltage line that was outside its bounds. But what are the bounds? The bounds are set up by software which once again may or may not match your motherboard to produce valid results.
I'm sorry, the cheapest way is to buy a new MB or PSU to test. You might be able to get away with a multimeter but again, multimeters also tend to be slow and won't detect fast glitches in your power. There's no other way you can really tell for sure. The on board sensor chips are not only inaccurate, but also slow - at most a few samples every second versus millions of samples for a DSO. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
 |
Holysword l33t


Joined: 19 Nov 2006 Posts: 946 Location: Greece
|
Posted: Mon Jun 29, 2009 2:22 am Post subject: |
|
|
eccerr0r wrote: | Well that proves to show that lm-sensors is unreliable, and nothing else. |
Maybe it proves how n00b I am, 'cause I haven't configured properly those things :S Anyway, in the future I will try to configure it. _________________ "Nolite arbitrari quia venerim mittere pacem in terram non veni pacem mittere sed gladium" (Yeshua Ha Mashiach) |
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 9932 Location: almost Mile High in the USA
|
Posted: Mon Jun 29, 2009 10:57 pm Post subject: |
|
|
Holysword wrote: | Anyway, in the future I will try to configure [lm-sensors]. |
Unless someone else has figured out the numbers for the exact board you have -- there's really no way for a "mere mortal" to configure it, without knowing exactly how it's wired on the motherboard -- some reverse engineering or motherboard manufacturer support is needed.
Just because another motherboard has the same chip as yours, means nothing to configuration. It needs to be the exact same motherboard and revision of motherboard to use the same config. This is because the resistors used may be hooked up differently on different boards and different revisions.
I basically ignore lm-sensors numbers. One of my machine looks like
Code: |
subaru:/root# sensors
it8718-isa-0290
Adapter: ISA adapter
in0: +1.25 V (min = +0.00 V, max = +4.08 V)
in1: +1.84 V (min = +0.00 V, max = +4.08 V)
in2: +3.31 V (min = +0.00 V, max = +4.08 V)
in3: +2.93 V (min = +0.00 V, max = +4.08 V)
in4: +3.06 V (min = +0.00 V, max = +4.08 V)
in5: +0.00 V (min = +0.00 V, max = +4.08 V) ALARM
in6: +1.18 V (min = +0.00 V, max = +4.08 V)
in7: +4.08 V (min = +0.00 V, max = +4.08 V) ALARM
in8: +3.18 V
fan1: 2096 RPM (min = 10 RPM)
fan2: 0 RPM (min = 0 RPM)
fan3: 0 RPM (min = 0 RPM)
temp1: +53 C (low = +127 C, high = +127 C) sensor = thermistor
temp2: -2 C (low = +127 C, high = +127 C) sensor = thermistor
temp3: +45 C (low = +127 C, high = +70 C) sensor = diode
vid: +0.000 V
|
yet this machine runs perfectly fine despite double the number of alarms...
Another of my machines is a bit more correct after hacking /etc/sensors.conf slightly and sensors -s to reload:
Code: |
doujima:~$ sensors
it87-isa-0290
Adapter: ISA adapter
CPU: +1.58 V (min = +0.00 V, max = +4.08 V)
RAM: +2.50 V (min = +0.00 V, max = +4.08 V)
+3.3V: +3.20 V (min = +0.00 V, max = +4.08 V)
+5V: +4.87 V (min = +0.00 V, max = +6.85 V)
+12V: +12.16 V (min = +0.00 V, max = +16.32 V)
-12V: -12.38 V (min = -0.00 V, max = -14.69 V)
-5V: -5.54 V (min = -0.00 V, max = -6.12 V)
Stdby: +5.05 V (min = +0.00 V, max = +6.85 V)
VBat: +3.42 V
fan1/CPU: 3375 RPM (min = 0 RPM, div = 8)
fan2/PS: 2220 RPM (min = 0 RPM, div = 8)
Temp1/MB: +34 C (low = -2 C, high = +254 C) sensor = thermistor
Temp2/VRM: +40 C (low = -1 C, high = +127 C) sensor = thermistor
Temp3/CPU: +53 C (low = -1 C, high = +127 C) sensor = thermistor
|
yet I still would not bet a penny those numbers are correct. They merely "look" decent but still totally inaccurate (BTW first is a core2 board (gigabyte g31 board), second is an athlon in a fairly popular ECS k7s5a). There could be some correlation from model to model, but there's definitely no guarantee the same input number is attached to the same rail. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
 |
|