Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Machine keeps rebooting
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
curmudgeon
Veteran
Veteran


Joined: 08 Aug 2003
Posts: 1744

PostPosted: Wed Jun 22, 2011 3:21 am    Post subject: Machine keeps rebooting Reply with quote

I have a machine that has become a major headache.

I have noticed an occasional random boot (system just restarts without any warning) in the past (maybe once a week or so), which should have warned me, but now the problem has become much worse.

The facts:

1. Asus p5vd2-mx running amd64 gentoo with any recent kernel (2.6.36-gentoo-r8, 2.6.37-gentoo-r4, 2.6.38-gentoo-r6).

2. Machine will (apparently) stay up indefinitely when not under load.

3. When stressed (in particular when compiling), the machine will reboot (always) within five to twenty minutes.

4. I am attempting to install the kde 4.6 upgrade, and I can not get by some of the bigger pieces (I have simply done emerge --resume - over thirty times now - to continue, but that is obviously not a viable solution).

5. I am not doing anything funny to the machine. It is a standard Intel E6300 running at the standard 1.86 GHz.

6. I have run memtest86 (from the gestoo live dvd) overnight, and it has not detected any problems.

7. I don't believe it is a thermal problem (the sensors command shows peak temperatures of about 60 C when compiling).

Does anyone have any idea what could be causing this and/or how to track it down? Thank you in advance.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23082

PostPosted: Wed Jun 22, 2011 4:25 am    Post subject: Reply with quote

Even though you say you checked the thermal sensors, my first suspects would be overheating or possibly inadequate power supply. Bad RAM tends to manifest as program crashes, not spontaneous system reboots.

You say the machine always crashes when stressed. Does this apply even for trivial stresses, such as running a CPU hog on each core (with no load on the disk or RAM, just spinning the CPU)?
Back to top
View user's profile Send private message
curmudgeon
Veteran
Veteran


Joined: 08 Aug 2003
Posts: 1744

PostPosted: Wed Jun 22, 2011 5:47 am    Post subject: Reply with quote

Hu wrote:
Even though you say you checked the thermal sensors, my first suspects would be overheating or possibly inadequate power supply.


I don't have any reason to suspect overheating, though I suppose some fan could have quit. I can't find any sensors (other than coretemp), which seems strange for an Asus board (and there is fan information available in the BIOS pages, which I will have to check that next time).

The power supply theory (which I hadn't thought of before) sounds intriguing. Not inadequate, but perhaps failing (to explain the recent deterioration). Actually, I did get the impression (I bought the machine used) that it was a cheap power supply (local voltage only, not 120/240), so problems there would not surprise me.

Is there some way of testing it (short of buying another one and plugging it in :) )? I do have a cheap Chinese "power supply tester," but I actually wouldn't know what to look for.

Hu wrote:
You say the machine always crashes when stressed. Does this apply even for trivial stresses, such as running a CPU hog on each core (with no load on the disk or RAM, just spinning the CPU)?


I had not tried that, but at the moment, I am running (in two terminals):

Code:

dd if=/dev/urandom of=/dev/null


I have the temperatures up to 74 C and 75 C on the cores (higher than I have ever seen before, and "high" is 74 C). Top is showing

Code:

top - 05:42:05 up 12 min,  3 users,  load average: 2.44, 2.02, 1.12
Tasks:  94 total,   3 running,  91 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.5%us, 99.5%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3088560k total,   404408k used,  2684152k free,   147848k buffers
Swap:  2056284k total,        0k used,  2056284k free,   110728k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2108 user      20   0 11472  688  568 R  100  0.0  10:18.76 dd
 2109 user      20   0 11472  692  568 R  100  0.0  10:08.76 dd


Let's assume this doesn't reboot (I will run it for an hour or so). Does that indicate that hard drive accesses are overstressing the power supply?
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23082

PostPosted: Wed Jun 22, 2011 10:24 pm    Post subject: Reply with quote

If that does not reboot, it would tend to rule out CPU overheating. Hard drive access stressing the power supply remains a possibility, but is not proved by the failure to reboot when stressing the CPU. You could stress the disk with something like (untested):
Code:
while :;
do
    dd if=/dev/sda of=/dev/null skip=$RANDOM count=1 bs=4096 iflag=direct
done
This should seek to random places on the disk, read 4K, and repeat, as fast as possible. It is non-destructive. If that does not elicit a reboot, try running both this fragment and your previous RNG stress concurrently, which would be characteristic of a heavy compilation workload.
Back to top
View user's profile Send private message
curmudgeon
Veteran
Veteran


Joined: 08 Aug 2003
Posts: 1744

PostPosted: Thu Jun 23, 2011 4:11 am    Post subject: Reply with quote

Well, now I am more confused than ever. I ran that script with two of the CPU stress tests, and the machine has stayed up for two hours (the longest I have seen it stay up when trying to compile something over the past several days is thirty minutes). Any other ideas?
Back to top
View user's profile Send private message
curmudgeon
Veteran
Veteran


Joined: 08 Aug 2003
Posts: 1744

PostPosted: Tue Jun 28, 2011 8:43 am    Post subject: Reply with quote

Still getting nowhere with this.

At the suggestion of someone, I downloaded an tried to run the livecd from http://www.inquisitor.ru/

It didn't impress me too much. I kept seeing a lot of errors attempting to load modules, and a majority of the tests were destructive.

I tried the smart test (I have smartd running on the machine and do surface tests regularly, and have never encountered any problems).

It returned:

Code:

Test hdd-smart[CALLED]
Initial /dev/sda testTest hdd-smart[FAILED]

Fatal failure: testing stopped
Reason:


Not very helpful.

Anyway, the more interesting event occurred when performing one of the stress tests that does video transcoding. That test failed (spectacularly) after a few moments with:

Code:

Test benchmark-hqrip[CALLED]
Creating filesystem...[ DONE ]
Copying source file...[ DONE ]
Transcoding... [ 9873.505196]
[ 9873.505196] HARDWARE ERRORo irq handler for vector
[ 9873.505196] Kernel panic - not syncing: Machine checkcontact your hardware ve


Note that I copied the output EXACTLY (typos and all). I don't know if this represents the same problem or not (why is gentoo rebooting rather than just stopping with a kernel panic?), but it might provide someone with just enough information to come up with an idea.
Back to top
View user's profile Send private message
MacGyver031
Tux's lil' helper
Tux's lil' helper


Joined: 11 Jul 2004
Posts: 141
Location: Ilavalai, Sri Lanka

PostPosted: Tue Jun 28, 2011 9:03 am    Post subject: Reply with quote

curmudgeon wrote:
Well, now I am more confused than ever. I ran that script with two of the CPU stress tests, and the machine has stayed up for two hours (the longest I have seen it stay up when trying to compile something over the past several days is thirty minutes). Any other ideas?


I had a similar situation: I had changed my original powersupply with an other one. The box rebooted at every compilation. The trick which I did was to exchange the disk (3.5inch) with laptop-disk(2.5inch) and take out the dvd-drive.
After this, the box only reboots if compilation takes long enough to get the cpu-fan drive to 100% capacity.

If you have checked memory (memtest86) and disk (badblocks), you should consider in using a better (higher wattage) powersupply.
In my case, the mainboard is running at the limits of 3.3V.
_________________
Sincerely your
Joanand K.

MacBook Pro 5.1: 2.4GHz Core2 Duo, 4096MB, 500GB, NVidia 9400/9600 M GT
Gentoo, Kernel 3.4.9, XOrg, Fluxbox.
Back to top
View user's profile Send private message
curmudgeon
Veteran
Veteran


Joined: 08 Aug 2003
Posts: 1744

PostPosted: Wed Jul 06, 2011 11:53 pm    Post subject: Reply with quote

This is still not solved (though I have made progress).

I tried a brand new high wattage power supply, and it was noticeably better, but instead of rebooting every 10-30 minutes, it rebooted every 10-50 minutes. I did a lot more tests disabling devices and what not, and nothing seemed to help.

Finally, I came up with the obvious idea of locking the cpu to the lower speed (it only runs at 1.87 GHz and 1.60 GHz, and obviously was always using 1.87 GHz when compiling). That actually made a huge difference. The machine did not reboot even after several hours of compilation.

Still want to get whatever the problem is fixed. Any newer ideas about the possible causes of this?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum