Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Scanning for broken hardware?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
mariourk
l33t
l33t


Joined: 11 Jul 2003
Posts: 807
Location: Urk, Netherlands

PostPosted: Tue May 01, 2007 9:01 am    Post subject: Scanning for broken hardware? Reply with quote

I'm having trouble with an old server. I think it's a hardware problem.
But I'm not sure what part causes the trouble. Does someone know if
there exists some program or script that can be used to scan for any
broken or malfunctioning hardware part? :?
_________________
If there is one thing to learn from history, it's that we usualy don't learn anything from it, at all.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54317
Location: 56N 3W

PostPosted: Tue May 01, 2007 12:04 pm    Post subject: Reply with quote

mariourk,

Tell us your symptoms.
There are a few applications for stress testing various parts of the system but they don't take the place of experiance.

Anyway a few things to try.
Carefully clean the CPU fan and heatsink using a stiff brush - never a hoover.
Overheating is apparent during compiles and other heave CPU use.

Run memtest86 from the liveCD. Errors reported at the same address on several passes indicate faulty RAM.
Random errors (not repeatable) point to other causes - e.g. overheating, CPU, PSU, or North Bridge chip.

Look at the Vcore PSU capacitors (tall cylinders) mounted close to the CPU. There will be approx 10.
Do they all have flat tops and can you see any signs of fluid leaking onto the motherboard at the bottoms of them?
What about rubber bungs being pushed out of the bottom?
Domed tops, leaks, visible rubber bungs are all signs of failure. You need to replace all the capacitors in the Vcore regulator with good quality parts if even one has failed, since the rest will not be far behind. You need intermediate soldering skills for that.

If you have a spare PSU, swap the PSU out there is no easy DIY test for PSUs
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
mariourk
l33t
l33t


Joined: 11 Jul 2003
Posts: 807
Location: Urk, Netherlands

PostPosted: Tue May 01, 2007 1:10 pm    Post subject: Reply with quote

The symptoms where a completely frozen system, whitch often indicates
that the problem is in the hardware and not in the software.
Because the network connection was no longer usable, I had to hookup
a keybord an a monitor to see what was going on. The screen was filled
with errors. The errors I don't recall. It was not possible to copy and past
them and there was a lot of stress to get the server back online. You know... :?

The logs don't show much relevant info either. Just a gab between
03:45:08 and 09:22:20, whitch was the time I rebooted the server. Booting the server
took a lot of time. This was probably because the RAID (software RAID-1) went corrupt.
and began rebuilding itself. This takes huge amounts of CPU an I/O. So it slows down
the system a lot. This rebuilding process is still going on. I couldn't find any errors in the
logs, before the crash, that where related to the software RAID. So I think the crash
itself was responsible for the RAID corruption. Not the other way around.

The system seems to work fine, for now. I just have no clue of what went wrong.
And that concerns me. So, I would like to check for some broken hardware.
The problem is that the server itself is buildin between some other servers. So, taking
it out to see inside, check the heatsink, clean things a bit, etc, it's... not easy. :wink:
So I would prefer to do some software tests first, if possible.
_________________
If there is one thing to learn from history, it's that we usualy don't learn anything from it, at all.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54317
Location: 56N 3W

PostPosted: Tue May 01, 2007 1:59 pm    Post subject: Reply with quote

mariourk,

You can check the temperatures and to some extent PSU voltages with lm_sensors, however, the voltages shown are average values, which is not what you need to know. The CPU temperatures will be useful and may indicate how good your cooling is.
lm_sensors can also show fan speeds but not al fans provide a tacho output so a reading of zero may just indicate that a tacho output is not fitted rather than a failed fan.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum