Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Remote System Freeze - How To Analyze?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
SmokeyPete
n00b
n00b


Joined: 26 Oct 2004
Posts: 9

PostPosted: Thu Dec 16, 2004 7:34 am    Post subject: Remote System Freeze - How To Analyze? Reply with quote

I'm having a very tough time with a re-occuring system freeze and I'm not sure what to do about it. I have a machine sitting in a server farm where far, far away from where I am now (I installed it for a customer and returned home). Anyway I had no troubles with the installation, I've done a number of Gentoo installs, nor with the initial config so I put the machine into the server farm and left to complete the setup remotely. Here's a breakdown of the situation:

1. Twice, after running for a couple of days each time, the system has suddenly and unexpectedly frozen. Most system freezes I've read about on the forums deal with problems with X, but I'm not running X, which throws all those possibilities out the window.

2. The first time it froze I had stepped away from my machine for a minute and when I came back the server had locked up. I had no apps running on terminal and nothing else running aside from the basic daemons.

3. The second time I was doing a emerge on a package and my connection dropped as the system had frozen. I have done numerous emerges, updates, un-emerges, starting, stopping, restarting of daeomons, everything with no problems before so I don't hold emerge responsible.

4. The log files have no information. In the first install I had metalog running but read on the forums that since it buffers its outputs on a sudden system crash kernel panic messages (or whatever else) don't make it to the logs. After the first crash I switched to syslog-ng hoping to catch anything if the crash happened again. It did and there's nothing in /var/log/messages or anywhere else. It just shows the last action on the machine before the freeze and then hours later when the system came back online the boot sequence.

5. I'm running a software RAID 5 with reiserfs on /dev/md1 but my /proc/mdstat shows no problems:
md1 : active raid5 ide/host0/bus1/target1/lun0/part3[2] ide/host0/bus0/target1/lun0/part3[1] ide/host0/bus0/target0/lun0/part3[0]
154233856 blocks level 5, 32k chunk, algorithm 0 [3/3] [UUU]

6. debugreiserfs /dev/md1 doesn't offer any help:
root@www log # debugreiserfs /dev/md1
debugreiserfs 3.6.18 (2003 www.namesys.com)
Filesystem state: consistency is not checked after last mounting
Reiserfs super block in block 16 on 0x901 of format 3.6 with standard journal
Count of blocks on the device: 38558464
Number of bitmaps: 1177
Blocksize: 4096
Free blocks (count of blocks - used [journal, bitmaps, data, reserved] blocks): 38137593
Root block: 6227425
Filesystem marked as NOT cleanly umounted
Tree height: 5
Hash function used to sort names: "r5"
Objectid map size 92, max 972
Journal parameters:
Device [0x0]
Magic [0x68979a65]
Size 8193 blocks (including 1 for journal header) (first block 18)
Max transaction length 1024 blocks
Max batch size 900 blocks
Max commit age 30
Blocks reserved by journal: 0
Fs state field: 0x0:
sb_version: 2
inode generation number: 858612
UUID: 37cc912e-a5da-4dfc-bab9-04326e873eb0
LABEL:
Set flags in SB:
ATTRIBUTES CLEAN

7. I cannot run reiserfsck on the individual disk partitions because as a RAID they don't have their own superblock:
root@www log # reiserfsck /dev/hda2
reiserfsck 3.6.18 (2003 www.namesys.com)
reiserfs_open: the reiserfs superblock cannot be found on /dev/hda2.
Failed to open the filesystem.

8. I cannot run reiserfsck on the RAID because it isn't mounted as read only:
root@www log # reiserfsck /dev/md1
reiserfsck 3.6.18 (2003 www.namesys.com)
reiserfsck --check started at Wed Dec 15 17:14:38 2004
###########
Partition /dev/md1 is mounted with write permissions, cannot check it

9. I've run badblocks on each of the three RAID partitions with no bad blocks found:
root@www log # badblocks -v /dev/hda3
Checking blocks 0 to 77117040
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.

root@www log # badblocks -v /dev/hdb3
Checking blocks 0 to 77117040
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.

root@www log # badblocks -v /dev/hdd3
Checking blocks 0 to 77118552
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found.




The system reboots cleanly, aside from the reiserfs checking of the internal tree upon, so there doesn't seem to be an apparent critical error somewhere. I suspect it is a hardware problem but do not know what it might be. It could be a memory leak but I have no idea how to trace that. The again it could be anything, I just don't know and have never run into anything like it before. Anyone have any clue about what this could be?


Pete
Back to top
View user's profile Send private message
adaptr
Watchman
Watchman


Joined: 06 Oct 2002
Posts: 6730
Location: Rotterdam, Netherlands

PostPosted: Thu Dec 16, 2004 9:26 am    Post subject: Reply with quote

Reiserfsck can easily be run on bootup when you have the option of mounting it readonly.
But this is not a HD problem, since no disk error will cause this kind of behaviour - the system runs in memory, after all, and disk I/O is just that - I/O.

To perform any sort of useful forensics on the machine you'll have to pull it out and examine and test it physically.

Stresslinux is good for testing various hardware issues such as memory, CPU functionality, burn-in (continuous run) performance, and hard disk performance consistency.

Also check out Knoppix or one of its derivatives - no, you don't need or use X, but the system will get used to the maximum when you boot it up.

Then be prepared to let it run overnight for several tests, since that is when you say the freezes occur - never after 5 minutes uptime.

If at all possible, monitor it through an external syslogger - log everything to another host so you will get messages right up to the point the network fails.

Also sometimes just randomly replacing key hardware can "solve" it - for no apparent reason, but at least it'll be solved.
_________________
>>> emerge (3 of 7) mcse/70-293 to /
Essential tools: gentoolkit eix profuse screen
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum