Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Xorg hangs: ping-able, but not telnet-able?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Gentoo on AMD64
View previous topic :: View next topic  
Author Message
toofastforyahuh
Apprentice
Apprentice


Joined: 18 May 2004
Posts: 164

PostPosted: Sun Jun 06, 2004 7:27 pm    Post subject: Xorg hangs: ping-able, but not telnet-able? Reply with quote

I've been getting random hangs on my system under Xorg. It happens sporadically, but it is especially prone to happen while playing DVDs in xine, usually within 15 minutes.

The system is SK8V, FX-53, 2 GB RAM, beefy 520 W PSU, multiple fans and Thermaltake SilentBoost K8, Revolution 7.1 sound card, and AiW Radeon AGP card, vanilla kernel 2.6.6.

When X hangs the system seems to lock up more than I'd expect. I can still ping the machine, but when I tried to telnet in (please, no rants on SSH) to kill X I found I can't get through. Ditto with FTP . Remote xterms also hang.

Twice also in console mode I had kernel panics during emerge (next time I will write them down and try to decode them in that parsemce tool), but not sure if that's related.

The system runs hot (apartment is also hot) but it is constant and never above 57 deg C which is within spec for the CPU. (Sorry, cpudyn can't help with this--FX-53 cannot throttle speed.) The RAM has been memtested under sweltering heat several times without any errors. I never overclock, and voltages seem OK.

There are no details in the Xorg.0.log or /var/log/messages. (Is this another kernel panic?)

I tried changing to fbdev driver, and also adding idle=poll noapic pci=noacpi, but still get the same problem.

Previously I got around my kernel-panic-at-boot with disabling Legacy USB in BIOS, so that shouldn't be the problem. I also disabled ACPI 2.0 in BIOS.

Any ideas? Is this kernel level or Xorg doing something really bad? Thanks!
Back to top
View user's profile Send private message
toofastforyahuh
Apprentice
Apprentice


Joined: 18 May 2004
Posts: 164

PostPosted: Sun Jun 06, 2004 11:46 pm    Post subject: Re: Xorg hangs: ping-able, but not telnet-able? Reply with quote

toofastforyahuh wrote:
I've been getting random hangs on my system under Xorg. It happens sporadically, but it is especially prone to happen while playing DVDs in xine, usually within 15 minutes.


I'm now on my second pass through a DVD that, until now, has always locked up within 20 minutes. Obviously I must have done *something* right.

I think I have a magic recipe to fix this, but I don't understand why it works. I think it's an interaction between the BIOS, CPU, and kernel. I do NOT think it's marginal quality RAM because not only does it pass memtest86 but the hang even occurred at 200 MHz/PC1600 speed.

First let me say this: I believe firmly in ECC. It's a lesson I learned from years of SGI/Sun work and other activities. RAM problems are nasty to diagnose; memtest86 does a good job but I think there's more going on behind the scenes. Being forced to buy registered (and ECC!) memory for socket 940 systems is a true blessing, not a curse!

With socket 940 CPUs we have ECC on L1 cache, L2 cache, and RAM, plus we have options for chip kill and scrubbing. Chip Kill I am avoiding because I thought one of the K8 errata suggested turning off Chip Kill to prevent a hardware bug. Scrubbing is something I do not believe in as strongly. It is an attempt to not only correct ECC failures on the fly, but rewrite the corrected data back to RAM in the hope of preventing multibit errors over time that ECC cannot deal with. (ECC can only fix 1-bit errors and detect 2-bit errors.)

This seems like alot of work for preventing a multibit error--an event that even I would consider very rare. So until now I have never enabled scrubbing in BIOS although the SK8V BIOS 1002 defaults to enabling DRAM scrubbing every 640ns. The only feature I had enabled was the master ECC setting.

So I had ECC turned on and figured it was enough. And memtest86 passes 100% even with ECC disabled, so I figured that was all I needed. But as this hang problem started ticking me off I got desperate enough to try anything.

...or everything. The next thing I tried was turning on scrubbing at all levels. Not so much for DRAM, but for the CPU's caches! Why? Because I saw a google groups search about a bank 0 kernel panic that talked about "bank 0" being L1 data cache on the K7 CPUs, and I think bank 0 was what my console kernel panics were. And let's face it-- K8 is basically very much a K7 on steroids.

Attempt 1:
Code:

L2 cache BG Scrub -- 10 us
Data cache BG Scrub -- 640 ns
DRAM BG Scrub -- 640 ns
DRAM Scrub Redirect -- Enabled
ECC Chip Kill  -- Disabled


With this attempt the kernel panicked on boot immediately. I think the kernel's idle loop (??) did not like it. My error was something like this (I wrote down some of it).
Code:

CPU0 7 Bank 4 f422210000000a13
RIP 10 <ffffffff8010f784> default idle 0x24/0x30
TSC 1ac98de14a ADDR 7fff04d0


So I decided to pare down my settings one at a time.

Attempt #2:
Code:

L2 cache BG Scrub -- Disabled
Data cache BG Scrub -- 640 ns
DRAM BG Scrub -- 640 ns
DRAM Scrub Redirect -- Enabled
ECC Chip Kill  -- Disabled


This appears to be the magic formula, although it may not be a minimal magic formula. I've booted twice without any kernel panic and I'm now almost done with the second pass of this DVD. Somehow the kernel did not like background scrubbing on the L2 cache, but scrubbing the L1 cache and/or DRAM puts everything in a happy state.

Either the BIOS does something weird, the kernel does something weird, xine is testing RAM better than memtest86, or AMD is shipping CPUs with marginal quality caches.

Or all of the above. I will keep testing this machine to see if any more hangs appear, but I hope this is it. Rebooting every 10 minutes through a movie is not fun!
Back to top
View user's profile Send private message
toofastforyahuh
Apprentice
Apprentice


Joined: 18 May 2004
Posts: 164

PostPosted: Tue Jun 08, 2004 6:51 pm    Post subject: Reply with quote

I take it partially back. I still haven't had a DVD/X/kernel hang. I just had xine on pause all night and it still resumes playback.

However, on initial boot last night I did get another kernel panic like before (problem resyncing idle state or something like that). I think the kernel is just being wacky. The 2.6.7 changelog so far does list fixes for MCEs, so maybe this will be fixed in the kernel.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo on AMD64 All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum