View previous topic :: View next topic |
Author |
Message |
ksool Guru
Joined: 27 May 2006 Posts: 337 Location: Cambridge, MA
|
Posted: Mon Sep 17, 2007 11:29 pm Post subject: Machine Check Exception on 2xAMD Opteron |
|
|
Hey all,
I've just been working with a Tyan Thunder K8S Pro mobo with 2x AMD Opteron 248 and 4*2GB RAM. The machine will randomly lockup giving this error on the console
Code: |
HARDWARE ERROR
CPU 1: Machine Check Exception 4 bank 4: b607a00100000813
TSC 34064b0a5f910 ADDR 1dcebfdb0
This is not a software problem !
Run through mcelog --ascii to decode and contact your hardward vendor
Kernel panic - not syncing: Machine check
|
I'll post mcelog in a bit, the machine is currently running Memtest86+ V1.70 since I'm thinking this is probably a memory problem. If this is the case, is it for sure that the whole stick is wasted or can I salvage it in anyway (blacklisting the dead section or something) since the 2G sticks are relatively expensive?
Anyway, until I get the mcelog up, has anybody seen anything like this before or know what this is?
TIA |
|
Back to top |
|
|
ksool Guru
Joined: 27 May 2006 Posts: 337 Location: Cambridge, MA
|
Posted: Tue Sep 18, 2007 12:38 am Post subject: |
|
|
Here's the tail of mcelog:
Code: |
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 0 data cache TSC 76e2f811f9c8e
ADDR 1d09b7f40
Data cache ECC error (syndrome a4)
bit46 = corrected ecc error
bus error 'local node origin, request didn't time out
data read mem transaction
memory access, level generic'
STATUS 9452400000000833 MCGSTATUS 0
MCE 2
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 7742a4bf1d2c3
ADDR 1f307ff70
Northbridge ECC error
ECC syndrome = 57
bit32 = err cpu0
bit46 = corrected ecc error
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 942bc00100000813 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC d4a65cbc90c
RIP 33:5ce0b4 ADDR 1f321f778
Northbridge ECC error
ECC syndrome = 6
bit45 = uncorrected ecc error
bit61 = error uncorrected
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS f403200000000a13 MCGSTATUS 7
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 8b4f97d40194f
RIP 33:5caceb ADDR 1dd5b74b8
Northbridge ECC error
ECC syndrome = c
bit45 = uncorrected ecc error
bit61 = error uncorrected
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS b406200000000a13 MCGSTATUS 7
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 2 bus unit TSC 229d6e2c9cb09
L2 cache ECC error
Bus or cache array error
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
prefetch mem transaction
memory access, level generic'
STATUS d000400000000863 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 4 northbridge TSC 229d6e2c9d58c
ADDR 1f42777b8
Northbridge ECC error
ECC syndrome = 5d
bit32 = err cpu0
bit46 = corrected ecc error
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS 942ec00100000813 MCGSTATUS 0
|
|
|
Back to top |
|
|
SnEptUne l33t
Joined: 23 Aug 2004 Posts: 656
|
Posted: Tue Sep 18, 2007 1:22 am Post subject: |
|
|
Yeah. Looks like memory problem. I don't know how realiable memtest is, since it didn't catch the problem with my RAM when my computer infrequently and randomly seg fault two years ago, even though I ran the test overnight.
But since your RAM has ECC, I suppose it is much easier finding error(s). _________________ "There will be more joy in heaven over the tear-bathed face of a repentant sinner than over the white robes of a hundred just men." (LM, 114) |
|
Back to top |
|
|
dmpogo Advocate
Joined: 02 Sep 2004 Posts: 3468 Location: Canada
|
Posted: Tue Sep 18, 2007 1:33 am Post subject: |
|
|
SnEptUne wrote: | Yeah. Looks like memory problem. I don't know how realiable memtest is, since it didn't catch the problem with my RAM when my computer infrequently and randomly seg fault two years ago, even though I ran the test overnight.
But since your RAM has ECC, I suppose it is much easier finding error(s). |
I had random lockups couple of years ago on quad opteron (celestica board). Learned all about MCE then , but could not get it 100% stable.
memtest did not give any errors. Called the guys who built for me, after two days of tests they said that several memory chips are busted
(and I have 32 Gb RAM, 16x2Gb). Replaced all the memory by a different brand and it is on a second year uptime. |
|
Back to top |
|
|
ksool Guru
Joined: 27 May 2006 Posts: 337 Location: Cambridge, MA
|
Posted: Tue Sep 18, 2007 1:53 am Post subject: |
|
|
I might try and ship it back, but I doubt I'll have much luck on that front as I already sent it back for a busted bios and it's probably out of warranty by now.
Anyway, the problem only seems to occur once every couple of weeks, but that was when the machine was barely in use. I was hoping to use it as a backend for some thin X clients. Hopefully, given the rate of failure, it's only one of the sticks that's bad and the machine is still salvageable.
Does anybody know, once a stick has been deemed bad, is there anything you can do for it, or is it just trash? Might the logs give some insight into where the stick is bad and allow for that physical area to be blacklisted or something? It's not everyday a guy like me comes across a free server with 8GB ram and I'm not letting it go that easy. |
|
Back to top |
|
|
dmpogo Advocate
Joined: 02 Sep 2004 Posts: 3468 Location: Canada
|
Posted: Tue Sep 18, 2007 3:06 am Post subject: |
|
|
krs1ars wrote: | I might try and ship it back, but I doubt I'll have much luck on that front as I already sent it back for a busted bios and it's probably out of warranty by now.
Anyway, the problem only seems to occur once every couple of weeks, but that was when the machine was barely in use. I was hoping to use it as a backend for some thin X clients. Hopefully, given the rate of failure, it's only one of the sticks that's bad and the machine is still salvageable.
Does anybody know, once a stick has been deemed bad, is there anything you can do for it, or is it just trash? Might the logs give some insight into where the stick is bad and allow for that physical area to be blacklisted or something? It's not everyday a guy like me comes across a free server with 8GB ram and I'm not letting it go that easy. |
I read recently about some kernel patch that masks out bad RAM sections (it is supposed to be submitted for introduction to the official tree soon). Memtest86+ actually generates the output in the format this patch takes. I don't rememeber more, memtest86+ documentation had some references. |
|
Back to top |
|
|
xbmodder Guru
Joined: 25 Feb 2004 Posts: 404
|
Posted: Tue Sep 18, 2007 3:54 am Post subject: |
|
|
I'd ebay off the chips, and buy some new ones. They're cheap. _________________ http://xbmodder.us/ |
|
Back to top |
|
|
ksool Guru
Joined: 27 May 2006 Posts: 337 Location: Cambridge, MA
|
Posted: Wed Sep 19, 2007 2:22 am Post subject: |
|
|
Cheap? http://store.memory4less.net/36500132.html
Either way, I don't think I could rationalize buying any more ram for the machine. Assuming I had to chuck one stick, that box would still have 6x as much ram as the next runner up on my local network.
That kernel patch sounds super bad-ass. I'll have to look into it. The problem though, is that memtest86+ doesn't seem to catch the error every time, so I may be stuck running it for days just to find the bad sector. |
|
Back to top |
|
|
Akkara Bodhisattva
Joined: 28 Mar 2006 Posts: 6702 Location: &akkara
|
Posted: Wed Sep 19, 2007 2:48 am Post subject: |
|
|
Sometimes the ram isn't outright bad, just a bit marginal on the timing, which only shows up when used on a heavily-loaded memory bus like the one you have.
If this is the problem, underclocking it 5 or 10% often fixes it. Increasing the RAS, CAS, etc. timings can also work, especially if it is just one timing that is close.
Alternatively, if you have another box with a 2GB stick in it, try swapping sticks. Chances are your old one will work in the less-loaded other box, and other one hopefully is better on the timing. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|