bexamous n00b
Joined: 26 Feb 2004 Posts: 17
|
Posted: Thu Nov 15, 2007 12:39 am Post subject: MCE Errors / DMI Information (AMD systems) |
|
|
MCE Errors are such a pain, especially on systems with lots and lots of memory. I'm trying to get a better understanding on how mcelog uses DMI information to locate the bad dimm. How accurate is this? Does it ever actually work?
http://man-wiki.net/index.php/8:mcelog
With the --dmi option mcelog will look up the addresses reported in machine checks in the SMBIOS/DMI tables of the BIOS. This can sometimes tell you which DIMM or memory controller has developed a problem. More often the information reported by the BIOS is either subtly or obviously wrong or useless. This option requires that mcelog has read access to /dev/mem (normally requires root) and runs on the same machine in the same hardware configuration as when the machine check event happened.
Okay assuming DMI information was all available and correct running `mcelog --dmi --ascii` should let me know what DIMM is bad, correct?
What DMI information is actually needed before trying to verify it is correct?
Handle 0x001A, DMI type 17, 27 bytes
Memory Device
Array Handle: 0x0018
Error Information Handle: No Error
Total Width: 72 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: DIMM
Set: 1
Locator: DIMM2
Bank Locator: DIMM2
Type: DDR2
Type Detail: Synchronous
Speed: 533 MHz (1.9 ns)
Manufacturer: Not Specified
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Handle 0x0023, DMI type 20, 19 bytes
Memory Device Mapped Address
Starting Address: 0x00080000000
Ending Address: 0x000FFFFFFFF
Range Size: 2 GB
Physical Device Handle: 0x001A
Memory Array Mapped Address Handle: 0x0021
Partition Row Position: Unknown
Interleave Position: Unknown
Interleaved Data Depth: Unknown
DMI type 17: entries are available for all the DIMMs on the board and all the 'Locator' labels match silkscreens on the board.
DMI type 20: entires have memory ranges for all the DIMMs
I'm *GUESSING* that mcelog decodes a MCE error, gets a memory location and uses these two pieces of DMI information to print out the 'Locator' of the bad DIMM.
Problem:
DMI type 20 seems to rarely ever be available.
http://www.dmtf.org/standards/published_documents/DSP0134.pdf
type 20 was changed to 'optional' rather than required awhile ago.... if no one is adding this anymore how the heck is mcelog ever supposed to work?
How does mcelog deal with dualchannel memory? If DIMM1 and DIMM2 are in dual channel mode... or ganged memory whatever they call it, will an MCE error be able to distinguish which DIMM is bad? If not how does mcelog always print out a single dimm location?
If anyone can help me at all that would be awesome. I've been going nuts trying to figure this mcelog crap out and not getting very far. mcelog's man suggests complain to the motherboard company to fix DMI information... yet if its no longer required information what are the chances of getting someone to add it? |
|