Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Isolating memory failures
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2  
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1471
Location: Berlin, Germany

PostPosted: Sat Apr 13, 2019 1:19 pm    Post subject: Reply with quote

Well, I've been running now for a week or two with a new PSU, and it hasn't frozen yet. So I don't have any further information, except for this: I saw in dmesg the following error messages:

Code:
[ 8410.034472] mce: [Hardware Error]: Machine check events logged
[ 8410.034478] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 2: 98254000000c0176
[ 8410.034482] mce: [Hardware Error]: TSC 0 MISC c008000100000000
[ 8410.034488] mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1555132314 SOCKET 0 APIC 0 microcode 6000822
[15258.199168] mce: [Hardware Error]: Machine check events logged
[15258.199173] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 2: dc25407000040136
[15258.199176] mce: [Hardware Error]: TSC 0 ADDR 7b039ad38 MISC c008000300000000
[15258.199178] mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1555139162 SOCKET 0 APIC 1 microcode 6000822
[15569.479162] mce: [Hardware Error]: Machine check events logged
[15569.479164] mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 2: dc2540e000040136
[15569.479168] mce: [Hardware Error]: TSC 0 ADDR 705799f78 MISC c008000700000000
[15569.479171] mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1555139474 SOCKET 0 APIC 1 microcode 6000822


Any idea what that is? The CPU is an AMD FX-9590, from 2016.

Cheers,

EE
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54834
Location: 56N 3W

PostPosted: Sat Apr 13, 2019 4:07 pm    Post subject: Reply with quote

ExecutorElassus,

At face value its a CPU problem.

However, if you have ECC RAM, it can be a RAM problem too.
The CPU has ECC on the internal caches and without ECC errors go undetected.

The good news is that the error was detected and corrected.
Detected uncorrectable errors get you a panic.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
JustAnother
Apprentice
Apprentice


Joined: 23 Sep 2016
Posts: 197

PostPosted: Sun Apr 14, 2019 5:12 am    Post subject: Reply with quote

I had my own little freezing issue lately.

-- A dual core AMD machine, ~2008.

-- The "cpu fan bracket" has two hooks that live under a lot of stress.

-- About three years ago I heard a loud sound like a bolt being thrown against the case. Then the machine started shutting down. I realized that a fan bracket hook had broken with great gusto, and the cpu was overheating. For a few bucks I got it running. Case closed - or so I thought.

-- Recently the same machine started freezing without warning. No messages. No pings. No ssh's. No nothing. Just stone silence.

-- Aha! It must be firefox - it had just upgraded. Turned out not the be the problem.

-- Aha! I had just dockerized the kernel. I must have messed up a working kernel with all those fancy schmancy docker switches. Turned out not to be the case.

-- Aha! Electromigration -- people say the processor is only good for about 10 years, and those damn atoms have jostled around one too many times. Nope.

-- Aha! A cosmic ray nailed the cpu. Nope.

-- Aha! It must be the memory going senile because it dune wore out. I had never seen anything special happen when I ran memtest, but "they" said to do this. To my surprise, the computer seemed to freeze during memtest. After several more runs memtest failed, saying there was a bogus hardware interrupt on cpu 1, shutting it down. Okey, so it's the hardware, not the dockerized kernel.

-- It's only 11 years old, so maybe it's time to upgrade. But I can't stand the thought of rebuilding this thing from scratch.

-- So I figured it might be wise to open the case and look for anything obvious, like some sparks or some ugly black stains.

-- Aha! I found something. The heat sink and fan assembly didn't feel right -- it was too loose. Apparently one of the two fan bracket plastic hooks had failed, but unlike the previous failure where the hook went flying like a bullet, two sides failed and it pivoted along the third side. -- The net effect was that one side of the cpu was held too loosely, and the other side was held way too loosely. This puts a gradient onto the thermal conductance per unit area, leading to an asymmetric cpu failure. This sneaky little problem was making the computer freeze, not shut down.

-- A new part for $5 seemed to fix everything. I just smeared around that nasty grease with my finger.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum