Help Debugging Hardware Failure? [solved]

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

Hello All,
I realize this isn't necessarily the ideal forum for this, but thought I'd try anyway. I just built a system that keeps hanging on me and I'm trying to debug the problem. Here's some pertinent data:

I ran memtest86 from a USB for 1 pass; no problems showed up (2 days).
I ran stress on the machine for about 10 minutes, no issues.
When I try to emerge world, the machine inevitably dies within 30 minutes.
Last time I tried, I had sensors running in a loop; CPU temps all below 60 degrees
The only possibly related message I see in /var/log/messages is:

Quote:

alaya kernel: [ 1219.857741] kworker/dying (203) used greatest stack depth: 12288 bytes left
but I don't know if this is relevant
The system was actually dying when I was trying to finish up the stage3 install; i.e. when compiling grub2 while the system was mounted from a minimal bootdisk. I got around this by removing all but one of the RAM sticks and then it compiled o.k. But then I assumed the RAM wasn't the issue because of the memtest.
Some relevant hardware specs: Xeon-W2295 CPU, Asus WS C422 Pro/Se Mobo, 512G Ram, Nvividia 1080Ti GPU, Samsung EVO 970 SSD, Corsair 1200W PSU

So my guess is the issue is the CPU, the motherboard or the RAM, but I'm not sure how to narrow it down. Any tips? As I write this I'm thinking I should maybe remove a bunch of the RAM sticks and try emerge world again. Then maybe trying memtest86+?

Can I rule any of those 3 out? Anybody have educated guesses as to where I should focus my efforts? Should I run stress for longer?

NeddySeagoon · Posted: Fri Sep 25, 2020 4:45 pm Post subject:

justin_brody,

We have a few retired hardware guys/girls here and some not yet retired too, so this will do. :)

To save us all a lot of typing, follow the advice in Method to test hardware functionality of crashing system?. While the things you can try are already in that topic, the questions you need to ask may be different.

Follow that topic for investigation ideas but post your questions and findings here.
They may reach a different conclusion to that topic.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

Thanks so much NeddySeagoon! I think you actually helped me diagnose hardware failures on the last system I built, so that's probably why I had the idea this was a good place Plus this feels more like home than anywhere else.

I will go through the link as you suggested and post updates; thanks again!

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

Hi Mika,

Thanks for that tip! I'm hoping I can figure out my issues without having to get that sophisticated, but it might be fun to play with if I need too

Some updates on my debugging efforts, based on the post NeddySeagoon pointed to:

smartctl -x reports no errors on the SSD
I spent a fair amount of time swapping RAM modules in and out. With just one module, I was able to emerge chromium without any issues. I tried this with each of the 8 modules; everything seemed o.k. I read somewhere that this might indicate the DDR needed higher voltage, so I manually set it to 1.4 everywhere. This caused the machine to hang partway through the gentoo boot-up. I set it back to [auto] and the system seemed more stable; I was able to run it (with all 8 RAM modules) for a few hours, but then it hung again.
As I mentioned in the (now old) original post, I ran memtest86 for a couple of days. It completed the first cycle without any trouble.
BIOS is updated to the latest; June of 2020 as I recall.
Right now, lm_sensors is only giving me CPU temps and they always look o.k. (< 60 Celsius even during heavy compilation). It is *not* giving me motherboard temps; I'll see if there's something I need to do on the motherboard to get those reported.
Overall, this thing produces a lot of heat. But with this CPU and the GPU I would think that's to be expected. When I originally pulled the RAM modules out they were hot to touch; is that normal? For what it's worth the modules don't have any kind of heat spreaders or anything else on them. Is there a way to tell if they're simply overheating?

My guess right now is that it's something with the RAM; but I'm not sure how to verify. Should I try running it for a few days with only half the RAM in and see if it's more stable? Any advice will be much appreciated!

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

OK, running stress --vm 20 seems to hang the machine very quickly. Not usually happy to see my computer crash

So this sound like a memory issue. How to debug?

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

Took all but one module out, now I can run stress --vm 20 without issue.

Given that memtest86 reported nothing and that I was able to emerge chromium with each individual module, I'm guessing this is some kind of motherboard-related issue? Voltages? Power? Heat?

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

In case it's relevant, the modules are all the same make and model but were packaged as 8 individual sticks rather than being packaged as a kit. I've seen people say that can be an issue. I'm also noticing that the response time of the machine seems better with a single stick than it did with 8.

NeddySeagoon · Posted: Thu Oct 01, 2020 1:37 pm Post subject:

justin_brody,

Stress streesses lots of things. The memory cannot be used without the memory controller, which is a part of the CPU.
The CPU communicates with the RAM using the motherboard.
The whole thing requires power.

Do a binary search on your RAM. Remove half of it and leave half fitted to test.
Identify the sticks, you need to be able to tell then apart later.
If that fails, the defect is still present, so swap the fitted sticks with the ones you removed.
If that fails, its probably not RAM.
If that works, you have likely removed a faulty RAM stick.
However, it could be an address bit, With half the RAM removed, you are using less of the address bus.

With 4 sticks you test 2 then 1 .. and so on.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

Oh, that makes very good sense -- thanks! I'll give it a shot and report back.

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

O.k, I tried what you suggested and both sets of 4 work. Tried adding in two more and I was able to run stress --vm 20 with both pairs from the other 4 (10 minute run). In the process of looking up which sockets to put the sticks in, I saw this in the manual:

x90e · n00b Joined: 30 Sep 2020 Posts: 40

NeddySeagoon · Posted: Thu Oct 01, 2020 7:16 pm Post subject:

justin_brody,

Temperature matters. Its one of a whole range of physical properties that affect the the performance of electronics.

The hotter the silicon, the higher the leakage currents and the harder the memory controller has to work.
The more DRAMs you fit, the higher the stray (parasitic) capacitance that the RAM presents to the DRAM controller, so the slower the signals to the RAM must be.
As motherboard manufactures go to great lengths to make all the traces transmission lines of the same length (yes the wavy tracks are there for a reason), the more the stray capacitance, the slower the RAMs can be driven.

The stray capacitance even has a temperature coefficient. It gets worse with rising temperature.

Temperature is about the only thing you can usefully influence.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

O.k. -- thanks so much! I will look into cooling things down!

NeddySeagoon · Posted: Thu Oct 01, 2020 10:30 pm Post subject:

justin_brody,

You may find that your DRAMs have a temperature sensor, so you can actually measure your success.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

oh, very interesting! is that something i should be able to pick up with lm_sensors if it's configured correctly?

NeddySeagoon · Posted: Thu Oct 01, 2020 10:41 pm Post subject:

justin_brody,

If its there and you have kernel support, yes.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

figueroa · Posted: Fri Oct 02, 2020 2:47 am Post subject:

Be sure your CPU cooler is mounted properly on the chip. Thin coat of thermal paste, cooler securely mounted.

Put a digital thermometer inside your box.

Take the cover off and aim a fan at the hardware. This last trick has helped a lot of machines get through stressful processes. If it works, then you know you have a heat problem.

Are you sure about the power supply? (Swap it out as a test.)
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi

NeddySeagoon · Posted: Fri Oct 02, 2020 10:05 am Post subject:

figueroa,

All good stuff.

As the silicon temperature increases, it take more power, which makes it get hotter ...
Luckily, silicon does not exhibit 'themal runaway' not at useful temperatures anyway, unlike germanium.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

I was going to ask about blowing a fan into the case for testing this out

The BIOS is reporting stats (like motherboard temperature) that I'm not getting through lm_sensors, but that's probably another thread if I can't figure it out. Otherwise I'll see if I can borrow a good thermometer from someone.

Thanks for the tips!

NeddySeagoon · Posted: Fri Oct 02, 2020 11:35 am Post subject:

justin_brody,

Set the fan up first and observe the effects, if any.
It will be easy enough to make temperature measurements later.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

O.k., that turned out to be pretty illuminating! Ran the machine with 6DIMMs in over the weekend, so some moderately heavy work. No problems. Put the last two in this morning and ran with a big box fan blowing into it. It stayed cool but still died after around 18 minutes of stress --vm 20

I was saving lm_sensors data into a file every 10 seconds; here's the last reading:

NeddySeagoon · Posted: Mon Oct 05, 2020 3:01 pm Post subject:

justin_brody,

Is it OK with any 6 from 8 DIMMs or is it a particular pair?

Setting up the DIMM drives it a bit hit and miss.
The BIOS reads settings from the SPI ROM on the DIM. Some brain dead systems only read one, rather than use the slowest of all fitted DIMMS.
That's just for starters, now the BIOS does trial and error to establish limits.

Its not just timings, its signal drive strength too.
Not strong enough, it won't work because its too slow. Too strong and despite the transmission line traces on the PCB, you get overshoot and again it won't work.
Its all very much like baby bears porridge[1].

[1]Search for Goldilocks and the Three Bears if you don't know the story.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

Hi NeddySeagoon,

So I just swapped the 2 DIMMs that had been left out back in and was able to run stress for over an hour without any issues.

It sounds like your suggestion is that I play around with voltage and speed setting in BIOS? Since it's DDR4 can I just set all the voltages manually to 1.2V?

NeddySeagoon · Posted: Mon Oct 05, 2020 8:25 pm Post subject:

justin_brody,

Its much more complex that the DRAM power supply voltage.
The DRAM timings and signal drive strengths matter too. That's a lot of things to play with.

Personally, I would just leave two sticks of RAM out and call it a day.
Overvolting can destroy your RAM. If you are lucky, the damage will stop there.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

justin_brody · Apprentice Joined: 26 Jan 2005 Posts: 283

Understood -- thanks again NeddySeagoon. I think I'm going to RMA the motherboard; very much appreciate the help!!!