View previous topic :: View next topic |
Author |
Message |
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Fri Sep 25, 2020 12:14 pm Post subject: Help Debugging Hardware Failure? [solved] |
|
|
Hello All,
I realize this isn't necessarily the ideal forum for this, but thought I'd try anyway. I just built a system that keeps hanging on me and I'm trying to debug the problem. Here's some pertinent data:
- I ran memtest86 from a USB for 1 pass; no problems showed up (2 days).
- I ran stress on the machine for about 10 minutes, no issues.
- When I try to emerge world, the machine inevitably dies within 30 minutes.
- Last time I tried, I had sensors running in a loop; CPU temps all below 60 degrees
- The only possibly related message I see in /var/log/messages is:
Quote: | alaya kernel: [ 1219.857741] kworker/dying (203) used greatest stack depth: 12288 bytes left | but I don't know if this is relevant
The system was actually dying when I was trying to finish up the stage3 install; i.e. when compiling grub2 while the system was mounted from a minimal bootdisk. I got around this by removing all but one of the RAM sticks and then it compiled o.k. But then I assumed the RAM wasn't the issue because of the memtest.
Some relevant hardware specs: Xeon-W2295 CPU, Asus WS C422 Pro/Se Mobo, 512G Ram, Nvividia 1080Ti GPU, Samsung EVO 970 SSD, Corsair 1200W PSU
So my guess is the issue is the CPU, the motherboard or the RAM, but I'm not sure how to narrow it down. Any tips? As I write this I'm thinking I should maybe remove a bunch of the RAM sticks and try emerge world again. Then maybe trying memtest86+?
Can I rule any of those 3 out? Anybody have educated guesses as to where I should focus my efforts? Should I run stress for longer?
Last edited by justin_brody on Wed Oct 07, 2020 4:39 pm; edited 1 time in total |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Fri Sep 25, 2020 4:45 pm Post subject: |
|
|
justin_brody,
We have a few retired hardware guys/girls here and some not yet retired too, so this will do. :)
To save us all a lot of typing, follow the advice in Method to test hardware functionality of crashing system?. While the things you can try are already in that topic, the questions you need to ask may be different.
Follow that topic for investigation ideas but post your questions and findings here.
They may reach a different conclusion to that topic. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Fri Sep 25, 2020 5:18 pm Post subject: |
|
|
Thanks so much NeddySeagoon! I think you actually helped me diagnose hardware failures on the last system I built, so that's probably why I had the idea this was a good place Plus this feels more like home than anywhere else.
I will go through the link as you suggested and post updates; thanks again! |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Thu Oct 01, 2020 12:47 pm Post subject: |
|
|
Hi Mika,
Thanks for that tip! I'm hoping I can figure out my issues without having to get that sophisticated, but it might be fun to play with if I need too
Some updates on my debugging efforts, based on the post NeddySeagoon pointed to:
- smartctl -x reports no errors on the SSD
- I spent a fair amount of time swapping RAM modules in and out. With just one module, I was able to emerge chromium without any issues. I tried this with each of the 8 modules; everything seemed o.k. I read somewhere that this might indicate the DDR needed higher voltage, so I manually set it to 1.4 everywhere. This caused the machine to hang partway through the gentoo boot-up. I set it back to [auto] and the system seemed more stable; I was able to run it (with all 8 RAM modules) for a few hours, but then it hung again.
- As I mentioned in the (now old) original post, I ran memtest86 for a couple of days. It completed the first cycle without any trouble.
- BIOS is updated to the latest; June of 2020 as I recall.
- Right now, lm_sensors is only giving me CPU temps and they always look o.k. (< 60 Celsius even during heavy compilation). It is *not* giving me motherboard temps; I'll see if there's something I need to do on the motherboard to get those reported.
- Overall, this thing produces a lot of heat. But with this CPU and the GPU I would think that's to be expected. When I originally pulled the RAM modules out they were hot to touch; is that normal? For what it's worth the modules don't have any kind of heat spreaders or anything else on them. Is there a way to tell if they're simply overheating?
My guess right now is that it's something with the RAM; but I'm not sure how to verify. Should I try running it for a few days with only half the RAM in and see if it's more stable? Any advice will be much appreciated! |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Thu Oct 01, 2020 12:55 pm Post subject: |
|
|
OK, running stress --vm 20 seems to hang the machine very quickly. Not usually happy to see my computer crash
So this sound like a memory issue. How to debug? |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Thu Oct 01, 2020 1:18 pm Post subject: |
|
|
Took all but one module out, now I can run stress --vm 20 without issue.
Given that memtest86 reported nothing and that I was able to emerge chromium with each individual module, I'm guessing this is some kind of motherboard-related issue? Voltages? Power? Heat? |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Thu Oct 01, 2020 1:23 pm Post subject: |
|
|
In case it's relevant, the modules are all the same make and model but were packaged as 8 individual sticks rather than being packaged as a kit. I've seen people say that can be an issue. I'm also noticing that the response time of the machine seems better with a single stick than it did with 8. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Thu Oct 01, 2020 1:37 pm Post subject: |
|
|
justin_brody,
Stress streesses lots of things. The memory cannot be used without the memory controller, which is a part of the CPU.
The CPU communicates with the RAM using the motherboard.
The whole thing requires power.
Do a binary search on your RAM. Remove half of it and leave half fitted to test.
Identify the sticks, you need to be able to tell then apart later.
If that fails, the defect is still present, so swap the fitted sticks with the ones you removed.
If that fails, its probably not RAM.
If that works, you have likely removed a faulty RAM stick.
However, it could be an address bit, With half the RAM removed, you are using less of the address bus.
With 4 sticks you test 2 then 1 .. and so on. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Thu Oct 01, 2020 1:38 pm Post subject: |
|
|
Oh, that makes very good sense -- thanks! I'll give it a shot and report back. |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Thu Oct 01, 2020 3:11 pm Post subject: |
|
|
O.k, I tried what you suggested and both sets of 4 work. Tried adding in two more and I was able to run stress --vm 20 with both pairs from the other 4 (10 minute run). In the process of looking up which sockets to put the sticks in, I saw this in the manual:
Quote: | For system stability, use a more efficient memory cooling system to support a full memory load (8 DIMMS) |
So I think that's my answer
Now I have to figure out what that looks like; but I'm optimistically thinking that none of my hardware is bad, which is very good news. I think I can somehow live on 384GB of RAM for a bit.
Thanks again for the helpful advice NeddySeagoon! Unless you tell me I'm not done I'll go ahead and mark this [solved].
Or perhaps not??? Other info online seems to indicate that RAM overheating isn't a real issue. Perhaps it's just something in motherboard/cpu on the last two sockets? But wouldn't that show up more reliably??? |
|
Back to top |
|
|
x90e n00b
Joined: 30 Sep 2020 Posts: 40
|
Posted: Thu Oct 01, 2020 5:10 pm Post subject: |
|
|
justin_brody wrote: | O.k, I tried what you suggested and both sets of 4 work. Tried adding in two more and I was able to run stress --vm 20 with both pairs from the other 4 (10 minute run). In the process of looking up which sockets to put the sticks in, I saw this in the manual:
Quote: | For system stability, use a more efficient memory cooling system to support a full memory load (8 DIMMS) |
So I think that's my answer
Now I have to figure out what that looks like; but I'm optimistically thinking that none of my hardware is bad, which is very good news. I think I can somehow live on 384GB of RAM for a bit.
Thanks again for the helpful advice NeddySeagoon! Unless you tell me I'm not done I'll go ahead and mark this [solved].
Or perhaps not??? Other info online seems to indicate that RAM overheating isn't a real issue. Perhaps it's just something in motherboard/cpu on the last two sockets? But wouldn't that show up more reliably??? |
RAM overheating can definitely be a thing.. high speeds and full slots when you've got 8 DIMMs, plus a full RAM load is more stressful on any memory controller. like with the memory controller on a 2700x cpu/b450-f motherboard,3013 bios, I can run 2 sticks of DDR4 at 3600 but with all four slots full i'm lucky to get 2933 with single rank DIMMs but with all four RAM slots full with dual rank DIMMs and I am looking at only being able to run at 2133mhz. maybe 2400mhz but I will start getting random errors and crashes, causing all sorts of system stability issues.
They make coolers with fans that attach over your memory sticks you might be able to find one that fits in your case, not sure how it works with 8 dimms but you can't be the first person to have this problem. |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Thu Oct 01, 2020 7:16 pm Post subject: |
|
|
justin_brody,
Temperature matters. Its one of a whole range of physical properties that affect the the performance of electronics.
The hotter the silicon, the higher the leakage currents and the harder the memory controller has to work.
The more DRAMs you fit, the higher the stray (parasitic) capacitance that the RAM presents to the DRAM controller, so the slower the signals to the RAM must be.
As motherboard manufactures go to great lengths to make all the traces transmission lines of the same length (yes the wavy tracks are there for a reason), the more the stray capacitance, the slower the RAMs can be driven.
The stray capacitance even has a temperature coefficient. It gets worse with rising temperature.
Temperature is about the only thing you can usefully influence. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Thu Oct 01, 2020 10:28 pm Post subject: |
|
|
O.k. -- thanks so much! I will look into cooling things down! |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Thu Oct 01, 2020 10:30 pm Post subject: |
|
|
justin_brody,
You may find that your DRAMs have a temperature sensor, so you can actually measure your success. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Thu Oct 01, 2020 10:33 pm Post subject: |
|
|
oh, very interesting! is that something i should be able to pick up with lm_sensors if it's configured correctly? |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Thu Oct 01, 2020 10:41 pm Post subject: |
|
|
justin_brody,
If its there and you have kernel support, yes. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
figueroa Advocate
Joined: 14 Aug 2005 Posts: 3007 Location: Edge of marsh USA
|
Posted: Fri Oct 02, 2020 2:47 am Post subject: |
|
|
Be sure your CPU cooler is mounted properly on the chip. Thin coat of thermal paste, cooler securely mounted.
Put a digital thermometer inside your box.
Take the cover off and aim a fan at the hardware. This last trick has helped a lot of machines get through stressful processes. If it works, then you know you have a heat problem.
Are you sure about the power supply? (Swap it out as a test.) _________________ Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Fri Oct 02, 2020 10:05 am Post subject: |
|
|
figueroa,
All good stuff.
As the silicon temperature increases, it take more power, which makes it get hotter ...
Luckily, silicon does not exhibit 'themal runaway' not at useful temperatures anyway, unlike germanium. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Fri Oct 02, 2020 11:31 am Post subject: |
|
|
I was going to ask about blowing a fan into the case for testing this out
The BIOS is reporting stats (like motherboard temperature) that I'm not getting through lm_sensors, but that's probably another thread if I can't figure it out. Otherwise I'll see if I can borrow a good thermometer from someone.
Thanks for the tips! |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Fri Oct 02, 2020 11:35 am Post subject: |
|
|
justin_brody,
Set the fan up first and observe the effects, if any.
It will be easy enough to make temperature measurements later. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Mon Oct 05, 2020 2:42 pm Post subject: |
|
|
O.k., that turned out to be pretty illuminating! Ran the machine with 6DIMMs in over the weekend, so some moderately heavy work. No problems. Put the last two in this morning and ran with a big box fan blowing into it. It stayed cool but still died after around 18 minutes of stress --vm 20
I was saving lm_sensors data into a file every 10 seconds; here's the last reading:
Quote: |
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +51.0°C (high = +76.0°C, crit = +86.0°C)
Core 0: +40.0°C (high = +76.0°C, crit = +86.0°C)
Core 1: +45.0°C (high = +76.0°C, crit = +86.0°C)
Core 2: +49.0°C (high = +76.0°C, crit = +86.0°C)
Core 3: +50.0°C (high = +76.0°C, crit = +86.0°C)
Core 4: +50.0°C (high = +76.0°C, crit = +86.0°C)
Core 8: +48.0°C (high = +76.0°C, crit = +86.0°C)
Core 9: +49.0°C (high = +76.0°C, crit = +86.0°C)
Core 10: +48.0°C (high = +76.0°C, crit = +86.0°C)
Core 11: +45.0°C (high = +76.0°C, crit = +86.0°C)
Core 16: +48.0°C (high = +76.0°C, crit = +86.0°C)
Core 17: +45.0°C (high = +76.0°C, crit = +86.0°C)
Core 18: +48.0°C (high = +76.0°C, crit = +86.0°C)
Core 19: +50.0°C (high = +76.0°C, crit = +86.0°C)
Core 20: +48.0°C (high = +76.0°C, crit = +86.0°C)
Core 24: +51.0°C (high = +76.0°C, crit = +86.0°C)
Core 25: +49.0°C (high = +76.0°C, crit = +86.0°C)
Core 26: +50.0°C (high = +76.0°C, crit = +86.0°C)
Core 27: +47.0°C (high = +76.0°C, crit = +86.0°C)
Core 26: +50.0°C (high = +76.0°C, crit = +86.0°C)
Core 27: +47.0°C (high = +76.0°C, crit = +86.0°C)
nct6796-isa-0290
Adapter: ISA adapter
Vcore: 896.00 mV (min = +0.00 V, max = +1.74 V)
in1: 1000.00 mV (min = +0.00 V, max = +0.00 V) ALARM
AVCC: 3.39 V (min = +0.00 V, max = +0.00 V) ALARM
+3.3V: 3.26 V (min = +0.00 V, max = +0.00 V) ALARM
in4: 1.02 V (min = +0.00 V, max = +0.00 V) ALARM
in5: 0.00 V (min = +0.00 V, max = +0.00 V)
in6: 592.00 mV (min = +0.00 V, max = +0.00 V) ALARM
3VSB: 3.39 V (min = +0.00 V, max = +0.00 V) ALARM
Vbat: 3.10 V (min = +0.00 V, max = +0.00 V) ALARM
in9: 1.02 V (min = +0.00 V, max = +0.00 V) ALARM
in10: 600.00 mV (min = +0.00 V, max = +0.00 V) ALARM
in11: 440.00 mV (min = +0.00 V, max = +0.00 V) ALARM
in12: 1.01 V (min = +0.00 V, max = +0.00 V) ALARM
in13: 0.00 V (min = +0.00 V, max = +0.00 V)
in14: 512.00 mV (min = +0.00 V, max = +0.00 V) ALARM
fan1: 0 RPM (min = 0 RPM)
fan2: 0 RPM (min = 0 RPM)
fan3: 0 RPM (min = 0 RPM)
fan4: 0 RPM (min = 0 RPM)
fan5: 0 RPM (min = 0 RPM)
fan6: 0 RPM (min = 0 RPM)
fan7: 0 RPM (min = 0 RPM)
SYSTIN: +30.0°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor
CPUTIN: +46.5°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor
AUXTIN0: +50.5°C sensor = thermistor
AUXTIN1: +50.0°C sensor = thermistor
AUXTIN2: +63.0°C sensor = thermistor
AUXTIN3: +57.0°C sensor = thermistor
PECI Agent 0 Calibration: +11.0°C
PCH_CHIP_CPU_MAX_TEMP: +0.0°C
PCH_CHIP_TEMP: +0.0°C
PCH_CPU_TEMP: +0.0°C
intrusion0: ALARM
intrusion1: ALARM
beep_enable: disabled
|
From what I could tell, nothing looked significantly different from the readings with the machine having 6 dimms in (I can post that as well if it would be helpful).
So is this looking like a motherboard or CPU issue?
I will note that I only stressed the RAM modules for about 10 minutes earlier; so perhaps I should try again with 20 minutes and the last two modules swapped in? |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Mon Oct 05, 2020 3:01 pm Post subject: |
|
|
justin_brody,
Is it OK with any 6 from 8 DIMMs or is it a particular pair?
Setting up the DIMM drives it a bit hit and miss.
The BIOS reads settings from the SPI ROM on the DIM. Some brain dead systems only read one, rather than use the slowest of all fitted DIMMS.
That's just for starters, now the BIOS does trial and error to establish limits.
Its not just timings, its signal drive strength too.
Not strong enough, it won't work because its too slow. Too strong and despite the transmission line traces on the PCB, you get overshoot and again it won't work.
Its all very much like baby bears porridge[1].
[1]Search for Goldilocks and the Three Bears if you don't know the story. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Mon Oct 05, 2020 4:14 pm Post subject: |
|
|
Hi NeddySeagoon,
So I just swapped the 2 DIMMs that had been left out back in and was able to run stress for over an hour without any issues.
It sounds like your suggestion is that I play around with voltage and speed setting in BIOS? Since it's DDR4 can I just set all the voltages manually to 1.2V? |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54815 Location: 56N 3W
|
Posted: Mon Oct 05, 2020 8:25 pm Post subject: |
|
|
justin_brody,
Its much more complex that the DRAM power supply voltage.
The DRAM timings and signal drive strengths matter too. That's a lot of things to play with.
Personally, I would just leave two sticks of RAM out and call it a day.
Overvolting can destroy your RAM. If you are lucky, the damage will stop there. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
justin_brody Apprentice
Joined: 26 Jan 2005 Posts: 283
|
Posted: Wed Oct 07, 2020 3:07 pm Post subject: |
|
|
Understood -- thanks again NeddySeagoon. I think I'm going to RMA the motherboard; very much appreciate the help!!! |
|
Back to top |
|
|
|