Random Hardware Errors

SATURN_RINGS · n00b Joined: 23 Apr 2021 Posts: 6 Location: Joe

So I recently finished my first install of Gentoo Linux, but I've ran into a bit of a snag. Every now and then while I am using my system, my computer case's speaker will beep and this error will pop-up in terminal:

eccerr0r · Posted: Sat Apr 24, 2021 8:11 pm Post subject:

Software does not fix hardware problems.

Software can hide hardware problems, you can just ignore the errors if you wish.

You can try downclocking your hardware to make them go away, but likely you need to check your power supply, motherboard, cpu&heatsink, etc. to see if there's any problem with them.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

NeddySeagoon · Posted: Sat Apr 24, 2021 8:14 pm Post subject:

SATURN_RINGS,

It really looks like a hardware error. The cache level: L1, tx: DATA error can only be the CPU.

Tell us about your hardware and if any of it is overclocked, Even if it 'tested to be overclocked'.

Linux is very good at finding errors due to overclocking that other operating systems don't tell you about.
That's no to say they are not there, they may just go unreported.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

SATURN_RINGS · n00b Joined: 23 Apr 2021 Posts: 6 Location: Joe

Hu · Administrator Joined: 06 Mar 2007 Posts: 23103

What power supply are you using, and how many powered devices (video cards (1, I assume), disk drives (?), etc.) is it serving? If you're lucky, the problem might be that your power supply cannot reliably service all its consumers, in which case reducing load or improving the power supply might help. Before you change that, we should evaluate what it is rated to be able to do versus what your computer needs it to do.

NeddySeagoon · Posted: Sat Apr 24, 2021 10:59 pm Post subject:

SATURN_RINGS,

I'm with Hu on potential PSU problems.
What exact model of PSU do you have, so we can check the specs.

It can be temperature related too. Do the errors correlate with CPU temperature or with the CPU working hard?
Install lm-sensors if you don't have it and keep an eye on temperatures.

When you assembled the heatsink onto the CPU, wan the thermal compound provided with the heat sink or did you apply your own?
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

SATURN_RINGS · n00b Joined: 23 Apr 2021 Posts: 6 Location: Joe

OldTango · l33t Joined: 21 Feb 2004 Posts: 718

All other hardware aside your Ryzen9-3900x CPU and Nvidia-RTX2080 GPU are the most power hungry. For Nvidias-RTX2080 Super the recommended power supply minimum is 600 watts but a 650 watt or greater is recommended. The RM850x PSU should be able to handle your hardware.

As Neddy has said your errors are CPU related. So either you do have a heating problem or you have a CPU/RAM compatibility issue, a motherboard/bios problem or possibly a broken CPU.

The Ryzen9-3900x came shipped with a Wraith Prism with RGB LED cooler. I have used that cooler to cool a Ryzen9-3950x successfully in the past. If it is installed properly it should keep the CPU cool enough, even during major package builds.

What is the make and model of your motherboard and what bios revision is installed?
What is the make, model and rated speed of your RAM and in what configuration is it? 16x4-GB or 32x2-GB.

Best Tango.....

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 197

You might want to see this:

https://forums.gentoo.org/viewtopic-t-1131909-highlight-.html

Power supplies can go bad after about 10 years. Strange things can happen. Lack of heat removal can make parts fail.

molletts · Tux's lil' helper Joined: 16 Feb 2013 Posts: 131

What I'd do if I were you: sell 32 or 48GB of the RAM and buy a good case with high airflow. Your CPU and GPU both need a lot of cooling, with cold air from the outside, not hot, recirculated air from inside the case. The PSU must vent to the outside, and preferably draw in its own supply of external air.

What temperatures are being reported by the CPU's sensors at idle and under load? Likewise the GPU. They will both throttle to lower speeds the hotter they get so cooling them better will improve performance as well as reducing the chances of long-term damage (assuming that the hardware errors don't indicate that this damage has already happened).

If you are getting ECC errors on the CPU cache, you're also likely getting errors in data transferred to/from memory - is your RAM ECC-protected and, if so, have you enabled ECC mode? This won't fix the underlying fault but it might reduce the chance of data corruption (and warn you that it is happening) or at least enable the kernel to halt the system rather than write back uncorrectable corrupt data to the disk.

steve_v · Guru Joined: 20 Jun 2004 Posts: 416 Location: New Zealand

NeddySeagoon · Posted: Sun Apr 25, 2021 11:08 am Post subject:

molletts,

I think we need more analysis before spending any money.
It may yet be a dead CPU. The internal ECC errors are purely internal.
RAM to CPU ECC errors look quite different.

My money is on the PSU (in the can), the 12v to CPU Vcore PSU on the motherboard, or the CPU, if its a hardware issue.
Followed by overheating, possibly due to an overdose of thermal paste.
I wouldn't be surprised to discover some accidental overclocking too.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

SATURN_RINGS · n00b Joined: 23 Apr 2021 Posts: 6 Location: Joe

eccerr0r · Posted: Mon Apr 26, 2021 12:34 am Post subject:

Usually external memory errors do not get promoted like what you're seeing.
It's still motherboard, PSU, or CPU; not likely to be RAM unless it's a second order effect. Reducing your clock speed may show some more information of the nature of the issue.

You may remove some RAM to see if the problem still shows up, again this is a second order effect but may show details of the issue at hand.

Also be sure to update to latest motherboard BIOS if there is such available. This may program your CPU wrong and cause issues like this.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?

figueroa · Posted: Mon Apr 26, 2021 3:08 am Post subject:

1. Put a real thermometer in your case. Then you'll really know the internal air temperature. I think I paid @ $10/US for mine which has an LCD display and a remote probe connected with a braided metal wire.

2. Increase ventilation -- hack your case if necessary. It's an old case; doesn't have to be pretty. Need fresh air in, hot air out. Or leave the cover off.

I have a server with its UPS that is in a cabinet cubby, which is nice and out of the way, but it doesn't get good ventilation there because only the front of the cubby is open. So, I have a small low-speed household fan blowing cool air into the cubby. It's crude, but simple and it works.
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi

NeddySeagoon · Posted: Mon Apr 26, 2021 9:22 am Post subject:

Even cruder ... open the case for a few days and see if it has any effect.

3600MHz with 4 sticks of RAM may be a problem but its not the internal CPU error problem.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

wjb · Posted: Mon Apr 26, 2021 9:57 am Post subject:

OldTango · l33t Joined: 21 Feb 2004 Posts: 718

SATURN_RINGS · n00b Joined: 23 Apr 2021 Posts: 6 Location: Joe

figueroa · Posted: Tue Apr 27, 2021 2:05 am Post subject:

You can check:

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 197

NeddySeagoon · Posted: Tue Apr 27, 2021 8:51 am Post subject:

SATURN_RINGS,

OldTango · l33t Joined: 21 Feb 2004 Posts: 718

NeddySeagoon · Posted: Wed Apr 28, 2021 8:26 am Post subject:

SATURN_RINGS,

Don't waste your money on a PSU tester. They can check the PSU under static load conditions but as PSUs age, that's not where problems show.

You need to know what happens to the supply voltage as the CPU goes from idle to full load in one CPU clock. That's less than 1ns.
If the CPU can't supply that gulp, the voltages go out of spec and strange things may happen.
That's called the dynamic regulation and its very hard to measure.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

JustAnother · Apprentice Joined: 23 Sep 2016 Posts: 197

I've been planning to buy one of those $25 oscilloscopes on ebay. They go to 200 kHz. That, and a terminal strip adapter, can say a lot.

Even if things happen on a much faster time scale there would be artifacts below 200 kHz (I hope).

By the way. most circuits elements like resistors, capacitors, inductors, etc. only act that way up into tens of MHz (or less). After that they become self resonant and behave totally differently. One's first experience with an impedance analyser is a humbling experience. Been there.