Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Random Hardware Errors
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
SATURN_RINGS
n00b
n00b


Joined: 23 Apr 2021
Posts: 6
Location: Joe

PostPosted: Sat Apr 24, 2021 7:07 pm    Post subject: Random Hardware Errors Reply with quote

So I recently finished my first install of Gentoo Linux, but I've ran into a bit of a snag. Every now and then while I am using my system, my computer case's speaker will beep and this error will pop-up in terminal:
Code:
Message from syslogd@  at Wed Dec 31 19:10:08 1969 ...
: [  627.656467] [Hardware Error]: Corrected error, no action required.

Message from syslogd@  at Wed Dec 31 19:10:08 1969 ...
: [  627.656473] [Hardware Error]: CPU:7 (17:71:0) MC0_STATUS[Over|CE|MiscV|-|-|-|SyndV|-|-|-]: 0xd820000000100015

Message from syslogd@  at Wed Dec 31 19:10:08 1969 ...
: [  627.656478] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000003a032802

Message from syslogd@  at Wed Dec 31 19:10:08 1969 ...
: [  627.656481] [Hardware Error]: Load Store Unit Ext. Error Code: 16, Level 2 TLB parity error.

Message from syslogd@  at Wed Dec 31 19:10:08 1969 ...
: [  627.656483] [Hardware Error]: cache level: L1, tx: DATA

Message from syslogd@  at Wed Dec 31 19:10:08 1969 ...
: [  627.656488] [Hardware Error]: Corrected error, no action required.

Message from syslogd@  at Wed Dec 31 19:10:08 1969 ...
: [  627.656490] [Hardware Error]: CPU:19 (17:71:0) MC0_STATUS[Over|CE|MiscV|-|-|-|SyndV|-|-|-]: 0xd820000000100015

Message from syslogd@  at Wed Dec 31 19:10:08 1969 ...
: [  627.656493] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000003a032802

Message from syslogd@  at Wed Dec 31 19:10:08 1969 ...
: [  627.656495] [Hardware Error]: Load Store Unit Ext. Error Code: 16, Level 2 TLB parity error.

Message from syslogd@  at Wed Dec 31 19:10:08 1969 ...
: [  627.656496] [Hardware Error]: cache level: L1, tx: DATA


I had the same problem with my previous Arch Linux install and had hoped it would be fixed in Gentoo, but this is not the case. Assistance would be greatly appreciated, thank you. :)
_________________
Saturn's calling!
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9895
Location: almost Mile High in the USA

PostPosted: Sat Apr 24, 2021 8:11 pm    Post subject: Reply with quote

Software does not fix hardware problems.

Software can hide hardware problems, you can just ignore the errors if you wish.

You can try downclocking your hardware to make them go away, but likely you need to check your power supply, motherboard, cpu&heatsink, etc. to see if there's any problem with them.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54851
Location: 56N 3W

PostPosted: Sat Apr 24, 2021 8:14 pm    Post subject: Reply with quote

SATURN_RINGS,

It really looks like a hardware error. The cache level: L1, tx: DATA error can only be the CPU.

Tell us about your hardware and if any of it is overclocked, Even if it 'tested to be overclocked'.

Linux is very good at finding errors due to overclocking that other operating systems don't tell you about.
That's no to say they are not there, they may just go unreported.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
SATURN_RINGS
n00b
n00b


Joined: 23 Apr 2021
Posts: 6
Location: Joe

PostPosted: Sat Apr 24, 2021 9:04 pm    Post subject: Reply with quote

NeddySeagoon wrote:
SATURN_RINGS,

It really looks like a hardware error. The cache level: L1, tx: DATA error can only be the CPU.

Tell us about your hardware and if any of it is overclocked, Even if it 'tested to be overclocked'.

Linux is very good at finding errors due to overclocking that other operating systems don't tell you about.
That's no to say they are not there, they may just go unreported.

Dear NeddySeagoon,

Thank you for the reply! No, none of my hardware is overclock. My PC is custom built though, and the case isn't very... suitable for the hardware I am using. Due to a tight budget as of late though, I can not change that. My specifications are as follows:

- 64GB RAM
- Bluray Drive
- Ryzen 3900x
- NVIDIA 2080 Super
- Floppy Drive (For archiving)

My case is unfortunately a hand-me-down from the year 2000, so air flow is limited...
_________________
Saturn's calling!
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23103

PostPosted: Sat Apr 24, 2021 10:16 pm    Post subject: Reply with quote

What power supply are you using, and how many powered devices (video cards (1, I assume), disk drives (?), etc.) is it serving? If you're lucky, the problem might be that your power supply cannot reliably service all its consumers, in which case reducing load or improving the power supply might help. Before you change that, we should evaluate what it is rated to be able to do versus what your computer needs it to do.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54851
Location: 56N 3W

PostPosted: Sat Apr 24, 2021 10:59 pm    Post subject: Reply with quote

SATURN_RINGS,

I'm with Hu on potential PSU problems.
What exact model of PSU do you have, so we can check the specs.

It can be temperature related too. Do the errors correlate with CPU temperature or with the CPU working hard?
Install lm-sensors if you don't have it and keep an eye on temperatures.

When you assembled the heatsink onto the CPU, wan the thermal compound provided with the heat sink or did you apply your own?
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
SATURN_RINGS
n00b
n00b


Joined: 23 Apr 2021
Posts: 6
Location: Joe

PostPosted: Sat Apr 24, 2021 11:32 pm    Post subject: Reply with quote

NeddySeagoon wrote:
SATURN_RINGS,

I'm with Hu on potential PSU problems.
What exact model of PSU do you have, so we can check the specs.

It can be temperature related too. Do the errors correlate with CPU temperature or with the CPU working hard?
Install lm-sensors if you don't have it and keep an eye on temperatures.

When you assembled the heatsink onto the CPU, wan the thermal compound provided with the heat sink or did you apply your own?

Dear NeddySeagoon and Hu,

The power supply in the case is a Corsair RM850x 80 Plus Gold. And as far as I know it's powering about 5 items. This Includes:

- Floppy Drive
- Bluray Drive
- DVD Drive (Forgot to mention previously)
- Motherboard
- CPU

Also, the power supply has to be faced inwards to the computer in my situation since there are no vent holes for the power supply.
_________________
Saturn's calling!
Back to top
View user's profile Send private message
OldTango
l33t
l33t


Joined: 21 Feb 2004
Posts: 718

PostPosted: Sun Apr 25, 2021 2:30 am    Post subject: Reply with quote

All other hardware aside your Ryzen9-3900x CPU and Nvidia-RTX2080 GPU are the most power hungry. For Nvidias-RTX2080 Super the recommended power supply minimum is 600 watts but a 650 watt or greater is recommended. The RM850x PSU should be able to handle your hardware.

As Neddy has said your errors are CPU related. So either you do have a heating problem or you have a CPU/RAM compatibility issue, a motherboard/bios problem or possibly a broken CPU.

The Ryzen9-3900x came shipped with a Wraith Prism with RGB LED cooler. I have used that cooler to cool a Ryzen9-3950x successfully in the past. If it is installed properly it should keep the CPU cool enough, even during major package builds.

What is the make and model of your motherboard and what bios revision is installed?
What is the make, model and rated speed of your RAM and in what configuration is it? 16x4-GB or 32x2-GB.

Best Tango..... :)
Back to top
View user's profile Send private message
JustAnother
Apprentice
Apprentice


Joined: 23 Sep 2016
Posts: 197

PostPosted: Sun Apr 25, 2021 3:46 am    Post subject: Reply with quote

You might want to see this:

https://forums.gentoo.org/viewtopic-t-1131909-highlight-.html

Power supplies can go bad after about 10 years. Strange things can happen. Lack of heat removal can make parts fail.
Back to top
View user's profile Send private message
molletts
Tux's lil' helper
Tux's lil' helper


Joined: 16 Feb 2013
Posts: 131

PostPosted: Sun Apr 25, 2021 9:00 am    Post subject: Reply with quote

What I'd do if I were you: sell 32 or 48GB of the RAM and buy a good case with high airflow. Your CPU and GPU both need a lot of cooling, with cold air from the outside, not hot, recirculated air from inside the case. The PSU must vent to the outside, and preferably draw in its own supply of external air.

What temperatures are being reported by the CPU's sensors at idle and under load? Likewise the GPU. They will both throttle to lower speeds the hotter they get so cooling them better will improve performance as well as reducing the chances of long-term damage (assuming that the hardware errors don't indicate that this damage has already happened).

If you are getting ECC errors on the CPU cache, you're also likely getting errors in data transferred to/from memory - is your RAM ECC-protected and, if so, have you enabled ECC mode? This won't fix the underlying fault but it might reduce the chance of data corruption (and warn you that it is happening) or at least enable the kernel to halt the system rather than write back uncorrectable corrupt data to the disk.
Back to top
View user's profile Send private message
steve_v
Guru
Guru


Joined: 20 Jun 2004
Posts: 416
Location: New Zealand

PostPosted: Sun Apr 25, 2021 10:10 am    Post subject: Reply with quote

SATURN_RINGS wrote:
Corsair RM850x 80 Plus Gold.

That's a nice PSU, and probably overkill for your system. It'll be barely ticking over unless the GPU is fully loaded, and even then it's got plenty in reserve on the 12v rail.
Unless it's very old (and those have a 10 year warranty IIRC) or you're very unlucky, I doubt it's the problem... It is a modular PSU though, so double check all those molex connectors are properly inserted and locked at both ends, and all the EATX 12v connectors on the board are plugged in.
It sounds silly I know, but I've seen this kind of thing more than once before. Flaky power delivery to the motherboard can absolutely cause the kind of MCE errors you're seeing.

SATURN_RINGS wrote:
the power supply has to be faced inwards to the computer in my situation since there are no vent holes for the power supply.

PSU up or down is largely personal preference, so long as case airflow is adequate. That PSU should be able to handle ~40C inlet temp with nothing more alarming than a bunch of fan noise, and your case would have to be exceptionally horrible to get anywhere near that.
If you mean that the PSU exhaust side is inside the case... Then I'm kinda confused as to how you plug the power cable in, and what exactly you're using as a case to begin with.

OldTango wrote:
either you do have a heating problem or you have a CPU/RAM compatibility issue, a motherboard/bios problem or possibly a broken CPU.

This. ^
I'd start with the easy things, namely monitoring system (particularly CPU) temps and voltages under load, and clearing and updating your BIOS. Try some extremely conservative memory clocks for good measure, and hit up your board manufacturer and/or their discussion boards for any known issues.
If those don't reveal anything, then you're probably looking at swapping out parts until you can isolate the cause. It sucks, but that's how it is, those are hardware errors and the problem is almost certainly faulty or misconfigured hardware. Distro-hopping sure won't change anything.
It's probably also worth checking that the defaults your BIOS sets are actually sane, it seems to be a thing at the moment for motherboard manufacturers (with the exception of ASUS, apparently) to pull values out of their ass with no regard for CPU spec sheets at all...

molletts wrote:
sell 32 or 48GB of the RAM and buy a good case with high airflow.

A mite premature methinks, only recording temps will tell the real story WRT airflow. I have an overclocked 10900KF in a ~80USD case (with one exhaust fan) right here, and it's just fine.
A literal shoebox will provide perfectly acceptable airflow with the right fans in the right places, and a fancy RGB-lit tempered-glass case with 12 fans won't fix a bad CPU or a dodgy BIOS. IME shiny "gaming" cases is the #1 place people waste money better spent on more useful components.
_________________
Once is happenstance. Twice is coincidence. Three times is enemy action. Four times is Official GNOME Policy.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54851
Location: 56N 3W

PostPosted: Sun Apr 25, 2021 11:08 am    Post subject: Reply with quote

molletts,

I think we need more analysis before spending any money.
It may yet be a dead CPU. The internal ECC errors are purely internal.
RAM to CPU ECC errors look quite different.

My money is on the PSU (in the can), the 12v to CPU Vcore PSU on the motherboard, or the CPU, if its a hardware issue.
Followed by overheating, possibly due to an overdose of thermal paste.
I wouldn't be surprised to discover some accidental overclocking too.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
SATURN_RINGS
n00b
n00b


Joined: 23 Apr 2021
Posts: 6
Location: Joe

PostPosted: Sun Apr 25, 2021 11:14 pm    Post subject: Reply with quote

OldTango wrote:
All other hardware aside your Ryzen9-3900x CPU and Nvidia-RTX2080 GPU are the most power hungry. For Nvidias-RTX2080 Super the recommended power supply minimum is 600 watts but a 650 watt or greater is recommended. The RM850x PSU should be able to handle your hardware.

As Neddy has said your errors are CPU related. So either you do have a heating problem or you have a CPU/RAM compatibility issue, a motherboard/bios problem or possibly a broken CPU.

The Ryzen9-3900x came shipped with a Wraith Prism with RGB LED cooler. I have used that cooler to cool a Ryzen9-3950x successfully in the past. If it is installed properly it should keep the CPU cool enough, even during major package builds.

What is the make and model of your motherboard and what bios revision is installed?
What is the make, model and rated speed of your RAM and in what configuration is it? 16x4-GB or 32x2-GB.

Best Tango..... :)

Dear OldTango,

Apologies on the late response! From what I know, my motherboard is a MSI X570 Gaming Edge WiFi, and runs on BIOS version 7C37v1D. The RAM comes from two different manufacturers as the computer was upgraded last year. The two manufactuers are G.Skill and Corsair, but are identical specification wise. The RAM is setup in a 16x4-GB configuration (Half being G.Skill, the other half being Corsair), and runs at 3600MHz. And as far as I know, the power supply was purchased last year
_________________
Saturn's calling!
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 9895
Location: almost Mile High in the USA

PostPosted: Mon Apr 26, 2021 12:34 am    Post subject: Reply with quote

Usually external memory errors do not get promoted like what you're seeing.
It's still motherboard, PSU, or CPU; not likely to be RAM unless it's a second order effect. Reducing your clock speed may show some more information of the nature of the issue.

You may remove some RAM to see if the problem still shows up, again this is a second order effect but may show details of the issue at hand.

Also be sure to update to latest motherboard BIOS if there is such available. This may program your CPU wrong and cause issues like this.
_________________
Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 3007
Location: Edge of marsh USA

PostPosted: Mon Apr 26, 2021 3:08 am    Post subject: Reply with quote

1. Put a real thermometer in your case. Then you'll really know the internal air temperature. I think I paid @ $10/US for mine which has an LCD display and a remote probe connected with a braided metal wire.

2. Increase ventilation -- hack your case if necessary. It's an old case; doesn't have to be pretty. Need fresh air in, hot air out. Or leave the cover off.

I have a server with its UPS that is in a cabinet cubby, which is nice and out of the way, but it doesn't get good ventilation there because only the front of the cubby is open. So, I have a small low-speed household fan blowing cool air into the cubby. It's crude, but simple and it works.
_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54851
Location: 56N 3W

PostPosted: Mon Apr 26, 2021 9:22 am    Post subject: Reply with quote

Even cruder ... open the case for a few days and see if it has any effect.

3600MHz with 4 sticks of RAM may be a problem but its not the internal CPU error problem.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
wjb
l33t
l33t


Joined: 10 Jul 2005
Posts: 644
Location: Fife, Scotland

PostPosted: Mon Apr 26, 2021 9:57 am    Post subject: Reply with quote

Code:
- Floppy Drive (For archiving)


8O

Disconnect the floppy drive? It's as much use as a chocolate teapot these days, and a possible source of h/w errors if its an antique.
Back to top
View user's profile Send private message
OldTango
l33t
l33t


Joined: 21 Feb 2004
Posts: 718

PostPosted: Mon Apr 26, 2021 8:28 pm    Post subject: Reply with quote

SATURN_RINGS wrote:
Apologies on the late response! From what I know, my motherboard is a MSI X570 Gaming Edge WiFi, and runs on BIOS version 7C37v1D. The RAM comes from two different manufacturers as the computer was upgraded last year. The two manufactuers are G.Skill and Corsair, but are identical specification wise. The RAM is setup in a 16x4-GB configuration (Half being G.Skill, the other half being Corsair), and runs at 3600MHz. And as far as I know, the power supply was purchased last year
That's the latest version available for your motherboard without installing the absolute latest beta version which I don't recommend unless it addresses your particular issue. I haven't found any info on the beta version as yet.

For reference:
I am using a MSI-MEG-X570-ACE Motherboard and it is using the same bios revision as yours. The processor is a Gen 3 AMD Ryzen 9-5950x with 128GB GSkill TridentZ-Royal @ 3200MHz. The power supply is an older Corsair HX1100. Because the default JEDEC Standard for RAM is 2133MHz to get my ram running at 3200MHz required me to enable XMP. While this seems like a normal step to take it is in fact a form of OVERCLOCKING and requires user intervention to complete in most cases. The RAM needs to be XMP compatible as well or issues can arise.

My system idles around 34C, under average load between 40C to 50C and under heavy load and major compiles around 65C. The CPU has a max temp rating of 90C and if this temp is reached the system would simply shutdown. These CPU's run hot by their nature so cooling is very important. AMD's Precision Boost attempts to balance core clock speeds with work load. Excessive heat can result in unwanted performance loss and may produce CPU errors.

I have been building custom PC's for 22 years and have seen a host of problems in that time.
When it comes to the failing PSU's they will almost always cause random reboots or shutdowns which is much harder to diagnose because other components can cause the same symptoms, the system will no longer power on, or the worst, they go POP and start smoking.
Motherboards either won't post or boot the system up. They can fail to find components connected to the system buses etc. Usually if they post and boot without errors and you encounter random errors while the using the system, a bios update may fix the problem if its a known issue and if it is bios related.

Where to look:
Your CPU has a max temp of 95C.
Your Corsair RM850x PSU is designed for silent operation. Under normal operation or at idle the PSU fan should not be running. If it is, there could be a heating issue of some kind or a PSU issue. You need to know for sure that the the CPU and it's cooling system are mounted properly with a high quality thermo compound. Follow Neddy's advise remove the cover, also place a window fan in a location that can help provide better cooling. You need to eliminate HEAT as a potential problem.

Your Ram:
It's not recommended to mix RAM from different manufactures while it may work and I've done it myself in the past, it could introduce problems. MSI lists a lot of RAM speeds that board can run, AMD rates your CPU RAM support @ 3200MHz.

I noticed the errors you reported have a time stamp of at least four months ago.
Have they happened sense, and how often?
Have they appeared in "dmesg" as well?
Have you encountered any other errors of any kind, like during system updates, where packages failed to build for any reason not bug related?
When you installed "Gentoo" did you compile AMD's Firmware into the kernel?
What's the output of:
Code:
emerge --info


Like Neddy stated, Linux and Gentoo (being source based) in particular is very good at exposing hardware issues.

Best Tango..... :)
Back to top
View user's profile Send private message
SATURN_RINGS
n00b
n00b


Joined: 23 Apr 2021
Posts: 6
Location: Joe

PostPosted: Tue Apr 27, 2021 12:59 am    Post subject: Reply with quote

OldTango wrote:
SATURN_RINGS wrote:
Apologies on the late response! From what I know, my motherboard is a MSI X570 Gaming Edge WiFi, and runs on BIOS version 7C37v1D. The RAM comes from two different manufacturers as the computer was upgraded last year. The two manufactuers are G.Skill and Corsair, but are identical specification wise. The RAM is setup in a 16x4-GB configuration (Half being G.Skill, the other half being Corsair), and runs at 3600MHz. And as far as I know, the power supply was purchased last year
That's the latest version available for your motherboard without installing the absolute latest beta version which I don't recommend unless it addresses your particular issue. I haven't found any info on the beta version as yet.

For reference:
I am using a MSI-MEG-X570-ACE Motherboard and it is using the same bios revision as yours. The processor is a Gen 3 AMD Ryzen 9-5950x with 128GB GSkill TridentZ-Royal @ 3200MHz. The power supply is an older Corsair HX1100. Because the default JEDEC Standard for RAM is 2133MHz to get my ram running at 3200MHz required me to enable XMP. While this seems like a normal step to take it is in fact a form of OVERCLOCKING and requires user intervention to complete in most cases. The RAM needs to be XMP compatible as well or issues can arise.

My system idles around 34C, under average load between 40C to 50C and under heavy load and major compiles around 65C. The CPU has a max temp rating of 90C and if this temp is reached the system would simply shutdown. These CPU's run hot by their nature so cooling is very important. AMD's Precision Boost attempts to balance core clock speeds with work load. Excessive heat can result in unwanted performance loss and may produce CPU errors.

I have been building custom PC's for 22 years and have seen a host of problems in that time.
When it comes to the failing PSU's they will almost always cause random reboots or shutdowns which is much harder to diagnose because other components can cause the same symptoms, the system will no longer power on, or the worst, they go POP and start smoking.
Motherboards either won't post or boot the system up. They can fail to find components connected to the system buses etc. Usually if they post and boot without errors and you encounter random errors while the using the system, a bios update may fix the problem if its a known issue and if it is bios related.

Where to look:
Your CPU has a max temp of 95C.
Your Corsair RM850x PSU is designed for silent operation. Under normal operation or at idle the PSU fan should not be running. If it is, there could be a heating issue of some kind or a PSU issue. You need to know for sure that the the CPU and it's cooling system are mounted properly with a high quality thermo compound. Follow Neddy's advise remove the cover, also place a window fan in a location that can help provide better cooling. You need to eliminate HEAT as a potential problem.

Your Ram:
It's not recommended to mix RAM from different manufactures while it may work and I've done it myself in the past, it could introduce problems. MSI lists a lot of RAM speeds that board can run, AMD rates your CPU RAM support @ 3200MHz.

I noticed the errors you reported have a time stamp of at least four months ago.
Have they happened sense, and how often?
Have they appeared in "dmesg" as well?
Have you encountered any other errors of any kind, like during system updates, where packages failed to build for any reason not bug related?
When you installed "Gentoo" did you compile AMD's Firmware into the kernel?
What's the output of:
Code:
emerge --info


Like Neddy stated, Linux and Gentoo (being source based) in particular is very good at exposing hardware issues.

Best Tango..... :)

Dear OldTango,

It's funny you should mention random shutdowns, reboots, and trouble starting the system in general, because I've been running into a lot of those lately. I'll be in the middle of a task and the computer just shuts off sometimes. And sometimes it takes several tries for the computer to boot past the blinking underscore, and even then it sometimes crashes on the BIOS Boot Menu screen. It's not [i]unusable/i], but it can be a time consumer. I'll try fan cooling tomorrow, its a bit late right now. :)

As for the errors, I have just noticed it says the date is December 31st of 1969. That's obviously wrong, but when I entered the date command it shows up with the right time. But to answer your question, they are still happening and pop-up every 1 to 5 minutes, and the errors do appear in dmesg.

Here is the output of 'emerge --info':
Code:
Portage 3.0.17 (python 3.8.8-final-0, default/linux/amd64/17.1/desktop, gcc-10.2.0, glibc-2.32-r7, 5.10.27-gentoo-x86_64 x86_64)
=================================================================
System uname: Linux-5.10.27-gentoo-x86_64-x86_64-AMD_Ryzen_9_3900X_12-Core_Processor-with-glibc2.2.5
KiB Mem:    65846880 total,  64064412 free
KiB Swap:  134217724 total, 134217724 free
Timestamp of repository gentoo: Sat, 24 Apr 2021 00:30:01 +0000
Head commit of repository gentoo: dd069ebac8b0f15edc1dee19bb77f9611b5a812a
sh bash 5.0_p18
ld GNU ld (Gentoo 2.35.2 p1) 2.35.2
app-shells/bash:          5.0_p18::gentoo
dev-lang/perl:            5.30.3::gentoo
dev-lang/python:          3.8.8_p1::gentoo, 3.9.2_p1::gentoo
dev-lang/rust:            1.51.0-r2::gentoo
dev-util/cmake:           3.18.5::gentoo
sys-apps/baselayout:      2.7::gentoo
sys-apps/openrc:          0.42.1-r1::gentoo
sys-apps/sandbox:         2.20::gentoo
sys-devel/autoconf:       2.13-r1::gentoo, 2.69-r5::gentoo
sys-devel/automake:       1.16.2-r1::gentoo
sys-devel/binutils:       2.35.2::gentoo
sys-devel/gcc:            10.2.0-r5::gentoo
sys-devel/gcc-config:     2.4::gentoo
sys-devel/libtool:        2.4.6-r6::gentoo
sys-devel/make:           4.3::gentoo
sys-kernel/linux-headers: 5.10::gentoo (virtual/os-headers)
sys-libs/glibc:           2.32-r7::gentoo
Repositories:

gentoo
    location: /var/db/repos/gentoo
    sync-type: rsync
    sync-uri: rsync://rsync.gentoo.org/gentoo-portage
    priority: -1000
    sync-rsync-verify-max-age: 24
    sync-rsync-extra-opts:
    sync-rsync-verify-metamanifest: yes
    sync-rsync-verify-jobs: 1

ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="*"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=znver2 -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-march=znver2 -O2 -pipe"
DISTDIR="/var/cache/distfiles"
ENV_UNSET="CARGO_HOME DBUS_SESSION_BUS_ADDRESS DISPLAY GOBIN GOPATH PERL5LIB PERL5OPT PERLPREFIX PERL_CORE PERL_MB_OPT PERL_MM_OPT XAUTHORITY XDG_CACHE_HOME XDG_CONFIG_HOME XDG_DATA_HOME XDG_RUNTIME_DIR"
FCFLAGS="-march=znver2 -O2 -pipe"
FEATURES="assume-digests binpkg-docompress binpkg-dostrip binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles ipc-sandbox merge-sync multilib-strict network-sandbox news parallel-fetch pid-sandbox preserve-libs protect-owned qa-unresolved-soname-deps sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr"
FFLAGS="-march=znver2 -O2 -pipe"
GENTOO_MIRRORS="http://distfiles.gentoo.org"
LANG="en_US.utf8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
MAKEOPTS="-j12"
PKGDIR="/var/cache/binpkgs"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --exclude=/.git"
PORTAGE_TMPDIR="/var/tmp"
USE="X a52 aac acl acpi alsa amd64 berkdb branding bzip2 cairo cdda cdr cli crypt cups dbus dri dts dvd dvdr elogind emboss encode exif flac fortran gdbm gif gpm gtk gui iconv icu ipv6 jpeg lcms libglvnd libnotify libtirpc mad mng mp3 mp4 mpeg multilib ncurses nls nptl ogg opengl openmp pam pango pcre pdf png policykit ppds qt5 readline sdl seccomp spell split-usr ssl startup-notification svg tcpd tiff truetype udev udisks unicode upower usb vorbis wxwidgets x264 xattr xcb xml xv xvid zlib" ABI_X86="64" ADA_TARGET="gnat_2018" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="karbon sheets words" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" CPU_FLAGS_X86="mmx mmxext sse sse2" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock greis isync itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf skytraq superstar2 timing tsip tripmate tnt ublox ubx" GRUB_PLATFORMS="efi-64" INPUT_DEVICES="libinput" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" LUA_SINGLE_TARGET="lua5-1" LUA_TARGETS="lua5-1" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php7-3 php7-4" POSTGRES_TARGETS="postgres10 postgres11" PYTHON_SINGLE_TARGET="python3_8" PYTHON_TARGETS="python3_8" RUBY_TARGETS="ruby26" USERLAND="GNU" VIDEO_CARDS="amdgpu fbdev intel nouveau radeon radeonsi vesa dummy v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq proto steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CC, CPPFLAGS, CTARGET, CXX, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LC_ALL, LINGUAS, PORTAGE_BINHOST, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, RUSTFLAGS


EDIT: Apologies! I nearly forgot to mention, I'm not 100% certain but I do think I compiled the AMD Firmware into the kernel. I used genkernel so I'm not entirely positive, sorry.
_________________
Saturn's calling!
Back to top
View user's profile Send private message
figueroa
Advocate
Advocate


Joined: 14 Aug 2005
Posts: 3007
Location: Edge of marsh USA

PostPosted: Tue Apr 27, 2021 2:05 am    Post subject: Reply with quote

You can check:
Code:
grep -i firmware /usr/src/linux/.config

_________________
Andy Figueroa
hp pavilion hpe h8-1260t/2AB5; spinning rust x3
i7-2600 @ 3.40GHz; 16 gb; Radeon HD 7570
amd64/23.0/split-usr/desktop (stable), OpenRC, -systemd -pulseaudio -uefi
Back to top
View user's profile Send private message
JustAnother
Apprentice
Apprentice


Joined: 23 Sep 2016
Posts: 197

PostPosted: Tue Apr 27, 2021 4:57 am    Post subject: Reply with quote

Quote:
It's funny you should mention random shutdowns, reboots, and trouble starting the system in general, because I've been running into a lot of those lately. I'll be in the middle of a task and the computer just shuts off sometimes. And sometimes it takes several tries for the computer to boot past the blinking underscore, and even then it sometimes crashes on the BIOS Boot Menu screen.


I once had the same problem. The blinking underscore on the upper left was "obviously" the hard drive not talking to the bios.
This turned out not to be the case. It turned out the computer had a front panel module where you could plug in all those weird obsolete memory gadgets, as well as usb. It was the module that was acting up, not the hard drive. Once I unplugged the module, the bios worked.

The ability to permute hardware is the best tactic for finding some of these problems. Each problem however is unique. PITA.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54851
Location: 56N 3W

PostPosted: Tue Apr 27, 2021 8:51 am    Post subject: Reply with quote

SATURN_RINGS,

Code:
Message from syslogd@  at Wed Dec 31 19:10:08 1969 ...
: [  627.656488] [Hardware Error]: Corrected error, no action required.


That's a very peculiar time. *NIX time started at 00:00:00 on 1-Jan-1970 UTC.
System time was set [ 627.656488] seconds ago, or 10 min 27.65.. seconds ago.

From that we can deduce the the system has not yet synchronised the clock to the BIOS, which happens very early in boot.
In a few seconds or less, not minutes. The only way to get a timestamp before time began is to apply the timezone correction but that's set after system time has been synched from the BIOS.
In this case, UTC-5 or the East Coast of the USA timezone. That's a whole strip. Other places use it too.
How do you get the timezone set but system time no synced to the BIOS?

There's more. What happened to the approx 20 sec between Wed Dec 31 19:10:08 1969 and [ 627.656488]?
The messages may not be atomic but 20 sec is a huge lag.

The symptoms you describe point to cooling and power, in that order.

Tell us how you assembled the heatsink to the CPU?
Did it come with a thermal pad in place, so you only removed the protective film or did you apply your own thermal paste?

I'm asking as I have seen many home builds go wrong here.
Protective film still in place ... excessive thermal compound, so it keeps the heat in.

-- Edit --

Genkernel will do what its told with firmware. If you didn't tell it, you don't have it.
dmesg will show the firmware being updiated.
Code:
$ dmesg | grep updated
[    6.604250] microcode: microcode updated early to new patch_level=0x010000dc

That's on my Phenom II.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
OldTango
l33t
l33t


Joined: 21 Feb 2004
Posts: 718

PostPosted: Wed Apr 28, 2021 2:08 am    Post subject: Reply with quote

SATURN_RINGS wrote:
It's funny you should mention random shutdowns, reboots, and trouble starting the system in general, because I've been running into a lot of those lately. I'll be in the middle of a task and the computer just shuts off sometimes. And sometimes it takes several tries for the computer to boot past the blinking underscore, and even then it sometimes crashes on the BIOS Boot Menu screen. It's not [i]unusable/i], but it can be a time consumer. I'll try fan cooling tomorrow, its a bit late right now. :)
All of this new information is confirming to me that you have a HEAT or PSU issue. You half to eliminate or confirm HEAT as the problem first.

I will address the HEAT issue first:
As I have said your processor has a max temp limit of 90C. It can run hotter than that before it is fiscally damaged beyond repair. The CPU along with the BIOS will prevent that from happening. Once the system hits that threshold it will instantly shut down. If you have the BIOS set to recover from a power failure then the system will attempt to reboot. If it shut down because of excessive HEAT it can't and won't reboot until the temps have reached a safe level at which point the BIOS will post but it should than halt before any OS is booted asking you if you want to enter the BISO setup menu or boot an OS.

Without some way to monitor the CPU temps like lm-sensors, this can be very difficult to verify because if it is HEAT related it also appears to be progressive. Which means temps start off normal, but over time and system use, temps continue to climb until the max temp is reached. The cooling system is not preforming its job well enough to cool the CPU.

You could shut the system down for 30 minutes or so. Boot and go into the BIOS then open the SYSTEM MONITOR and monitor the temps. The BIOS alone won't really put any stress on the CPU. The temps should remain stable, with very little fluctuation but if they climb and continue to climb then for sure you have a cooling issue.

The PSU:
The simplest way for you to determine whether the PSU is faulty or not, would be to borrow or buy a 650W or 750W unit and hook it up and see if it resolves the problems if it dose replace your PSU. Corsair's warranty on that unit is 10 years so request an RMA. You can buy a power supply tester for around $15 US on amazon. It can at least tell you if the PSU is providing proper voltages on all rails. I prefer to use a known working unit attached to a live system where I can put it under a load. It might test good on a bench but fail or be unstable under a load.

SATURN_RINGS wrote:
As for the errors, I have just noticed it says the date is December 31st of 1969. That's obviously wrong, but when I entered the date command it shows up with the right time. But to answer your question, they are still happening and pop-up every 1 to 5 minutes, and the errors do appear in dmesg.
Sorry I skipped right over the year part of that error. Either the person who built the system never set the BIOS clock, the CMOS battery is shot, or the BIOS is resetting itself to factory defaults because of the hard shut downs. I am assuming you have NTP installed and running. During system boot up NTP will sync your system clock to some time server and adjustments are made for your time zone. During a normal reboot or shutdown operation NTP will sync your hardware clock with your system clock. If the system is shutdown because of hardware failures, NTP will never be ran and the hardware clock won't be synchronized. After a hard shutdown you will see this as a CLOCK SKEW error during the early initialization process when booting. Neddy can confirm or correct me here. Its been a while sense I last looked at the NTP specs.

Please answer Neddy'squestions.
Also I am just assuming you used the cooler that was shipped with your CPU. Please confirm this or provide the information on the cooler installed.

Best Tango..... :)
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54851
Location: 56N 3W

PostPosted: Wed Apr 28, 2021 8:26 am    Post subject: Reply with quote

SATURN_RINGS,

Don't waste your money on a PSU tester. They can check the PSU under static load conditions but as PSUs age, that's not where problems show.

You need to know what happens to the supply voltage as the CPU goes from idle to full load in one CPU clock. That's less than 1ns.
If the CPU can't supply that gulp, the voltages go out of spec and strange things may happen.
That's called the dynamic regulation and its very hard to measure.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
JustAnother
Apprentice
Apprentice


Joined: 23 Sep 2016
Posts: 197

PostPosted: Wed Apr 28, 2021 7:01 pm    Post subject: Reply with quote

I've been planning to buy one of those $25 oscilloscopes on ebay. They go to 200 kHz. That, and a terminal strip adapter, can say a lot.

Even if things happen on a much faster time scale there would be artifacts below 200 kHz (I hope).

By the way. most circuits elements like resistors, capacitors, inductors, etc. only act that way up into tens of MHz (or less). After that they become self resonant and behave totally differently. One's first experience with an impedance analyser is a humbling experience. Been there.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum