View previous topic :: View next topic |
Author |
Message |
ExecutorElassus Veteran
Joined: 11 Mar 2004 Posts: 1471 Location: Berlin, Germany
|
Posted: Thu Mar 14, 2019 12:40 pm Post subject: Isolating memory failures |
|
|
I've had a randomly recurring problem with system freezes, usually when doing Gaming Things (shoutout to wine's excellent progress with supporting modern games). I ran furmark without issue, and ran 'stress' on the CPU, also without issue. However, running memtest86+ caused the computer to switch off and restart, usually about 10% into the test. I pulled out all the RAM sticks, and put them in one by one, getting the same problem (at different points, lasting longer if I told memtest to use SMP) on every stick in any slot.
This was suggested to me elsewhere to be an fault with the mobo's RAM bus. Is there a way to isolate this better? The problem with system freezes occurs more frequently when it's hot out, which suggests to me some component on the mobo is burned, but I'd like to know a bit better before I plunk down another €500 for a new CPU/mobo combo.
Cheers,
EE |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9895 Location: almost Mile High in the USA
|
Posted: Thu Mar 14, 2019 3:31 pm Post subject: |
|
|
If the computer powers down during running a RAM test, I'd look into power problems.
Since you mention that it fails more often during hot weather, you should see if the cooling system is working. Ensure everything is clean of dust.
Your RAM is probably fine, but motherboard, power supply (including the motherboard ones) are suspect. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
ExecutorElassus Veteran
Joined: 11 Mar 2004 Posts: 1471 Location: Berlin, Germany
|
Posted: Thu Mar 14, 2019 5:23 pm Post subject: |
|
|
CPU and GPU are on their own water loop with an external radiator. The case is vertically oriented (that is, the ports are on the top of the case, not the back, so airflow moves bottom-to-top over the cards).
I was told that PSUs only last "5-8 years," and this one's over a decade old. Is there any way to test that the PSU is faulty besides swapping it out?
Cheers,
EE
(this would be, honestly, a preferably diagnosis to a bad mobo, because the PSU is cheap to replace and I'd have to replace the entire CPU/RAM/mobo set otherwise) |
|
Back to top |
|
|
Jaglover Watchman
Joined: 29 May 2005 Posts: 8291 Location: Saint Amant, Acadiana
|
Posted: Thu Mar 14, 2019 5:49 pm Post subject: |
|
|
You could measure the voltages on ATX connector, don't unplug it, it is important to measure under load. However, this test would not reveal if some rail provides "dirty" power. But then again, if it is that old why not replace it. There are PSU testers, but the cheap ones are useless, they do not put any load to the PSU. _________________ My Gentoo installation notes.
Please learn how to denote units correctly! |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54851 Location: 56N 3W
|
Posted: Thu Mar 14, 2019 6:52 pm Post subject: |
|
|
ExecutorElassus,
If its PSU, you need to test the dynamic regulation. That's hard.
The problem is that the voltages need to stay within spec when the CPU goes from almost nothing to full power in one CPU clock.
With a 3 GHz CPU clock that's not very long. (3.33ps)
It gets worse. The CPU and memory subsystem have their own on the motherboard PSU. This takes 12v out of the tin can PSU and turns it into the voltages required by the CPU and memory.
This bit gets a very hard life and as result, fails more often than the PSU you are thinking of replacing.
At over 10 years old, if the rest of the system is of an age with the PSU, failures here can be often be spotted with Mk1 eyeball.
Look at the capacitors around the CPU. Be sure that they are not leaking, bulging, or tipped over. That's all signs of failure.
Replacing these parts, they must all be done together, is a job requiring intermediate soldering skills.
So far, you have identified a systems problem that is probably not RAM.
Test the RAM with mentest86+ in another system. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
ExecutorElassus Veteran
Joined: 11 Mar 2004 Posts: 1471 Location: Berlin, Germany
|
Posted: Thu Mar 14, 2019 7:03 pm Post subject: |
|
|
Hi Neddy!
the mobo/CPU/RAM were all from 2016; the GPU from 2009, the PSU from around 2008.
The intermittent problem I'm having is that the system will freeze, requiring hard reboot. It happens more often during warm weather, and more often when doing memory-intensive things (or, at least I assume that's the case since the game I play only uses one core really and doesn't do a lot of HDD writes).
As I said, another contact suggested the RAM bus is failing. But maybe it's the PSU on the mobo?
I'll have to ask around to see if I can find anybody who even has a desktop.
Is there any other way to test besides finding another machine?
Cheers,
EE |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54851 Location: 56N 3W
|
Posted: Thu Mar 14, 2019 7:17 pm Post subject: |
|
|
ExecutorElassus,
In 2016, the memory controller was built into the CPU. The bus is the tracking and terminating resistors at each end.
These resistors are on the RAM sticks at the RAM stick end and on the motherboard at the CPU end.
Asumming you have 4 or 6 memory sockets, then 4 or 6 parts of the RAM bus have failed.
That's unlikely.
Can you post images of the region of the motherboard around the CPU?
Don't remove the CPU heatsink yet, lets see what we can see first. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
ExecutorElassus Veteran
Joined: 11 Mar 2004 Posts: 1471 Location: Berlin, Germany
|
Posted: Thu Mar 14, 2019 7:55 pm Post subject: |
|
|
Hi Neddy,
the best I could do without cracking the case is this photo. Sorry for all the tubing (and yes, the green coolant suggests to me that the GPU could probably do with a replacement; I'm gonna try that by year's end).
See anything useful there?
Cheers,
EE |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54851 Location: 56N 3W
|
Posted: Thu Mar 14, 2019 8:27 pm Post subject: |
|
|
ExecutorElassus,
Its difficult to see in that image.
There are 11 tubular silver things at the top and top right side of the water block.
There are more above the finned black and red heatsink. Thats the bits we need to see.
The finned heatsink carries the switching transistors for the CPU power supply.
The black D on the tops indicates polarity. That's important if/when you come to replace the parts.
The connector in the top right of the image, with the black and yellow wires is the 12v input to the CPU PSU.
My connector is charred and it has most of the plastic missing from the PSU cable. Every now and again it goes high resistance and I nave to clean it.
Yours looks OK. Its worth pulling in apart to inspect.
Better images would be useful. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
ExecutorElassus Veteran
Joined: 11 Mar 2004 Posts: 1471 Location: Berlin, Germany
|
Posted: Tue Mar 19, 2019 9:22 am Post subject: |
|
|
Hi Neddy,
I finally cracked the case open and got some better photos. I don't see any obvious damage here, and the cable appears to be properly seated (though there's a second cable, lower down and smaller, going into the mobo that I didn't check). Can you see anything here that looks like an obvious culprit?
Thanks for the help,
EE |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54851 Location: 56N 3W
|
Posted: Tue Mar 19, 2019 10:21 am Post subject: |
|
|
ExecutorElassus,
Both photos look good.
-- edit --
All look good. - Missed one. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
ExecutorElassus Veteran
Joined: 11 Mar 2004 Posts: 1471 Location: Berlin, Germany
|
Posted: Wed Mar 20, 2019 6:12 pm Post subject: |
|
|
all right. So should I go back to swapping out the PSU, or is there some other check I can make to try to narrow it down?
Cheers,
EE |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54851 Location: 56N 3W
|
Posted: Wed Mar 20, 2019 6:24 pm Post subject: |
|
|
ExecutorElassus,
Testing by substitution is indeed the next step.
Order doesn't matter but just one thing at a time. If you have a PSU to hand, swap it.
Likewise, try parts from this system elsewhere. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
artbody Guru
Joined: 15 Sep 2006 Posts: 494 Location: LB
|
Posted: Thu Mar 21, 2019 3:50 pm Post subject: |
|
|
I don't know a lot about PSU's
but i've always (on my PC)
sys-apps/lm_sensors
and GKrellM for visualisation
installed and configured,
so i can always see what temperature the GPU and CPU has.
the other thing i would suggest is a memtest _________________ Never give up
WM : E16 the true enlightenment
achim |
|
Back to top |
|
|
ExecutorElassus Veteran
Joined: 11 Mar 2004 Posts: 1471 Location: Berlin, Germany
|
Posted: Fri Mar 22, 2019 5:54 am Post subject: |
|
|
hi artbody,
I have gkrellm running. Neither CPU nor GPU ever get near redline (CPU maxes out around 55°C under 100% load, occasionally spiking to a bit over 60°C; GPU never gets above about 45°C).
As I said upthread, running memtest causes the machine to switch off and restart.
Unfortunately, I don't have a spare machine, and I don't know anybody who has a PC. I could maybe ask at the University if the lab has a spare PSU they could lend, and try that out. I'll let you know.
Cheers,
EE |
|
Back to top |
|
|
ExecutorElassus Veteran
Joined: 11 Mar 2004 Posts: 1471 Location: Berlin, Germany
|
Posted: Thu Apr 04, 2019 11:42 am Post subject: |
|
|
Update: I installed a new PSU. Same problem with memtest. I don't know yet if the machine freezes (it does it randomly). However, I also dusted off all the fans, and now both CPU and GPU temps are much lower, even when gaming.
I wonder if the problem might be heat-related. With both the CPU and GPU water-cooled, they don't register very high temperature, and since the case fans are motherboard-controlled, maybe there's not enough airflow in the case and some other component is overheating? It seems to freeze more when it's hot out, and less after I clean the fan filters. That still doesn't explain the switch-off running memtest.
But in any case, with a new PSU I still don't know what the problem is, because it evidently isn't solved completely yet. Neddy, any idea what to try next? I was going to see if I could borrow a vid card from the University lab, and see if maybe that might be the problem. The GPU is now the oldest component (it's almost 9 years old), and I heard that vid card problems can affect memtest.
Any other suggestions?
Cheers,
EE |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54851 Location: 56N 3W
|
Posted: Thu Apr 04, 2019 8:14 pm Post subject: |
|
|
ExecutorElassus,
Try turning off Message Signalled Interrupts. Add to your kernel line in grub.conf.
You get a small performance penalty for that.
Normal IRQs and MSI work quite differently.
It the old way, the address of the interrupt service routine is stored in a look up table. If the IRQ is shared, the service routine has to query every device in the list until it finds the device that raised the IRQ.
With MSI, the device is programmed with the address of the IRQ when the interrupt service routine is installed.
When the IRQ is acknowledged, the device puts this address on the bus and the CPU jumps to it.
Its more complex and has tighter timing constraints that the old way.
Sometimes, it fixes hard to track down lockups. When MSIs fail, the CPU can jump anywhere.
Test. When the system locks up, in the CPU halted?
In the halt state, pressing the reset button will not restart the CPU. Only the power button will bring it out of halt.
If the CPU is halted. its got itself in a big mess, like it would if it jumped to something that was not code in response to an interrupt.
GPUs generate a lot of IRQs ...
Look at /proc/interrupts _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
ExecutorElassus Veteran
Joined: 11 Mar 2004 Posts: 1471 Location: Berlin, Germany
|
Posted: Thu Apr 04, 2019 8:33 pm Post subject: |
|
|
Hi Neddy,
I'll try disabling MSI on next reboot. Would that also affect memtest?
When the system locks up (which so far has only once happened outside of gaming, and then immediately after closing the game), the reset button restarts the machine.
So far, though, after dusting off my fan filters, I haven't had any lockups. But since it's quite random, I don't know if that means anything.
I'm going to try borrowing a vid card and see if that solves the memtest issue.
Stay tuned,
EE |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54851 Location: 56N 3W
|
Posted: Fri Apr 05, 2019 8:29 pm Post subject: |
|
|
ExecutorElassus,
Turning off MSI fixes lots of marginal timing things. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
C5ace Guru
Joined: 23 Dec 2013 Posts: 489 Location: Brisbane, Australia
|
Posted: Sat Apr 06, 2019 4:58 pm Post subject: |
|
|
Had the same problem during our summer with my 9 year old 24/7 system. Fixed it by replacing the dried out termal paste with fresh termal paste between the CPU and heatsink. _________________ Observation after 30 years working with computers:
All software has known and unknown bugs and vulnerabilities. Especially software written in complex, unstable and object oriented languages such as perl, python, C++, C#, Rust and the likes. |
|
Back to top |
|
|
ExecutorElassus Veteran
Joined: 11 Mar 2004 Posts: 1471 Location: Berlin, Germany
|
Posted: Sat Apr 06, 2019 5:31 pm Post subject: |
|
|
yup, that too. I've since learned that good thermal paste only lasts maybe 6 months, so now I replace it regularly. I also (again) discovered that I need to give the whole case, and especially the fan filters, a thorough vacuuming at least a few times a year. Drops the running temp down a good 30°C.
But since the freezes I was having happened apparently randomly (and only really when gaming) I have no way to know whether I've resolved the problem, except by inference and probability. The longer it goes without freezing, the more likely it looks that my problem, perhaps unrelated to memtest, was simply a problem with something in the case overheating.
I'll keep y'all posted.
Cheers,
EE |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54851 Location: 56N 3W
|
Posted: Sat Apr 06, 2019 5:37 pm Post subject: |
|
|
ExecutorElassus,
It really helps if you can generate a simple test case.
Keep in mind too that absence of evidence is not evidence of absence.
So you can't prove the problem is not there any more. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Last edited by NeddySeagoon on Sat Apr 06, 2019 8:27 pm; edited 1 time in total |
|
Back to top |
|
|
ExecutorElassus Veteran
Joined: 11 Mar 2004 Posts: 1471 Location: Berlin, Germany
|
Posted: Sat Apr 06, 2019 7:18 pm Post subject: |
|
|
Neddy, you are exactly right. I may have worded it poorly, but that's what I was getting at: all I really know so far is that it hasn't frozen again yet. And my main problem has been, from the beginning, that I can't isolate what's causing the system to freeze in the first place. I just know that it happens when gaming, and seems to happen less when the case/fans/radiator have been dusted. But the proximate cause remains unknown.
So, I guess, I'll just keep trying to get it to freeze, and keep trying different tests (should be able to borrow a vid card soon), and keep you posted.
Cheers,
EE |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54851 Location: 56N 3W
|
Posted: Sat Apr 06, 2019 8:29 pm Post subject: |
|
|
ExecutorElassus,
I think that phrase was attributed to Carl Sagan.
I'm sure I heard him use it with regards to SETI and SETI@home. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
eccerr0r Watchman
Joined: 01 Jul 2004 Posts: 9895 Location: almost Mile High in the USA
|
Posted: Sat Apr 06, 2019 10:29 pm Post subject: |
|
|
I kind of doubt a video card could cause memtest failures especially in newer machines where busses are mostly separated from each other. But it could be a first especially if the video card for whatever reason causes overload.
On older machines with a separate northbridge, I have had instances where the northbridge overheating or failing, causing memory failures. This shouldn't be the case for modern machines however - unless the CPU has gone bad. BTW do you get memory failures on cacheable vs uncacheable tests?
When I see overheating problems on my PC, it's usually due to dust clogging heatsink fins -- which shouldn't be a problem with water blocks -- but the heatsink compound is a common denominator... IMHO if heat sink compound is applied properly (versus blathered all over the place), I think it should last longer between applications. Then again I really try hard not to need to remove the heatsink so I don't need to reapply heatsink compound, and I have machines that have yet to replace the heatsink compound since initial assembly when new. _________________ Intel Core i7 2700K/Radeon R7 250/24GB DDR3/256GB SSD
What am I supposed watching? |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|