Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[tutorial] BadRAM 2025 update
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks
View previous topic :: View next topic  
Author Message
ecko
Tux's lil' helper
Tux's lil' helper


Joined: 04 Jul 2010
Posts: 116

PostPosted: Sun Feb 09, 2025 12:59 am    Post subject: [tutorial] BadRAM 2025 update Reply with quote

I have been dealing with RAM issues, and I found the existing information online to be slightly outdated, so I decided to write a short tutorial.

Previous references on this forum:


Symptoms

I had compiler segfault when building in big packages, such as dev-qt/qtwebengine-6.8.2, or other qt or kde applications. With --keep-going in the options and going until the end, I would relaunch it and often it would then be successful, though on very large packages like qtwebengine, or chromium, it would fail again.

Memtest86+

I installed sys-apps/memtest86+-7.20 and rebooted into it. Memtest86+ runs 10 different tests (e.g. simple read/write, block move, modulo 20) and on my machine at 1 GB/min, meaning 2 hours for the 128 GB I have.

Anytime during the test, press <F1><F4><F4> to go to the badram mode, where the output is most usable for the solution. Instead of listing individual failed bytes (which could be thousands), memtest86+ tries to summarize them into ranges using a mask. The output is limited to 10 lines.

Here is the output in my case, copied from a mobile phone picture:
Code:

badram=0x00000010168001b8,0xfffffffffffdb8,
       0x0000001016800538,0xfffffffffffd38,
       0x00000010168009f8,0xfffffffffffff8,
       0x0000001016801178,0xfffffffffff578,
       0x00000010168040b8,0xfffffffffff3f8,
       0x0000001016804638,0xfffffffffffff8,
       0x0000001016804cb8,0xfffffffffffdb8,
       0x0000001016805038,0xfffffffffff978,
       0x00000011af842960,0xfffffffffffff8,
       0x00000011af847ea0,0xfffffffffffff8,


The output is composed of an address and a mask, where the f (the all-ones) correspond to the bits that are common in the range, and the zeros correspond to the bits that can vary.

It can be seen that the addresses are very close and can be summarized by two 64k blocks:
Code:

0x0000001016800000,0xffffffffffff0000
0x00000011af840000,0xffffffffffff0000


We have to exclude at least 4k blocks which is the page size of the kernel, but it is safer to exclude bigger blocks. It is likely that the number failing bytes are going to increase over time near the ones that already failed (and 64k is very small anyway).

GRUB2
We add the ranges to /etc/default/grub then update the grub configuration:

Code:

GRUB_BADRAM=0x0000001016800000,0xffffffffffff0000,0x00000011af840000,0xffffffffffff0000

grub-mkconfig -o /boot/grub/grub.cfg


This calls the script /etc/grub.d/00_header which will copy the GRUB_BADRAM parameter from /etc/default/grub into /boot/grub/grub.cfg :

Code:


play 60 800 1
badram 0x0000001016800000,0xffffffffffff0000,0x00000011af840000,0xffffffffffff0000
### END /etc/grub.d/00_header ###

### BEGIN /etc/grub.d/10_linux ###
menuentry 'Gentoo GNU/Linux, with Linux 6.13.2-gentoo-x86_64' --class gentoo --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-6.13.2-gentoo-x86_64-advanced-36f0c021-f626-4ac4-b1f1-b3dc23d8f85d' {


Reboot
The badram parameter is not a command-line parameter to the linux kernel. Nothing special is not visible in the grub menu, even if using "e" to edit a line.

After reboot, we check that the kernel takes into account the excluded address. We use the number "11af" which is part of the address (we could also check the other range which contains "10168").

Code:

dmesg | grep 11af # to be adapted to each address range
[    0.000000] BIOS-e820: [mem 0x0000001016810000-0x00000011af83ffff] usable
[    0.000000] BIOS-e820: [mem 0x00000011af840000-0x00000011af84ffff] unusable
[    0.000000] BIOS-e820: [mem 0x00000011af850000-0x000000201f2fffff] usable
[    0.000000] reserve setup_data: [mem 0x0000001016810000-0x00000011af83ffff] usable
[    0.000000] reserve setup_data: [mem 0x00000011af840000-0x00000011af84ffff] unusable
[    0.000000] reserve setup_data: [mem 0x00000011af850000-0x000000201f2fffff] usable
[    0.167868]   node   0: [mem 0x0000001016810000-0x00000011af83ffff]
[    0.167869]   node   0: [mem 0x00000011af850000-0x000000201f2fffff]
[    0.173324] PM: hibernation: Registered nosave memory: [mem 0x11af840000-0x11af84ffff]
[    0.679391] e820: reserve RAM buffer [mem 0x11af840000-0x11afffffff]



A different solution would be to add "memtest=4" as kernel parameter, which instructs the kernel to perform a sort of memtest on boot. I have not tested; I preferred to find the numbers myself; but I guess it would be better to do both such that any additional failure in the future is detected faster than the occasional user run of memtest86+.
Back to top
View user's profile Send private message
pietinger
Moderator
Moderator


Joined: 17 Oct 2006
Posts: 5382
Location: Bavaria

PostPosted: Sun Feb 09, 2025 9:58 am    Post subject: Reply with quote

Moved from Kernel & Hardware to Documentation, Tips & Tricks.
_________________
https://wiki.gentoo.org/wiki/User:Pietinger
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54845
Location: 56N 3W

PostPosted: Sun Feb 09, 2025 1:44 pm    Post subject: Reply with quote

ecko,

Two things
A memtest failure does not always mean the RAM is faulty.
Often, removing the RAM and refitting it fixes the problem. Its called 'wiping the contacts'.

This information would be better as a Wiki page. It can get lost/forgotten here.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
ecko
Tux's lil' helper
Tux's lil' helper


Joined: 04 Jul 2010
Posts: 116

PostPosted: Sun Feb 09, 2025 11:16 pm    Post subject: Reply with quote

In my case I tested with the 4 DIMMs together (initial configuration), then tested them one by one (I numbered them with a pencil), then two by two, then all 4 again. The results were consistent with 2 particular DIMMs reporting errors, and not the other 2. I did not attempt to clean the contacts though. You're right, reseating the components should be mentioned as part of the procedure.

I will consider creating an account on the wiki and move the howto there so more contributors can improve the instructions.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Documentation, Tips & Tricks All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum