View previous topic :: View next topic |
Author |
Message |
deadram n00b
Joined: 20 Dec 2006 Posts: 44
|
Posted: Sun Dec 10, 2023 10:55 pm Post subject: Rootfs on tmpfs |
|
|
Hey all,
Setting up my rootfs as a tmpfs, and figured I'd ask for any tips on the setup. Running a dell poweredge r730, dual xeon 36 cores 72 threads with 512gb ram. I have a ssd off the perc controller with sda1 efi boot and sda2 a 64gb xfs root. Was going to modify the initramfs I generated with genkernel to create a tmpfs root, rsync the rootfs from sda2 and boot from tmpfs. Once booted, as a cronjob every hour, rsync the tmpfs back to sda2. I'll add an extra rsync in /etc/local.d/mirror-root.stop at shutdown.
Anyone with tips or recommendations? As a side note I'll be getting a UPS battery backup for the system eventually, and I'm well aware of the possibility I'll lose the past hour of work. It's mainly used as a personal server, but we'll be running a vm with windows 11 for gaming too (with pcie passthrough and an amd rx 7800 xt). Nothing mission critical, so main goal is speed. _________________ echo deadram; fortune |
|
Back to top |
|
|
pingtoo Veteran
Joined: 10 Sep 2021 Posts: 1236 Location: Richmond Hill, Canada
|
Posted: Mon Dec 11, 2023 12:12 am Post subject: |
|
|
deadram,
Instead running entire thing in memory, may be consider use overlayfs combine tmpfs and storage or overlayfs combine tmpfs and squashfs.
You just need to a cronjob sync tmpfs backup to storage with much less I/O. |
|
Back to top |
|
|
deadram n00b
Joined: 20 Dec 2006 Posts: 44
|
Posted: Mon Dec 11, 2023 2:40 am Post subject: |
|
|
The idea I have is to have both fast read time and write time, with a slower mirror for reboots.
With overlay, if I understand it correctly, a read for a file would occur on sda2, and the second read for the same file would again occur on sda2 (and be as slow as that ssd). Once the file was modified, it would exist on tmpfs, and a subsequent read would be from tmpfs (and fast).
Can overlayfs be setup to move the file into the upper tmpfs after the first read?
Can overlayfs be setup to slowly mirror the upper fs back onto the lower fs? _________________ echo deadram; fortune |
|
Back to top |
|
|
pingtoo Veteran
Joined: 10 Sep 2021 Posts: 1236 Location: Richmond Hill, Canada
|
Posted: Mon Dec 11, 2023 3:17 am Post subject: |
|
|
deadram,
You are correct, overlayfs does not help cache read. I hope in your use case the linux default cache would help.
for mirror upper layer down, I think aufs have this function. however aufs is not in mainline kernel tree, you will need to patch kernel source to use it. |
|
Back to top |
|
|
deadram n00b
Joined: 20 Dec 2006 Posts: 44
|
Posted: Mon Dec 11, 2023 3:47 am Post subject: |
|
|
Ok, I did come across aufs in my searches, I'll look into it deeper. Worst case, I will look into making my own patch for overlay fs, but that would be a long term plan. _________________ echo deadram; fortune |
|
Back to top |
|
|
Hu Administrator
Joined: 06 Mar 2007 Posts: 22619
|
Posted: Mon Dec 11, 2023 4:01 am Post subject: |
|
|
Generally, solid state drives are decently fast, and with that much RAM, Linux caching should cover any drive latency. What are you seeing that you think you need to force everything into memory? |
|
Back to top |
|
|
deadram n00b
Joined: 20 Dec 2006 Posts: 44
|
Posted: Mon Dec 11, 2023 6:15 am Post subject: |
|
|
Linux caching runs into the other issue with ssd/nvme. Limited writes. If I keep my system up on average 2 weeks to 1 month, and have UPS backup for brownouts, I should be able to tune it to make most of the writes to ram, rather then ssd/nvme.
Second issue with cache, first read is from ssd, second read is from cache.
When this setup is finished, I'm hoping to read from ssd into tmpfs when i/o is at idle, so that most stuff is run from ram. When i/o is in use, read from ssd into whatever program (say loading sshd at boot time), but also into tmpfs. Boot rather quickly, but initial 5 to 30 min when i/o goes idle, continue moving ssd data (with reads at idle i/o) into tmpfs (for thing like firefox, or gimp, that I might use daily, but not at boot time). Once in tmpfs, only flush back to ssd (with writes) on reboot, or when UPS tells the computer it's power is out. Like pre-caching the whole file system, but also write caching it (almost) indefinatly.
Should allow ssd/hhd/nvme to last much longer, and much faster read and write speeds, even compaired to nvme. Nvme=south bridge, ram=north bridge, north bridge is always faster.
For the time being I'm without a UPS, so I'll settle for once an hour or once a day, flushing tmpfs back into ssd.
If there is a clear way to do this with read/write cache settings in the kernel, I'd like to know so that I can look into the settings more, but from what I can tell, they are not made for this use case. _________________ echo deadram; fortune |
|
Back to top |
|
|
Hu Administrator
Joined: 06 Mar 2007 Posts: 22619
|
Posted: Mon Dec 11, 2023 4:09 pm Post subject: |
|
|
Your concern then is that you are afraid of wearing out the SSD through repeated writes. How much write workload are you expecting if you do not use tmpfs? What kind of SSD are you using here? It's my understanding that current generation name-brand SSDs are rated for sufficient writes that you will need to make an unreasonable amount of writes to actually wear one out. It's not impossible to destroy one, but for how I would expect a typical root filesystem to be used, I expect the drive will be replaced for old age before you manage to wear it out.
As for read caching, if you have sufficient RAM and nothing else to do with it, just reading every file should prompt the kernel to cache those files in memory on its own.
Yes, RAM is faster than going to disk, but modern SSDs can achieve very high read throughput. Some crude tests here suggest I can exceed 1G/s as a sustained speed. For a typical use of the root filesystem, this seems very good. |
|
Back to top |
|
|
deadram n00b
Joined: 20 Dec 2006 Posts: 44
|
Posted: Mon Dec 11, 2023 4:42 pm Post subject: |
|
|
If I was doing this for a practical reason, or on a work machine, then yes, what you've said is correct. I'm doing this more to tune the last little bit of speed and longevity out of my system, rather then for any practical reason. For about $100 I could have a sufficiently large nvme drive off the pcie bus. But firefox will still load about 0.02 seconds faster, and compile about 5 seconds faster with my setup. _________________ echo deadram; fortune |
|
Back to top |
|
|
pingtoo Veteran
Joined: 10 Sep 2021 Posts: 1236 Location: Richmond Hill, Canada
|
Posted: Mon Dec 11, 2023 10:24 pm Post subject: |
|
|
deadram,
While I was contemplating your idea of place rootfs in memory. A wild thoughts come to mind, Have you consider use mirror volume?
One fixed disk mirror to one ram disk, this way it would match your idea and may be easier setup.
- Create a zram device
- mark it as PV
- mark your fixed disk as PV
- create volume group with above 2 PVs
- create mirrored LV
- mkfs on the mirrored LV
- build root file system on top of the mirrored LV
Have another fixed disk (or volume if you like) for file system backup.
It should be relatively easy create a zram device in initrd or even just modify the checking to allow reduced lv to continue and after boot up, create zram and add mirror back. |
|
Back to top |
|
|
deadram n00b
Joined: 20 Dec 2006 Posts: 44
|
Posted: Tue Dec 12, 2023 12:05 am Post subject: |
|
|
Excelent solution, an lv mirror would work perfectly. Shouldn't even need to bother with playing with initramfs, just need to add the -dolvm option to genkernel. _________________ echo deadram; fortune |
|
Back to top |
|
|
deadram n00b
Joined: 20 Dec 2006 Posts: 44
|
Posted: Sun Dec 17, 2023 6:01 am Post subject: |
|
|
For anyone who's curious. Still a few things to work out, and these directions and scripts are far from polished. This is like version 0.0.0.0.0.0.1alpha, so don't try this at home, and defiantly don't do this at a workplace that you enjoy returning to each day This system has 512GB ram, so plenty to spare, and it's mainly a going to be a gaming computer, so nothing mission critical. You have been warned
/sda1 1G efi
/sda2 64G lvm raid1 mirror
/sdb1 1G efi2 (not actually booting, Just for backup)
/sdb2 64G lvm raid1 mirror
Create your lvm logical raid1 mirrors Code: | $ pvcreate /dev/sda2
$ pvcreate /dev/sdb2
$ vgcreate vgrootraid1 /dev/sd[ab]2
$ lvcreate --mirrors 1 --type raid1 -l 100%FREE -n lvrootraid1 vgrootraid1
$ mkfs.xfs /dev/vgrootraid1/lvrootraid1 |
Now install your system, or like me boot into an iso, and copy your root partition data over to the new lvrootraid1. Don't forget to modify fstab and your boot configurations on the lvrootraid1 volume. Then create the following files.
/etc/local.d/ramraid.start Code: | #!/bin/sh
if [[ -f /etc/ramraid/ramraid.start ]] ; then
/etc/ramraid/ramraid.start
fi |
/etc/local.d/ramraid.stop Code: | #!/bin/sh
if [[ -f /etc/ramraid/ramraid.stop ]] ; then
/etc/ramraid/ramraid.stop
fi |
/etc/ramraid/config Code: | # The sdd or hdd or even nvme partition's uuid your loading the root partition from. These are all slower then ram in most configurations.
# Make sure your using partuuid, "ls -la /dev/disk/by-partuuid" will show which /dev/sdxN drive it's linked to
# /dev/sda2 (ssd) /dev/sdb2 (hw raid0 7TB)
SLOW_PARTUUID="51470111-07db-4ade-ad83-9fa7161b62e7 4080cf96-abf9-894a-bf6f-fef248666f7b"
# The size of you root partition
# To calculate, use fdisk to find the number of sectors, and logical sector size.
# in my case that's 134217728 sectors, and 512(logical)/4096(physical) sector size
# 134217728*512=68719476736
# take that number and divide by 1024
# 68719476736/1024=67108864
ROOT_SIZE="67108864"
# Number of mirrors once ram is up (this is one less then the number of drives, so 2 drives + 1 ram drive, 2 mirrors)
NUM_MIRROR_RAM="2"
# Number of mirrors once ram is down and removed from raid
NUM_MIRROR_NORAM="1"
# Name of the vg
VG_NAME="vgrootraid1"
# lv device path
LV_DEV="/dev/vgrootraid1/lvrootraid1"
# ram dev path
RAM_DEV="/dev/ram0"
# TODO:
# - cancel a shutdown if our sdd/hdd mirror is failed. Hopefully soon enough to still have a functioning system, and a message somewhere to remind us to mirror the ram drive before anything else fails!
# - figure out more info on the writebehind setting, and a sane value to use
# - disable readcatching for ram drive, maybe for lv raid1 drive too
# - disable writecatch for the ram drive. Just flush that data to ram, no need to buffer it.
# - increase writecatch for the slow drives, maybe even delay the flush, and only flush every x hours
# - clean up logging, and add more error checking |
/etc/ramraid/ramraid.start Code: | #!/bin/sh
if [[ -f /etc/ramraid/config ]] ; then
. /etc/ramraid/config
else
echo "$0 failed to load /etc/ramraid/config" >> /var/log/ramraid
fi
# add /dev/ram0 block ram device
modprobe brd rd_nr=1 rd_size=${ROOT_SIZE}
# add ram0 as a phgsical drive for lvm use
pvcreate ${RAM_DEV}
# add ram0 as a drive on the vg
vgextend ${VG_NAME} ${RAM_DEV}
# convert the lv to use the ram drive as mirror (this starts a sync from the ssd/hhd to your ram drive)
lvconvert -y -m ${NUM_MIRROR_RAM} ${LV_DEV}
# Use these commands to view sync status
# dmsetup status /dev/vgrootraid1/lvrootraid1
# lvs -a
# Fork our script to turn on writemostly for the slow drives. We need to wait till sync is completed before we do this, which is why we fork the script
if [[ -f /etc/ramraid/ramraid.forkwaitforsync.thenwritemostly ]] ; then
/etc/ramraid/ramraid.forkwaitforsync.thenwritemostly &
else
echo "Error forking ramraid.forkwaitforsync.thenwritemostly" >> /var/log/ramraid
fi |
/etc/ramraid/ramraid.stop Code: | #!/bin/sh
if [[ -f /etc/ramraid/config ]] ; then
. /etc/ramraid/config
else
echo "$0 failed to load /etc/ramraid/config" >> /var/log/ramraid
fi
# set sdd and hdd drive back to read/write No need to check if it's been turned on, since we force "n"
for i in $SLOW_PARTUUID; do
lvchange --writemostly /dev/disk/by-partuuid/${i}:n ${LV_DEV}
done
# remove the ram mirror, so lv doesn't look degraded, and we don't get "missing" pv errors for old ram drives after reboot.
# the -y is ok in this case, because we're telling lvconvert which drive to remove.
# Keep in mind if the sdd/hdd mirror drive is in fail mode, and we bring the system down, we're losing data
#
# Maybe add a check to see if we have at least 1 other online mirror. If not, cancel the shutdown.
# Once shutdown is canceled, we can hopefully backup the root partition from the ram drive.
lvconvert -y -m ${NUM_MIRROR_NORAM} ${LV_DEV} ${RAM_DEV}
# remove ram drive from vg
vgreduce ${VG_NAME} ${RAM_DEV}
# remove ram drive from pv
pvremove ${RAM_DEV}
# Keep in mind the 'vgchange --removemissing --force' command for power failures or bad shutdowns/reboots |
/etc/ramraid/ramraid.forkwaitforsync.thenwritemostly Code: | #!/bin/sh
if [[ -f /etc/ramraid/config ]] ; then
. /etc/ramraid/config
else
echo "$0 failed to load /etc/ramraid/config" >> /var/log/ramraid
fi
# While sync is not 100 percent, sleep for 5 seconds
TEST=0
while [[ x$TEST != x100 ]] ; do
sleep 5 # Sleep for 5 seconds
TEST=$(lvs -o sync_percent ${LV_DEV} | tail -n 1 | sed -ne 's/ \+//gm;s/\.[0-9][0-9]//;s/ \+//m;p')
done
# Just take our time
sleep 5
# For each slow drive
for i in $SLOW_PARTUUID ; do
# set drive to write only, so we only read from the ram drive (once it finishes sync)
lvchange --writemostly /dev/disk/by-partuuid/${i}:y ${LV_DEV}
done
# use "lvs -a" to see writemostly bit on your drives
# set the lv to have writebehind, allowing the sdd/hdd to fall behind the ram on writes.
# This will speed up writes, because they will not block once the ram has finished it's write.
lvchange --writebehind 4096 ${LV_DEV} |
_________________ echo deadram; fortune |
|
Back to top |
|
|
NeddySeagoon Administrator
Joined: 05 Jul 2003 Posts: 54577 Location: 56N 3W
|
Posted: Sun Dec 17, 2023 12:11 pm Post subject: |
|
|
deadram,
When I worked out the the write life of my oldest SSD would be over 80 years, I stopped worrying about write life.
It's 10 years old and already too small.
Another SSD, same part number and a little younger failed the other day with a rash of bad blocks in /var.
I managed to save world but portage lost its mind.
That drive still had 80 years write life left too.
I recovered /etc and world and rebuilt the install on a bigger SSD. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
|
pingtoo Veteran
Joined: 10 Sep 2021 Posts: 1236 Location: Richmond Hill, Canada
|
Posted: Sun Dec 17, 2023 3:07 pm Post subject: |
|
|
deadram,
lovely work
I presume it is functional as your design. So is everything ran faster? was it even notifiable (start/shutdown time)? I mean as something is hindering performance?
I hope you have backup system in place, all this work should have good place to preserve it. I use app-backup/rsnapshot which I setup to do hourly backup. |
|
Back to top |
|
|
deadram n00b
Joined: 20 Dec 2006 Posts: 44
|
Posted: Mon Dec 18, 2023 3:52 am Post subject: |
|
|
NeddySeagoon wrote: | When I worked out the the write life of my oldest SSD would be over 80 years, ... |
Give me that same drive, and a month building chia plots, using it as a tmp plot drive. Heat+Use=Time. Anything can burn out long before it's supposed to, when abused, and I can abuse almost anything
pingtoo wrote: | lovely work.. | Yah except it doesn't work. Just ran a bunch of write and read tests. Looks like the lvm performs within +/-1 second if the ram drive is online, or offline. I have a feeling it's to do with kernel read/write catch settings, and lvm tuning. Once Christmas is over I'll look into it deeper. Might start with a initramfs that does "dd if=/dev/vggroup/lvdrive of=/dev/ram0" and then boots off ram, just to see if the issue is lvm related, or kernel read/write catch related. Could just be that my tests used a few 0.5GB to 4.4GB files, and the kernel read/write catch is bigger. I know my L2 and L3 CPU Catch was fully used up, then I shut down the program using it, and even with it fully free, i/o was faster, but equally so with ram drive online or offline. Anyways, I'll plug away, and update you when I figured it out. _________________ echo deadram; fortune |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|