Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Why not just put all of portage on read-only NBD volumes?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Gentoo Chat
View previous topic :: View next topic  
Author Message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 205

PostPosted: Sat Jan 11, 2025 5:12 pm    Post subject: Why not just put all of portage on read-only NBD volumes? Reply with quote

The tree, the distfiles, all of it.

Do mirrors just like is being done now with http, https, rsync....

Only now, you're inviting the public to mount the network block device and access it like it's a local drive.

Go further, and make it part of the handbook, where you have everybody setting up the chroot. In the same way if you were using nfs for portage, by doing those mounts before the chroot.

The final system could do something which uses dm-cache or lvm's cache or whatever to make it so that files that are effectively downloaded from the NBD volume can be reused later without having to re-fetch.

A NBD server should be a simple affair. You're identifying bandwidth abusers at the block level ffs, people who try dd get throttled quickly and easily.

And you'd have the only distro as far as I can tell that allows users some piece of mind as they perform updates of their system. Everybody else seems to either demand simultaneous access to your home directory with the tls connection back to the mothership or works hard to break any proxy solution the community comes up with to gain some measure of control over their data.

The install cd combined with the read-only repository drive should provide excellent privacy for its users.

Moved from "Portage & Programming" to "Gentoo Chat". --Zucca
_________________
Today is the first day of the rest of your Gentoo installation.
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10691
Location: Somewhere over Atlanta, Georgia

PostPosted: Sat Jan 11, 2025 7:45 pm    Post subject: Reply with quote

I've tried it and I didn't like it—with the repositories, at least. My own personal experience is that Portage performance with a remotely mounted repository is dramatically lower than with a local copy. This is principally during the dependency resolution process. I do sync my home server against Gentoo and then sync all my other machines against my home server, but that's more out of respect for Gentoo's donated bandwidth than anything else. My home server also shares its /var/cache/distfiles via NFS with the rest of the network for the same reason.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.


Last edited by John R. Graham on Mon Jan 13, 2025 12:31 am; edited 1 time in total
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 22990

PostPosted: Sat Jan 11, 2025 8:25 pm    Post subject: Reply with quote

What problem is this solving? If users need to avoid revealing what they are reading, then asking them to read it over the Internet from a network block device is not an improvement over asking them to read it over the Internet using a TLS-encrypted connection. They need a way to get a purely local copy and do so without revealing what they are reading. Portage already caches everything it reasonably can, using local files, not a mirror of a block device.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 205

PostPosted: Sun Jan 12, 2025 12:30 am    Post subject: Reply with quote

Performance would be solved using dm-cache. Yes the first sync pulls in the whole tree but we're doing that already anyways. Once all of the bits are sitting in your local cache, speeds should back to normal.

And nobody is asking to avoid revealing what THEY are reading. The point is to be able to update a Linux installation without letting THE PACKAGE MANAGER read any of the data the user has on the system.

I should be able to update my Linux system without exposing my data. This does that.

And it seems so simple.
_________________
Today is the first day of the rest of your Gentoo installation.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 22990

PostPosted: Sun Jan 12, 2025 1:24 am    Post subject: Reply with quote

I don't see how this solves your stated problem. If you don't want the package manager poking around in the user's home directories, give those home directories permission bits that prevent the Linux user portage from searching them. That can be done just as readily with the current Portage design as with your proposal.

If you want everything cached by dm-cache, would it not be better to download everything in one go to local storage that knows it is primary, rather than letting the cache slowly populate itself as it is accessed? Your proposal does not seem simple to me. It relies on at least two kernel features that most people otherwise do not need to enable. It relies on infrastructure no one is running yet. It seems to me that its performance will at best be equal to the current setup, but likely worse due to demand-loading parts of the tree over time instead of eagerly loading the entire tree into local storage at emerge --sync time.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 205

PostPosted: Sun Jan 12, 2025 1:27 am    Post subject: Reply with quote

John R. Graham wrote:
I do sync my home server against Gentoo and then sync all my other machines against my home server, but that's more out of respect for Gentoo's donated bandwidth than anything else.


I just want to expand on this because I'm trying to do something similar.

I have the home server running Gentoo and it is hosting both /var/db/repos/gentoo and /var/cache/distfiles over nfs. Going to install Gentoo on a new machine connected to that server is a delight; I just remember to perform the nfs mounts prior to the chroot, and then get to skip the sections of the handbook covering how to set up portage and distfiles.

The problem of course is that the home server is running with a limited number of installed packages. The new machine is the laptop and I hope to have sway, Firefox, whichever Doom we're on, etc. I also hope to never access the Internet with this device, leaving all such things to the home server.

Now we arrive at the fork in the road. When you say you sync your home server against Gentoo, do you also sync your distfiles?

If you do, you have a very easy time of it going forward. Any package you want to emerge on the workstation you get to, easily and without fuss.

If, on the other hand, you are not syncing your distfiles, then you have to manually fetch the needed packages from the home server. The current solution to that appears to be running the output of emerge -p on the target machine through some sed/perl/python and then sneakernetting the result onto the host machine where you do a emerge -f --nodeps to actually get the files.

And it works! I am happily on my way to a Gentoo-powered home network!

I was in this situation before; I think I had the big bulky machine at home without Internet I wanted to update via the laptop I could take to the cafe. My solution then was to sync all of distfiles. This was frowned upon here. So I stopped doing it.

But it was great! I could be out in the sticks and have all of Gentoo at my disposal. I could wipe out my install through stupidity, but if the drive holding portage and the distfiles was still intact, I could claw my way back to a running system. Without the Internet.

Anyways, I am very happy with the way my setup is working out, but it could have been so much simpler, is I guess what I'm saying. I want to respect Gentoo's bandwidth too. There's some weird stuff in the distfiles, have you looked? 99% of it is stuff I will never install.

But I don't know in advance which packages make up the 1% that I will install. The ability to simply mount that critical volume over the Internet to quickly and safely satisfy those dependencies reeks of elegance.
_________________
Today is the first day of the rest of your Gentoo installation.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 205

PostPosted: Sun Jan 12, 2025 1:57 am    Post subject: Reply with quote

Hu wrote:
I don't see how this solves your stated problem. If you don't want the package manager poking around in the user's home directories, give those home directories permission bits that prevent the Linux user portage from searching them. That can be done just as readily with the current Portage design as with your proposal.


It involves my having the Gentoo machine connected to the Internet. I appreciate what you're saying about the effectiveness of discretionary access control, but the surface area of that solution is enormous when compared to the solution of simply never connecting to the Internet, or, in that case, making that connection as safe as it can possibly be.

Hu wrote:
If you want everything cached by dm-cache, would it not be better to download everything in one go to local storage that knows it is primary, rather than letting the cache slowly populate itself as it is accessed?


Ok, I can withdraw that part of the recommendation. Nothing wrong with just downloading the snapshot to the local device. Being able to have just one copy on the local network shared across all devices is nice and all, but you're right, no real advantage over the local copy. Other than having to maintain a local copy that is.

Simplifying access to the distfiles addresses the pain point.

Hu wrote:
Your proposal does not seem simple to me. It relies on at least two kernel features that most people otherwise do not need to enable...

nbd and dm-cache should already be part of any self-respecting kernel, c'mon.

Hu wrote:
It relies on infrastructure no one is running yet.

A nbd server is a nothing thing. Ten lines of Python, tops. Twenty if you want to throttle abusers.

But wait, there's more!

Why can't the Gentoo install CD simply be GRUB and a kernel with nbd enabled, that's given a nbd server on the Internet to load the initramfs from? EDIT: ok the initramfs would have to be on the stick but it can mount the nbd device and then you can switch_root to that.

Once a Gentoo user has this thing burned onto a USB drive, it's good for life and they never have to touch another thing. The image used for the initramfs can be updated constantly. It loads them into a shell where /var/db/repos/gentoo and /var/cache/distfiles are already mounted and they can then do partitioning/formatting, etc. stage3 would be hosted on yet another nbd volume.

Two little [x] things in the kernel but they give so much!
_________________
Today is the first day of the rest of your Gentoo installation.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 205

PostPosted: Sun Jan 12, 2025 5:42 am    Post subject: Reply with quote

Immediately prior to this install I was running on Fedora's Sway Atomic spin.

rpm-ostree. Sure you guys know all about it. But aren't they missing the point?

If everybody is running the same root, and /home and /mnt are all stuffed into /var and made into overlays, then why aren't we just publishing that root as a nbd volume and letting everyone use a combination of a local dm-cache over the nbd mount and then put the overlays on top of that, hosted on the local system?

It isn't an install CD anymore. You're letting people download boot devices. The internal ssd gets divided between dm-cache, the root overlay, and home.
_________________
Today is the first day of the rest of your Gentoo installation.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54692
Location: 56N 3W

PostPosted: Sun Jan 12, 2025 12:28 pm    Post subject: Reply with quote

nokilli,

Distfiles and the repo(s) are not static. The master mirror updates every 30 minutes. The mirrors update at the whim of the mirror admins, which is not under Gentoos control.

How do you deal with the cases where things change under you?
e.g.
Portage calculates the dependency tree but an ebuiild has been removed before its needed to be used for the build?
Similarly for distfiles?

Non atomic mirror updates are a problem now, emerge --sync can fail as the mirror being used changed during the process.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10691
Location: Somewhere over Atlanta, Georgia

PostPosted: Mon Jan 13, 2025 2:01 am    Post subject: Reply with quote

nokilli wrote:
Now we arrive at the fork in the road. When you say you sync your home server against Gentoo, do you also sync your distfiles?

If you do, you have a very easy time of it going forward. Any package you want to emerge on the workstation you get to, easily and without fuss.
Of course not. That would be a massive overuse of Gentoo's donated bandwidth, and would be fetching a lot of files I'd never use. /var/cache/distfiles from the server is mounted on all my satellite machines via NFS. Whichever machine builds a package first fetches it--just once--for potential use of all the rest.

nokilli wrote:
If, on the other hand, you are not syncing your distfiles, then you have to manually fetch the needed packages from the home server. The current solution to that appears to be running the output of emerge -p on the target machine through some sed/perl/python and then sneakernetting the result onto the host machine where you do a emerge -f --nodeps to actually get the files
Not necessary for the same reason as above. Incidentally, I see very little performance degradation with NFS when reading one big file (the source tarball) as compared to when I was reading thousands of little files (the repository). It all works extremely well.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 22990

PostPosted: Mon Jan 13, 2025 3:36 am    Post subject: Reply with quote

nokilli wrote:
It involves my having the Gentoo machine connected to the Internet.
If you don't want to be on the Internet, how does having servers on the Internet expose their data as a Network Block Device help you? You would need to be on the Internet to reach them.
nokilli wrote:
I appreciate what you're saying about the effectiveness of discretionary access control, but the surface area of that solution is enormous when compared to the solution of simply never connecting to the Internet, or, in that case, making that connection as safe as it can possibly be.
The discretionary access control is very easy though: set all user home directories to mode 700. Alternatively, run Portage in a namespace where the home directories are not visible, in which case you don't need users to manage their permissions.
nokilli wrote:
Ok, I can withdraw that part of the recommendation. Nothing wrong with just downloading the snapshot to the local device. Being able to have just one copy on the local network shared across all devices is nice and all, but you're right, no real advantage over the local copy. Other than having to maintain a local copy that is.
This is a bit confusing. You seem to be mixing whether we are caching a partial local mirror or caching a complete public mirror. As others explained, caching a mirror of the ebuilds is practical (although caching it at the block layer is needless complexity). Caching all the distfiles is not practical. There is a reason Portage only downloads distfiles as needed.
nokilli wrote:
Hu wrote:
It relies on at least two kernel features that most people otherwise do not need to enable...
nbd and dm-cache should already be part of any self-respecting kernel, c'mon.
Any self-respecting kernel should enable only the features that are needed to serve its assigned workload. Enabling more than that is needless attack surface. I cannot recall ever needing to enable DM-Cache in any of my kernels, for any of my workloads. I think I experimented briefly with NBD for virtual machines, and ultimately found it not to be worthwhile.
nokilli wrote:
Hu wrote:
It relies on infrastructure no one is running yet.
A nbd server is a nothing thing. Ten lines of Python, tops. Twenty if you want to throttle abusers.
In that case, could you post the 10 line version here? I know Python's standard library is extensive, but I was not aware it included an NBD server.
nokilli wrote:
Why can't the Gentoo install CD simply be GRUB and a kernel with nbd enabled, that's given a nbd server on the Internet to load the initramfs from?
Such an install CD would be useless for offline work. You previously stated you want to not connect your machine to the Internet, so such a CD could not work for you.
nokilli wrote:
Once a Gentoo user has this thing burned onto a USB drive, it's good for life and they never have to touch another thing.
As long as no one renames the host server, or stops hosting it, or replaces the kernel on the server with one that is not compatible with the user's hardware.

Also, your proposal assumes the user's Internet bandwidth is plentiful, fast, and unmetered. Some people would prefer to download everything to local storage once, then use it across multiple machines in a repeatable fashion. A system that relies on an Internet server at every boot fails for that use case.
nokilli wrote:
The image used for the initramfs can be updated constantly.
Gentoo does provide relatively frequent stage3 updates. Users are welcome to download those, or not, as needed.
nokilli wrote:
It loads them into a shell where /var/db/repos/gentoo and /var/cache/distfiles are already mounted and they can then do partitioning/formatting, etc. stage3 would be hosted on yet another nbd volume.

Two little [x] things in the kernel but they give so much!
I fail to see how the substantial server support you are requesting is worthwhile considering the, in my opinion, minimal benefit it provides to users.
nokilli wrote:
rpm-ostree. Sure you guys know all about it.
No, I don't know all about it. I have not looked seriously at Fedora in many years, because they are not Gentoo.
nokilli wrote:
If everybody is running the same root, and /home and /mnt are all stuffed into /var and made into overlays, then why aren't we just publishing that root as a nbd volume and letting everyone use a combination of a local dm-cache over the nbd mount and then put the overlays on top of that, hosted on the local system?
Why would we? That seems like added complexity for no benefit. Also, as above, if this is an NBD volume, then it can only be accessed if you have a working Internet connection. Some people want their system to be usable offline.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 205

PostPosted: Mon Jan 13, 2025 10:44 am    Post subject: Reply with quote

Let's imagine this nbd server exists and holds the entire set of distfiles.

So to be clear, this is a block device that has been formatted as ext4. I can mount this on my home server, do a directory listing, and see that all of the distfiles are there.

Now I create a new block device. I'm going to give this to dm-cache, and it's going to sit in front of the nbd. I will mount this volume as /var/cache/distfiles.

Every time I read a file from /var/cache/distfiles, if it's in the cache I get the cached version, if it's not, it gets fetched from the nbd device.

My home server is connected to the Internet.

My laptop is not. My laptop can connect to the home server, but that's it, e.g., ip_forwarding is off, and the firewall otherwise makes access impossible.

But because my laptop is connected to the home server, I can mount the /var/cache/distfiles that sits on the server from my laptop and do my emerges from that. It isn't connected to the Internet, but I can still easily emerge any file I like. Most of the time the requests are satisfied by the cache, but for the few packages that aren't, it still works because the home server can fallback to the nbd volume because IT is connected to the Internet.

You guys seem to think this means moving around big amounts of data. This is just like a hard drive. Just because I'm mounting it, doesn't mean I'm reading every byte off of the device. And the same holds true for the nbd device. It's a lazy cache.
_________________
Today is the first day of the rest of your Gentoo installation.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 205

PostPosted: Mon Jan 13, 2025 10:59 am    Post subject: Reply with quote

This is what a LLM gives us as a simple nbd server in Python:

Code:

import asyncio
import struct

class SimpleNBDServer:
    def __init__(self, filename, host='127.0.0.1', port=10809):
        self.filename = filename
        self.host = host
        self.port = port

    async def start(self):
        self.server = await asyncio.start_server(self.handle_client, self.host, self.port)
        print(f'Serving {self.filename} on {self.host}:{self.port}')
        async with self.server:
            await self.server.serve_forever()

    async def handle_client(self, reader, writer):
        # Handshake and negotiation
        writer.write(b'NBDMAGIC' + b'\x00' * 8)
        writer.write(struct.pack('>Q', 0x00420281861253) + b'\x00' * 124)
        await writer.drain()

        with open(self.filename, 'r+b') as f:
            while True:
                req = await reader.read(28)
                if len(req) == 0:
                    break

                magic, request_type, handle, offset, length = struct.unpack('>LL8sQQ', req)
                if magic != 0x25609513:
                    print('Invalid request')
                    break

                if request_type == 0:  # Read
                    f.seek(offset)
                    data = f.read(length)
                    writer.write(struct.pack('>LL8s', 0x67446698, 0, handle) + data)
                    await writer.drain()

    def run(self):
        asyncio.run(self.start())

if __name__ == "__main__":
    server = SimpleNBDServer("example.img")
    server.run()


That's about right, I'm familiar with asyncio, and this looks pretty good. I'd put a priority queue in there, hold tasks in there indexed by ip address, and then emit blocks based on who is using the server the least. That's probably another 10 lines.

It's a super simple thing. It is true that nbd devices aren't commonly used in this way. Most people don't compile their own gcc either. How does this get exploited and in a way that a web server or a rsync server isn't also obvious to me at all. This is a much simpler protocol. The input is easily checked. You're reading data off of a block device and sending it unmolested over the Internet. Remember that it's a read-only device and it doesn't have to even think about file systems or anything like that.
_________________
Today is the first day of the rest of your Gentoo installation.
Back to top
View user's profile Send private message
pingtoo
Veteran
Veteran


Joined: 10 Sep 2021
Posts: 1386
Location: Richmond Hill, Canada

PostPosted: Mon Jan 13, 2025 11:51 am    Post subject: Reply with quote

nokilli wrote:
Let's imagine this nbd server exists and holds the entire set of distfiles.

So to be clear, this is a block device that has been formatted as ext4. I can mount this on my home server, do a directory listing, and see that all of the distfiles are there.

Now I create a new block device. I'm going to give this to dm-cache, and it's going to sit in front of the nbd. I will mount this volume as /var/cache/distfiles.

Every time I read a file from /var/cache/distfiles, if it's in the cache I get the cached version, if it's not, it gets fetched from the nbd device.

My home server is connected to the Internet.

My laptop is not. My laptop can connect to the home server, but that's it, e.g., ip_forwarding is off, and the firewall otherwise makes access impossible.

But because my laptop is connected to the home server, I can mount the /var/cache/distfiles that sits on the server from my laptop and do my emerges from that. It isn't connected to the Internet, but I can still easily emerge any file I like. Most of the time the requests are satisfied by the cache, but for the few packages that aren't, it still works because the home server can fallback to the nbd volume because IT is connected to the Internet.

You guys seem to think this means moving around big amounts of data. This is just like a hard drive. Just because I'm mounting it, doesn't mean I'm reading every byte off of the device. And the same holds true for the nbd device. It's a lazy cache.
Why the complexity having dm-cache on top of nbd-client to nbd-server? if you have network connectivity why not just NFS and may be add fs-cache on top? this is for you local network file distribution. And for internet access do fuse over rsync or fuse over git, this way Gentoo server does not need to spin up any other new service.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 205

PostPosted: Mon Jan 13, 2025 12:36 pm    Post subject: Reply with quote

How the hell did I miss Neddy's post.

Yes, dm-cache is stupid for the distfiles-only use case, my apologies for what appears to be wasting valuable developer time.

I would point out that this still sort-of works for my use case if you replace dm-cache and just daisy the nbd servers.

So Gentoo publishes the distfiles over nbd somewhere, my home server mounts that, and publishes that using a local nbd server, and my laptop then mounts that.

That satisfies my desire to not make direct connections to the Internet, but then it always goes to the nbd server.

pingtoo mentions fs-cache; honestly I have no experience with that, but that sounds like an obvious first place to look.

Please do not waste any more of your time with this unless/until I have a demo that works.

Or even then. ;)

---

You cannot be too paranoid today. But I guess it doesn't hurt to dress it up a little nicer.
_________________
Today is the first day of the rest of your Gentoo installation.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo Chat All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum