Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Why not just put all of portage on read-only NBD volumes?
View unanswered posts
View posts from last 24 hours

Goto page 1, 2, 3  Next  
Reply to topic    Gentoo Forums Forum Index Gentoo Chat
View previous topic :: View next topic  
Author Message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Sat Jan 11, 2025 5:12 pm    Post subject: Why not just put all of portage on read-only NBD volumes? Reply with quote

The tree, the distfiles, all of it.

Do mirrors just like is being done now with http, https, rsync....

Only now, you're inviting the public to mount the network block device and access it like it's a local drive.

Go further, and make it part of the handbook, where you have everybody setting up the chroot. In the same way if you were using nfs for portage, by doing those mounts before the chroot.

The final system could do something which uses dm-cache or lvm's cache or whatever to make it so that files that are effectively downloaded from the NBD volume can be reused later without having to re-fetch.

A NBD server should be a simple affair. You're identifying bandwidth abusers at the block level ffs, people who try dd get throttled quickly and easily.

And you'd have the only distro as far as I can tell that allows users some piece of mind as they perform updates of their system. Everybody else seems to either demand simultaneous access to your home directory with the tls connection back to the mothership or works hard to break any proxy solution the community comes up with to gain some measure of control over their data.

The install cd combined with the read-only repository drive should provide excellent privacy for its users.

Moved from "Portage & Programming" to "Gentoo Chat". --Zucca
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10716
Location: Somewhere over Atlanta, Georgia

PostPosted: Sat Jan 11, 2025 7:45 pm    Post subject: Reply with quote

I've tried it and I didn't like it—with the repositories, at least. My own personal experience is that Portage performance with a remotely mounted repository is dramatically lower than with a local copy. This is principally during the dependency resolution process. I do sync my home server against Gentoo and then sync all my other machines against my home server, but that's more out of respect for Gentoo's donated bandwidth than anything else. My home server also shares its /var/cache/distfiles via NFS with the rest of the network for the same reason.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.


Last edited by John R. Graham on Mon Jan 13, 2025 12:31 am; edited 1 time in total
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23046

PostPosted: Sat Jan 11, 2025 8:25 pm    Post subject: Reply with quote

What problem is this solving? If users need to avoid revealing what they are reading, then asking them to read it over the Internet from a network block device is not an improvement over asking them to read it over the Internet using a TLS-encrypted connection. They need a way to get a purely local copy and do so without revealing what they are reading. Portage already caches everything it reasonably can, using local files, not a mirror of a block device.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Sun Jan 12, 2025 12:30 am    Post subject: Reply with quote

Performance would be solved using dm-cache. Yes the first sync pulls in the whole tree but we're doing that already anyways. Once all of the bits are sitting in your local cache, speeds should back to normal.

And nobody is asking to avoid revealing what THEY are reading. The point is to be able to update a Linux installation without letting THE PACKAGE MANAGER read any of the data the user has on the system.

I should be able to update my Linux system without exposing my data. This does that.

And it seems so simple.
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23046

PostPosted: Sun Jan 12, 2025 1:24 am    Post subject: Reply with quote

I don't see how this solves your stated problem. If you don't want the package manager poking around in the user's home directories, give those home directories permission bits that prevent the Linux user portage from searching them. That can be done just as readily with the current Portage design as with your proposal.

If you want everything cached by dm-cache, would it not be better to download everything in one go to local storage that knows it is primary, rather than letting the cache slowly populate itself as it is accessed? Your proposal does not seem simple to me. It relies on at least two kernel features that most people otherwise do not need to enable. It relies on infrastructure no one is running yet. It seems to me that its performance will at best be equal to the current setup, but likely worse due to demand-loading parts of the tree over time instead of eagerly loading the entire tree into local storage at emerge --sync time.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Sun Jan 12, 2025 1:27 am    Post subject: Reply with quote

John R. Graham wrote:
I do sync my home server against Gentoo and then sync all my other machines against my home server, but that's more out of respect for Gentoo's donated bandwidth than anything else.


I just want to expand on this because I'm trying to do something similar.

I have the home server running Gentoo and it is hosting both /var/db/repos/gentoo and /var/cache/distfiles over nfs. Going to install Gentoo on a new machine connected to that server is a delight; I just remember to perform the nfs mounts prior to the chroot, and then get to skip the sections of the handbook covering how to set up portage and distfiles.

The problem of course is that the home server is running with a limited number of installed packages. The new machine is the laptop and I hope to have sway, Firefox, whichever Doom we're on, etc. I also hope to never access the Internet with this device, leaving all such things to the home server.

Now we arrive at the fork in the road. When you say you sync your home server against Gentoo, do you also sync your distfiles?

If you do, you have a very easy time of it going forward. Any package you want to emerge on the workstation you get to, easily and without fuss.

If, on the other hand, you are not syncing your distfiles, then you have to manually fetch the needed packages from the home server. The current solution to that appears to be running the output of emerge -p on the target machine through some sed/perl/python and then sneakernetting the result onto the host machine where you do a emerge -f --nodeps to actually get the files.

And it works! I am happily on my way to a Gentoo-powered home network!

I was in this situation before; I think I had the big bulky machine at home without Internet I wanted to update via the laptop I could take to the cafe. My solution then was to sync all of distfiles. This was frowned upon here. So I stopped doing it.

But it was great! I could be out in the sticks and have all of Gentoo at my disposal. I could wipe out my install through stupidity, but if the drive holding portage and the distfiles was still intact, I could claw my way back to a running system. Without the Internet.

Anyways, I am very happy with the way my setup is working out, but it could have been so much simpler, is I guess what I'm saying. I want to respect Gentoo's bandwidth too. There's some weird stuff in the distfiles, have you looked? 99% of it is stuff I will never install.

But I don't know in advance which packages make up the 1% that I will install. The ability to simply mount that critical volume over the Internet to quickly and safely satisfy those dependencies reeks of elegance.
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Sun Jan 12, 2025 1:57 am    Post subject: Reply with quote

Hu wrote:
I don't see how this solves your stated problem. If you don't want the package manager poking around in the user's home directories, give those home directories permission bits that prevent the Linux user portage from searching them. That can be done just as readily with the current Portage design as with your proposal.


It involves my having the Gentoo machine connected to the Internet. I appreciate what you're saying about the effectiveness of discretionary access control, but the surface area of that solution is enormous when compared to the solution of simply never connecting to the Internet, or, in that case, making that connection as safe as it can possibly be.

Hu wrote:
If you want everything cached by dm-cache, would it not be better to download everything in one go to local storage that knows it is primary, rather than letting the cache slowly populate itself as it is accessed?


Ok, I can withdraw that part of the recommendation. Nothing wrong with just downloading the snapshot to the local device. Being able to have just one copy on the local network shared across all devices is nice and all, but you're right, no real advantage over the local copy. Other than having to maintain a local copy that is.

Simplifying access to the distfiles addresses the pain point.

Hu wrote:
Your proposal does not seem simple to me. It relies on at least two kernel features that most people otherwise do not need to enable...

nbd and dm-cache should already be part of any self-respecting kernel, c'mon.

Hu wrote:
It relies on infrastructure no one is running yet.

A nbd server is a nothing thing. Ten lines of Python, tops. Twenty if you want to throttle abusers.

But wait, there's more!

Why can't the Gentoo install CD simply be GRUB and a kernel with nbd enabled, that's given a nbd server on the Internet to load the initramfs from? EDIT: ok the initramfs would have to be on the stick but it can mount the nbd device and then you can switch_root to that.

Once a Gentoo user has this thing burned onto a USB drive, it's good for life and they never have to touch another thing. The image used for the initramfs can be updated constantly. It loads them into a shell where /var/db/repos/gentoo and /var/cache/distfiles are already mounted and they can then do partitioning/formatting, etc. stage3 would be hosted on yet another nbd volume.

Two little [x] things in the kernel but they give so much!
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Sun Jan 12, 2025 5:42 am    Post subject: Reply with quote

Immediately prior to this install I was running on Fedora's Sway Atomic spin.

rpm-ostree. Sure you guys know all about it. But aren't they missing the point?

If everybody is running the same root, and /home and /mnt are all stuffed into /var and made into overlays, then why aren't we just publishing that root as a nbd volume and letting everyone use a combination of a local dm-cache over the nbd mount and then put the overlays on top of that, hosted on the local system?

It isn't an install CD anymore. You're letting people download boot devices. The internal ssd gets divided between dm-cache, the root overlay, and home.
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54766
Location: 56N 3W

PostPosted: Sun Jan 12, 2025 12:28 pm    Post subject: Reply with quote

nokilli,

Distfiles and the repo(s) are not static. The master mirror updates every 30 minutes. The mirrors update at the whim of the mirror admins, which is not under Gentoos control.

How do you deal with the cases where things change under you?
e.g.
Portage calculates the dependency tree but an ebuiild has been removed before its needed to be used for the build?
Similarly for distfiles?

Non atomic mirror updates are a problem now, emerge --sync can fail as the mirror being used changed during the process.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
John R. Graham
Administrator
Administrator


Joined: 08 Mar 2005
Posts: 10716
Location: Somewhere over Atlanta, Georgia

PostPosted: Mon Jan 13, 2025 2:01 am    Post subject: Reply with quote

nokilli wrote:
Now we arrive at the fork in the road. When you say you sync your home server against Gentoo, do you also sync your distfiles?

If you do, you have a very easy time of it going forward. Any package you want to emerge on the workstation you get to, easily and without fuss.
Of course not. That would be a massive overuse of Gentoo's donated bandwidth, and would be fetching a lot of files I'd never use. /var/cache/distfiles from the server is mounted on all my satellite machines via NFS. Whichever machine builds a package first fetches it--just once--for potential use of all the rest.

nokilli wrote:
If, on the other hand, you are not syncing your distfiles, then you have to manually fetch the needed packages from the home server. The current solution to that appears to be running the output of emerge -p on the target machine through some sed/perl/python and then sneakernetting the result onto the host machine where you do a emerge -f --nodeps to actually get the files
Not necessary for the same reason as above. Incidentally, I see very little performance degradation with NFS when reading one big file (the source tarball) as compared to when I was reading thousands of little files (the repository). It all works extremely well.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23046

PostPosted: Mon Jan 13, 2025 3:36 am    Post subject: Reply with quote

nokilli wrote:
It involves my having the Gentoo machine connected to the Internet.
If you don't want to be on the Internet, how does having servers on the Internet expose their data as a Network Block Device help you? You would need to be on the Internet to reach them.
nokilli wrote:
I appreciate what you're saying about the effectiveness of discretionary access control, but the surface area of that solution is enormous when compared to the solution of simply never connecting to the Internet, or, in that case, making that connection as safe as it can possibly be.
The discretionary access control is very easy though: set all user home directories to mode 700. Alternatively, run Portage in a namespace where the home directories are not visible, in which case you don't need users to manage their permissions.
nokilli wrote:
Ok, I can withdraw that part of the recommendation. Nothing wrong with just downloading the snapshot to the local device. Being able to have just one copy on the local network shared across all devices is nice and all, but you're right, no real advantage over the local copy. Other than having to maintain a local copy that is.
This is a bit confusing. You seem to be mixing whether we are caching a partial local mirror or caching a complete public mirror. As others explained, caching a mirror of the ebuilds is practical (although caching it at the block layer is needless complexity). Caching all the distfiles is not practical. There is a reason Portage only downloads distfiles as needed.
nokilli wrote:
Hu wrote:
It relies on at least two kernel features that most people otherwise do not need to enable...
nbd and dm-cache should already be part of any self-respecting kernel, c'mon.
Any self-respecting kernel should enable only the features that are needed to serve its assigned workload. Enabling more than that is needless attack surface. I cannot recall ever needing to enable DM-Cache in any of my kernels, for any of my workloads. I think I experimented briefly with NBD for virtual machines, and ultimately found it not to be worthwhile.
nokilli wrote:
Hu wrote:
It relies on infrastructure no one is running yet.
A nbd server is a nothing thing. Ten lines of Python, tops. Twenty if you want to throttle abusers.
In that case, could you post the 10 line version here? I know Python's standard library is extensive, but I was not aware it included an NBD server.
nokilli wrote:
Why can't the Gentoo install CD simply be GRUB and a kernel with nbd enabled, that's given a nbd server on the Internet to load the initramfs from?
Such an install CD would be useless for offline work. You previously stated you want to not connect your machine to the Internet, so such a CD could not work for you.
nokilli wrote:
Once a Gentoo user has this thing burned onto a USB drive, it's good for life and they never have to touch another thing.
As long as no one renames the host server, or stops hosting it, or replaces the kernel on the server with one that is not compatible with the user's hardware.

Also, your proposal assumes the user's Internet bandwidth is plentiful, fast, and unmetered. Some people would prefer to download everything to local storage once, then use it across multiple machines in a repeatable fashion. A system that relies on an Internet server at every boot fails for that use case.
nokilli wrote:
The image used for the initramfs can be updated constantly.
Gentoo does provide relatively frequent stage3 updates. Users are welcome to download those, or not, as needed.
nokilli wrote:
It loads them into a shell where /var/db/repos/gentoo and /var/cache/distfiles are already mounted and they can then do partitioning/formatting, etc. stage3 would be hosted on yet another nbd volume.

Two little [x] things in the kernel but they give so much!
I fail to see how the substantial server support you are requesting is worthwhile considering the, in my opinion, minimal benefit it provides to users.
nokilli wrote:
rpm-ostree. Sure you guys know all about it.
No, I don't know all about it. I have not looked seriously at Fedora in many years, because they are not Gentoo.
nokilli wrote:
If everybody is running the same root, and /home and /mnt are all stuffed into /var and made into overlays, then why aren't we just publishing that root as a nbd volume and letting everyone use a combination of a local dm-cache over the nbd mount and then put the overlays on top of that, hosted on the local system?
Why would we? That seems like added complexity for no benefit. Also, as above, if this is an NBD volume, then it can only be accessed if you have a working Internet connection. Some people want their system to be usable offline.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Mon Jan 13, 2025 10:44 am    Post subject: Reply with quote

Let's imagine this nbd server exists and holds the entire set of distfiles.

So to be clear, this is a block device that has been formatted as ext4. I can mount this on my home server, do a directory listing, and see that all of the distfiles are there.

Now I create a new block device. I'm going to give this to dm-cache, and it's going to sit in front of the nbd. I will mount this volume as /var/cache/distfiles.

Every time I read a file from /var/cache/distfiles, if it's in the cache I get the cached version, if it's not, it gets fetched from the nbd device.

My home server is connected to the Internet.

My laptop is not. My laptop can connect to the home server, but that's it, e.g., ip_forwarding is off, and the firewall otherwise makes access impossible.

But because my laptop is connected to the home server, I can mount the /var/cache/distfiles that sits on the server from my laptop and do my emerges from that. It isn't connected to the Internet, but I can still easily emerge any file I like. Most of the time the requests are satisfied by the cache, but for the few packages that aren't, it still works because the home server can fallback to the nbd volume because IT is connected to the Internet.

You guys seem to think this means moving around big amounts of data. This is just like a hard drive. Just because I'm mounting it, doesn't mean I'm reading every byte off of the device. And the same holds true for the nbd device. It's a lazy cache.
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Mon Jan 13, 2025 10:59 am    Post subject: Reply with quote

This is what a LLM gives us as a simple nbd server in Python:

Code:

import asyncio
import struct

class SimpleNBDServer:
    def __init__(self, filename, host='127.0.0.1', port=10809):
        self.filename = filename
        self.host = host
        self.port = port

    async def start(self):
        self.server = await asyncio.start_server(self.handle_client, self.host, self.port)
        print(f'Serving {self.filename} on {self.host}:{self.port}')
        async with self.server:
            await self.server.serve_forever()

    async def handle_client(self, reader, writer):
        # Handshake and negotiation
        writer.write(b'NBDMAGIC' + b'\x00' * 8)
        writer.write(struct.pack('>Q', 0x00420281861253) + b'\x00' * 124)
        await writer.drain()

        with open(self.filename, 'r+b') as f:
            while True:
                req = await reader.read(28)
                if len(req) == 0:
                    break

                magic, request_type, handle, offset, length = struct.unpack('>LL8sQQ', req)
                if magic != 0x25609513:
                    print('Invalid request')
                    break

                if request_type == 0:  # Read
                    f.seek(offset)
                    data = f.read(length)
                    writer.write(struct.pack('>LL8s', 0x67446698, 0, handle) + data)
                    await writer.drain()

    def run(self):
        asyncio.run(self.start())

if __name__ == "__main__":
    server = SimpleNBDServer("example.img")
    server.run()


That's about right, I'm familiar with asyncio, and this looks pretty good. I'd put a priority queue in there, hold tasks in there indexed by ip address, and then emit blocks based on who is using the server the least. That's probably another 10 lines.

It's a super simple thing. It is true that nbd devices aren't commonly used in this way. Most people don't compile their own gcc either. How does this get exploited and in a way that a web server or a rsync server isn't also obvious to me at all. This is a much simpler protocol. The input is easily checked. You're reading data off of a block device and sending it unmolested over the Internet. Remember that it's a read-only device and it doesn't have to even think about file systems or anything like that.
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
pingtoo
Veteran
Veteran


Joined: 10 Sep 2021
Posts: 1444
Location: Richmond Hill, Canada

PostPosted: Mon Jan 13, 2025 11:51 am    Post subject: Reply with quote

nokilli wrote:
Let's imagine this nbd server exists and holds the entire set of distfiles.

So to be clear, this is a block device that has been formatted as ext4. I can mount this on my home server, do a directory listing, and see that all of the distfiles are there.

Now I create a new block device. I'm going to give this to dm-cache, and it's going to sit in front of the nbd. I will mount this volume as /var/cache/distfiles.

Every time I read a file from /var/cache/distfiles, if it's in the cache I get the cached version, if it's not, it gets fetched from the nbd device.

My home server is connected to the Internet.

My laptop is not. My laptop can connect to the home server, but that's it, e.g., ip_forwarding is off, and the firewall otherwise makes access impossible.

But because my laptop is connected to the home server, I can mount the /var/cache/distfiles that sits on the server from my laptop and do my emerges from that. It isn't connected to the Internet, but I can still easily emerge any file I like. Most of the time the requests are satisfied by the cache, but for the few packages that aren't, it still works because the home server can fallback to the nbd volume because IT is connected to the Internet.

You guys seem to think this means moving around big amounts of data. This is just like a hard drive. Just because I'm mounting it, doesn't mean I'm reading every byte off of the device. And the same holds true for the nbd device. It's a lazy cache.
Why the complexity having dm-cache on top of nbd-client to nbd-server? if you have network connectivity why not just NFS and may be add fs-cache on top? this is for you local network file distribution. And for internet access do fuse over rsync or fuse over git, this way Gentoo server does not need to spin up any other new service.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Mon Jan 13, 2025 12:36 pm    Post subject: Reply with quote

How the hell did I miss Neddy's post.

Yes, dm-cache is stupid for the distfiles-only use case, my apologies for what appears to be wasting valuable developer time.

I would point out that this still sort-of works for my use case if you replace dm-cache and just daisy the nbd servers.

So Gentoo publishes the distfiles over nbd somewhere, my home server mounts that, and publishes that using a local nbd server, and my laptop then mounts that.

That satisfies my desire to not make direct connections to the Internet, but then it always goes to the nbd server.

pingtoo mentions fs-cache; honestly I have no experience with that, but that sounds like an obvious first place to look.

Please do not waste any more of your time with this unless/until I have a demo that works.

Or even then. ;)

---

You cannot be too paranoid today. But I guess it doesn't hurt to dress it up a little nicer.
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Tue Jan 14, 2025 3:06 am    Post subject: Reply with quote

pingtoo suggests Fuse over rsync or fuse over git. Since my concern is the distfiles I don't believe fuse over git works (the git repositories doesn't hold distfiles, right?)

Fuse over rsync looks interesting. There's also a couple of fuse over http implementations.

Can I just think out loud here?

Taking Gentoo's bandwidth concerns seriously, isn't the problem that a package gets updated with a few small file changes and then it gets tarred up with everything else and we're all made to download this huge thing when what we could really be just downloading the file changes?

So it isn't an nbd volume holding distfiles. It's an nbd volume holding all of the project trees. Shouldn't portage updating the local project then be a matter of just rsyncing with the new project tree? In other words, ebuild fetch which does some sort of wget thing now would instead do an rsync from the project tree on the remotely mounted nbd volume onto the project tree stored on the local drive.

Shouldn't this mean dramatically lower bandwidth use?

It's a huge volume, yes, but because of the way we access it less data is actually retrieved.

And you get your caching for free from portage as it does the rsync from the remote mount to the local drive.

Stop tarring and zipping things please. :)
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Tue Jan 14, 2025 10:28 am    Post subject: Reply with quote

Ok, so you'd keep the tarballs for initial installations.

The thing is, if you were to do this, you'd need to recreate the tarballs on the client both for long-term storage but also to verify the signatures because it's the tarballs that get signed.

I imagine that tar is stable, but then untarring, rsyncing, and retarring... would that always result in the same image? It would need to. If it does, we're good.

NeddySeagoon wrote:
nokilli,

Distfiles and the repo(s) are not static. The master mirror updates every 30 minutes. The mirrors update at the whim of the mirror admins, which is not under Gentoos control.

How do you deal with the cases where things change under you?
e.g.
Portage calculates the dependency tree but an ebuiild has been removed before its needed to be used for the build?
Similarly for distfiles?

Non atomic mirror updates are a problem now, emerge --sync can fail as the mirror being used changed during the process.


If I'm right about the latest spin on this approach, the bandwidth savings might justify making some kind of architectural change.

I'm not a server admin. I screwed around with AWS a bunch of times and I know they allow you to create static images which can then immediately be served. My guess/my hope would that you would always have two images hosted. One for the benefit of users who are newly downloading material, and the other for the benefit of those who are still in progress with their downloads, so either nobody ever gets into that trouble, or if they do it's rare.

You'd be trading doubling the disk space requirement for whatever the potential reductions in bandwidth use you'd gain, but you'd be giving the user the better experience.

What is more expensive? Disk space or bandwidth? Honestly I don't know.

I would point out that in this latest iteration rsync is performed locally. So I have to imagine it's not just bandwidth that is being saved.
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
pingtoo
Veteran
Veteran


Joined: 10 Sep 2021
Posts: 1444
Location: Richmond Hill, Canada

PostPosted: Tue Jan 14, 2025 12:04 pm    Post subject: Reply with quote

nokilli wrote:
pingtoo suggests Fuse over rsync or fuse over git. Since my concern is the distfiles I don't believe fuse over git works (the git repositories doesn't hold distfiles, right?)

Fuse over rsync looks interesting. There's also a couple of fuse over http implementations.

Can I just think out loud here?

Taking Gentoo's bandwidth concerns seriously, isn't the problem that a package gets updated with a few small file changes and then it gets tarred up with everything else and we're all made to download this huge thing when what we could really be just downloading the file changes?

So it isn't an nbd volume holding distfiles. It's an nbd volume holding all of the project trees. Shouldn't portage updating the local project then be a matter of just rsyncing with the new project tree? In other words, ebuild fetch which does some sort of wget thing now would instead do an rsync from the project tree on the remotely mounted nbd volume onto the project tree stored on the local drive.

Shouldn't this mean dramatically lower bandwidth use?

It's a huge volume, yes, but because of the way we access it less data is actually retrieved.

And you get your caching for free from portage as it does the rsync from the remote mount to the local drive.

Stop tarring and zipping things please. :)


So your main concern is distfiles.

Are you just propose another transport method (NBD protocol) instead using manual on demand fetch (wget/curl)? I am at lost of your conversation.

It is strange to me to think wanting entire content copied locally instead only have portion that I want to use. may be I am misunderstand your thoughts.
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 23046

PostPosted: Tue Jan 14, 2025 2:53 pm    Post subject: Reply with quote

Bandwidth is the bigger concern than disk space. First, some users are on metered connections, so every gibibyte counts for them. Second, the server-side bandwidth is being donated to Gentoo by organizations that are (probably) paying their ISPs for the transfers, so etiquette dictates that Gentoo users should consume as little of this bandwidth (and, by proxy, the donor's money) as we can.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Tue Jan 14, 2025 3:04 pm    Post subject: Reply with quote

The proposal at this point is to host distfiles on a nbd server, with a file layout just as it is now, with one exception: every tarball is accompanied by the source directory that was used to create it.

So this is a really big block device that's been formatted as ext4(?) that holds this huge amount of content.

The idea is to let users mount this just as if it's a local drive. When they do this they're not downloading the entire drive! They're just downloading a limited number of blocks to load the root directory, in a quantity that is probably little different than the html index page alternative.

Why do is this way? When a user installs a new package, they go to fetch the tarball, just as they do now. Nothing is different.

But when they go to update that package, we can do the following:

  • Extract the source from the tarball that sits on the local drive for the previous version of the package being emerged.
  • Rsync from the new version of the source for that project that sits on the nbd server to the source tree we just extracted.
  • Then create a new tarball from this new source tree, both to be able to repeat this process later, but also to allow us the ability to verify the signature in the next step...
  • Do your portage stuff.

A portion of the distfiles directory on the local machine looks like this:
Code:

/distfiles
    ...
    /prboom
    /prboom-a-b-c.tar.bz2
    /prboom-a-b-d.tar.bz2
    /prboom-a-b-e.tar.bz2
    ...

And on the nbd server:
Code:

/distfiles
    ...
    /prboom-a-b-c
    /prboom-a-b-c.tar.bz2
    /prboom-a-b-d
    /prboom-a-b-d.tar.bz2
    /prboom-a-b-e
    /prboom-a-b-e.tar.bz2
    ...

The advantage?

Because right now, to update prboom you need to download the entire tarball. Even though you've got this other tarball sitting right there that contains 99% of the same stuff. It's a huge download, and for what? A new imp skin?

Whereas this way, to update prboom, you're doing an rsync from the new version on the nbd server to the local version you just extracted from the previous tarball.

You're just downloading the imp skin.

Yes, we'd be asking the user to bear a new burden of recreating the tarball. But it's a sharing the load kind of situation because the bandwidth savings could be enormous.

It's also optional. They can spend the blocks to keep the source tree available instead (I'd be one of these.)

---

We don't respect the network block device because it's a useless thing in Mac and Windows land. And Mac and Windows land is where all of the money is.

But this is Linux. This is Gentoo. The read-only network block device is state of the art as far as I'm concerned. Super simple. Super powerful.
_________________
We are the block device. The kernel is our client.


Last edited by nokilli on Tue Jan 14, 2025 3:54 pm; edited 1 time in total
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Tue Jan 14, 2025 3:22 pm    Post subject: Reply with quote

Ok, enjoy the enthusiasm in my previous post, but there's still outstanding issues.

It's actually a wild assumption now that I think about it.

Can you sign a tarball, then see that tarball expanded into one project directory, rsync'd onto another project directory, re-tarred, and then pass signature verification?

There's the delete-* options in rsync. Would that be enough?

Just one little bit out of place...
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Tue Jan 14, 2025 3:34 pm    Post subject: Reply with quote

Hu wrote:
Bandwidth is the bigger concern than disk space. First, some users are on metered connections, so every gibibyte counts for them. Second, the server-side bandwidth is being donated to Gentoo by organizations that are (probably) paying their ISPs for the transfers, so etiquette dictates that Gentoo users should consume as little of this bandwidth (and, by proxy, the donor's money) as we can.

Do you have any idea as to the breakdown of bandwidth usage to support first installations as opposed to updates?

The Linux kernel as a network client. It takes awhile to get used to the idea.
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
pingtoo
Veteran
Veteran


Joined: 10 Sep 2021
Posts: 1444
Location: Richmond Hill, Canada

PostPosted: Tue Jan 14, 2025 3:38 pm    Post subject: Reply with quote

nokilli wrote:
The proposal at this point is to host distfiles on a nbd server, with a file layout just as it is now, with one exception: every tarball is accompanied by the source directory that was used to create it.

So this is a really big block device that's been formatted as ext4(?) that holds this huge amount of content.

The idea is to let users mount this just as if it's a local drive. When they do this they're not downloading the entire drive! They're just downloading a limited number of blocks to load the root directory, in a quantity that is probably little different than the html index page alternative.

Why do is this way? When a user installs a new package, they go to fetch the tarball, just as they do now. Nothing is different.

But when they go to update that package, we can do the following:

  • Extract the source from the tarball that sits on the local drive for the previous version of the package being emerged.
  • Rsync from the new version of the source for that project that sits on the nbd server to the source tree we just extracted.
  • Then create a new tarball from this new source tree, both to be able to repeat this process later, but also to allow us the ability to verify the signature in the next step...
  • Do your portage stuff.

A portion of the distfiles directory looks like this:

Code:

/distfiles
    ...
    /prboom
    /prboom-a-b-c.tar.bz2
    /prboom-a-b-d.tar.bz2
    /prboom-a-b-e.tar.bz2
    ...


The advantage?

Because right now, to update prboom you need to download the entire tarball. Even though you've got this other tarball sitting right there that contains 99% of the same stuff. It's a huge download, and for what? A new imp skin?

Whereas this way, to update prboom, you're doing an rsync from the new version on the nbd server to the local version you just extracted from the previous tarball.

You're just downloading the imp skin.

Yes, we'd be asking the user to bear a new burden of recreating the tarball. But it's a sharing the load kind of situation because the bandwidth savings could be enormous.

It's also optional. They can spend the blocks to keep the source tree available instead (I'd be one of these.)

---

We don't respect the network block device because it's a useless thing in Mac and Windows land. And Mac and Windows land is where all of the money is.

But this is Linux. This is Gentoo. The read-only network block device is state of the art as far as I'm concerned. Super simple. Super powerful.


Thank you for explanation, I think I got the idea.

However I don't see the benefit of doing this.

In your point 2, it seems to me you are suggest the "distfiles" contain upstream project source code tree. But why? In this suggestion you are propose Gentoo perform the sync from upstream then generate signature for sync result. where as current process Gentoo obtain ready signed release tarball from upstream. So why is this burden needed for Gentoo repository management team? isn't kind of redundant?

I got a feeling you are thinking current portage supported method, it is call "live" version it is usually mark as version 9999. its function is to get current moment source from upstream and build from there. it does not guarantee successful build and not every package support this function. But I think from your suggestion may be a modification in current emerge program that can provide a command line option that allow pull source code like the way live version do but with specific version tag in the upstream source RCS that might just do what you suggested.

Is in your proposal in part based on able to review distfiles tree content? and you find there are benefit of adhoc viewing the tree?

If we thinking "distfiles" as black box and not caring what is its content and how it got its content does it still matter how the black box content got filled? in another word if my "distfiles" is on ramdisk and every reboot I lost everything in it and every time I rebuild/upgrade something got pull into it without me understand and/or prepare system (enable NBD, mounting NBD, etc...) I would imagine this is much beneficial for system development (system I mean Gentoo Portage) and from user point of view for not need to be aware of what additional requirement (for example may be need to open additional firewall port to support yet another protocol)
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 54766
Location: 56N 3W

PostPosted: Tue Jan 14, 2025 3:52 pm    Post subject: Reply with quote

nokilli,

This has to work on memory and CPU constrained systems too.
Not just modern AMD64 systems.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
nokilli
Apprentice
Apprentice


Joined: 25 Feb 2004
Posts: 235

PostPosted: Tue Jan 14, 2025 4:22 pm    Post subject: Reply with quote

NeddySeagoon wrote:
nokilli,

This has to work on memory and CPU constrained systems too.
Not just modern AMD64 systems.

Can the selection of file system influence this outcome?

Some combination of a file system and layout that allows the fewest blocks to be loaded in order to start loading other files and directories? Maybe even the CDROM file systems? It will be read-only after all.

It's already crazy architectural change. And I'm so very, very sorry... my itch sort of led me to stumble into this other thing and now I don't know what I'm doing.

But it looks good!

Is it possible that memory and CPU constrained systems deserve their own trees in any case? What percentage of packages in the main tree are they capable of running?

The question is: what percentage of Gentoo's bandwidth is devoted to new installations as opposed to updating existing installations? What is that times money? I don't have these answers.

But I do love simple answers to hard problems. Especially when the same answer solves multiple hard problems at once. I have a privacy issue. You have a bandwidth issue. If this can seriously be a twofer, then why not look hard at it?

You have no idea how much respect I have for you and the developers.

Quote:
The goal of Gentoo is to design tools and systems that allow a user to do that work as pleasantly and efficiently as possible, as they see fit.


I want to do that.
_________________
We are the block device. The kernel is our client.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo Chat All times are GMT
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum