View previous topic :: View next topic |
Author |
Message |
g2user2024 n00b
Joined: 19 Nov 2024 Posts: 4
|
Posted: Wed Nov 20, 2024 1:38 am Post subject: Good fault tolerant and distributed fs |
|
|
Looking for a open-source gentoo packaged cluster filesystem. The last forum post about this is from around 2015.
The goal is for a small development cluster to be able replicate data and to handle a system/disk being unavailable without clients noticing/pausing. To ease vm management and reloaction, the vm images would be on the cluster file system. The filesystem must also support NFS-like semantics (almost posix, except for file locking).
The hardware (s1-s6 are the server names): Code: | s1, s2 ryzen 5700 / 128G ram
s3 ryzen 5500 / 64G ram
s4-s6 odroid h3+ (pentium silver 2GHz) / 64G ram |
3 physical networks (can handle 10GB rsyncs at 2.3gb/s on the 2.5G nics): Code: | lan 2.5G ethernet (s1-s6 w/ 2.5G nics, clients w/ 1G nics)
san 2.5G ehternet (s1-s6 w/ 2.5G nics)
dmz 1G ethernet (s1-s3 w/ 1G nics) |
s1-s3 each have a USB 10G/s external enclosure (non-raid) w/ three 12T or 16T 7200rpm seagate ironwolf drives. The disk performance is good enough at ~450MB/s using direct access. For future secure asset-disposal, all of the drives are encrypted with luks.
All servers, vms and clients are time-synced via ntp to +/-80microseconds. Backup is to separate media and cloud storage, so the filesytem is not intended as a backup mechanism.
We have tried / looked at:- DRBD w/ raw /dev/drbdX passed to vm which then exports via nfs
Works to a degree, but the root filesystems of the vms become readonly during a failover. No commands can be executed by a shell so a vm reboot is required.
- VM creating a Raid 1 over three iSCSI targets, file system is on the raid, exported via nfs
Works, but slowly. The iSCSI traffic is on the san network but clients see 60MB/s transfers at best. ISCSd can be finicky when the target goes away while there is a logged in session. Otherwise pretty resilient even with an abrupt target server halt. There is a single point of failure - the server running the export vm. As long as the nfs export for the vm images is up, vm relocation is just stopping a vm on one server and restarting it on another.
- CEPH with daemons in vms w/ raw partitions passed to the vms (OSD-OSD replication traffic on the san network)
Worked ok, but the monitors hammered their data base disk - ~1TB written in 24hrs. Don't have space to add SSDs for the monitor databases and think that SSDs would get killed because of write amplification. Don't know if its possible to increase the interval between heartbeat checks for the monitors. In any case, the lan network will see large file transfers (up to a 150GB) which congest the lan network. ceph doesn't seem to handle the congestion very well. Backup monitors kept getting kicked out the quorum.
- Glusterfs a few years ago
We had lockups during large file transfers (3GB or larger). Red Hat will stop support for gluster at the end of this year and there appears to be no future for the project. This would be a good candidate if it was going to be supported and the packaging wasn't difficult.
- NFS exports via a vm running on a server w/ raw raid 1 passed to the vm
Performance is pretty good and as long as a vm comes back in a few minutes, the clients don't end up with stale file handles because of a server reboot for a kernel upgrade (servers/clients using NFS4). The downside is a single host is still a critical failure point and vm relocation requires moving disks/enclosures.
- Samba
As far as we can tell, samba doesn't by itself provide a cluster filesystem but relies on exporting a cluster filesystem so it can redirect clients when a server goes down.
- Direct server nfs exports
Fragile in that moving a filesystem requires moving the disks/enclosures and then hoping that the persistent tcp-connections from the clients don't disrupt client mounts when a mount point's ip moves to a different server. In the case of a crash, the client's won't know to close their connections and then try reconnecting.
- gfarm
Somewhat like gluster in that the data is stored on a traditional fs but the metadata is in a postgres database (possibly replicated). Creating the server-part ebuild wasn't too bad. The project's tools assume systemd, and manual setup isn't described so we gave up on trying it out. Setting up the database seems to be embedded in the tools instead of separate sql files one could execute with psql.
We are curious as to what the Gentoo infrastructure team does for resilency. Some other filesystems exist, e.g. xtreemfs, but aren't in portage or do not support NFS-like file access or are defunct.
We did find that ethernet flow control (aka pause) had to be turned off to make iSCSI reliable (see ethtool -a). We would get hangups and corruption otherwise. Found this info in an article about QNAP NAS performance tuning. We had flow control enabled while trying ceph and don't know if that affected the monitors. |
|
Back to top |
|
|
szatox Advocate
Joined: 27 Aug 2013 Posts: 3477
|
Posted: Wed Nov 20, 2024 1:07 pm Post subject: |
|
|
Quote: | CEPH with daemons in vms w/ raw partitions passed to the vms (OSD-OSD replication traffic on the san network)
Worked ok, but the monitors hammered their data base disk - ~1TB written in 24hrs. Don't have space to add SSDs for the monitor databases and think that SSDs would get killed because of write amplification. Don't know if its possible to increase the interval between heartbeat checks for the monitors. In any case, the lan network will see large file transfers (up to a 150GB) which congest the lan network. ceph doesn't seem to handle the congestion very well. Backup monitors kept getting kicked out the quorum. |
This is probably your best option, but there are ways to make it perform well or poorly.
CEPH monitors are very sensitive to disk latency (must be on ssd) and the whole thing is sensitive to network latency (a single 10Gb line is better than an aggregate link of multiple 1Gb lines even if you don't need the bandwidth);
Each OSD also needs fast SSD storage for logs in addition to data storage (which can be HDD). If you lack disk slots, you _could_ get away with sharing a single SSD between a monitor an OSDs. You shouldn't do that, but you could, if you really need.
Don't mix HDD-based OSDs with SSD-based OSD in a single data pool, hdds will make ssds feel slow rather than the other way around.
Ceph also requires clocks to be synchronized with each other, it's more important for all nodes to agree on what the current time is than for them to be correct. Use a local ntp server or make all nodes ntp peers.
Killing SSDs hasn't been a problem on farms I used to work with.
How many monitors did you deploy to choke that network with heartbeats? 3 is all you need to maintain the quorum in case any of them disappears without a goodbye. Don't go above that number unless you actually have a reason to do so.
There is also a bunch of parameters you can fine-tune. Like, AFAIR scrubbing was too aggressive by default.
Quote: | Glusterfs a few years ago
We had lockups during large file transfers (3GB or larger). Red Hat will stop support for gluster at the end of this year and there appears to be no future for the project. This would be a good candidate if it was going to be supported and the packaging wasn't difficult. | I think it was designed with at most 10 nodes in mind and can't maintain performance past that point. I don't know at which size it begins to degrade. Effectively, it has been replaced by ceph.
Quote: | DRBD w/ raw /dev/drbdX passed to vm which then exports via nfs
Works to a degree, but the root filesystems of the vms become readonly during a failover. No commands can be executed by a shell so a vm reboot is required.
NFS exports via a vm running on a server w/ raw raid 1 passed to the vm
Performance is pretty good and as long as a vm comes back in a few minutes, the clients don't end up with stale file handles because of a server reboot for a kernel upgrade (servers/clients using NFS4). The downside is a single host is still a critical failure point and vm relocation requires moving disks/enclosures. |
How 'bout NFS-exported DRBD? Sounds really stupid, but from what you write it could just work for you. If storage fails, you just reboot the NFS server on the other physical server and let the client reconnect.
I see an increased potential for data corruption during failover though, synchronous replication is a must. _________________ Make Computing Fun Again |
|
Back to top |
|
|
pingtoo Veteran
Joined: 10 Sep 2021 Posts: 1351 Location: Richmond Hill, Canada
|
Posted: Wed Nov 20, 2024 4:56 pm Post subject: |
|
|
A interesting topic. I am not involve in Gentoo at all. Just a Gentoo user. I used to be an infrastructure architect prior retirement.
your description seems to mixing Block device and File system together. IMHO they should be discussed separately. May be you can describe a bit more about the access pattern (as in which is client and which is server).
I gather you infrastructure is pure ethernet, there is no other layer two device (FC for example) can be used in this project?
have you tried with jumbo frame? and/or RDMA?
Do you (the project) have full control of network? (or there is separate network team management involve) |
|
Back to top |
|
|
g2user2024 n00b
Joined: 19 Nov 2024 Posts: 4
|
Posted: Thu Nov 21, 2024 11:56 pm Post subject: |
|
|
Thanks for the replies.
Some additional info: This is a development enviroment separate from the production data center. There is no dedicated infra admin and things have evolved over 15 years. Think of a wire rack in the corner of a storage area. There are space, power and budget constraints for hardware changes and this project winds down Dec 2025.
Without a dedicated admin, we have been putting services into vms to simplify service management. The vms run development versions of web apps (go-based), automated tests against the apps, proxies to the internet, code building, nfs for video files and general data, web-mail clients, binpkg builders, binhosts, backups, etc. The vms have their own addresses on the lan (and san where needed).
We have custom filesystem-based tools for creating vm images, chroot setup using a vm image file, cloning from a directory or a running vm, and performing package updates from binhosts. So RBD storage of the vm images doesn't work with the tools we have. The vms run as openrc services. This allows setting dependencies on vm startup.
We have been running the general-usage nfs server in a vm to simplify ip management as the ip just moves with the vm. So no need to also manage floating ips that are associated with physical hosts.
The qemu+kvm block and net devices as seen on the vms (via lspci):- SCSI storage controller: Red Hat, Inc. Virtio block device
- Ethernet controller: Red Hat, Inc. Virtio network device
@szatox
I agree that CEPH has the features we want. It looks like CEPH is more for larger sites with a dedicated admin and more hardware resources than we have. Moving the network to 10G isn't workable for us since the existing hardware can't be upgraded to that.
The CEPH test had 1 active mon and 2 standbys, 3 active osds, 2 mds. All using 7200 rpm sata disks. So looks like a setup bound to perform poorly.
@pingtoo
We have control over the local assets/network, but there is no budget for serious upgrades. The six servers (s1-s6) host ~40 vms, some of which are infrastructure. The clients do sw development, office apps, web email, graphics & some video editing. There are at least a few >10GB file copies a day.
There are three isolated 2.5g switches for the lan, san and dmz networks. The lan nics for servers s1-s6 are all on the same lan switch, the san nics for servers s1-s6 are all on the same san switch. The clients (laptops, desktops) connect to the lan via a secondary 2.5g switch.
On the lan, server-server ping times average ~0.55ms with spikes up to 2.5ms. Server to vm ping times average ~0.675ms. Client ping times are usually 0.030ms higher. The san ping times are about the same as the lan. The host kvm net devices use vhost-net, the vms virtio.
The nics/switches in theroy support jumbo frames. Used ip link set mtu 9000 dev netX on all nics connected to the san network. No noticable difference in thoughput. |
|
Back to top |
|
|
pingtoo Veteran
Joined: 10 Sep 2021 Posts: 1351 Location: Richmond Hill, Canada
|
Posted: Fri Nov 22, 2024 12:57 am Post subject: |
|
|
g2user2024 wrote: | The nics/switches in theroy support jumbo frames. Used ip link set mtu 9000 dev netX on all nics connected to the san network. No noticable difference in thoughput. | Sometime network gear need some nudge (configure) to make it work right. might want to make sure it is happened as expect. In my past experience having jumbo frame push performance significantly for VM over nfs storage. you need to verify storage traffic go through right path and nothing in between is doing magic trick.
Use traceroute to make sure traffic path look ok.
Use ping -s <right jumbo frame size> -M do <target>
Set -M do for Don't Fragment (DF) bit in the IP header, instructing routers along the path not to fragment the packet.
I have seen even nic/switch advertise 9000 but in real test, it got reduced down to 89XX because some weird overhead.
With jumbo frame usually some level of turning of TCP/IP stack also need happen. things like send/receive buffer should adjust to accommodate large packet.
If you absolutely need "live migration" feature. I think your option are very limited. Using block device protocol like iSCSI or DRBD (NBD) with careful planning might work. ClusterFs require some sort of Cluster ware to support and manage everything together. I think this kind of setup is over kill for None production.
If you can live with fail over than use some kind simple heart beat monitoring (i.e. keepalived) with some scripting will likely do the job. |
|
Back to top |
|
|
szatox Advocate
Joined: 27 Aug 2013 Posts: 3477
|
Posted: Fri Nov 22, 2024 11:41 am Post subject: |
|
|
Quote: | The CEPH test had 1 active mon and 2 standbys, 3 active osds, 2 mds. All using 7200 rpm sata disks. So looks like a setup bound to perform poorly. |
Yes, that's your bottleneck right there. With ceph's typical data redundancy of 3, everything that happens on your system hammers all your OSD's at the same time. You need more leaves than that + a bunch of placement groups which will be assigned to different sets of OSDs to spread the load.
The single biggest upgrade you could do on this one is replacing HDD with SSD. Rotating disks surely are the primary source of latency, chopping it down would be a massive improvement.
Quote: | All servers, vms and clients are time-synced via ntp to +/-80microseconds | It's been a while, but AFAIR ceph wants machines hosting its daemons to be within 20ms of each other.
Our monitoring would light up with warnings when time difference exceeded 5ms.
Quote: | Direct server nfs exports
Fragile in that moving a filesystem requires moving the disks/enclosures and then hoping that the persistent tcp-connections from the clients don't disrupt client mounts when a mount point's ip moves to a different server. In the case of a crash, the client's won't know to close their connections and then try reconnecting. | I've found NFS over UDP to be much more reliable than TCP.
TCP is expected to stand strong. UDP is expected to fail and recover.
If network glitches during migrations were your only problem with this approach, switching protocols might be worth a shot. _________________ Make Computing Fun Again |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|