View previous topic :: View next topic |
Author |
Message |
Bircoph Retired Dev
Joined: 27 Jun 2008 Posts: 261 Location: Moscow
|
Posted: Sun Nov 13, 2011 12:07 am Post subject: Cluster filesystem for HPC |
|
|
Hello,
I'm setting up a small scientific cluster: it contains five nodes, while a CPU and a RAM onboard are good enough, a disk space is limited: 120 GB per node, no hardware upgrade is possible.
I want to use ~80-100 GB partitions from each node to create a single distributed file system to use it on all nodes simultaneously for user data and software, the internal network is 1Gb/s and should handle this well. Left space will be used for OS files, swap, local caches and a scratch space.
There are dozens of cluster file systems available, so it is hard for me to choose.
So, in spite of hardware limitations and project goals (mostly pbs, mpi and distcc), these are the features I need:
1) The replication ratio should be controllable, the most likely I will disable replication at all to save space. (Data will be backed up regularly using external storage. But external storage will be slow (no more than 100 Mb/s), so I need to use internal fast cluster file system.)
2) FS should be kernel space supported. No fuse solutions are acceptable: heavy I/O is expected, I do not want to waste CPU cycles available.
3) A solution should be open source, no commercial licenses.
4) User/group quotas support is desirable (to protect a file system from overflow due to program error or malicious actions).
5) Nodes holding FS can't be exclusive: the very same nodes will be used for computations.
6) It is highly desirable that distributed FS blocks which are on current computing node will be used directly instead of transferring to the master node (mount source) and back to the client node to save network bandwidth from noops.
I have considered several cluster file systems, so there are my findings:
1) ceph — a good design, a very good documentation, full replication control, load and disk fill balancing; it provides everything I need except for quotas. It is already present in Gentoo, so it will be easier to support. Currently I'm looking forward to use this FS as the best choise available from my point of view.
2) ocfs2 — can't find any way to control or disable the replication ratio, a poor documentation, otherwise it looks good.
3) pvfs — looks good, but no mention of quotas and it can't intelligently distribute files over free space servers.
4) lustre — a good, powerful stuff, but it is too complex, perhaps this is an overkill for a small cluster. It requires heavy kernel patching for the server side, with any problems or urgent (security) kernel updates I will be on my own. I'd like to use something less intrusive.
5) gfs2 — looks interesting, but bad reports from my colleagues about both stability and performance outside of RH builds. Also this will required gnbd with very complicated setup.
6) gpfs — a commercial license required.
7) glusterfs — userspace solution, unacceptable.
8) moosefs — userspace solution, unacceptable.
9) fhgfs — free for use, but no sources available, and I do not want a pain with binaries.
I have no experience in setting up distributed FS myself, except for a setup where separate systems is used to handle local discs and export all storage space via nfs, so advice will be welcomed.
ATM I'm looking toward the first five mentioned with ceph in great preference, but any opinion will be considered. _________________ Per aspera ad astra! |
|
Back to top |
|
|
chithanh Developer
Joined: 05 Aug 2006 Posts: 2158 Location: Berlin, Germany
|
Posted: Sun Nov 13, 2011 12:20 am Post subject: Re: Cluster filesystem for HPC |
|
|
10) afs
Bircoph wrote: | 2) FS should be kernel space supported. No fuse solutions are acceptable: heavy I/O is expected, I do not want to waste CPU cycles available. | You are very quick to discard the userspace solutions. If your jobs are I/O bound anyway, then it won't matter if some extra CPU cycles are used for the filesystem.
Benchmarks with a representative subset of your particular workload will probably answer your questions better than any speculation about the best cluster fs. |
|
Back to top |
|
|
Bircoph Retired Dev
Joined: 27 Jun 2008 Posts: 261 Location: Moscow
|
Posted: Sun Nov 13, 2011 2:08 am Post subject: Re: Cluster filesystem for HPC |
|
|
Thank you for this suggestion. I forgot that AFS may combine different volumes in the same cell.
But AFS volume itself is a blocking problem: if I'll have 100 GB partitions it will be impossible to create 110 GB volume, moreover: if I'll have 49 GB free on all nodes, I woldn't be able to write 50 GB in any single volume. I want more flexible storage control to be able to temporarily allow users to use large storage amount. Also AFS is really bad on storage balancing: in general you should move volumes by hand, I found a balance application for AFS, but it is not realtime and will work on cron job with rather large time intervals.
By the way, AFS is redundant here and it is questionable if AFS cache will improve performance, because network bandwidth in my case is larger, than hdd i/o rate. AFS is excellent for long-range distributed or even geographically separated setups. But when all your cluster is in the same blade rack, other approaches may be more efficient.
Quote: |
Benchmarks with a representative subset of your particular workload will probably answer your questions better than any speculation about the best cluster fs. |
You are right, but I'll die to build, install, configure and test all of them, and choose of cluster fs may affect overall design of the setup. That's why I'm asking for people's experience and suggestions. _________________ Per aspera ad astra! |
|
Back to top |
|
|
Bircoph Retired Dev
Joined: 27 Jun 2008 Posts: 261 Location: Moscow
|
|
Back to top |
|
|
John R. Graham Administrator
Joined: 08 Mar 2005 Posts: 10655 Location: Somewhere over Atlanta, Georgia
|
Posted: Tue Nov 15, 2011 1:04 pm Post subject: |
|
|
Is strict POSIX filesystem semantics one of your requirements? That seems to reduce the playing field somewhat.
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
|
Bircoph Retired Dev
Joined: 27 Jun 2008 Posts: 261 Location: Moscow
|
Posted: Tue Nov 15, 2011 4:22 pm Post subject: |
|
|
John R. Graham wrote: | Is strict POSIX filesystem semantics one of your requirements? That seems to reduce the playing field somewhat.
|
Speaking frankly I do not understand everything behind the "strict POSIX filesystem semantics", I looked into the Single UNIX Specification, Version 4, but definitions look too vague for me.
But most probably: yes, I need this semantics, because I need a distributed file system to behave as local filesystem as close as possible from user's applications point of view. In particular, I need these requirements:
- standard Unix permissions; ACL's are welcome, but not strictly required;
- quota support;
- consistency guaranty for multiple and possibly overlapping writes (so POSIX locks should be supported, my guess).
- xattr support and probably other features commonly used by local system applications.
This cluster is not being designed for a single computation task, it may be used in a variety of ways for a lot of computing tasks with the limits it can technically handle, including but not limited to educational purposes. _________________ Per aspera ad astra! |
|
Back to top |
|
|
John R. Graham Administrator
Joined: 08 Mar 2005 Posts: 10655 Location: Somewhere over Atlanta, Georgia
|
Posted: Tue Nov 15, 2011 4:28 pm Post subject: |
|
|
I was referring to things like this from the Ceph web site: Differences from POSIX. I've by no means made a comprehensive survey but OCFS, for instance, does not suffer from these corner case limitations.
- John _________________ I can confirm that I have received between 0 and 499 National Security Letters. |
|
Back to top |
|
|
Bircoph Retired Dev
Joined: 27 Jun 2008 Posts: 261 Location: Moscow
|
Posted: Tue Nov 15, 2011 5:54 pm Post subject: |
|
|
Hmm, it's hard to say... Concerning these two cases I definitely do not care about wrong df reports for sparse files, but an issue with simultaneous writes around object boundary may be a problem... and may be a not, I can't be sure, but it is better to take some precautions.
Quote: |
OCFS, for instance, does not suffer from these corner case limitations.
|
According to ocfs2 docs they are targeted to be as close to local filesystem behaviour as possible, I hope this will work well. So ocfs2 will be the first to try. _________________ Per aspera ad astra! |
|
Back to top |
|
|
Mad Merlin Veteran
Joined: 09 May 2005 Posts: 1155
|
Posted: Thu Nov 17, 2011 1:43 am Post subject: |
|
|
You seem to have found this already, but it's worth reiterating that GFS2 and OCFS2 are shared filesystems, not clustered filesystems.
In a shared filesystem there is exactly one copy of the data on exactly one node with the caveat that all nodes can access it concurrently once the underlying block device is exported over the network (typically via iSCSI). You can add additional copies by putting something like DRBD underneath the filesystem, but I don't think that's what you're looking for. You gain no additional (storage) performance or space by adding more nodes.
In contrast, a clustered filesystem like Ceph stores data across multiple nodes on its own. You generally gain (storage) performance and/or space by adding more nodes. _________________ Game! - Where the stick is mightier than the sword! |
|
Back to top |
|
|
Bircoph Retired Dev
Joined: 27 Jun 2008 Posts: 261 Location: Moscow
|
Posted: Thu Nov 17, 2011 8:50 pm Post subject: |
|
|
Mad Merlin wrote: | You seem to have found this already, but it's worth reiterating that GFS2 and OCFS2 are shared filesystems, not clustered filesystems.
In a shared filesystem there is exactly one copy of the data on exactly one node with the caveat that all nodes can access it concurrently once the underlying block device is exported over the network (typically via iSCSI). You can add additional copies by putting something like DRBD underneath the filesystem, but I don't think that's what you're looking for. You gain no additional (storage) performance or space by adding more nodes.
|
Yes, thanks for reminding me than I need to use some distributed block device beneath OCFS2. And this is a problem, because I can't find an appropriate one: DRBD implies RAID-1 mirroring, NBD can't be mounted rw on several hosts, RADOS block device requires Ceph to be used, so it is irrational to use it for OCFS2, because it will be more appropriate to use Ceph itself. Another approach maybe to use something above OCFS2 like AUFS3, but this may lead to a disaster since on-top fs will not be cluster aware.
However, this is not valid for GFS2, because it can work on top of CLVM (and OCFS2 can't do this). And clvm allows to create logical network distributed block device, which can be used beneath gfs2. So this is a good combination to try, but gfs2+clvm has poor performance, even when you have concurrent writes from just two nodes.
Quote: |
In contrast, a clustered filesystem like Ceph stores data across multiple nodes on its own. You generally gain (storage) performance and/or space by adding more nodes. |
Well, perhaps I should give Ceph a try. Maybe someone have an idea how to use disk quotas on Ceph using some external layer (selinux, aufs, nfs... I'm running out ideas)?
Though, please note that for my setup scalability is nothing, because hardware setup is unscalable and any external disk storage will be at least one degree slower, so even if it will be used in the future, it will be deliberately mounted at a separate mount point. _________________ Per aspera ad astra! |
|
Back to top |
|
|
Mad Merlin Veteran
Joined: 09 May 2005 Posts: 1155
|
Posted: Thu Nov 17, 2011 10:59 pm Post subject: |
|
|
Bircoph wrote: | Mad Merlin wrote: | You seem to have found this already, but it's worth reiterating that GFS2 and OCFS2 are shared filesystems, not clustered filesystems.
In a shared filesystem there is exactly one copy of the data on exactly one node with the caveat that all nodes can access it concurrently once the underlying block device is exported over the network (typically via iSCSI). You can add additional copies by putting something like DRBD underneath the filesystem, but I don't think that's what you're looking for. You gain no additional (storage) performance or space by adding more nodes.
|
Yes, thanks for reminding me than I need to use some distributed block device beneath OCFS2. And this is a problem, because I can't find an appropriate one: DRBD implies RAID-1 mirroring, NBD can't be mounted rw on several hosts, RADOS block device requires Ceph to be used, so it is irrational to use it for OCFS2, because it will be more appropriate to use Ceph itself. Another approach maybe to use something above OCFS2 like AUFS3, but this may lead to a disaster since on-top fs will not be cluster aware.
However, this is not valid for GFS2, because it can work on top of CLVM (and OCFS2 can't do this). And clvm allows to create logical network distributed block device, which can be used beneath gfs2. So this is a good combination to try, but gfs2+clvm has poor performance, even when you have concurrent writes from just two nodes. |
Ah, you appear to be right. A fascinating option for GFS2 which I had not realized up until now.
In that case, the distinction is that the filesystem itself is only aware of a single copy of the data for shared filesystems. _________________ Game! - Where the stick is mightier than the sword! |
|
Back to top |
|
|
tmgoblin n00b
Joined: 05 Jan 2012 Posts: 2
|
Posted: Thu Jan 05, 2012 11:43 pm Post subject: |
|
|
Perhaps a ceph setup wherein you create per user or per use (sparse) disk images, mounted via loop device and formatted with a quota supporting file system if needed? |
|
Back to top |
|
|
Bircoph Retired Dev
Joined: 27 Jun 2008 Posts: 261 Location: Moscow
|
Posted: Fri Jan 06, 2012 9:14 am Post subject: |
|
|
tmgoblin wrote: | Perhaps a ceph setup wherein you create per user or per use (sparse) disk images, mounted via loop device and formatted with a quota supporting file system if needed? |
1) As discussed with ceph developers, if one needs to mount ceph filesystem on the same node where disk server is located (this is my case), fuse must be used, otherwise deadlock will occur on memory pressure. And I want to avoid fuse if possible.
2) The same is valid for any loopback mounts as you proposed, this approach is unsafe:
https://docs.google.com/a/dreamhost.com/viewer?a=v&q=cache:ONtIKJFSC7QJ:https://tao.truststc.org/Members/hweather/advanced_storage/Public%2520resources/network/nfs_user+nfs+loopback+deadlock+linux&hl=en&gl=us&pid=bl&srcid=ADGEESgpaVYYNoh2pmvPVQ9I_bpLLcoF3GJIMKavomIHNgTb-cbii6RVtWg28poJKdHBqQgKGXzVA2NOsC25FtWMP3yywTfNkX9N26IrKVIcVA9eRz6ZGBx1_Ur0JerUrfBQlPcmcBBz&sig=AHIEtbSjGX_hCVny345iF
3) If you will use non-cluster aware filesystem over a distributed disk image, this will kill your I/O to a miserable condition. Also this filesystem must be cluster-aware, like ocfs2 or gfs2.
ATM I gave up with user quotas, this is not so critical after all, throughout and sane CPU load is more important now.
It looks like any distributed filesystem has its very own bugs, perils, benefits and drawbacks and none of them works ideally. Currently I plan to try OrangeFS as PVFS2 successor, though a work is postponed due to some other reasons. _________________ Per aspera ad astra! |
|
Back to top |
|
|
Bircoph Retired Dev
Joined: 27 Jun 2008 Posts: 261 Location: Moscow
|
Posted: Tue Mar 27, 2012 3:30 am Post subject: |
|
|
Well, I finally finished this project. In short: after thorough analysis and tests for selected few I chose
OrangeFS (branch of PVFS2).
The main limitation in my work was a need to use the same nodes for both storage and computing.
This through away most of available solutions at once, because they appear to be either incapable to
mount (and work well) on the same nodes as data server nodes or they are not designed to share
available CPU/RAM resources with other applications sanely.
After theoretical considerations and search through various sources I tested three installations on
practice:
1) Ceph. Good thing in theory, maybe even the future of data storage, but it is immature currently:
a) Ceph support for multiple active metadata servers is unstable now; that means all client metadata
requests will be served by a single host and this bottleneck limits scalability and performance on peak
loads greatly.
b) Ceph kernel client can't be safely used on the nodes serving as data servers. See discussion in
previous posts.
2) GlusterFS. As of benefits this distributed filesystem is, perhaps, the simplest one to deploy and
manage. Everything else are drawbacks.
a) Performance is acceptable only in linear mode, where client get no benefits from parallel i/o to
multiple nodes, so use of this mode in HPC is a mockery. And even in this mode even for large data
chunks (dd bs=1M) CPU consumption is 30% of single core (Xeon E5410). I/O speed for large chunks
is about 52 MB/s, which is close to a limit of local hard drives (57 MB/s).
b) Even in simple stripe mode performance is horrible: while good on write (100 MB/s) it drops to
absolutely unacceptable 30 MB/s on read operations. CPU utilization is terrible: it consumes about
250% of single core, so on single node with 2 Xeons and 8 cores one must drop 2.5 cores for
nothing. Tested on large chunks. On small i/o requests all distributed filesystems behave worse.
3) OrangeFS. It does not support quotas nor file locks (though all i/o operations are atomic and this
way consistency is kept without locks). But it works, and works well and stable. Furthermore this is
not a general file storage oriented system, but HPC dedicated one, targeted on parallel I/O including
ROMIO support. All test were done for stripe data distribution.
a) No quotas — to hell quotas. I gave up on them anyway, even glusterfs supports not common
uid/gid based quotas, but directory size limitations, more like LVM works.
b) Multiple active metadata servers are supported and stable. Compared to dedicated metadata
storage (single node) this gives +50% performance on small files and no significant difference on
large ones.
c) Excellent performance on large data chunks (dd bs=1M). It is limited by a sum of local hard drive
(do not forget each node participates as a data server as well) speed and available network bandwidth.
CPU consumption on such load is decent and is about 50% of single core on a client node and about
10% percents on each other data server nodes.
d) Fair performance on large sets of small files. For the test I untared linux kernel 3.1. It took 5 minutes
over OrangeFS (with tuned parameters) and almost 2 minutes over NFSv4 (tuned as well) for comparison.
CPU load is about 50% of single core (of course, it is actually distributed between cores) on the client and
about several percents on each node.
e) Support of ROMIO MPI I/O API. This is a sweet yummy for MPI aware applications, which allows to use
PVFS2/OrangeFS parallel input-output features directly from applications.
f) No support for special files (sockets, fifo, block devices). Thus can't be safely used as /home and I use
NFSv4 for that task providing users quota-restricted small home space. Though most distributed
filesystems don't support special files anyway.
As I spend some considerable amount of time on analysis of different distributed filesystems, I found
that the most close solution to PVFS2/OrangeFS is famous LustreFS both by design and features provided.
The main difference between them is that lustre is in kernel space, while orangefs is in userspace except
for kernel VFS part. As a result of kernelspace approach lustre supports only few rather old and heavily
patched kernels like RHEL ones. It will be a pain to install that on Gentoo, it will be even more pain to deal
with such old and heavily patched kernels. On top of that lustre is targeted on dedicated storage nodes
only and consumes all available resources for good. So I'm out of even trying it.
I found that PVFS2 was once in portage, so I used that ebuild for a start. It was removed from tree
because lacked support for recent kernels, though tree cleaners overdone their job and removed
package days before new release was out and fixed support for recent kernels. This removal was even
more stupid, because PVFS2/OrangeFS provides fuse client which can be used on not yet supported kernels.
As of now I heavily remade that ebuild accommodating new or earlier missed features of the filesystem,
as well as heavily patching it from trunk to fix some bugs, all this with interaction with upstream. Also I
fixed some build issues and send them patches as well. This ebuild (orangefs-2.8.5) may be found in my
overlay ("bircoph", registered in layman). I plan to post it on bugzilla and probably to move it to the
science overlay where I can support it, and this program will fit the science overlay really good, because
this filesystem is targeted on scientific applications. Less likely it will go upstream because it is extremely
hard to put anything upstream this day, though proxy maintaining is still a thinkable solution.
Currently all kernels up to 3.2 are supported. Support for 3.3 is under way, but this will take time, because
considerable changes are required. Though filesystem is fully functional even without kernel support:
1) Fuse client may be used instead of kernel-based one. Functionality is about the same.
2) OrangeFS server is userspace and does not need kernel support at all.
3) It is possible to use filesystem without VFS at all (yes, no mouts and usual file access). ROMIO API
allows to access data directly from application using parallel I/O routines. _________________ Per aspera ad astra! |
|
Back to top |
|
|
TequilaTR n00b
Joined: 01 Feb 2005 Posts: 67
|
Posted: Sat Apr 07, 2012 4:47 pm Post subject: |
|
|
Very nice summary. The orange-fs solution really sounds promising. |
|
Back to top |
|
|
Bircoph Retired Dev
Joined: 27 Jun 2008 Posts: 261 Location: Moscow
|
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|