Cluster filesystem for HPC

Bircoph · Developer Joined: 27 Jun 2008 Posts: 261 Location: Moscow

Hello,

I'm setting up a small scientific cluster: it contains five nodes, while a CPU and a RAM onboard are good enough, a disk space is limited: 120 GB per node, no hardware upgrade is possible.

I want to use ~80-100 GB partitions from each node to create a single distributed file system to use it on all nodes simultaneously for user data and software, the internal network is 1Gb/s and should handle this well. Left space will be used for OS files, swap, local caches and a scratch space.

There are dozens of cluster file systems available, so it is hard for me to choose.
So, in spite of hardware limitations and project goals (mostly pbs, mpi and distcc), these are the features I need:

1) The replication ratio should be controllable, the most likely I will disable replication at all to save space. (Data will be backed up regularly using external storage. But external storage will be slow (no more than 100 Mb/s), so I need to use internal fast cluster file system.)

2) FS should be kernel space supported. No fuse solutions are acceptable: heavy I/O is expected, I do not want to waste CPU cycles available.

3) A solution should be open source, no commercial licenses.

4) User/group quotas support is desirable (to protect a file system from overflow due to program error or malicious actions).

5) Nodes holding FS can't be exclusive: the very same nodes will be used for computations.

6) It is highly desirable that distributed FS blocks which are on current computing node will be used directly instead of transferring to the master node (mount source) and back to the client node to save network bandwidth from noops.

I have considered several cluster file systems, so there are my findings:

1) ceph — a good design, a very good documentation, full replication control, load and disk fill balancing; it provides everything I need except for quotas. It is already present in Gentoo, so it will be easier to support. Currently I'm looking forward to use this FS as the best choise available from my point of view.

2) ocfs2 — can't find any way to control or disable the replication ratio, a poor documentation, otherwise it looks good.

3) pvfs — looks good, but no mention of quotas and it can't intelligently distribute files over free space servers.

4) lustre — a good, powerful stuff, but it is too complex, perhaps this is an overkill for a small cluster. It requires heavy kernel patching for the server side, with any problems or urgent (security) kernel updates I will be on my own. I'd like to use something less intrusive.

5) gfs2 — looks interesting, but bad reports from my colleagues about both stability and performance outside of RH builds. Also this will required gnbd with very complicated setup.

6) gpfs — a commercial license required.

7) glusterfs — userspace solution, unacceptable.

8) moosefs — userspace solution, unacceptable.

9) fhgfs — free for use, but no sources available, and I do not want a pain with binaries.

I have no experience in setting up distributed FS myself, except for a setup where separate systems is used to handle local discs and export all storage space via nfs, so advice will be welcomed.

ATM I'm looking toward the first five mentioned with ceph in great preference, but any opinion will be considered.
_________________
Per aspera ad astra!

chithanh · Posted: Sun Nov 13, 2011 12:20 am Post subject: Re: Cluster filesystem for HPC

10) afs

Bircoph · Developer Joined: 27 Jun 2008 Posts: 261 Location: Moscow

Bircoph · Developer Joined: 27 Jun 2008 Posts: 261 Location: Moscow

After some research I found more info:

There is a good comparison between nfs, gfs2 and ocfs2:
http://www.gpaterno.com/publications/2010/dublin_ossbarcamp_2010_fs_comparison.pdf
and between nfs and ceph (on btrfs):
http://ceph.newdream.net/2009/01/some-performance-comparisons/

It looks like ocfs2 beats gfs2 almost everywhere, and this is not surprising, because gfs2 works on top of clvm, and lvm is a tough troublemaker both in terms of performance and stability. Also I prefer to use the most simple design with the smallest number of additional layers, because all of them have non-zero failure probabilities.

And I took back my words about ocfs2 documentation: rather good one may be found here:
http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.6/ocfs2-1_6-usersguide.pdf
The only problem with this docs was to understand that ocfs2 doesn't support replication itself (in the same way as gfs2).

I'll try ocfs2 after some preparations and will report my experience here in several weeks or so.
_________________
Per aspera ad astra!

John R. Graham · Posted: Tue Nov 15, 2011 1:04 pm Post subject:

Is strict POSIX filesystem semantics one of your requirements? That seems to reduce the playing field somewhat.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.

Bircoph · Developer Joined: 27 Jun 2008 Posts: 261 Location: Moscow

John R. Graham · Posted: Tue Nov 15, 2011 4:28 pm Post subject:

I was referring to things like this from the Ceph web site: Differences from POSIX. I've by no means made a comprehensive survey but OCFS, for instance, does not suffer from these corner case limitations.

- John
_________________
I can confirm that I have received between 0 and 499 National Security Letters.

Bircoph · Developer Joined: 27 Jun 2008 Posts: 261 Location: Moscow

Mad Merlin · Veteran Joined: 09 May 2005 Posts: 1155

You seem to have found this already, but it's worth reiterating that GFS2 and OCFS2 are shared filesystems, not clustered filesystems.

In a shared filesystem there is exactly one copy of the data on exactly one node with the caveat that all nodes can access it concurrently once the underlying block device is exported over the network (typically via iSCSI). You can add additional copies by putting something like DRBD underneath the filesystem, but I don't think that's what you're looking for. You gain no additional (storage) performance or space by adding more nodes.

In contrast, a clustered filesystem like Ceph stores data across multiple nodes on its own. You generally gain (storage) performance and/or space by adding more nodes.
_________________
Game! - Where the stick is mightier than the sword!

Bircoph · Developer Joined: 27 Jun 2008 Posts: 261 Location: Moscow

Mad Merlin · Veteran Joined: 09 May 2005 Posts: 1155

tmgoblin · n00b Joined: 05 Jan 2012 Posts: 2

Perhaps a ceph setup wherein you create per user or per use (sparse) disk images, mounted via loop device and formatted with a quota supporting file system if needed?

Bircoph · Developer Joined: 27 Jun 2008 Posts: 261 Location: Moscow

Bircoph · Developer Joined: 27 Jun 2008 Posts: 261 Location: Moscow

Well, I finally finished this project. In short: after thorough analysis and tests for selected few I chose
OrangeFS (branch of PVFS2).

The main limitation in my work was a need to use the same nodes for both storage and computing.
This through away most of available solutions at once, because they appear to be either incapable to
mount (and work well) on the same nodes as data server nodes or they are not designed to share
available CPU/RAM resources with other applications sanely.

After theoretical considerations and search through various sources I tested three installations on
practice:

1) Ceph. Good thing in theory, maybe even the future of data storage, but it is immature currently:
a) Ceph support for multiple active metadata servers is unstable now; that means all client metadata
requests will be served by a single host and this bottleneck limits scalability and performance on peak
loads greatly.
b) Ceph kernel client can't be safely used on the nodes serving as data servers. See discussion in
previous posts.

2) GlusterFS. As of benefits this distributed filesystem is, perhaps, the simplest one to deploy and
manage. Everything else are drawbacks.
a) Performance is acceptable only in linear mode, where client get no benefits from parallel i/o to
multiple nodes, so use of this mode in HPC is a mockery. And even in this mode even for large data
chunks (dd bs=1M) CPU consumption is 30% of single core (Xeon E5410). I/O speed for large chunks
is about 52 MB/s, which is close to a limit of local hard drives (57 MB/s).
b) Even in simple stripe mode performance is horrible: while good on write (100 MB/s) it drops to
absolutely unacceptable 30 MB/s on read operations. CPU utilization is terrible: it consumes about
250% of single core, so on single node with 2 Xeons and 8 cores one must drop 2.5 cores for
nothing. Tested on large chunks. On small i/o requests all distributed filesystems behave worse.

3) OrangeFS. It does not support quotas nor file locks (though all i/o operations are atomic and this
way consistency is kept without locks). But it works, and works well and stable. Furthermore this is
not a general file storage oriented system, but HPC dedicated one, targeted on parallel I/O including
ROMIO support. All test were done for stripe data distribution.
a) No quotas — to hell quotas. I gave up on them anyway, even glusterfs supports not common
uid/gid based quotas, but directory size limitations, more like LVM works.
b) Multiple active metadata servers are supported and stable. Compared to dedicated metadata
storage (single node) this gives +50% performance on small files and no significant difference on
large ones.
c) Excellent performance on large data chunks (dd bs=1M). It is limited by a sum of local hard drive
(do not forget each node participates as a data server as well) speed and available network bandwidth.
CPU consumption on such load is decent and is about 50% of single core on a client node and about
10% percents on each other data server nodes.
d) Fair performance on large sets of small files. For the test I untared linux kernel 3.1. It took 5 minutes
over OrangeFS (with tuned parameters) and almost 2 minutes over NFSv4 (tuned as well) for comparison.
CPU load is about 50% of single core (of course, it is actually distributed between cores) on the client and
about several percents on each node.
e) Support of ROMIO MPI I/O API. This is a sweet yummy for MPI aware applications, which allows to use
PVFS2/OrangeFS parallel input-output features directly from applications.
f) No support for special files (sockets, fifo, block devices). Thus can't be safely used as /home and I use
NFSv4 for that task providing users quota-restricted small home space. Though most distributed
filesystems don't support special files anyway.

As I spend some considerable amount of time on analysis of different distributed filesystems, I found
that the most close solution to PVFS2/OrangeFS is famous LustreFS both by design and features provided.
The main difference between them is that lustre is in kernel space, while orangefs is in userspace except
for kernel VFS part. As a result of kernelspace approach lustre supports only few rather old and heavily
patched kernels like RHEL ones. It will be a pain to install that on Gentoo, it will be even more pain to deal
with such old and heavily patched kernels. On top of that lustre is targeted on dedicated storage nodes
only and consumes all available resources for good. So I'm out of even trying it.

I found that PVFS2 was once in portage, so I used that ebuild for a start. It was removed from tree
because lacked support for recent kernels, though tree cleaners overdone their job and removed
package days before new release was out and fixed support for recent kernels. This removal was even
more stupid, because PVFS2/OrangeFS provides fuse client which can be used on not yet supported kernels.

As of now I heavily remade that ebuild accommodating new or earlier missed features of the filesystem,
as well as heavily patching it from trunk to fix some bugs, all this with interaction with upstream. Also I
fixed some build issues and send them patches as well. This ebuild (orangefs-2.8.5) may be found in my
overlay ("bircoph", registered in layman). I plan to post it on bugzilla and probably to move it to the
science overlay where I can support it, and this program will fit the science overlay really good, because
this filesystem is targeted on scientific applications. Less likely it will go upstream because it is extremely
hard to put anything upstream this day, though proxy maintaining is still a thinkable solution.

Currently all kernels up to 3.2 are supported. Support for 3.3 is under way, but this will take time, because
considerable changes are required. Though filesystem is fully functional even without kernel support:
1) Fuse client may be used instead of kernel-based one. Functionality is about the same.
2) OrangeFS server is userspace and does not need kernel support at all.
3) It is possible to use filesystem without VFS at all (yes, no mouts and usual file access). ROMIO API
allows to access data directly from application using parallel I/O routines.
_________________
Per aspera ad astra!

TequilaTR · n00b Joined: 01 Feb 2005 Posts: 67

Very nice summary. The orange-fs solution really sounds promising.

Bircoph · Developer Joined: 27 Jun 2008 Posts: 261 Location: Moscow

I opened bug 411173 for this package:
https://bugs.gentoo.org/show_bug.cgi?id=411173
_________________
Per aspera ad astra!