Syncing very large number of files to another server

humbletech99 · Veteran Joined: 26 May 2005 Posts: 1229 Location: London

I've got a folder with a very large number of files and subdirectories (it's full of maildir folders for my company actually) which has files that rank in the millions. I need to replicate this to another server as a backup.

The number of files is not the only consideration, the volume is quite significant, even over gigabit.

I've tried using rsync to try to avoid re-copying a large volume of data by only taking the differences across to the backup server, but the extremely large number of files tends to hurt rsync and take forever while it's building up file lists.

Does anybody have a better idea or another tool for replicating a directory structure with both large volume and a large number of files?
_________________
The Human Equation:

value(geeks) > value(mundanes)

SnakeByte · Posted: Fri Jul 18, 2008 11:02 am Post subject:

hi,

as you have

jsosic · Posted: Fri Jul 18, 2008 2:23 pm Post subject:

I'm syncing 6 GB's of data (small files, web server) with rsync, and it works quite fast. Rsync is done after approximately 15 seconds.

If the other server is empty, then make the first copying with scp, and later use rsync for syncing the differences.
_________________
I avenge with darkness, the blood is the life
The Order of the Dragon, I feed on human life

humbletech99 · Veteran Joined: 26 May 2005 Posts: 1229 Location: London

SnakeByte: thanks for the idea but I already did this to split the file list in to multiple smaller file lists but it was still too much of a burden and takes hours to run when I know a lot of this is simply the file generation list.

To put this in perspective for you and jsosic, I'll update you with a count from my server... as soon as it stops aggregating it!!!
_________________
The Human Equation:

value(geeks) > value(mundanes)

humbletech99 · Veteran Joined: 26 May 2005 Posts: 1229 Location: London

OK here are my current counts to give you an idea of what I am trying to manage:

think4urs11 · Posted: Sun Jul 20, 2008 9:05 pm Post subject:

-W might help a bit but with that numbers i'd use a file storage server as backend and have that one doing regular snapshots for backup instead of holding the same data on individual servers.
_________________
Nothing is secure / Security is always a trade-off with usability / Do not assume anything / Trust no-one, nothing / Paranoia is your friend / Think for yourself

lucapost · Posted: Sun Jul 20, 2008 10:36 pm Post subject:

You can use a ftp client that support mirror funcion like lftp (mirror -R)....
_________________
LP

SnakeByte · Posted: Mon Jul 21, 2008 4:05 pm Post subject:

aronparsons · Posted: Mon Jul 21, 2008 8:32 pm Post subject:

Have you tried using rsync over NFS as opposed to tunneling via SSH (I'm assuming your doing something like "rsync /data/. backupserver:/backups/" since you didn't specify otherwise)? If you do this, export it read-only and asynchronous (ro,async).

What is your hard drive configuration (drive speed, interface, RAID, LVM, filesystem, etc)?

Something else that might help is to disable access time updates ('noatime' and 'nodiratime' mount options); this may not apply to your filesystems, but will for ext3 and ReiserFS.

humbletech99 · Veteran Joined: 26 May 2005 Posts: 1229 Location: London

think4urs11 · Posted: Thu Jul 24, 2008 9:50 pm Post subject:

SnakeByte · Posted: Sun Jul 27, 2008 4:51 pm Post subject:

cyrillic

humbletech99 · Veteran Joined: 26 May 2005 Posts: 1229 Location: London

sschlueter · Posted: Mon Jul 28, 2008 12:40 am Post subject:

Are you already using rsync 3.0? If not, upgrading might help a bit:

sschlueter · Posted: Mon Jul 28, 2008 1:38 am Post subject:

I think the main problem here is that rsync lacks the feature to maintain and utilize an index. It must scan both local and remote directories each time it is run in order to determine the differences.

That being said, I would suggest git as a solution that uses an index.

This would include the following steps:

1) creating a git repository on the original machine
2) regularly creating new revisions by comitting new/changed/deleted files
3) cloning the repository on the backup machine (by using git clone)
4) regularly synching the cloned repository (by using git pull)

Step 2) is greatly simplified by an additional tool called gibak: "gibak commit" is all you have to do here. This step still requires a filesystem scan but the author claims that it's faster that rsync's scan method.

The major advantage is that step 4) is way more efficient than using rsync because everything is indexed now.

Step 1) may not be space efficient (I haven't checked that) but keep in mind that step 2) gets you a versioned backup in addition to mere replication for free. Step 2) in itself is space efficient, by the way.

Edit: Even the scan in step 2) could be avoided if someone created a demon that utilized the kernel's inotify feature to create the list of new/changed/deleted files. Any volunteers? :wink:

humbletech99 · Veteran Joined: 26 May 2005 Posts: 1229 Location: London

sschlueter: thanks for the recommendations. It looks like rsync's algorithm improvements will be well worth it, I'm still on 2.x on both ends but will try rsync 3 when I get a chance.

I've not used git, but I have used subversion and I have to say I still have my reservations about such a method... I'll look in to git when I get a chance though.

Thanks again.
_________________
The Human Equation:

value(geeks) > value(mundanes)

i92guboj · Posted: Mon Jul 28, 2008 8:48 am Post subject:

Some random bits.

-If performance is an issue. nfs will beat ssh because of the lack of entryption.
-Git might be an option to consider.
-If you use compresion, use gzip instead of bzip2. It will for sure be a tad faster, and you will save a lot of bandwidth.

Sorry if some or all of these have been already mentioned, I don't have the time right now to read the whole thread.