Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Syncing very large number of files to another server
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Networking & Security
View previous topic :: View next topic  
Author Message
humbletech99
Veteran
Veteran


Joined: 26 May 2005
Posts: 1229
Location: London

PostPosted: Fri Jul 18, 2008 10:45 am    Post subject: Syncing very large number of files to another server Reply with quote

I've got a folder with a very large number of files and subdirectories (it's full of maildir folders for my company actually) which has files that rank in the millions. I need to replicate this to another server as a backup.

The number of files is not the only consideration, the volume is quite significant, even over gigabit.

I've tried using rsync to try to avoid re-copying a large volume of data by only taking the differences across to the backup server, but the extremely large number of files tends to hurt rsync and take forever while it's building up file lists.

Does anybody have a better idea or another tool for replicating a directory structure with both large volume and a large number of files?
_________________
The Human Equation:

value(geeks) > value(mundanes)
Back to top
View user's profile Send private message
SnakeByte
Apprentice
Apprentice


Joined: 04 Oct 2002
Posts: 177
Location: Europe - Germany

PostPosted: Fri Jul 18, 2008 11:02 am    Post subject: Reply with quote

hi,

as you have
Quote:
a folder with a very large number of files and subdirectories

you can still use rsync.

just do a
Code:
for subfolder in */*; do rsync $subfolder <target>; done

this will reduce the number of files to sync in each step.


regards
Back to top
View user's profile Send private message
jsosic
Guru
Guru


Joined: 02 Aug 2004
Posts: 510
Location: Split (Croatia)

PostPosted: Fri Jul 18, 2008 2:23 pm    Post subject: Reply with quote

I'm syncing 6 GB's of data (small files, web server) with rsync, and it works quite fast. Rsync is done after approximately 15 seconds.

If the other server is empty, then make the first copying with scp, and later use rsync for syncing the differences.
_________________
I avenge with darkness, the blood is the life
The Order of the Dragon, I feed on human life
Back to top
View user's profile Send private message
humbletech99
Veteran
Veteran


Joined: 26 May 2005
Posts: 1229
Location: London

PostPosted: Fri Jul 18, 2008 11:24 pm    Post subject: Reply with quote

SnakeByte: thanks for the idea but I already did this to split the file list in to multiple smaller file lists but it was still too much of a burden and takes hours to run when I know a lot of this is simply the file generation list.

To put this in perspective for you and jsosic, I'll update you with a count from my server... as soon as it stops aggregating it!!!
_________________
The Human Equation:

value(geeks) > value(mundanes)
Back to top
View user's profile Send private message
humbletech99
Veteran
Veteran


Joined: 26 May 2005
Posts: 1229
Location: London

PostPosted: Sun Jul 20, 2008 1:47 pm    Post subject: Reply with quote

OK here are my current counts to give you an idea of what I am trying to manage:
Code:
find .|wc -l
12537631

du -cshx .
179G    .
179G    total

So there you go, 12 millions files/directories and 179G. It took me to leave that du overnight just for it to finish...!

Now you can see why my filelists take so long to build (even when I loop over each subdir independently) and why I don't want to tar the whole thing etc as this wastes time on a large volume of data that is already on my other server and recopying everything every time is really out of the question...

All ideas welcome at this point.
_________________
The Human Equation:

value(geeks) > value(mundanes)
Back to top
View user's profile Send private message
think4urs11
Bodhisattva
Bodhisattva


Joined: 25 Jun 2003
Posts: 6659
Location: above the cloud

PostPosted: Sun Jul 20, 2008 9:05 pm    Post subject: Reply with quote

-W might help a bit but with that numbers i'd use a file storage server as backend and have that one doing regular snapshots for backup instead of holding the same data on individual servers.
_________________
Nothing is secure / Security is always a trade-off with usability / Do not assume anything / Trust no-one, nothing / Paranoia is your friend / Think for yourself
Back to top
View user's profile Send private message
lucapost
Veteran
Veteran


Joined: 24 Nov 2005
Posts: 1419
Location: <ud|me|ts> - Italy

PostPosted: Sun Jul 20, 2008 10:36 pm    Post subject: Reply with quote

You can use a ftp client that support mirror funcion like lftp (mirror -R)....
_________________
LP
Back to top
View user's profile Send private message
SnakeByte
Apprentice
Apprentice


Joined: 04 Oct 2002
Posts: 177
Location: Europe - Germany

PostPosted: Mon Jul 21, 2008 4:05 pm    Post subject: Reply with quote

lucapost wrote:
You can use a ftp client that support mirror funcion like lftp (mirror -R)....


This would result in a full copy, wouldn't it?


@humbletech99

Can you give some more information about the general directory layout?

Is it symmetric, is there a given pattern for directory / filenames?


regards
Back to top
View user's profile Send private message
aronparsons
Tux's lil' helper
Tux's lil' helper


Joined: 04 Oct 2004
Posts: 117
Location: Virginia

PostPosted: Mon Jul 21, 2008 8:32 pm    Post subject: Reply with quote

Have you tried using rsync over NFS as opposed to tunneling via SSH (I'm assuming your doing something like "rsync /data/. backupserver:/backups/" since you didn't specify otherwise)? If you do this, export it read-only and asynchronous (ro,async).

What is your hard drive configuration (drive speed, interface, RAID, LVM, filesystem, etc)?

Something else that might help is to disable access time updates ('noatime' and 'nodiratime' mount options); this may not apply to your filesystems, but will for ext3 and ReiserFS.
Back to top
View user's profile Send private message
humbletech99
Veteran
Veteran


Joined: 26 May 2005
Posts: 1229
Location: London

PostPosted: Thu Jul 24, 2008 1:28 pm    Post subject: Reply with quote

Think4UrS11 wrote:
-W might help a bit but with that numbers i'd use a file storage server as backend and have that one doing regular snapshots for backup instead of holding the same data on individual servers.

I'm not sure how this -W (whole file) option would help. The man page says it does not use the rsync algorithm. It that case, does it mean copying all the files regardless? At 179GB I can't even try to do that.

aronparsons: Both servers have 7200rpm sata raid arrays spanning several TB. There is no lvm in use on them.

SnakeByte: the structure is like user/maildirs/... where the directory is split into subdirs of users, each subdir containing files and further subdirs as per the maildir structure.
_________________
The Human Equation:

value(geeks) > value(mundanes)
Back to top
View user's profile Send private message
think4urs11
Bodhisattva
Bodhisattva


Joined: 25 Jun 2003
Posts: 6659
Location: above the cloud

PostPosted: Thu Jul 24, 2008 9:50 pm    Post subject: Reply with quote

humbletech99 wrote:
Think4UrS11 wrote:
-W might help a bit but with that numbers i'd use a file storage server as backend and have that one doing regular snapshots for backup instead of holding the same data on individual servers.

I'm not sure how this -W (whole file) option would help. The man page says it does not use the rsync algorithm. It that case, does it mean copying all the files regardless? At 179GB I can't even try to do that.

e.g. if you've high number of small files or not too much big files which change their content -W simply copies the whole file instead of checking the content - lowers processor usage but needs probably a bit more bandwidth.
Depending on your exact type of datas this as said might be an option.
Actually as you have a mailserver normally not too much of the existing files will change as 1file=1mail (roughly spoken).
There will be always new files, files get deleted but not too much changed files. So there's no real need to have the server really check each and every file in-depth but it might be quicker to simply have changed files transfered completely.
_________________
Nothing is secure / Security is always a trade-off with usability / Do not assume anything / Trust no-one, nothing / Paranoia is your friend / Think for yourself
Back to top
View user's profile Send private message
SnakeByte
Apprentice
Apprentice


Joined: 04 Oct 2002
Posts: 177
Location: Europe - Germany

PostPosted: Sun Jul 27, 2008 4:51 pm    Post subject: Reply with quote

humbletech99 wrote:
OK here are my current counts to give you an idea of what I am trying to manage:
Code:
find .|wc -l
12537631

du -cshx .
179G    .
179G    total



So 12 million files split by how many users?

You could give rsync on a per user directory a try.

Or tar and bzip each user directory, copy untar
to save both CPU ( for the change check ) and bandwith.


regards
Back to top
View user's profile Send private message
cyrillic
Watchman
Watchman


Joined: 19 Feb 2003
Posts: 7313
Location: Groton, Massachusetts USA

PostPosted: Sun Jul 27, 2008 5:57 pm    Post subject: Re: Syncing very large number of files to another server Reply with quote

humbletech99 wrote:
I've tried using rsync to try to avoid re-copying a large volume of data by only taking the differences across to the backup server, but the extremely large number of files tends to hurt rsync and take forever while it's building up file lists.

I think rsync is a good choice in this situation.

Your bottleneck is most likely filesystem performance (or lack thereof).
Back to top
View user's profile Send private message
humbletech99
Veteran
Veteran


Joined: 26 May 2005
Posts: 1229
Location: London

PostPosted: Sun Jul 27, 2008 7:07 pm    Post subject: Reply with quote

SnakeByte wrote:
You could give rsync on a per user directory a try.
Tried that already in a bash script iterating over each user directory individually, it's still too slow and the file lists are too big.
Quote:
Or tar and bzip each user directory, copy untar
to save both CPU ( for the change check ) and bandwith.
I've done a very similar thing elsewhere and have to say that this is really not quick at all, even slower than the rsync method I think due to the fact you have to process a huge amount of data needlessly each time. I even did timing tests and found bzip to be a poor choice for this due to the extreme cpu usage, gzip was better as my gigabit network no longer became the bottleneck as much as cpu on a dual opteron server!

cyrillic: that's a good guess. I have observed cpu as the bottleneck on streaming zipping -> coping unzipping type operations, ram to be the bottleneck with this rsync operation as it chews up the entire gig of ram on the server just to build the file list before it ever starts transferring files, and I expect that tarring will definitely leave disk as the bottleneck although I haven't tested the last one.

So I'm back to rsync, I should probably get some more ram, but I'd love to find a superior solution to any of these so far discussed.... I'm tempted to use drbd for this, although not trivial, not rocket science but perhaps a little awkward to make the existing servers and data support this... I have no free block devices and as far as I know it's not supported to fake it with loopback devices either.

Open to suggestions
_________________
The Human Equation:

value(geeks) > value(mundanes)
Back to top
View user's profile Send private message
sschlueter
Guru
Guru


Joined: 26 Jul 2002
Posts: 578
Location: Dortmund, Germany

PostPosted: Mon Jul 28, 2008 12:40 am    Post subject: Reply with quote

Are you already using rsync 3.0? If not, upgrading might help a bit:

Quote:
Beginning with rsync 3.0.0, the recursive algorithm used is now an incremental scan that uses much less memory than before and begins the transfer after the scanning of the first few directories have been completed. This incremental scan only affects our recursion algorithm, and does not change a non-recursive transfer. It is also only possible when both ends of the transfer are at least version 3.0.0.
Back to top
View user's profile Send private message
sschlueter
Guru
Guru


Joined: 26 Jul 2002
Posts: 578
Location: Dortmund, Germany

PostPosted: Mon Jul 28, 2008 1:38 am    Post subject: Reply with quote

I think the main problem here is that rsync lacks the feature to maintain and utilize an index. It must scan both local and remote directories each time it is run in order to determine the differences.

That being said, I would suggest git as a solution that uses an index.

This would include the following steps:

1) creating a git repository on the original machine
2) regularly creating new revisions by comitting new/changed/deleted files
3) cloning the repository on the backup machine (by using git clone)
4) regularly synching the cloned repository (by using git pull)


Step 2) is greatly simplified by an additional tool called gibak: "gibak commit" is all you have to do here. This step still requires a filesystem scan but the author claims that it's faster that rsync's scan method.

The major advantage is that step 4) is way more efficient than using rsync because everything is indexed now.

Step 1) may not be space efficient (I haven't checked that) but keep in mind that step 2) gets you a versioned backup in addition to mere replication for free. Step 2) in itself is space efficient, by the way.


Edit: Even the scan in step 2) could be avoided if someone created a demon that utilized the kernel's inotify feature to create the list of new/changed/deleted files. Any volunteers? :wink:
Back to top
View user's profile Send private message
humbletech99
Veteran
Veteran


Joined: 26 May 2005
Posts: 1229
Location: London

PostPosted: Mon Jul 28, 2008 8:33 am    Post subject: Reply with quote

sschlueter: thanks for the recommendations. It looks like rsync's algorithm improvements will be well worth it, I'm still on 2.x on both ends but will try rsync 3 when I get a chance.

I've not used git, but I have used subversion and I have to say I still have my reservations about such a method... I'll look in to git when I get a chance though.

Thanks again.
_________________
The Human Equation:

value(geeks) > value(mundanes)
Back to top
View user's profile Send private message
i92guboj
Bodhisattva
Bodhisattva


Joined: 30 Nov 2004
Posts: 10315
Location: Córdoba (Spain)

PostPosted: Mon Jul 28, 2008 8:48 am    Post subject: Reply with quote

Some random bits.

-If performance is an issue. nfs will beat ssh because of the lack of entryption.
-Git might be an option to consider.
-If you use compresion, use gzip instead of bzip2. It will for sure be a tad faster, and you will save a lot of bandwidth.

Sorry if some or all of these have been already mentioned, I don't have the time right now to read the whole thread.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Networking & Security All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum