View previous topic :: View next topic |
Author |
Message |
humbletech99 Veteran
![Veteran Veteran](/images/ranks/rank_rect_5_vet.gif)
![](images/avatars/781050335437c44a8039b0.jpg)
Joined: 26 May 2005 Posts: 1229 Location: London
|
Posted: Fri Jul 18, 2008 10:45 am Post subject: Syncing very large number of files to another server |
|
|
I've got a folder with a very large number of files and subdirectories (it's full of maildir folders for my company actually) which has files that rank in the millions. I need to replicate this to another server as a backup.
The number of files is not the only consideration, the volume is quite significant, even over gigabit.
I've tried using rsync to try to avoid re-copying a large volume of data by only taking the differences across to the backup server, but the extremely large number of files tends to hurt rsync and take forever while it's building up file lists.
Does anybody have a better idea or another tool for replicating a directory structure with both large volume and a large number of files? _________________ The Human Equation:
value(geeks) > value(mundanes) |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
SnakeByte Apprentice
![Apprentice Apprentice](/images/ranks/rank_rect_2.gif)
![](images/avatars/c22783273d9d97ed74960.png)
Joined: 04 Oct 2002 Posts: 177 Location: Europe - Germany
|
Posted: Fri Jul 18, 2008 11:02 am Post subject: |
|
|
hi,
as you have Quote: | a folder with a very large number of files and subdirectories |
you can still use rsync.
just do a Code: | for subfolder in */*; do rsync $subfolder <target>; done |
this will reduce the number of files to sync in each step.
regards |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
jsosic Guru
![Guru Guru](/images/ranks/rank_rect_3.gif)
![](images/avatars/1231780896489e41f206c29.jpg)
Joined: 02 Aug 2004 Posts: 510 Location: Split (Croatia)
|
Posted: Fri Jul 18, 2008 2:23 pm Post subject: |
|
|
I'm syncing 6 GB's of data (small files, web server) with rsync, and it works quite fast. Rsync is done after approximately 15 seconds.
If the other server is empty, then make the first copying with scp, and later use rsync for syncing the differences. _________________ I avenge with darkness, the blood is the life
The Order of the Dragon, I feed on human life |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
humbletech99 Veteran
![Veteran Veteran](/images/ranks/rank_rect_5_vet.gif)
![](images/avatars/781050335437c44a8039b0.jpg)
Joined: 26 May 2005 Posts: 1229 Location: London
|
Posted: Fri Jul 18, 2008 11:24 pm Post subject: |
|
|
SnakeByte: thanks for the idea but I already did this to split the file list in to multiple smaller file lists but it was still too much of a burden and takes hours to run when I know a lot of this is simply the file generation list.
To put this in perspective for you and jsosic, I'll update you with a count from my server... as soon as it stops aggregating it!!! _________________ The Human Equation:
value(geeks) > value(mundanes) |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
humbletech99 Veteran
![Veteran Veteran](/images/ranks/rank_rect_5_vet.gif)
![](images/avatars/781050335437c44a8039b0.jpg)
Joined: 26 May 2005 Posts: 1229 Location: London
|
Posted: Sun Jul 20, 2008 1:47 pm Post subject: |
|
|
OK here are my current counts to give you an idea of what I am trying to manage:
Code: | find .|wc -l
12537631
du -cshx .
179G .
179G total |
So there you go, 12 millions files/directories and 179G. It took me to leave that du overnight just for it to finish...!
Now you can see why my filelists take so long to build (even when I loop over each subdir independently) and why I don't want to tar the whole thing etc as this wastes time on a large volume of data that is already on my other server and recopying everything every time is really out of the question...
All ideas welcome at this point. _________________ The Human Equation:
value(geeks) > value(mundanes) |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
think4urs11 Bodhisattva
![Bodhisattva Bodhisattva](/images/ranks/rank-bodhisattva.gif)
![](images/avatars/8534934054bad29b51e5fa.jpg)
Joined: 25 Jun 2003 Posts: 6659 Location: above the cloud
|
Posted: Sun Jul 20, 2008 9:05 pm Post subject: |
|
|
-W might help a bit but with that numbers i'd use a file storage server as backend and have that one doing regular snapshots for backup instead of holding the same data on individual servers. _________________ Nothing is secure / Security is always a trade-off with usability / Do not assume anything / Trust no-one, nothing / Paranoia is your friend / Think for yourself |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
lucapost Veteran
![Veteran Veteran](/images/ranks/rank_rect_5_vet.gif)
![](images/avatars/17095691174866151fd78ae.gif)
Joined: 24 Nov 2005 Posts: 1419 Location: <ud|me|ts> - Italy
|
Posted: Sun Jul 20, 2008 10:36 pm Post subject: |
|
|
You can use a ftp client that support mirror funcion like lftp (mirror -R).... _________________ LP |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
SnakeByte Apprentice
![Apprentice Apprentice](/images/ranks/rank_rect_2.gif)
![](images/avatars/c22783273d9d97ed74960.png)
Joined: 04 Oct 2002 Posts: 177 Location: Europe - Germany
|
Posted: Mon Jul 21, 2008 4:05 pm Post subject: |
|
|
lucapost wrote: | You can use a ftp client that support mirror funcion like lftp (mirror -R).... |
This would result in a full copy, wouldn't it?
@humbletech99
Can you give some more information about the general directory layout?
Is it symmetric, is there a given pattern for directory / filenames?
regards |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
aronparsons Tux's lil' helper
![Tux's lil' helper Tux's lil' helper](/images/ranks/rank_rect_1.gif)
Joined: 04 Oct 2004 Posts: 117 Location: Virginia
|
Posted: Mon Jul 21, 2008 8:32 pm Post subject: |
|
|
Have you tried using rsync over NFS as opposed to tunneling via SSH (I'm assuming your doing something like "rsync /data/. backupserver:/backups/" since you didn't specify otherwise)? If you do this, export it read-only and asynchronous (ro,async).
What is your hard drive configuration (drive speed, interface, RAID, LVM, filesystem, etc)?
Something else that might help is to disable access time updates ('noatime' and 'nodiratime' mount options); this may not apply to your filesystems, but will for ext3 and ReiserFS. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
humbletech99 Veteran
![Veteran Veteran](/images/ranks/rank_rect_5_vet.gif)
![](images/avatars/781050335437c44a8039b0.jpg)
Joined: 26 May 2005 Posts: 1229 Location: London
|
Posted: Thu Jul 24, 2008 1:28 pm Post subject: |
|
|
Think4UrS11 wrote: | -W might help a bit but with that numbers i'd use a file storage server as backend and have that one doing regular snapshots for backup instead of holding the same data on individual servers. |
I'm not sure how this -W (whole file) option would help. The man page says it does not use the rsync algorithm. It that case, does it mean copying all the files regardless? At 179GB I can't even try to do that.
aronparsons: Both servers have 7200rpm sata raid arrays spanning several TB. There is no lvm in use on them.
SnakeByte: the structure is like user/maildirs/... where the directory is split into subdirs of users, each subdir containing files and further subdirs as per the maildir structure. _________________ The Human Equation:
value(geeks) > value(mundanes) |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
think4urs11 Bodhisattva
![Bodhisattva Bodhisattva](/images/ranks/rank-bodhisattva.gif)
![](images/avatars/8534934054bad29b51e5fa.jpg)
Joined: 25 Jun 2003 Posts: 6659 Location: above the cloud
|
Posted: Thu Jul 24, 2008 9:50 pm Post subject: |
|
|
humbletech99 wrote: | Think4UrS11 wrote: | -W might help a bit but with that numbers i'd use a file storage server as backend and have that one doing regular snapshots for backup instead of holding the same data on individual servers. |
I'm not sure how this -W (whole file) option would help. The man page says it does not use the rsync algorithm. It that case, does it mean copying all the files regardless? At 179GB I can't even try to do that. |
e.g. if you've high number of small files or not too much big files which change their content -W simply copies the whole file instead of checking the content - lowers processor usage but needs probably a bit more bandwidth.
Depending on your exact type of datas this as said might be an option.
Actually as you have a mailserver normally not too much of the existing files will change as 1file=1mail (roughly spoken).
There will be always new files, files get deleted but not too much changed files. So there's no real need to have the server really check each and every file in-depth but it might be quicker to simply have changed files transfered completely. _________________ Nothing is secure / Security is always a trade-off with usability / Do not assume anything / Trust no-one, nothing / Paranoia is your friend / Think for yourself |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
SnakeByte Apprentice
![Apprentice Apprentice](/images/ranks/rank_rect_2.gif)
![](images/avatars/c22783273d9d97ed74960.png)
Joined: 04 Oct 2002 Posts: 177 Location: Europe - Germany
|
Posted: Sun Jul 27, 2008 4:51 pm Post subject: |
|
|
humbletech99 wrote: | OK here are my current counts to give you an idea of what I am trying to manage:
Code: | find .|wc -l
12537631
du -cshx .
179G .
179G total |
|
So 12 million files split by how many users?
You could give rsync on a per user directory a try.
Or tar and bzip each user directory, copy untar
to save both CPU ( for the change check ) and bandwith.
regards |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
cyrillic Watchman
![Watchman Watchman](/images/ranks/rank-G-2-watchman.gif)
![](images/avatars/8174739453e52fd5e9aef6.jpg)
Joined: 19 Feb 2003 Posts: 7313 Location: Groton, Massachusetts USA
|
Posted: Sun Jul 27, 2008 5:57 pm Post subject: Re: Syncing very large number of files to another server |
|
|
humbletech99 wrote: | I've tried using rsync to try to avoid re-copying a large volume of data by only taking the differences across to the backup server, but the extremely large number of files tends to hurt rsync and take forever while it's building up file lists. |
I think rsync is a good choice in this situation.
Your bottleneck is most likely filesystem performance (or lack thereof). |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
humbletech99 Veteran
![Veteran Veteran](/images/ranks/rank_rect_5_vet.gif)
![](images/avatars/781050335437c44a8039b0.jpg)
Joined: 26 May 2005 Posts: 1229 Location: London
|
Posted: Sun Jul 27, 2008 7:07 pm Post subject: |
|
|
SnakeByte wrote: | You could give rsync on a per user directory a try.
| Tried that already in a bash script iterating over each user directory individually, it's still too slow and the file lists are too big. Quote: | Or tar and bzip each user directory, copy untar
to save both CPU ( for the change check ) and bandwith. | I've done a very similar thing elsewhere and have to say that this is really not quick at all, even slower than the rsync method I think due to the fact you have to process a huge amount of data needlessly each time. I even did timing tests and found bzip to be a poor choice for this due to the extreme cpu usage, gzip was better as my gigabit network no longer became the bottleneck as much as cpu on a dual opteron server!
cyrillic: that's a good guess. I have observed cpu as the bottleneck on streaming zipping -> coping unzipping type operations, ram to be the bottleneck with this rsync operation as it chews up the entire gig of ram on the server just to build the file list before it ever starts transferring files, and I expect that tarring will definitely leave disk as the bottleneck although I haven't tested the last one.
So I'm back to rsync, I should probably get some more ram, but I'd love to find a superior solution to any of these so far discussed.... I'm tempted to use drbd for this, although not trivial, not rocket science but perhaps a little awkward to make the existing servers and data support this... I have no free block devices and as far as I know it's not supported to fake it with loopback devices either.
Open to suggestions _________________ The Human Equation:
value(geeks) > value(mundanes) |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
sschlueter Guru
![Guru Guru](/images/ranks/rank_rect_3.gif)
Joined: 26 Jul 2002 Posts: 578 Location: Dortmund, Germany
|
Posted: Mon Jul 28, 2008 12:40 am Post subject: |
|
|
Are you already using rsync 3.0? If not, upgrading might help a bit:
Quote: | Beginning with rsync 3.0.0, the recursive algorithm used is now an incremental scan that uses much less memory than before and begins the transfer after the scanning of the first few directories have been completed. This incremental scan only affects our recursion algorithm, and does not change a non-recursive transfer. It is also only possible when both ends of the transfer are at least version 3.0.0. |
|
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
sschlueter Guru
![Guru Guru](/images/ranks/rank_rect_3.gif)
Joined: 26 Jul 2002 Posts: 578 Location: Dortmund, Germany
|
Posted: Mon Jul 28, 2008 1:38 am Post subject: |
|
|
I think the main problem here is that rsync lacks the feature to maintain and utilize an index. It must scan both local and remote directories each time it is run in order to determine the differences.
That being said, I would suggest git as a solution that uses an index.
This would include the following steps:
1) creating a git repository on the original machine
2) regularly creating new revisions by comitting new/changed/deleted files
3) cloning the repository on the backup machine (by using git clone)
4) regularly synching the cloned repository (by using git pull)
Step 2) is greatly simplified by an additional tool called gibak: "gibak commit" is all you have to do here. This step still requires a filesystem scan but the author claims that it's faster that rsync's scan method.
The major advantage is that step 4) is way more efficient than using rsync because everything is indexed now.
Step 1) may not be space efficient (I haven't checked that) but keep in mind that step 2) gets you a versioned backup in addition to mere replication for free. Step 2) in itself is space efficient, by the way.
Edit: Even the scan in step 2) could be avoided if someone created a demon that utilized the kernel's inotify feature to create the list of new/changed/deleted files. Any volunteers? ![Wink :wink:](images/smiles/icon_wink.gif) |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
humbletech99 Veteran
![Veteran Veteran](/images/ranks/rank_rect_5_vet.gif)
![](images/avatars/781050335437c44a8039b0.jpg)
Joined: 26 May 2005 Posts: 1229 Location: London
|
Posted: Mon Jul 28, 2008 8:33 am Post subject: |
|
|
sschlueter: thanks for the recommendations. It looks like rsync's algorithm improvements will be well worth it, I'm still on 2.x on both ends but will try rsync 3 when I get a chance.
I've not used git, but I have used subversion and I have to say I still have my reservations about such a method... I'll look in to git when I get a chance though.
Thanks again. _________________ The Human Equation:
value(geeks) > value(mundanes) |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
i92guboj Bodhisattva
![Bodhisattva Bodhisattva](/images/ranks/rank-bodhisattva.gif)
![](images/avatars/5913531844c4d36a8e43f9.jpg)
Joined: 30 Nov 2004 Posts: 10315 Location: Córdoba (Spain)
|
Posted: Mon Jul 28, 2008 8:48 am Post subject: |
|
|
Some random bits.
-If performance is an issue. nfs will beat ssh because of the lack of entryption.
-Git might be an option to consider.
-If you use compresion, use gzip instead of bzip2. It will for sure be a tad faster, and you will save a lot of bandwidth.
Sorry if some or all of these have been already mentioned, I don't have the time right now to read the whole thread. |
|
Back to top |
|
![](templates/gentoo/images/spacer.gif) |
|