View previous topic :: View next topic |
Author |
Message |
hbmartin Guru
Joined: 12 Sep 2003 Posts: 386 Location: Home is where the boxen are
|
Posted: Sun Feb 13, 2005 4:11 am Post subject: Spidering a website to d/l certain filetypes |
|
|
I'd like to get some sort of spider that can spider a site, but only download certain filetypes. I've been browsing indie music sites with mp3 archives (totally legal) and I'd like to be able to download the mp3s linked to from various pages without having to go through all the pages manually. Ant ideas on how to find or setup such a spider?
Thanks,
Harold |
|
Back to top |
|
|
moocha Watchman
Joined: 21 Oct 2003 Posts: 5722
|
Posted: Sun Feb 13, 2005 4:48 am Post subject: |
|
|
Code: | wget -r -np -l 20 -A mp3,MP3 http://domain.or.ip.address/path |
For more info, consult the wget manual.
Beware though that recursive downloading / spidering like this may be breaking that site's terms of use. _________________ Military Commissions Act of 2006: http://tinyurl.com/jrcto
"Those who would give up essential liberty to purchase a little temporary safety deserve neither liberty nor safety."
-- attributed to Benjamin Franklin |
|
Back to top |
|
|
hbmartin Guru
Joined: 12 Sep 2003 Posts: 386 Location: Home is where the boxen are
|
Posted: Thu Feb 17, 2005 2:56 am Post subject: |
|
|
Thanks for your reply. Most of the actuap mp3 files are hosted on a different site, so this command doesn't appear to d/l them. Is there any way to force the spider to d/l from other sites?
Thanks,
Harold |
|
Back to top |
|
|
teknomage1 Veteran
Joined: 05 Aug 2003 Posts: 1239 Location: Los Angeles, CA
|
Posted: Thu Feb 17, 2005 4:21 am Post subject: |
|
|
Are you any good with perl? The LWP modules make spidering by type very doable. Course web spdering versus html is always going to be somewhat tricky. O'Rielly's website has some great stuff on using WWW::Mechanize. |
|
Back to top |
|
|
moocha Watchman
Joined: 21 Oct 2003 Posts: 5722
|
Posted: Thu Feb 17, 2005 4:49 am Post subject: |
|
|
hbmartin wrote: | Thanks for your reply. Most of the actuap mp3 files are hosted on a different site, so this command doesn't appear to d/l them. Is there any way to force the spider to d/l from other sites?
Thanks,
Harold |
The wget manual as indicated in my previous post wrote: | -H
--span-hosts
Enable spanning across hosts when doing recursive retrieving.
-D domain-list
--domains=domain-list
Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does not turn on -H. |
_________________ Military Commissions Act of 2006: http://tinyurl.com/jrcto
"Those who would give up essential liberty to purchase a little temporary safety deserve neither liberty nor safety."
-- attributed to Benjamin Franklin |
|
Back to top |
|
|
|