Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Spidering a website to d/l certain filetypes
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Networking & Security
View previous topic :: View next topic  
Author Message
hbmartin
Guru
Guru


Joined: 12 Sep 2003
Posts: 386
Location: Home is where the boxen are

PostPosted: Sun Feb 13, 2005 4:11 am    Post subject: Spidering a website to d/l certain filetypes Reply with quote

I'd like to get some sort of spider that can spider a site, but only download certain filetypes. I've been browsing indie music sites with mp3 archives (totally legal) and I'd like to be able to download the mp3s linked to from various pages without having to go through all the pages manually. Ant ideas on how to find or setup such a spider?

Thanks,
Harold
Back to top
View user's profile Send private message
moocha
Watchman
Watchman


Joined: 21 Oct 2003
Posts: 5722

PostPosted: Sun Feb 13, 2005 4:48 am    Post subject: Reply with quote

Code:
wget -r -np -l 20 -A mp3,MP3 http://domain.or.ip.address/path

For more info, consult the wget manual.
Beware though that recursive downloading / spidering like this may be breaking that site's terms of use.
_________________
Military Commissions Act of 2006: http://tinyurl.com/jrcto

"Those who would give up essential liberty to purchase a little temporary safety deserve neither liberty nor safety."
-- attributed to Benjamin Franklin
Back to top
View user's profile Send private message
hbmartin
Guru
Guru


Joined: 12 Sep 2003
Posts: 386
Location: Home is where the boxen are

PostPosted: Thu Feb 17, 2005 2:56 am    Post subject: Reply with quote

Thanks for your reply. Most of the actuap mp3 files are hosted on a different site, so this command doesn't appear to d/l them. Is there any way to force the spider to d/l from other sites?

Thanks,
Harold
Back to top
View user's profile Send private message
teknomage1
Veteran
Veteran


Joined: 05 Aug 2003
Posts: 1239
Location: Los Angeles, CA

PostPosted: Thu Feb 17, 2005 4:21 am    Post subject: Reply with quote

Are you any good with perl? The LWP modules make spidering by type very doable. Course web spdering versus html is always going to be somewhat tricky. O'Rielly's website has some great stuff on using WWW::Mechanize.
Back to top
View user's profile Send private message
moocha
Watchman
Watchman


Joined: 21 Oct 2003
Posts: 5722

PostPosted: Thu Feb 17, 2005 4:49 am    Post subject: Reply with quote

hbmartin wrote:
Thanks for your reply. Most of the actuap mp3 files are hosted on a different site, so this command doesn't appear to d/l them. Is there any way to force the spider to d/l from other sites?

Thanks,
Harold

The wget manual as indicated in my previous post wrote:
-H
--span-hosts
Enable spanning across hosts when doing recursive retrieving.

-D domain-list
--domains=domain-list
Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does not turn on -H.

_________________
Military Commissions Act of 2006: http://tinyurl.com/jrcto

"Those who would give up essential liberty to purchase a little temporary safety deserve neither liberty nor safety."
-- attributed to Benjamin Franklin
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Networking & Security All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum