Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Is it possible to download the forum? For offline reading
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Gentoo Forums Feedback
View previous topic :: View next topic  
Author Message
vitaly-zdanevich
Tux's lil' helper
Tux's lil' helper


Joined: 01 Dec 2019
Posts: 106
Location: Belarus

PostPosted: Tue Jul 09, 2024 10:22 am    Post subject: Is it possible to download the forum? For offline reading Reply with quote

Hi.
Back to top
View user's profile Send private message
Banana
Moderator
Moderator


Joined: 21 May 2004
Posts: 1720
Location: Germany

PostPosted: Tue Jul 09, 2024 6:19 pm    Post subject: Reply with quote

No.

You have to consult your favourite search engine to find a tool which can crawl websites and stores the pages as html files.
_________________
Forum Guidelines

PFL - Portage file list - find which package a file or command belongs to.
My delta-labs.org snippets do expire
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 22618

PostPosted: Tue Jul 09, 2024 9:56 pm    Post subject: Reply with quote

Before trying this, review very carefully what the tool will download. The forums are massive, and a careless attempt to spider the forums will download far more than you want, as well as likely putting enough load on the server to irritate the host.
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3420

PostPosted: Tue Jul 09, 2024 10:13 pm    Post subject: Reply with quote

https://forums.gentoo.org/robots.txt
wget's mirror mode respects robots, so it should be good if you put a speed limit on it. If you won't, you will certainly irritate infra guys.

Any idea why it bans magpie-crawler though? It's a funny way to ban a spider too; if one wants to misbehave, it can start with simply ignoring the kind request anyway.
_________________
Make Computing Fun Again
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 22618

PostPosted: Wed Jul 10, 2024 12:20 am    Post subject: Reply with quote

I was not involved in creation of that file, so I cannot comment on magpie. However, having looked at it, I don't think that exclusion is sufficient to address my earlier warning. Once a spider finds a topic, each topic has within it links to the individual posts of that thread, so a 25-post topic will be downloaded 26 times: once as a topic, and once as each of its 25 posts. As the number of posts increase, the problem gets worse. Those robots.txt exclusions might keep a crawler from getting topic index lists, but if the user provides a link to a topic, the crawler can hop from thread to thread using the next/previous links.
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3420

PostPosted: Wed Jul 10, 2024 12:54 am    Post subject: Reply with quote

No, I think the first part of robots.txt (for all browsers) is a good instruction for well behaved crawlers, since it does allow indexing of content with permanent addresses, while filtering out the volatile queries.
Links to individual posts shouldn't be a problem either, since that #part is only a reference to ID attribute in a html element within the same document. Clicking on those links doesn't refresh page either, browser just jumps to the position. Unless there are some other links to posts I'm not aware of?

It's the second part, for magpie, that baffles me.
_________________
Make Computing Fun Again
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 22618

PostPosted: Wed Jul 10, 2024 2:07 am    Post subject: Reply with quote

When viewing the thread, for each post, there is an icon that links to post 8833110, post 8833113, etc. Those are all separate URLs from the thread itself, although their content is largely duplicative. You are right that those also have a fragment to jump the browser to the specific post, but my concern is that a crawler that is trying to mirror everything would retrieve each of those [post] links individually.
Back to top
View user's profile Send private message
szatox
Advocate
Advocate


Joined: 27 Aug 2013
Posts: 3420

PostPosted: Wed Jul 10, 2024 10:10 am    Post subject: Reply with quote

Wow, you got me on this one. It actually uses post ID in the path followed by post ID in element instead of the thread and page in path followed by post ID in element. Ok, now that IS actually a problem.
_________________
Make Computing Fun Again
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo Forums Feedback All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum