View previous topic :: View next topic |
Author |
Message |
hds Advocate
Joined: 21 Aug 2004 Posts: 2629 Location: Sprockhoevel [GER]
|
Posted: Thu Apr 14, 2005 10:29 am Post subject: |
|
|
i just realize that "screen" is a stopword. could this be removed? there is an application called screen (everyone probably knows that) and i have answered questions about this one at least 2 or 3 times. |
|
Back to top |
|
|
kbranch n00b
Joined: 17 Nov 2004 Posts: 40
|
Posted: Fri May 13, 2005 3:45 am Post subject: |
|
|
I can understand the reasons for having a list like this, but I'd say that having this many words on it is just overkill. Many of the words on the list are very useful under some circumstances.
For example, let's say you're having a problem compiling kde 3.4. Such a problem would usually mean that there's something wrong with the ebuild, so other users have likely seen the same problem and posted a fix for it. I'd expect a search for "kde 3.4 compile error" to turn up some useful posts within the first few results, but the current search would just turn up useless crap about kde 3.4.
I've run into the problem several times without knowing why I got useless results. I'd say that slightly slower servers would be better than people thinking that the forums are just full of crap. |
|
Back to top |
|
|
pjp Administrator
Joined: 16 Apr 2002 Posts: 20551
|
Posted: Fri May 13, 2005 4:36 am Post subject: |
|
|
kbranch wrote: | the current search would just turn up useless crap about kde 3.4. | Actually, it'll just turn up useless crap about kde. "3.4" (or other similar numbers) aren't indexed. The stopwords list was generated based on how often the word is used. For example, "the" and "gentoo" are not likely to be helpful search words. _________________ Quis separabit? Quo animo? |
|
Back to top |
|
|
kbranch n00b
Joined: 17 Nov 2004 Posts: 40
|
Posted: Fri May 13, 2005 5:02 am Post subject: |
|
|
pjp wrote: | kbranch wrote: | the current search would just turn up useless crap about kde 3.4. | Actually, it'll just turn up useless crap about kde. "3.4" (or other similar numbers) aren't indexed. The stopwords list was generated based on how often the word is used. For example, "the" and "gentoo" are not likely to be helpful search words. |
Well, I guess that makes my example somewhat less than useful, but the underlying point still stands.
If the list was just automatically generated and there aren't any specific objections to removing the more useful words, I'd be glad to go through and find such words. |
|
Back to top |
|
|
pjp Administrator
Joined: 16 Apr 2002 Posts: 20551
|
Posted: Fri May 13, 2005 5:04 am Post subject: |
|
|
Well, it wasn't a script that did it without human intervention. Someone looked at the worst offenders, and filtered out the obvious useful terms, such as kde. _________________ Quis separabit? Quo animo? |
|
Back to top |
|
|
kbranch n00b
Joined: 17 Nov 2004 Posts: 40
|
Posted: Fri May 13, 2005 5:11 am Post subject: |
|
|
So then why are some of those words still on the list? Is there another side to things that I haven't seen in this thread or is it just that nobody's done it yet?
Again, if it's just a question of effort, I'll be glad to submit a revised list. |
|
Back to top |
|
|
cokey Advocate
Joined: 23 Apr 2004 Posts: 3355
|
Posted: Fri May 13, 2005 5:15 am Post subject: |
|
|
what happens when you put a phrase in speech marks like in a search engine? _________________ https://otw20.com/ OTW20 The new place for off the wall chat |
|
Back to top |
|
|
tomk Bodhisattva
Joined: 23 Sep 2003 Posts: 7221 Location: Sat in front of my computer
|
Posted: Fri May 13, 2005 7:19 am Post subject: |
|
|
cokehabit wrote: | what happens when you put a phrase in speech marks like in a search engine? |
That doesn't make any difference with the forums search, as it only knows about individual words, not groups of words.
kbranch wrote: | So then why are some of those words still on the list? Is there another side to things that I haven't seen in this thread or is it just that nobody's done it yet?
Again, if it's just a question of effort, I'll be glad to submit a revised list. |
It's not just a matter of removing words from the list, the stopwords aren't index. Re-indexing isn't an option on forums this big. _________________ Search | Read | Answer | Report | Strip |
|
Back to top |
|
|
Gherald Veteran
Joined: 23 Aug 2004 Posts: 1399 Location: CLUAConsole
|
|
Back to top |
|
|
TXTad Tux's lil' helper
Joined: 15 Jan 2004 Posts: 108 Location: Texas
|
Posted: Wed May 18, 2005 3:01 pm Post subject: |
|
|
Wow. No wonder seraching for anything in the forums is such a whippin'. I agree with this concept for keywords, but this is completely counter productive for phrases, such as the "you have mail" example. As a sysadmin, I can sympathize with the resources being consumed, but, that's what we have computers for: To do things that are hard for humans to do. If the computer has to work hard, then it's doing its job. Breaking the system for humans so that things are easy on the computer doesn't seem like the proper way to handle a problem.
Tad |
|
Back to top |
|
|
pjp Administrator
Joined: 16 Apr 2002 Posts: 20551
|
Posted: Wed May 18, 2005 3:04 pm Post subject: |
|
|
TXTad wrote: | Breaking the system for humans so that things are easy on the computer doesn't seem like the proper way to handle a problem. | When the hardware can no longer perform the tasks demanded by humans, and there is no reasonable means to ensure power is available as load increases, tradeoffs need to be made. If you'd like to donate an 8-way dual-core Opteron system with SCSI disks and many gigs of RAM, that might delay it for a while. _________________ Quis separabit? Quo animo? |
|
Back to top |
|
|
killfire l33t
Joined: 04 Oct 2003 Posts: 618
|
Posted: Thu May 19, 2005 10:53 pm Post subject: |
|
|
is it possible to say, not index those words in the actual posts (but take them from the post subject), and get rid of the stopwords altogether? or block them when alone, but for some, use it when other (valid) words are combined? like have a completely block list (like RTFM, the, or LOL) and have a partial block list (error, screen, compile...) for things that are combined?
because just stripping them seems like it will lower the accuracy of the posts a lot...
then again, i have no idea how the db works... _________________ my website, built in HAppS: http://dbpatterson.com
an art (oil painting) website I built a pure python backend for: http://www.lydiajohnston.com |
|
Back to top |
|
|
rhill Retired Dev
Joined: 22 Oct 2004 Posts: 1629 Location: sk.ca
|
Posted: Sat May 21, 2005 6:10 am Post subject: |
|
|
killfire wrote: | is it possible to say, not index those words in the actual posts (but take them from the post subject), and get rid of the stopwords altogether? |
Quote: | because just stripping them seems like it will lower the accuracy of the posts a lot... |
and only searching thread subjects won't? you don't search for titles, you search for content.
Quote: | or block them when alone, but for some, use it when other (valid) words are combined? |
tomk wrote: | That doesn't make any difference with the forums search, as it only knows about individual words, not groups of words. |
_________________ by design, by neglect
for a fact or just for effect |
|
Back to top |
|
|
mallchin l33t
Joined: 21 Jan 2003 Posts: 655 Location: United Kingdom
|
Posted: Sun May 22, 2005 12:24 pm Post subject: |
|
|
I agree with cokehabit, a heavy forum needs an advanced search widget to help sift through the crap and find what you want. Encapsualting whole strings (as google can) would soon make the word shitlist irrelevant, and I have always felt this is a missing feature from phpBB.
I understand we might not make such changes, and this may be the only acceptable solution for now, but a better solution does exist. _________________ 6700 @ 2.66GHz, 4Gb RAM, 2 x 500Gb, 8800 GTX, PhysX, X-Fi, 24" Widescreen, Tux mascot |
|
Back to top |
|
|
My_World Guru
Joined: 01 Sep 2003 Posts: 339 Location: Kalahari Desert
|
Posted: Sun May 22, 2005 3:19 pm Post subject: |
|
|
Just a few questions and suggestions I would like to voice....
Having some of those words removed from searches WILL lead to more double postings, and in the end, very frustrated users who cannot find the solution they want to a "serious" problem. We cater not only for the home user, but companies also, and being a sys-admin and go through pages uppon pages of worthless search results will land Gentoo in a bit of hot water at the end of the day. I have also been there this week trying to find a cure for why GLX was suddenly broken and all the search strings I entered landed me less than helpfull results. I was only by chanse that I found the topic to help me while browsing the forums a bit. I have resolved, in many cases now, to rather use Google and see if I can't find an answer rather than using the forums search.
For example, look when I joined and look at the number of posts I made till now. Very few, and most of them was made within this year cause the search function did not return usable results anymore. Get my drift?
I think "strings" are important, like searching the error string output. That will lead you exactly to the right place almost every time.
That said, I know that the servers are taking an enormous load, but isn't there a way to make the searching a bit more effective? Limiting the results to say 3 months (with maybe an advanced option for 6, 9, 12 months), upgrading the board maybe to better software, rallying for donations to buy a better server?
We once bought, from donations, an Opteron server for a Bit-Torrent tracker here in South Africa with only 300 members! US-$10 might not sound much, but add a few thousand users x $10 and you have yourself a new server. There must be a better way to either filter the stopwords, better the search engine software or someting?
_________________ "Ubuntu" - an African word meaning "Gentoo is too hard for me". |
|
Back to top |
|
|
ian! Bodhisattva
Joined: 25 Feb 2003 Posts: 3829 Location: Essen, Germany
|
Posted: Sun May 22, 2005 7:36 pm Post subject: |
|
|
Hardware isn't the problem. (Webserver: Dual Xeon 3Ghz, 2GB Ram, RAID5 10k rpm harddisks 120GB, DB: an even faster machine)
It's a software problem. _________________ "To have a successful open source project, you need to be at least somewhat successful at getting along with people." -- Daniel Robbins |
|
Back to top |
|
|
mcspiff Tux's lil' helper
Joined: 24 Oct 2004 Posts: 109
|
Posted: Sun May 22, 2005 8:01 pm Post subject: |
|
|
My_World wrote: | Just a few questions and suggestions I would like to voice....
Having some of those words removed from searches WILL lead to more double postings, and in the end, very frustrated users who cannot find the solution they want to a "serious" problem. We cater not only for the home user, but companies also, and being a sys-admin and go through pages uppon pages of worthless search results will land Gentoo in a bit of hot water at the end of the day. I have also been there this week trying to find a cure for why GLX was suddenly broken and all the search strings I entered landed me less than helpfull results. I was only by chanse that I found the topic to help me while browsing the forums a bit. I have resolved, in many cases now, to rather use Google and see if I can't find an answer rather than using the forums search.
For example, look when I joined and look at the number of posts I made till now. Very few, and most of them was made within this year cause the search function did not return usable results anymore. Get my drift?
I think "strings" are important, like searching the error string output. That will lead you exactly to the right place almost every time.
That said, I know that the servers are taking an enormous load, but isn't there a way to make the searching a bit more effective? Limiting the results to say 3 months (with maybe an advanced option for 6, 9, 12 months), upgrading the board maybe to better software, rallying for donations to buy a better server?
We once bought, from donations, an Opteron server for a Bit-Torrent tracker here in South Africa with only 300 members! US-$10 might not sound much, but add a few thousand users x $10 and you have yourself a new server. There must be a better way to either filter the stopwords, better the search engine software or someting?
|
Wow, just wow. If your business relies on the gentoo forums for tech. support, maybe they should fire the tech guys and use the saved cash on a redhat contract. |
|
Back to top |
|
|
My_World Guru
Joined: 01 Sep 2003 Posts: 339 Location: Kalahari Desert
|
Posted: Sun May 22, 2005 8:44 pm Post subject: |
|
|
mcspiff wrote: |
Wow, just wow. If your business relies on the gentoo forums for tech. support, maybe they should fire the tech guys and use the saved cash on a redhat contract. |
Not everybody knows all there is to know of Gentoo Linux...
_________________ "Ubuntu" - an African word meaning "Gentoo is too hard for me". |
|
Back to top |
|
|
tecknojunky Veteran
Joined: 19 Oct 2002 Posts: 1937 Location: Montréal
|
Posted: Tue May 24, 2005 9:37 am Post subject: |
|
|
Well, I ended up here because I could not get relevant search result anymore and I wanted to know if there was a problem with it. I guess I've found the answer.
My opinion: bad idea. Like many said, it's the combination of words that matters. This solution will not hold the long term road, I'm afraid. Like one poster said, more irrelevant search will lead to more posting will lead to more words will lead to a bigger database anyway, and back to square one.
The cause is quantity and the problem is two fold, depending on how you see things:
1- More quantities requires better search algorithm. I wonder how the db is set up. Maybe the problem is right there. Ever thought trying Oracle instead of mysql? How are the indexes made?
2- Do you really need to index all the posts made since april 2002? I know it's been mentionned before and other big forums do it, after some time, the threads should be made static. It as to be. You can't keep accumulating and preserve live threads and posts ad vitam eternam. Makes no sense (for proof).
So either find (make) a good search algo, or diminush the ammount of data. Don't pretend to know that by removing words is a solution, because any of these words in combination with other words is relevant. With time, you'll block all the words from the dictionnary and then we'll be able to say "wow man, searches are blazing fast now!" _________________ (7 of 9) Installing star-trek/species-8.4.7.2::talax. |
|
Back to top |
|
|
chrib Guru
Joined: 27 Sep 2003 Posts: 558 Location: Berlin, Germany
|
Posted: Tue May 24, 2005 2:18 pm Post subject: |
|
|
tecknojunky wrote: |
1- More quantities requires better search algorithm. I wonder how the db is set up. Maybe the problem is right there. Ever thought trying Oracle instead of mysql? |
Oracle is rather expensive and I don't think that the Gentoo Project is willing to spend their money on a license for an Oracle Database-Server.
YMMV _________________ Der Mensch kämpft um zu überleben, und nicht, um zu Grunde zu gehen. - Paulo Coelho
It is the end of all hope. To lose the child, the faith. To end all the innocence. To be someone like me. - Nightwish - End of all hope |
|
Back to top |
|
|
tecknojunky Veteran
Joined: 19 Oct 2002 Posts: 1937 Location: Montréal
|
Posted: Tue May 24, 2005 7:33 pm Post subject: |
|
|
chrib wrote: | Oracle is rather expensive... |
Oh. I was under the impression that it was free (beer) if use for not-for-profit. _________________ (7 of 9) Installing star-trek/species-8.4.7.2::talax. |
|
Back to top |
|
|
mallchin l33t
Joined: 21 Jan 2003 Posts: 655 Location: United Kingdom
|
Posted: Tue May 24, 2005 9:05 pm Post subject: |
|
|
tecknojunky wrote: | chrib wrote: | Oracle is rather expensive... |
Oh. I was under the impression that it was free (beer) if use for not-for-profit. |
I'd be surprised if it was. _________________ 6700 @ 2.66GHz, 4Gb RAM, 2 x 500Gb, 8800 GTX, PhysX, X-Fi, 24" Widescreen, Tux mascot |
|
Back to top |
|
|
Omega21 l33t
Joined: 14 Feb 2004 Posts: 788 Location: Canada (brrr. Its cold up here)
|
Posted: Thu May 26, 2005 3:33 am Post subject: |
|
|
Im suprised to not see much profanity in there... _________________ iMac G4 1GHz :: q6600 //2x 500GB//2GB RAM//8600GT//Gentoo :: MacBook Pro//2.53GHz |
|
Back to top |
|
|
Given M. Sur l33t
Joined: 03 Feb 2004 Posts: 648 Location: No such file or directory
|
Posted: Fri May 27, 2005 7:53 am Post subject: |
|
|
pjp wrote: | For example, "the" and "gentoo" are not likely to be helpful search words. |
Yeah? Try finding the Gentoo Desktops for May 2005 thread. Luckily Earthwings merged my mistakingly created dupe.
I have to agree with some of the others here that have said that these stopwords suck. Obviously the desktop thread isn't a great example since it's not a support thread (and can be found by skimming through the pages in Desktop Environments), but I have had problems trying to find support issues too.
And I don't understand something. If it's not a hardware problem (as ian! mentioned) what exactly is wrong with the software? I, like others, would rather have a slow relevant search than a quick irrelevant one.
Anyways, I just wanted to voice my objections. Thanks for reading. _________________ What is the best [insert-type-of-program-here]? |
|
Back to top |
|
|
amne Bodhisattva
Joined: 17 Nov 2002 Posts: 6378 Location: Graz / EU
|
Posted: Fri May 27, 2005 9:08 am Post subject: |
|
|
Given M. Sur wrote: |
Anyways, I just wanted to voice my objections. Thanks for reading. |
We are of course aware that the stopwords list isn't the perfect soltution and has some limitations. However we think that the positive effects outweigh the negative ones. _________________ Dinosaur week! (Ok, this thread is so last week) |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|