Gentoo Forums :: View topic - DSPAM + classification group

DSPAM + classification group

View unanswered posts
View posts from last 24 hours

Goto page 1, 2, 3, 4 Next

Gentoo Forums Forum Index

Networking & Security

View previous topic :: View next topic

Author

Message

petrjanda
Veteran
Veteran

Joined: 05 Sep 2003
Posts: 1557
Location: Brno, Czech Republic

Posted: Fri May 12, 2006 3:18 pm Post subject: DSPAM + classification group

Ive got a user called global that has a quite a good spam/ham corpus, i would like to test the corpus, but i dont want to apply the group globally to all users (2000+), ive got a number of users that would like to test the corpus, but i dont know how to configure dspam to use the global user for only a selected users, ie. user1, user2, user3. Is there a way? _________________ There is, a not-born, a not-become, a not-made, a not-compounded. If that unborn, not-become, not-made, not-compounded were not, there would be no escape from this here that is born, become, made and compounded. - Gautama Siddharta

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Fri May 12, 2006 3:26 pm Post subject:

Hallo petrjanda

You could create a merged group in your DSPAM home directory containing this:

Code:

global:merged:user1,user2,user3

On a normal DSPAM installation this should be here /var/spool/dspam/data/group. I don't know if it is realy there when you install with the Gentoo ebuild, since the Ebuild is a mess (this is MY viewpoint).

To see, where you need to create this group file, you could execute the following command:

Code:

dspam --version|sed -n "s:^.*\-\-with\-dspam\-home=$[^ ]*$.*:\1/data/group:gIp"

The merged group allows you to have the "global" data used/merged in realtime for user1,user2,user3. If you want user1,user2,user3 to have as well only one quarantine and share other stuff together, then consider using managed groups.

cheers

SteveB

BTW: Brno? Nice! The parents of my ex girlfriend are from Brno.

petrjanda
Veteran
Veteran

Joined: 05 Sep 2003
Posts: 1557
Location: Brno, Czech Republic

Posted: Fri May 12, 2006 4:43 pm Post subject:

steveb wrote:

Hallo petrjanda

You could create a merged group in your DSPAM home directory containing this:

Code:

global:merged:user1,user2,user3

Code:

dspam --version|sed -n "s:^.*\-\-with\-dspam\-home=$[^ ]*$.*:\1/data/group:gIp"

Thanks, i'll give that a whirl. What if the testing users receive innocent/spam missclassified? My setup uses username@spam.xxxx and username@ham.xxx to train on error. If the user1 replies to user1@ham.xxxx to report an innocent missclasified will that also train the global user's corpus? Im using DSPAM 3.4.9 as I found that 3.6.x has a rather terrible accuracy. If not, what is the best way to update the global user's tokens in the future to reflect changes/innovations in spam? Also, another question is concerning global classification group. I find that if mail is incorrectly classified by the global user,, and user1 tries to train his own tokens by replying to user1@ham.xxx or user1@spam.xxx respectively, the global classification group is used even when User1 is actually trying to train that the email was incorrectly classified. Looking at dspam stats, thereas a plus +1 on innocent misclassified but next time a same email is recevied dspam still misclassifies based on the global group even though User1 apparently trained his tokens for this kind of message. Is this a correct behaviour?

What nationality are you?

_________________
There is, a not-born, a not-become, a not-made, a not-compounded. If that unborn, not-become, not-made, not-compounded were not, there would be no escape from this here that is born, become, made and compounded. - Gautama Siddharta

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Fri May 12, 2006 10:35 pm Post subject:

petrjanda wrote:

NO! it will not at all train the data for your global user. It will only train the data for user1.

petrjanda wrote:

Im using DSPAM 3.4.9 as I found that 3.6.x has a rather terrible accuracy.

Funny! I use DSPAM 3.6.5. Anyway... If you find DSPAM 3.6.x to have terrible accurancy, then you don't understand what DSPAM is. DSPAM is not a normal Anti-Spam filter. DSPAM is alot more than that. It includes serval algorithms for classifying ham/spam mails, whitelisting, web interface, etc... All that is DSPAM. So telling me, that DSPAM 3.4.9 has better accurancy then DSPAM 3.6.5 is strange to me. You would better tell me, that using algorithm "burton" with pvalue "graham" had better result in DSPAM 3.4.9 then in DSPAM 3.6.5.

Anyway... you have luck. Today I am on my way to publish again new data on the DSPAM mailing list about my progress in training. My old stats are here.

My current training with the same data as in the above mentioned url has the following accurancy:

Code:

nautilus / # dspam_stats -H globaluser
globaluser:
TP True Positives: 1769983
TN True Negatives: 2951556
FP False Positives: 8185
FN False Negatives: 403
SC Spam Corpusfed: 447
NC Nonspam Corpusfed: 3605
TL Training Left: 0
SHR Spam Hit Rate 99.98%
HSR Ham Strike Rate: 0.28%
OCA Overall Accuracy: 99.82%

nautilus / #

Don't get confused about the high TP and TN. I don't have that much data! I only have:

Code:

nautilus / # for foo in /var/spam/vunet.training.maildir.00?/set??/nonspam/*;do echo 1;done|wc -l
285176
nautilus / # for foo in /var/spam/vunet.training.maildir.00?/set??/spam/*;do echo 1;done|wc -l
185529
nautilus / #

This is 285'176 ham and 185'529 spam messages.

But I trained this time diffrendly then before (It's to much to explain what I did, but the term used for that method of training is "Training to Exhaustion") and I am still not finished. However... using all the spam data from spamarchive.org and runing it over my current DSPAM installation does produce this:

Code:

nautilus submit # _total_spam_count=0 ;_total_false_negative_count=0 ; for foo in *.r2.gz ; do _total_spam="$(mboxgrep --basic-regexp . --mailbox-format=zmbox --count ${foo})" ; let _total_spam_count=$((_total_spam_count+_total_spam)) ; _total_false_negative=$(mboxgrep --basic-regexp . --mailbox-format=zmbox --pipe="dspam --user globaluser --classify --deliver=summary 2>/dev/null" ${foo} | grep -v "result=\"Spam" | wc -l) ; let _total_false_negative_count=$((_total_false_negative_count+_total_false_negative)) ; echo "${foo} : Total SPAM: ${_total_spam} - Total FN: ${_total_false_negative}" ; done ; echo ; echo "Total SPAM: ${_total_spam_count} - Total FN: ${_total_false_negative_count}" ; echo "Accurancy: "$(echo "scale=20 ; 100 - ( $_total_false_negative_count * 100 / $_total_spam_count )" | bc)"%"
800.r2.gz : Total SPAM: 1815 - Total FN: 0
801.r2.gz : Total SPAM: 228 - Total FN: 0
802.r2.gz : Total SPAM: 144 - Total FN: 0
803.r2.gz : Total SPAM: 2391 - Total FN: 0
804.r2.gz : Total SPAM: 74 - Total FN: 0
805.r2.gz : Total SPAM: 1537 - Total FN: 0
806.r2.gz : Total SPAM: 553 - Total FN: 1
807.r2.gz : Total SPAM: 137 - Total FN: 0
808.r2.gz : Total SPAM: 234 - Total FN: 0
809.r2.gz : Total SPAM: 2098 - Total FN: 0
810.r2.gz : Total SPAM: 218 - Total FN: 0
811.r2.gz : Total SPAM: 124 - Total FN: 0
812.r2.gz : Total SPAM: 1701 - Total FN: 0
813.r2.gz : Total SPAM: 94 - Total FN: 0
814.r2.gz : Total SPAM: 1715 - Total FN: 0
815.r2.gz : Total SPAM: 101 - Total FN: 0
816.r2.gz : Total SPAM: 378 - Total FN: 0
817.r2.gz : Total SPAM: 150 - Total FN: 0
818.r2.gz : Total SPAM: 122 - Total FN: 0
819.r2.gz : Total SPAM: 183 - Total FN: 0
820.r2.gz : Total SPAM: 145 - Total FN: 0
821.r2.gz : Total SPAM: 3157 - Total FN: 0
822.r2.gz : Total SPAM: 94 - Total FN: 0
823.r2.gz : Total SPAM: 77 - Total FN: 0
824.r2.gz : Total SPAM: 738 - Total FN: 0
825.r2.gz : Total SPAM: 182 - Total FN: 0
826.r2.gz : Total SPAM: 249 - Total FN: 0
827.r2.gz : Total SPAM: 98 - Total FN: 0
828.r2.gz : Total SPAM: 169 - Total FN: 0
829.r2.gz : Total SPAM: 169 - Total FN: 0
830.r2.gz : Total SPAM: 115 - Total FN: 0
831.r2.gz : Total SPAM: 127 - Total FN: 0
832.r2.gz : Total SPAM: 188 - Total FN: 0
833.r2.gz : Total SPAM: 928 - Total FN: 0
834.r2.gz : Total SPAM: 134 - Total FN: 0
835.r2.gz : Total SPAM: 137 - Total FN: 0
836.r2.gz : Total SPAM: 113 - Total FN: 0
837.r2.gz : Total SPAM: 99 - Total FN: 0
838.r2.gz : Total SPAM: 114 - Total FN: 0
839.r2.gz : Total SPAM: 511 - Total FN: 0
840.r2.gz : Total SPAM: 302 - Total FN: 2
841.r2.gz : Total SPAM: 13623 - Total FN: 0
842.r2.gz : Total SPAM: 9209 - Total FN: 0
843.r2.gz : Total SPAM: 305 - Total FN: 0
844.r2.gz : Total SPAM: 155 - Total FN: 0
845.r2.gz : Total SPAM: 20924 - Total FN: 0
846.r2.gz : Total SPAM: 2316 - Total FN: 0
847.r2.gz : Total SPAM: 213 - Total FN: 0
848.r2.gz : Total SPAM: 1250 - Total FN: 0
849.r2.gz : Total SPAM: 146 - Total FN: 0
850.r2.gz : Total SPAM: 103 - Total FN: 0
851.r2.gz : Total SPAM: 87 - Total FN: 0
852.r2.gz : Total SPAM: 7199 - Total FN: 0
853.r2.gz : Total SPAM: 1593 - Total FN: 0
854.r2.gz : Total SPAM: 116 - Total FN: 0
855.r2.gz : Total SPAM: 2074 - Total FN: 0
856.r2.gz : Total SPAM: 1884 - Total FN: 0
857.r2.gz : Total SPAM: 1734 - Total FN: 0
858.r2.gz : Total SPAM: 1372 - Total FN: 0
859.r2.gz : Total SPAM: 1728 - Total FN: 0
860.r2.gz : Total SPAM: 854 - Total FN: 0
861.r2.gz : Total SPAM: 2023 - Total FN: 0
862.r2.gz : Total SPAM: 1130 - Total FN: 0
863.r2.gz : Total SPAM: 125 - Total FN: 0
864.r2.gz : Total SPAM: 1943 - Total FN: 0
865.r2.gz : Total SPAM: 118 - Total FN: 0
866.r2.gz : Total SPAM: 3481 - Total FN: 0
867.r2.gz : Total SPAM: 483 - Total FN: 0
868.r2.gz : Total SPAM: 94 - Total FN: 0
869.r2.gz : Total SPAM: 709 - Total FN: 0
870.r2.gz : Total SPAM: 48 - Total FN: 0
871.r2.gz : Total SPAM: 32 - Total FN: 0
872.r2.gz : Total SPAM: 52 - Total FN: 0
873.r2.gz : Total SPAM: 332 - Total FN: 0
874.r2.gz : Total SPAM: 215 - Total FN: 0
875.r2.gz : Total SPAM: 176 - Total FN: 0
876.r2.gz : Total SPAM: 460 - Total FN: 0
877.r2.gz : Total SPAM: 196 - Total FN: 0
878.r2.gz : Total SPAM: 476 - Total FN: 0
879.r2.gz : Total SPAM: 102 - Total FN: 0
880.r2.gz : Total SPAM: 702 - Total FN: 0
881.r2.gz : Total SPAM: 3650 - Total FN: 0
882.r2.gz : Total SPAM: 6933 - Total FN: 0
883.r2.gz : Total SPAM: 118 - Total FN: 0
884.r2.gz : Total SPAM: 76 - Total FN: 0
886.r2.gz : Total SPAM: 3408 - Total FN: 0
887.r2.gz : Total SPAM: 75 - Total FN: 0
888.r2.gz : Total SPAM: 966 - Total FN: 0
889.r2.gz : Total SPAM: 115 - Total FN: 0
890.r2.gz : Total SPAM: 102 - Total FN: 0
891.r2.gz : Total SPAM: 180 - Total FN: 0
892.r2.gz : Total SPAM: 85 - Total FN: 0
893.r2.gz : Total SPAM: 202 - Total FN: 0
894.r2.gz : Total SPAM: 9 - Total FN: 0
895.r2.gz : Total SPAM: 161 - Total FN: 0
896.r2.gz : Total SPAM: 61 - Total FN: 0
897.r2.gz : Total SPAM: 302 - Total FN: 0
898.r2.gz : Total SPAM: 196 - Total FN: 0
899.r2.gz : Total SPAM: 213 - Total FN: 0
900.r2.gz : Total SPAM: 358 - Total FN: 0
901.r2.gz : Total SPAM: 97 - Total FN: 0
902.r2.gz : Total SPAM: 70 - Total FN: 0
903.r2.gz : Total SPAM: 89 - Total FN: 0
904.r2.gz : Total SPAM: 403 - Total FN: 0
905.r2.gz : Total SPAM: 195 - Total FN: 0
906.r2.gz : Total SPAM: 261 - Total FN: 0
907.r2.gz : Total SPAM: 55 - Total FN: 0
908.r2.gz : Total SPAM: 78 - Total FN: 0
909.r2.gz : Total SPAM: 244 - Total FN: 0
910.r2.gz : Total SPAM: 170 - Total FN: 0
911.r2.gz : Total SPAM: 98 - Total FN: 0
912.r2.gz : Total SPAM: 233 - Total FN: 0
913.r2.gz : Total SPAM: 138 - Total FN: 0
914.r2.gz : Total SPAM: 63 - Total FN: 0
915.r2.gz : Total SPAM: 245 - Total FN: 0
916.r2.gz : Total SPAM: 30 - Total FN: 0
917.r2.gz : Total SPAM: 188 - Total FN: 0
918.r2.gz : Total SPAM: 52 - Total FN: 0
919.r2.gz : Total SPAM: 163 - Total FN: 0
920.r2.gz : Total SPAM: 359 - Total FN: 0
921.r2.gz : Total SPAM: 25 - Total FN: 0
922.r2.gz : Total SPAM: 4 - Total FN: 0
923.r2.gz : Total SPAM: 164 - Total FN: 0
924.r2.gz : Total SPAM: 37 - Total FN: 0
925.r2.gz : Total SPAM: 14 - Total FN: 0
926.r2.gz : Total SPAM: 166 - Total FN: 0
927.r2.gz : Total SPAM: 121 - Total FN: 0
928.r2.gz : Total SPAM: 330 - Total FN: 0
929.r2.gz : Total SPAM: 323 - Total FN: 0
930.r2.gz : Total SPAM: 308 - Total FN: 0
931.r2.gz : Total SPAM: 285 - Total FN: 0
932.r2.gz : Total SPAM: 41 - Total FN: 0
933.r2.gz : Total SPAM: 43 - Total FN: 0
934.r2.gz : Total SPAM: 252 - Total FN: 0
935.r2.gz : Total SPAM: 116 - Total FN: 0
936.r2.gz : Total SPAM: 42 - Total FN: 0
937.r2.gz : Total SPAM: 269 - Total FN: 0
938.r2.gz : Total SPAM: 292 - Total FN: 0
939.r2.gz : Total SPAM: 235 - Total FN: 0
940.r2.gz : Total SPAM: 141 - Total FN: 0
941.r2.gz : Total SPAM: 257 - Total FN: 0
942.r2.gz : Total SPAM: 262 - Total FN: 0
943.r2.gz : Total SPAM: 106 - Total FN: 0
944.r2.gz : Total SPAM: 152 - Total FN: 0
945.r2.gz : Total SPAM: 204 - Total FN: 0
946.r2.gz : Total SPAM: 109 - Total FN: 0
947.r2.gz : Total SPAM: 79 - Total FN: 0
948.r2.gz : Total SPAM: 1064 - Total FN: 0

Total SPAM: 127507 - Total FN: 3
Accurancy: 99.99764718799752170470%
nautilus submit # cd ../submitautomated/
nautilus submitautomated # _total_spam_count=0 ;_total_false_negative_count=0 ; for foo in *.r2.gz ; do _total_spam="$(mboxgrep --basic-regexp . --mailbox-format=zmbox --count ${foo})" ; let _total_spam_count=$((_total_spam_count+_total_spam)) ; _total_false_negative=$(mboxgrep --basic-regexp . --mailbox-format=zmbox --pipe="dspam --user globaluser --classify --deliver=summary 2>/dev/null" ${foo} | grep -v "result=\"Spam" | wc -l) ; let _total_false_negative_count=$((_total_false_negative_count+_total_false_negative)) ; echo "${foo} : Total SPAM: ${_total_spam} - Total FN: ${_total_false_negative}" ; done ; echo ; echo "Total SPAM: ${_total_spam_count} - Total FN: ${_total_false_negative_count}" ; echo "Accurancy: "$(echo "scale=20 ; 100 - ( $_total_false_negative_count * 100 / $_total_spam_count )" | bc)"%"
800.r2.gz : Total SPAM: 118 - Total FN: 0
801.r2.gz : Total SPAM: 145 - Total FN: 0
802.r2.gz : Total SPAM: 135 - Total FN: 0
803.r2.gz : Total SPAM: 156 - Total FN: 0
804.r2.gz : Total SPAM: 186 - Total FN: 0
805.r2.gz : Total SPAM: 133 - Total FN: 0
806.r2.gz : Total SPAM: 89 - Total FN: 0
807.r2.gz : Total SPAM: 88 - Total FN: 0
808.r2.gz : Total SPAM: 215 - Total FN: 0
809.r2.gz : Total SPAM: 176 - Total FN: 0
810.r2.gz : Total SPAM: 175 - Total FN: 0
811.r2.gz : Total SPAM: 122 - Total FN: 0
812.r2.gz : Total SPAM: 104 - Total FN: 0
813.r2.gz : Total SPAM: 128 - Total FN: 0
814.r2.gz : Total SPAM: 139 - Total FN: 0
815.r2.gz : Total SPAM: 80 - Total FN: 0
816.r2.gz : Total SPAM: 60 - Total FN: 0
817.r2.gz : Total SPAM: 80 - Total FN: 0
818.r2.gz : Total SPAM: 92 - Total FN: 0
819.r2.gz : Total SPAM: 112 - Total FN: 0
820.r2.gz : Total SPAM: 175 - Total FN: 0
821.r2.gz : Total SPAM: 214 - Total FN: 0
822.r2.gz : Total SPAM: 197 - Total FN: 0
823.r2.gz : Total SPAM: 117 - Total FN: 0
824.r2.gz : Total SPAM: 89 - Total FN: 0
825.r2.gz : Total SPAM: 125 - Total FN: 0
826.r2.gz : Total SPAM: 57 - Total FN: 0
827.r2.gz : Total SPAM: 60 - Total FN: 0
828.r2.gz : Total SPAM: 128 - Total FN: 0
829.r2.gz : Total SPAM: 165 - Total FN: 0
830.r2.gz : Total SPAM: 132 - Total FN: 0
831.r2.gz : Total SPAM: 171 - Total FN: 0
832.r2.gz : Total SPAM: 132 - Total FN: 0
833.r2.gz : Total SPAM: 131 - Total FN: 0
834.r2.gz : Total SPAM: 113 - Total FN: 0
835.r2.gz : Total SPAM: 158 - Total FN: 0
836.r2.gz : Total SPAM: 92 - Total FN: 0
837.r2.gz : Total SPAM: 81 - Total FN: 0
838.r2.gz : Total SPAM: 79 - Total FN: 0
839.r2.gz : Total SPAM: 99 - Total FN: 0
840.r2.gz : Total SPAM: 64 - Total FN: 0
841.r2.gz : Total SPAM: 113 - Total FN: 0
842.r2.gz : Total SPAM: 92 - Total FN: 0
843.r2.gz : Total SPAM: 42 - Total FN: 0
844.r2.gz : Total SPAM: 55 - Total FN: 0
845.r2.gz : Total SPAM: 65 - Total FN: 0
846.r2.gz : Total SPAM: 49 - Total FN: 0
847.r2.gz : Total SPAM: 42 - Total FN: 0
849.r2.gz : Total SPAM: 34 - Total FN: 0
850.r2.gz : Total SPAM: 69 - Total FN: 0
851.r2.gz : Total SPAM: 33 - Total FN: 0
852.r2.gz : Total SPAM: 27 - Total FN: 0
853.r2.gz : Total SPAM: 20 - Total FN: 0
854.r2.gz : Total SPAM: 45 - Total FN: 0
855.r2.gz : Total SPAM: 64 - Total FN: 0
856.r2.gz : Total SPAM: 40 - Total FN: 0
857.r2.gz : Total SPAM: 50 - Total FN: 0
858.r2.gz : Total SPAM: 63 - Total FN: 0
859.r2.gz : Total SPAM: 57 - Total FN: 0
860.r2.gz : Total SPAM: 58 - Total FN: 0
861.r2.gz : Total SPAM: 56 - Total FN: 0
862.r2.gz : Total SPAM: 77 - Total FN: 0
863.r2.gz : Total SPAM: 62 - Total FN: 0
864.r2.gz : Total SPAM: 67 - Total FN: 0
865.r2.gz : Total SPAM: 42 - Total FN: 0
866.r2.gz : Total SPAM: 77 - Total FN: 0
867.r2.gz : Total SPAM: 53 - Total FN: 0
868.r2.gz : Total SPAM: 81 - Total FN: 0
869.r2.gz : Total SPAM: 69 - Total FN: 0
870.r2.gz : Total SPAM: 85 - Total FN: 0
871.r2.gz : Total SPAM: 76 - Total FN: 0
872.r2.gz : Total SPAM: 173 - Total FN: 0
873.r2.gz : Total SPAM: 84 - Total FN: 0
874.r2.gz : Total SPAM: 82 - Total FN: 0
875.r2.gz : Total SPAM: 84 - Total FN: 0
876.r2.gz : Total SPAM: 66 - Total FN: 0
877.r2.gz : Total SPAM: 76 - Total FN: 0
878.r2.gz : Total SPAM: 52 - Total FN: 0
879.r2.gz : Total SPAM: 107 - Total FN: 0
880.r2.gz : Total SPAM: 120 - Total FN: 0
881.r2.gz : Total SPAM: 70 - Total FN: 0
882.r2.gz : Total SPAM: 30 - Total FN: 0
883.r2.gz : Total SPAM: 44 - Total FN: 0
884.r2.gz : Total SPAM: 52 - Total FN: 0
885.r2.gz : Total SPAM: 55 - Total FN: 0
886.r2.gz : Total SPAM: 64 - Total FN: 0
887.r2.gz : Total SPAM: 448 - Total FN: 0
888.r2.gz : Total SPAM: 23 - Total FN: 0
889.r2.gz : Total SPAM: 42 - Total FN: 0
890.r2.gz : Total SPAM: 5 - Total FN: 0
891.r2.gz : Total SPAM: 52 - Total FN: 0
892.r2.gz : Total SPAM: 21 - Total FN: 0
893.r2.gz : Total SPAM: 5 - Total FN: 0
894.r2.gz : Total SPAM: 13 - Total FN: 0
895.r2.gz : Total SPAM: 3 - Total FN: 0
896.r2.gz : Total SPAM: 3 - Total FN: 0
897.r2.gz : Total SPAM: 5 - Total FN: 0
898.r2.gz : Total SPAM: 17 - Total FN: 0
899.r2.gz : Total SPAM: 75 - Total FN: 0
900.r2.gz : Total SPAM: 25 - Total FN: 0
901.r2.gz : Total SPAM: 22 - Total FN: 0
902.r2.gz : Total SPAM: 21 - Total FN: 0
903.r2.gz : Total SPAM: 50 - Total FN: 0
904.r2.gz : Total SPAM: 69 - Total FN: 0
905.r2.gz : Total SPAM: 26 - Total FN: 0
906.r2.gz : Total SPAM: 175 - Total FN: 0
907.r2.gz : Total SPAM: 75 - Total FN: 0
908.r2.gz : Total SPAM: 48 - Total FN: 0
909.r2.gz : Total SPAM: 85 - Total FN: 0
910.r2.gz : Total SPAM: 58 - Total FN: 0
911.r2.gz : Total SPAM: 36 - Total FN: 0
912.r2.gz : Total SPAM: 52 - Total FN: 0
913.r2.gz : Total SPAM: 43 - Total FN: 0
914.r2.gz : Total SPAM: 80 - Total FN: 0
915.r2.gz : Total SPAM: 306 - Total FN: 0
916.r2.gz : Total SPAM: 23 - Total FN: 0
917.r2.gz : Total SPAM: 33 - Total FN: 0
918.r2.gz : Total SPAM: 64 - Total FN: 0
919.r2.gz : Total SPAM: 71 - Total FN: 0
920.r2.gz : Total SPAM: 19 - Total FN: 0
921.r2.gz : Total SPAM: 25 - Total FN: 0
922.r2.gz : Total SPAM: 23 - Total FN: 0
923.r2.gz : Total SPAM: 3 - Total FN: 0
924.r2.gz : Total SPAM: 4 - Total FN: 0
925.r2.gz : Total SPAM: 46 - Total FN: 0
926.r2.gz : Total SPAM: 44 - Total FN: 0
927.r2.gz : Total SPAM: 3 - Total FN: 0
928.r2.gz : Total SPAM: 23 - Total FN: 0
929.r2.gz : Total SPAM: 42 - Total FN: 0
930.r2.gz : Total SPAM: 6 - Total FN: 0
931.r2.gz : Total SPAM: 7 - Total FN: 0
932.r2.gz : Total SPAM: 140 - Total FN: 0
933.r2.gz : Total SPAM: 22 - Total FN: 0
934.r2.gz : Total SPAM: 33 - Total FN: 0
935.r2.gz : Total SPAM: 29 - Total FN: 0
936.r2.gz : Total SPAM: 20 - Total FN: 0
937.r2.gz : Total SPAM: 7 - Total FN: 0
938.r2.gz : Total SPAM: 5 - Total FN: 0
939.r2.gz : Total SPAM: 16 - Total FN: 0
940.r2.gz : Total SPAM: 54 - Total FN: 0
941.r2.gz : Total SPAM: 37 - Total FN: 0
942.r2.gz : Total SPAM: 24 - Total FN: 0
943.r2.gz : Total SPAM: 53 - Total FN: 0
944.r2.gz : Total SPAM: 18 - Total FN: 0
945.r2.gz : Total SPAM: 7 - Total FN: 0
946.r2.gz : Total SPAM: 82 - Total FN: 0
947.r2.gz : Total SPAM: 27 - Total FN: 0
948.r2.gz : Total SPAM: 13 - Total FN: 0
949.r2.gz : Total SPAM: 62 - Total FN: 0

Total SPAM: 11002 - Total FN: 0
Accurancy: 100.00000000000000000000%
nautilus submitautomated #

As you can see, I have on the submit 99.99764718799752170470% and on the submitautomated I have 100% accurancy. And I don't use the data from spamarchive.org. I have my own data. But I use spamarchive.org as a reference to know how good my filter is.

I am sure, that my DSPAM setup is special. For example I have 169 IgnoreHeader entries in my dspam.conf and other stuff you probably don't have. As for the preferences used for my globaluser. They are:

Code:

nautilus / # dspam_admin list preference globaluser
enableBNR=on
enableWhitelist=off
fallbackDomain=
ignoreGroups=off
localStore=globaluser
makeCorpus=off
optIn=on
optOut=off
optOutClamAV=on
processorBias=off
showFactors=off
signatureLocation=header
spamAction=deliver
spamSubject=
statisticalSedation=0
storeFragments=off
trainingMode=TEFT
trainPristine=off
whitelistThreshold=9999999
nautilus / #

petrjanda wrote:

If not, what is the best way to update the global user's tokens in the future to reflect changes/innovations in spam?

You can merge the user tokens with the one from your global user. Or you could use spam traps or blocking lists to inoculate data into DSPAM. There are many ways to keep the data for your global user up to date. If you need more info on that, then let me know.

petrjanda wrote:

Also, another question is concerning global classification group.

Do you use classification or merged group?

petrjanda wrote:

I find that if mail is incorrectly classified by the global user,, and user1 tries to train his own tokens by replying to user1@ham.xxx or user1@spam.xxx respectively, the global classification group is used even when User1 is actually trying to train that the email was incorrectly classified. Looking at dspam stats, thereas a plus +1 on innocent misclassified but next time a same email is recevied dspam still misclassifies based on the global group even though User1 apparently trained his tokens for this kind of message. Is this a correct behaviour?

Yes. But only for all the other users except user1, because user1 has trained his data to include the correct tokens. If you don't like that, you could use a managed group (on top of the merged group), but then every one will have the same corpus and same quarantine. I personaly find it very bad to do that, since spam for user1 could be ham for user2.

petrjanda wrote:

What nationality are you?

Swiss

cheers

SteveB

Last edited by steveb on Sun May 21, 2006 4:56 pm; edited 1 time in total

petrjanda
Veteran
Veteran

Joined: 05 Sep 2003
Posts: 1557
Location: Brno, Czech Republic

Posted: Sat May 13, 2006 2:29 am Post subject:

steveb wrote:

petrjanda wrote:

Im using DSPAM 3.4.9 as I found that 3.6.x has a rather terrible accuracy.

Could you send me your dspam.conf to janda.petr@gmail.com so I could have a look at your configuration and use it as reference for mine? I would like to use 3.6 but because it had given me such bad results(in semi-default dspam configuration), i had to downgrade.

Quote:

More info would be nice!

Quote:

Do you use classification or merged group?

Up until now I was using classification group. Im gonna use merged group now.

Quote:

So do you have an idea why user1 still couldnt filter it? I remember looking at dspam.debug, it basically started off ok as if it was going to train for User1, but then global classification group was added, it checked global's tokens, and incorrectly classified the message, never returned to train User1.

Thanks!
_________________
There is, a not-born, a not-become, a not-made, a not-compounded. If that unborn, not-become, not-made, not-compounded were not, there would be no escape from this here that is born, become, made and compounded. - Gautama Siddharta

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Sat May 13, 2006 9:22 am Post subject:

petrjanda wrote:

This is one of my dspam.conf (sligthly modified for you) file. Keep in mind, that this is 3.6.5 installed with my own ebuild:

Code:

## $Id: dspam.conf.in,v 1.70 2006/02/15 18:19:40 jonz Exp $
## dspam.conf -- DSPAM configuration file
##

#
# DSPAM Home: Specifies the base directory to be used for DSPAM storage
#
Home /var/spool/dspam

#
# StorageDriver: Specifies the storage driver backend (library) to use.
# You'll only need to set this if you are using dynamic storage driver plugins.
# The default when one storage driver is specified is to statically link. Be
# sure to include the path to the library if necessary, and some systems may
# use an extension other than .so.
#
# Options include:
#
# libmysql_drv.so libpgsql_drv.so libsqlite_drv.so
# libsqlite3_drv.so libora_drv.so libdb4_drv.so
# libdb3_drv.so libhash_drv.so
#
# IMPORTANT: Switching storage drivers requires more than merely changing
# this option. If you do not wish to lose all of your data, you will need to
# migrate it to the new backend before making this change.
#
StorageDriver /usr/lib/libmysql_drv.so

#
# Trusted Delivery Agent: Specifies the local delivery agent DSPAM should call
# when delivering mail as a trusted user. Use %u to specify the user DSPAM is
# processing mail for. It is generally a good idea to allow the MTA to specify
# the pass-through arguments at run-time, but they may also be specified here.
#
# Most operating system defaults:
#TrustedDeliveryAgent "/usr/bin/procmail" # Linux
#TrustedDeliveryAgent "/usr/bin/mail" # Solaris
#TrustedDeliveryAgent "/usr/libexec/mail.local" # FreeBSD
#TrustedDeliveryAgent "/usr/bin/procmail" # Cygwin
#
# Other popular configurations:
#TrustedDeliveryAgent "/usr/cyrus/bin/deliver" # Cyrus
#TrustedDeliveryAgent "/bin/maildrop"    # Maildrop
#TrustedDeliveryAgent "/usr/local/sbin/exim -oMr spam-scanned" # Exim
#
#UntrustedDeliveryAgent "/usr/bin/procmail -d %u"
TrustedDeliveryAgent "/usr/sbin/sendmail"

#
# Untrusted Delivery Agent: Specifies the local delivery agent and arguments
# DSPAM should use when delivering mail and running in untrusted user mode.
# Because DSPAM will not allow pass-through arguments to be specified to
# untrusted users, all arguments should be specified here. Use %u to specify
# the user DSPAM is processing mail for. This configuration parameter is only
# necessary if you plan on allowing untrusted processing.
#
#UntrustedDeliveryAgent "/usr/bin/procmail -d %u"
UntrustedDeliveryAgent "/usr/sbin/sendmail"

#
# SMTP or LMTP Delivery: Alternatively, you may wish to use SMTP or LMTP
# delivery to deliver your message to the mail server. You will need to
# configure with --enable-daemon to use host delivery, however you do not need
# to operate in daemon mode. Specify an IP address or UNIX path to a domain
# socket below as a host.
#
# If you would like to set up DeliveryHost's on a per-domain basis, use
# the syntax: DeliveryHost.domain.com 1.2.3.4
#
#DeliveryHost 127.0.0.1
#DeliveryPort 24
#DeliveryIdent localhost
#DeliveryProto LMTP

#
# Quarantine Agent: DSPAM's default behavior is to quarantine all mail it
# thinks is spam. If you wish to override this behavior, you may specify
# a quarantine agent which will be called with all messages DSPAM thinks is
# spam. Use %u to specify the user DSPAM is processing mail for.
#
#QuarantineAgent "/usr/bin/procmail -d spam"

#
# DSPAM can optionally process "plused users" (addresses in the user+detail
# form) by truncating the username just before the "+", so all internal
# processing occurs for "user", but delivery will be performed for
# "user+detail". This is only useful if the LDA can handle "plused users"
# (for example Cyrus IMAP) and when configured for LMTP delivery above
#
# NOTE: Plused detail presently only works when usernames are provided and
# not fully qualified email address (@domain).
#
#EnablePlusedDetail on

#
# Quarantine Mailbox: DSPAM's LMTP code can send spam mail using LMTP to a
# "plused" mailbox (such as user+quarantine) leaving quarantine processing
# for retraining or deletion to be performed by the LDA and the mail client.
# "plused" mailboxes are supported by Cyrus IMAP and possibly other LDAs.
# The mailbox name must have the +
#
#QuarantineMailbox +quarantine

#
# OnFail: What to do if local delivery or quarantine should fail. If set
# to "unlearn", DSPAM will unlearn the message prior to exiting with an
# un successful return code. The default option, "error" will not unlearn
# the message but return the appropriate error code. The unlearn option
# is use-ful on some systems where local delivery failures will cause the
# message to be requeued for delivery, and could result in the message
# being processed multiple times. During a very large failure, however,
# this could cause a significant load increase.
#
OnFail error

# Trusted Users: Only the users specified below will be allowed to perform
# administrative functions in DSPAM such as setting the active user and
# accessing tools. All other users attempting to run DSPAM will be restricted;
# their uids will be forced to match the active username and they will not be
# able to specify delivery agent privileges or use tools.
#
Trust root
Trust mail
Trust mailnull
Trust smmsp
Trust daemon
Trust nobody
Trust majordomo
Trust apache
Trust mailman
Trust postfix
Trust dspam

#
# Debugging: Enables debugging for some or all users. IMPORTANT: DSPAM must
# be compiled with debug support in order to use this option. DSPAM should
# never be running in production with debug active unless you are
# troubleshooting problems.
#
# DebugOpt: One or more of: process, classify, spam, fp, inoculation, corpus
# process standard message processing
# classify message classification using --classify
# spam error correction of missed spam
# fp error correction of false positives
# inoculation message inoculations (source=inoculation)
# corpus corpusfed messages (source=corpus)
#
#Debug *
#Debug bob bill
#
#DebugOpt process spam fp
#DebugOpt process classify spam fp inoculation corpus

#
# ClassAlias: Alias a particular class to spam/nonspam. This is useful if
# classifying things other than spam.
#ClassAliasSpam badstuff
#ClassAliasNonspam goodstuff

#
# Training Mode: The default training mode to use for all operations, when
# one has not been specified on the commandline or in the user's preferences.
# Acceptable values are: toe, tum, teft, notrain
#
TrainingMode toe

#
# TestConditionalTraining: By default, dspam will retrain certain errors
# until the condition is no longer met. This usually accelerates learning.
# Some people argue that this can increase the risk of errors, however.
#
TestConditionalTraining on

#
# Features: Specify features to activate by default; can also be specified
# on the commandline. See the documentation for a list of available features.
# If _any_ features are specified on the commandline, these are ignored.
#
# NOTE: For standard "CRM114" Markovian weighting, use sbph
#
#Feature sbph
Feature noise
Feature chained
Feature whitelist

# Training Buffer: The training buffer waters down statistics during training.
# It is designed to prevent false positives, but can also dramatically reduce
# dspam's catch rate during initial training. This can be a number from 0
# (no buffering) to 10 (maximum buffering). If you are paranoid about false
# positives, you should probably enable this option.
Feature tb=5

#
# Algorithms: Specify the statistical algorithms to use, overriding any
# defaults configured in the build. The options are:
# naive Naive-Bayesian (All Tokens)
# graham Graham-Bayesian ("A Plan for Spam")
# burton Burton-Bayesian (SpamProbe)
# robinson Robinson's Geometric Mean Test (Obsolete)
# chi-square Fisher-Robinson's Chi-Square Algorithm
#
# You may have multiple algorithms active simultaneously, but it is strongly
# recommended that you group Bayesian algorithms with other Bayesian
# algorithms, and any use of Chi-Square remain exclusive.
#
# NOTE: For standard "CRM114" Markovian weighting, use 'naive', or consider
# using 'burton' for slightly better accuracy
#
# Don't mess with this unless you know what you're doing
#
#Algorithm chi-square
#Algorithm naive
Algorithm burton graham naive
#Algorithm burton

#
# PValue: Specify the technique used for calculating PValues, overriding any
# defaults configured in the build. These options are:
# graham Graham's Technique ("A Plan for Spam")
# robinson Robinson's Technique
# markov Markovian Weighted Technique
#
# Unlike algorithms, you may only have one of these defined. Use of the
# chi-square algorithm automatically changes this to robinson.
#
# Don't mess with this unless you know what you're doing.
#
#PValue robinson
#PValue markov
PValue graham

#
# SupressWebStats: Enable this if you are not using the CGI, and don't want
# .stats files written.
#SupressWebStats on

#
# ImprobabilityDrive: Calculate odds-ratios for ham/spam, and add to
# X-DSPAM-Improbability headers
ImprobabilityDrive on

#
# Preferences: Specify any preferences to set by default, unless otherwise
# overridden by the user (see next section) or a default.prefs file.
# If user or default.prefs are found, the user's preferences will override any
# defaults.
#
Preference "trainingMode=TOE"    # TEFT, TUM, TOE
Preference "spamAction=tag"    # tag, quarantine, deliver
Preference "signatureLocation=message" # 'message' or 'headers'
Preference "spamSubject=[SPAM]"
Preference "statisticalSedation=5" # 0 to 9
Preference "enableBNR=on"    # on, off
Preference "showFactors=off"    # on, off
Preference "enableWhitelist=on"    # on, off
Preference "whitelistThreshold=10"

#
# Overrides: Specifies the user preferences which may override configuration
# and commandline defaults. Any other preferences supplied by an untrusted user
# will be ignored.
#
AllowOverride enableBNR
AllowOverride enableWhitelist
AllowOverride fallbackDomain
AllowOverride ignoreGroups
AllowOverride localStore
AllowOverride makeCorpus
AllowOverride optIn
AllowOverride optOut
AllowOverride optOutClamAV
AllowOverride processorBias
AllowOverride showFactors
AllowOverride signatureLocation
AllowOverride spamAction
AllowOverride spamSubject
AllowOverride statisticalSedation
AllowOverride storeFragments
AllowOverride trainPristine
AllowOverride trainingMode
AllowOverride whitelistThreshold

# --- MySQL ---

#
# Storage driver settings: Specific to a particular storage driver. Uncomment
# the configuration specific to your installation, if applicable.
#
MySQLServer    /var/run/mysqld/mysqld.sock
MySQLPort
MySQLUser       dspam
MySQLPass    XXXXXXXXXXXXXXXXXXX
MySQLDb    dspam
MySQLCompress    true

# If you are using replication for clustering, you can also specify a separate
# server to perform all writes to.
#
#MySQLWriteServer /var/run/mysqld/mysqld.sock
#MySQLWritePort
#MySQLWriteUser    dspam
#MySQLWritePass    changeme
#MySQLWriteDb    dspam_write
#MySQLCompress    true

# If your replication isn't close to real-time, your retraining might fail if
# the signature isn't found. One workaround for this is to use the write
# database for all signature reads:
#
#MySQLReadSignaturesFromWriteDb on

# Use this if you have the 4.1 quote bug (see doc/mysql.txt)
#MySQLSupressQuote on

# If you're running DSPAM in client/server (daemon) mode, uncomment the
# setting below to override the default connection cache size (the number
# of connections the server pools between all clients). The connection cache
# represents the maximum number of database connections *available* and should
# be set based on the maximum number of concurrent connections you're likely
# to have. Each connection may be used by only one thread at a time, so all
# other threads _will block_ until another connection becomes available.
#
#MySQLConnectionCache 10

# If you're using vpopmail or some other type of virtual setup and wish to
# change the table dspam uses to perform username/uid lookups, you can over-
# ride it below

#MySQLVirtualTable dspam_virtual_uids
#MySQLVirtualUIDField uid
#MySQLVirtualUsernameField username

# UIDInSignature: MySQL supports the insertion of the user id into the DSPAM
# signature. This allows you to create one single spam or fp alias
# (pointing to some arbitrary user), and the uid in the signature will
# switch to the correct user. Result: you need only one spam alias

MySQLUIDInSignature on

# --- PostgreSQL ---

#PgSQLServer 127.0.0.1
#PgSQLPort 5432
#PgSQLUser dspam
#PgSQLPass changeme
#PgSQLDb dspam

# If you're running DSPAM in client/server (daemon) mode, uncomment the
# setting below to override the default connection cache size (the number
# of connections the server pools between all clients).
#
#PgSQLConnectionCache 3

# UIDInSignature: PgSQL supports the insertion of the user id into the DSPAM
# signature. This allows you to create one single spam or fp alias
# (pointing to some arbitrary user), and the uid in the signature will
# switch to the correct user. Result: you need only one spam alias

#PgSQLUIDInSignature on

# If you're using vpopmail or some other type of virtual setup and wish to
# change the table dspam uses to perform username/uid lookups, you can over-
# ride it below

#PgSQLVirtualTable dspam_virtual_uids
#PgSQLVirtualUIDField uid
#PgSQLVirtualUsernameField username

# --- Oracle ---

#OraServer "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=127.0.0.1)(PORT=1521))(CONNECT_DATA=(SID=PROD)))"
#OraUser dspam
#OraPass changeme
#OraSchema dspam

# --- SQLite ---

#SQLitePragma "synchronous = OFF"

# --- Hash ---

# HashRecMax: Default number of records to create in the initial segment when
# building hash files. 100,000 yields files 1.6MB in size, but can fill up
# fast, so be sure to increase this (to a million or more) if you're not using
# autoextend.
#
# Primes List:
# 53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593, 49157, 98317, 196613,
# 393241, 786433, 1572869, 3145739, 6291469, 12582917, 25165843, 50331653,
# 100663319, 201326611, 402653189, 805306457, 1610612741, 3221225473,
# 4294967291
#
HashRecMax    98317

# HashAutoExtend: Autoextend hash databases when they fill up. This allows
# them to continue to train by adding extents (extensions) to the file. There
# will be a small delay during the growth process, as everything needs to be
# closed and remapped.
#
HashAutoExtend    on

# HashMaxExtents: The maximum number of extents that may be created in a single
# hash file. Set this to zero for unlimited
#
HashMaxExtents    0

# HashExtentSize: The record size for newly created extents. Creating this too
# small could result in many extents being created. Creating this too large
# could result in excessive disk space usage.
#
HashExtentSize    49157

# HashMaxSeek: The maximum number of records to seek to insert a new record
# before failing or adding a new extent. Setting this too high will exhaustively
# scan each segment and kill performance. Typically, a low value is acceptable
# as even older extents will continue to fill over time.
#
HashMaxSeek    100

# HashConcurrentUser: If you are using a single, stateful hash database in
# daemon mode, specifying a concurrent user will cause the user to be
# permanently mapped into memory and shared via rwlocks.
#
#HashConcurrentUser user

# HashConnectionCache: If running in daemon mode, this is the max # of
# concurrent connections that will be supported. NOTE: If you are using
# HashConcurrentUser, this option is ignored, as all connections are read-
# write locked instead of mutex locked.
HashConnectionCache 10

# LDAP: Perform various LDAP functions depending on LDAPMode variable.
# Presently, the only mode supported is 'verify', which will verify the existence
# of an unknown user in LDAP prior to creating them as a new user in the system.
# This is useful on some systems acting as gateway machines.
#
#LDAPMode verify
#LDAPHost ldaphost.mydomain.com
#LDAPFilter "(mail=%u)"
#LDAPBase ou=people,dc=domain,dc=com

# Optionally, you can specify storage profiles, and specify the server to
# use on the commandline with --profile. For example:
#
Profile Nautilus
MySQLServer.Nautilus    /var/run/mysqld/mysqld.sock
MySQLPort.Nautilus    3306
MySQLUser.Nautilus    dspam
MySQLPass.Nautilus    XXXXXXXXXXXXXXXXXXX
MySQLDb.Nautilus    dspam
MySQLCompress.Nautilus    true
MySQLUIDInSignature.Nautilus on
#
#Profile DECAlpha
#MySQLServer.DECAlpha 10.0.0.1
#MySQLPort.DECAlpha 3306
#MySQLUser.DECAlpha dspam
#MySQLPass.DECAlpha changeme
#MySQLDb.DECAlpha dspam
#MySQLCompress.DECAlpha true
#
#Profile Sun420R
#MySQLServer.Sun420R 10.0.0.2
#MySQLPort.Sun420R 3306
#MySQLUser.Sun420R dspam
#MySQLPass.Sun420R changeme
#MySQLDb.Sun420R dspam
#MySQLCompress.Sun420R false
#
DefaultProfile Nautilus

#
# If you're using storage profiles, you can set failovers for each profile.
# Of course, if you'll be failing over to another database, that database
# must have the same information as the first. If you're using a global
# database with no training, this should be relatively simple. If you're
# configuring per-user data, however, you'll need to set up some type of
# replication between databases.
#
#Failover.DECAlpha SUN420R
#Failover.Sun420R DECAlpha

# If the storage fails, the agent will follow each profile's failover up to
# a maximum number of failover attempts. This should be set to a maximum of
# the number of profiles you have, otherwise the agent could loop and try
# the same profile multiple times (unless this is your desired behavior).
#
#FailoverAttempts 1

#
# Ignored headers: If DSPAM is behind other tools which may add a header to
# incoming emails, it may be beneficial to ignore these headers - especially
# if they are coming from another spam filter. If you are _not_ using one of
# these tools, however, leaving the appropriate headers commented out will
# allow DSPAM to use them as telltale signs of forged email.
#
IgnoreHeader X--MailScanner-SpamCheck
IgnoreHeader X-Admission-MailScanner-SpamCheck
IgnoreHeader X-Admission-MailScanner-SpamScore
IgnoreHeader X-Amavis-Alert
IgnoreHeader X-Antispam
IgnoreHeader X-AntiVirus
IgnoreHeader X-Antivirus-Scanner
IgnoreHeader X-Antivirus-Status
IgnoreHeader X-Assp-Spam-Prob
IgnoreHeader X-AV-Scanned
IgnoreHeader X-AVAS-Spam-Level
IgnoreHeader X-AVAS-Spam-Score
IgnoreHeader X-AVAS-Spam-Status
IgnoreHeader X-AVAS-Spam-Symbols
IgnoreHeader X-AVAS-Virus-Status
IgnoreHeader X-AVK-Virus-Check
IgnoreHeader X-Barracuda-Spam-Flag
IgnoreHeader X-Barracuda-Spam-Report
IgnoreHeader X-Barracuda-Spam-Score
IgnoreHeader X-Barracuda-Spam-Status
IgnoreHeader X-BTI-AntiSpam
IgnoreHeader X-Bogosity
IgnoreHeader X-ClamAntiVirus-Scanner
IgnoreHeader X-CRM114-Status
IgnoreHeader X-Despammed-Tracer
IgnoreHeader X-ELTE-SpamCheck
IgnoreHeader X-ELTE-SpamCheck-Details
IgnoreHeader X-ELTE-SpamScore
IgnoreHeader X-ELTE-SpamVersion
IgnoreHeader X-ELTE-VirusStatus
IgnoreHeader X-GMX-Antispam
IgnoreHeader X-GMX-Antivirus
IgnoreHeader X-Greylist
IgnoreHeader X-GWSPAM
IgnoreHeader X-HTMLM
IgnoreHeader X-HTMLM-Info
IgnoreHeader X-HTMLM-Score
IgnoreHeader X-iHateSpam-Checked
IgnoreHeader X-iHateSpam-Quarantined
IgnoreHeader X-IMAIL-SPAM-STATISTICS
IgnoreHeader X-IMAIL-SPAM-URL-DBL
IgnoreHeader X-IMAIL-SPAM-VALFROM
IgnoreHeader X-IMAIL-SPAM-VALHELO
IgnoreHeader X-IMAIL-SPAM-VALREVDNS
IgnoreHeader X-IronPort-Anti-Spam-Filtered
IgnoreHeader X-IronPort-Anti-Spam-Result
IgnoreHeader X-Kaspersky-Antivirus
IgnoreHeader X-KSV-Antispam
IgnoreHeader X-Mailer
IgnoreHeader X-MailScanner
IgnoreHeader X-MailScanner-Information
IgnoreHeader X-MailScanner-SpamCheck
IgnoreHeader X-MDaemon-Deliver-To
IgnoreHeader X-MDAV-Processed
IgnoreHeader X-MDRemoteIP
IgnoreHeader X-MIE-MailScanner-SpamCheck
IgnoreHeader X-MIMEOLE
IgnoreHeader X-Mlf-Spam-Status
IgnoreHeader X-MSMail-Priority
IgnoreHeader X-NAI-Spam-Checker-Version
IgnoreHeader X-NAI-Spam-Flag
IgnoreHeader X-NAI-Spam-Level
IgnoreHeader X-NAI-Spam-Route
IgnoreHeader X-NAI-Spam-Rules
IgnoreHeader X-NAI-Spam-Score
IgnoreHeader X-NAI-Spam-Threshold
IgnoreHeader X-NetcoreISpam1-ECMScanner
IgnoreHeader X-NetcoreISpam1-ECMScanner-From
IgnoreHeader X-NetcoreISpam1-ECMScanner-Information
IgnoreHeader X-NetcoreISpam1-ECMScanner-SpamCheck
IgnoreHeader X-NetcoreISpam1-ECMScanner-SpamScore
IgnoreHeader X-NEWT-spamscore
IgnoreHeader X-No-Spam
IgnoreHeader X-Olypen-Virus
IgnoreHeader X-OWM-SpamCheck
IgnoreHeader X-OWM-VirusCheck
IgnoreHeader X-PAA-AntiVirus
IgnoreHeader X-PAA-AntiVirus-Message
IgnoreHeader X-PIRONET-NDH-MailScanner-SpamCheck
IgnoreHeader X-PIRONET-NDH-MailScanner-SpamScore
IgnoreHeader X-PN-SPAMFiltered
IgnoreHeader X-Priority
IgnoreHeader X-Proofpoint-Spam-Details
IgnoreHeader X-purgate
IgnoreHeader X-purgate-Ad
IgnoreHeader X-purgate-ID
IgnoreHeader X-RAV-AntiVirus
IgnoreHeader X-Rc-Spam
IgnoreHeader X-Rc-Virus
IgnoreHeader X-RedHat-Spam-Score
IgnoreHeader X-RedHat-Spam-Warning
IgnoreHeader X-RegEx
IgnoreHeader X-RegEx-Score
IgnoreHeader X-RITmySpam
IgnoreHeader X-RITmySpam-IP
IgnoreHeader X-RITmySpam-Spam
IgnoreHeader X-Rocket-Spam
IgnoreHeader X-SA-GROUP
IgnoreHeader X-SA-RECEIPTSTATUS
IgnoreHeader X-Sohu-Antivirus
IgnoreHeader X-Spam
IgnoreHeader X-Spam-Check
IgnoreHeader X-Spam-Checked-By
IgnoreHeader X-Spam-Checker
IgnoreHeader X-Spam-Checker-Version
IgnoreHeader X-Spam-DCC
IgnoreHeader X-Spam-Details
IgnoreHeader X-Spam-detection-level
IgnoreHeader X-Spam-Filter
IgnoreHeader X-Spam-Filtered
IgnoreHeader X-Spam-Flag
IgnoreHeader X-Spam-Level
IgnoreHeader X-Spam-OrigSender
IgnoreHeader X-Spam-Pct
IgnoreHeader X-Spam-Prev-Subject
IgnoreHeader X-Spam-Processed
IgnoreHeader X-Spam-Pyzor
IgnoreHeader X-Spam-Rating
IgnoreHeader X-Spam-Report
IgnoreHeader X-Spam-Scanned
IgnoreHeader X-Spam-Score
IgnoreHeader X-Spam-Status
IgnoreHeader X-Spam-Tagged
IgnoreHeader X-Spam-Tests
IgnoreHeader X-Spam-Tests-Failed
IgnoreHeader X-Spam-Virus
IgnoreHeader X-Spamadvice
IgnoreHeader X-Spamarrest-noauth
IgnoreHeader X-Spamarrest-speedcode
IgnoreHeader X-SpamBouncer
IgnoreHeader X-Spambayes-Classification
IgnoreHeader X-SpamCatcher-Score
IgnoreHeader X-SpamCop-Checked
IgnoreHeader X-SpamCop-Disposition
IgnoreHeader X-SpamCop-Whitelisted
IgnoreHeader X-Spamcount
IgnoreHeader X-SpamDetected
IgnoreHeader X-SpamInfo
IgnoreHeader X-SpamPal
IgnoreHeader X-SpamPal-Timeout
IgnoreHeader X-SpamReason
IgnoreHeader X-SpamScore
IgnoreHeader X-Spamsensitivity
IgnoreHeader X-SpamTest-Categories
IgnoreHeader X-SpamTest-Info
IgnoreHeader X-SpamTest-Method
IgnoreHeader X-SpamTest-Status
IgnoreHeader X-SpamTest-Version
IgnoreHeader X-STA-NotSpam
IgnoreHeader X-STA-Spam
IgnoreHeader X-TERRACE-SPAMMARK
IgnoreHeader X-TERRACE-SPAMRATE
IgnoreHeader X-to-viruscore
IgnoreHeader X-Text-Classification
IgnoreHeader X-Text-Classification-Data
IgnoreHeader X-UCD-Spam-Score
IgnoreHeader x-uscspam
IgnoreHeader X-Virus-Check
IgnoreHeader X-Virus-Checked
IgnoreHeader X-Virus-Checker-Version
IgnoreHeader X-Virus-Scan
IgnoreHeader X-Virus-Scanned
IgnoreHeader X-Virus-Scanner
IgnoreHeader X-Virus-Scanner-Result
IgnoreHeader X-Virus-Status
IgnoreHeader X-VirusChecked
IgnoreHeader X-Virusscan
IgnoreHeader X-WinProxy-AntiVirus
IgnoreHeader X-WinProxy-AntiVirus-Message

#
# Lookup: Perform lookups on streamlined blackhole list servers (see
# http://www.nuclearelephant.com/projects/sbl/). The streamlined blacklist
# server is machine-automated, unsupervised blacklisting system designed to
# provide real-time and highly accurate blacklisting based on network spread.
# When performing a lookup, DSPAM will automatically learn the inbound message
# as spam if the source IP is listed. Until an official public RABL server is
# available, this feature is only useful if you are running your own
# streamlined blackhole list server for internal reporting among multiple mail
# servers. Provide the name of the lookup zone below to use.
#
# This function performs standard reverse-octet.domain lookups, and while it
# will function with many RBLs, it's strongly discouraged to use those
# maintained by humans as they're often inaccurate and could hurt filter
# learning and accuracy.
#
Lookup "sbl-xbl.spamhaus.org"

#
# RBLInoculate: If you want to inoculate the user from RBL'd messages it would
# have otherwise missed, set this to on.
#
RBLInoculate on

#
# Notifications: Enable the sending of notification emails to users (first
# message, quarantine full, etc.)
#
Notifications on

#
# Purge configuration: Set dspam_clean purge default options, if not otherwise
# specified on the commandline
#
#PurgeSignatures 14 # Stale signatures
#PurgeNeutral 90 # Tokens with neutralish probabilities
#PurgeUnused 90 # Unused tokens
#PurgeHapaxes 30 # Tokens with less than 5 hits (hapaxes)
#PurgeHits1S 15 # Tokens with only 1 spam hit
#PurgeHits1I 15 # Tokens with only 1 innocent hit

#
# Purge configuration for SQL-based installations using purge.sql
#
PurgeSignature off # Specified in purge.sql
PurgeNeutral 90
PurgeUnused off # Specified in purge.sql
PurgeHapaxes off # Specified in purge.sql
PurgeHits1S off # Specified in purge.sql
PurgeHits1I off # Specified in purge.sql

#
# Local Mail Exchangers: Used for source address tracking, tells DSPAM which
# mail exchangers are local and therefore should be ignored in the Received:
# header when tracking the source of an email. Note: you should use the address
# of the host as appears between brackets [ ] in the Received header.
#
LocalMX 127.0.0.1

#
# Logging: Disabling logging for users will make usage graphs unavailable to
# them. Disabling system logging will make admin graphs unavailable.
#
SystemLog on
UserLog on

#
# TrainPristine: for systems where the original message remains server side
# and can therefore be presented in pristine format for retraining. This option
# will cause DSPAM to cease all writing of signatures and DSPAM headers to the
# message, and deliver the message in as pristine format as possible. This mode
# REQUIRES that the original message in its pristine format (as of delivery)
# be presented for retraining, as in the case of webmail, imap, or other
# applications where the message is actually kept server-side during reading,
# and is preserved. DO NOT use this switch unless the original message can be
# presented for retraining with the ORIGINAL HEADERS and NO MODIFICATIONS.
#
#TrainPristine on

#
# Opt: in or out; determines DSPAM's default filtering behavior. If this value
# is set to in, users must opt-in to filtering by dropping a .dspam file in
# /var/dspam/opt-in/user.dspam (or if you have homedirs configured, a .dspam
# folder in their home directory). The default is opt-out, which means all
# users will be filtered unless a .nodspam file is dropped in
# /var/dspam/opt-out/user.nodspam
#
Opt in

#
# TrackSources: specify which (if any) source addresses to track and report
# them to syslog (mail.info). This is useful if you're running a firewall or
# blacklist and would like to use this information. Spam reporting also drops
# RABL blacklist files (see http://www.nuclearelephant.com/projects/rabl/).
#
TrackSources spam nonspam

#
# ParseToHeaders: In lieu of setting up individual aliases for each user,
# DSPAM can be configured to automatically parse the To: address for spam and
# false positive forwards. From there, it can be configured to either set the
# DSPAM user based on the username specified in the header and/or change the
# training class and source accordingly. The options below can be used to
# customize most common types of header parsing behavior to avoid the need for
# multiple aliases, or if using LMTP, aliases entirely..
#
# ParseToHeader: Parse the To: headers of an incoming message. This must be
# set to 'on' to use either of the following features.
#
# ChangeModeOnParse: Automatically change the class (to spam or innocent)
# depending on whether spam- or notspam- was specified, and change the source
# to 'error'. This is convenient if you're not using aliases at all, but
# are delivering via LMTP.
#
# ChangeUserOnParse: Automatically change the username to match that specified
# in the To: header. For example, spam-bob@domain.tld will set the username
# to bob, ignoring any --user passed in. This may not always be desirable if
# you are using virtual email addresses as usernames. Options:
# on or user take the portion before the @ sign only
# full    take everything after the initial {spam,notspam}-.
#
ParseToHeaders on
ChangeModeOnParse on
ChangeUserOnParse off

#
# Broken MTA Options: Some MTAs don't support the proper functionality
# necessary. In these cases you can activate certain features in DSPAM to
# compensate. 'returnCodes' causes DSPAM to return an exit code of 99 if
# the message is spam, 0 if not, or a negative code if an error has occured.
# Specifying 'case' causes DSPAM to force the input usernames to lowercase.
# Spceifying 'lineStripping' causes DSPAM to strip ^M's from messages passed
# in.
#
#Broken returnCodes
Broken case
Broken lineStripping

#
# MaxMessageSize: You may specify a maximum message size for DSPAM to process.
# If the message is larger than the maximum size, it will be delivered
# without processing. Value is in bytes.
#
MaxMessageSize 20971520

#
# Virus Checking: If you are running clamd, DSPAM can perform stream-based
# virus checking using TCP. Uncomment the values below to enable virus
# checking.
#
# ClamAVResponse: reject (reject or drop the message with a permanent failure)
# accept (accept the message and quietly drop the message)
# spam (treat as spam and quarantine/tag/whatever)
#
#ClamAVPort 3310
#ClamAVHost 127.0.0.1
#ClamAVResponse accept

#
# Daemonized Server: If you are running DSPAM as a daemonized server using
# --daemon, the following parameters will override the default. Use the
# ServerPass option to set up accounts for each client machine. The DSPAM
# server will process and deliver the message based on the parameters
# specified. If you want the client machine to perform delivery, use
# the --stdout option in conjunction with a local setup.
#
#ServerPort    24
ServerQueueSize    32
ServerPID    /var/run/dspam/dspam.pid

#
# ServerMode specifies the type of LMTP server to start. This can be one of:
# dspam: DSPAM-proprietary DLMTP server, for communicating with dspamc
# standard: Standard LMTP server, for communicating with Postfix or other MTA
# auto: Speak both DLMTP and LMTP; auto-detect by ServerPass.IDENT
#
ServerMode auto

# If supporting DLMTP (dspam) mode, dspam clients will require authentication
# as they will be passing in parameters. The idents below will be used to
# determine which clients will be speaking DLMTP, so if you will be using
# both LMTP and DLMTP from the same host, be sure to use something other
# than the server's hostname below (which will be sent by the MTA during a
# standard LMTP LHLO).
#
#ServerPass.Relay1 "secret"
#ServerPass.Relay2 "password"
#
ServerPass.Nautilus "XXXXXXXXXXXXXXXXXXX"

# If supporting standard LMTP mode, server parameters will need to be specified
# here, as they will not be passed in by the mail server. The ServerIdent
# specifies the 250 response code ident sent back to connecting clients and
# should be set to the hostname of your server, or an alias.
#
# NOTE: If you specify --user in ServerParameters, the RCPT TO will be
# used only for delivery, and not set as the active user for processing.
#
ServerParameters "--deliver=innocent,spam -d %u"
ServerIdent    "XXXXXXXXXXXXXXXXXX"

# If you wish to use a local domain socket instead of a TCP socket, uncomment
# the following. It is strongly recommended you use local domain sockets if
# you are running the client and server on the same machine, as it eliminates
# much of the bandwidth overhead.
#
ServerDomainSocketPath "/var/run/dspam/dspam.sock"

#
# Client Mode: If you are running DSPAM in client/server mode, uncomment and
# set these variables. A ClientHost beginning with a / will be treated as
# a domain socket.
#
#ClientHost /tmp/dspam.sock
#ClientIdent "secret@Relay1"
#
#ClientHost 127.0.0.1
#ClientPort 24
#ClientIdent "secret@Relay1"

# RABLQueue: Touch files in the RABL queue
# If you are a reporting streamlined blackhole list participant, you can
# touch ip addresses within the directory the rabl_client process is watching.
#
#RABLQueue /var/spool/rabl
ClientHost /var/run/dspam/dspam.sock
ClientIdent "XXXXXXXXXXXXXXXXXXXXXXXXXXX@Nautilus"

# DataSource: If you are using any type of data source that does not include
# email-like headers (such as documents), uncomment the line below. This
# will cause the entire input to be treated like a message "body"
#
#DataSource document

# ProcessorWordFrequency: By default, words are only counted once per message.
# If you are classifying large documents, however, you may wish to count once
# per occurrence instead.
#
#ProcessorWordFrequency occurrence

# ProcessorBias: Bias causes the filter to lean more toward 'innocent', and
# usually greatly reduces false positives. It is the default behavior of
# most Bayesian filters (including dspam).
#
# NOTE: You probably DONT want this if you're using Markovian Weighting, unless
# you are paranoid about false positives.
#
ProcessorBias on

## EOF

petrjanda wrote:

More info would be nice!

Okay... On my dspam.conf you see, that I enabled "RBLInoculate" and I enabled "Lookup". This is one way to keep up the global user with new spam data. But you need to point your spam traps to be delivered to your "global" user.

Another way of doing it, could be to set up some spam traps and in your MTA (I use Postfix) deliver any mail captured in that spam trap to be automaticly learned by your global user. I would not blindly push everything into the global user (this is bad! Except if you feed ham mails as well to that global user, else you will end up with to much spam tokens and this does have a negative influence on the accurancy). Keep in mind, that it would be the best to have 50% ham and 50% spam in your token data. It is okay to have 2/3 ham and 1/3 spam. A good document describing how to implement that in Postfix can be found here.

Another way is to use dspam_merge. What you could do (from time to time and only if you trust the source data/user):

Code:

dspam_merge user1 user2 ... userN -o global

Another way is to use inoculation groups. Read the README to see what that exactly is and how it works.

petrjanda wrote:

Up until now I was using classification group. Im gonna use merged group now.

Swich now to merged groups! Classification group is not that what you are looking for. The problem with classification groups is, that a mail could be tagged as "spam" but the header still say that is "ham" and vice versa. Merged group is way way better.

petrjanda wrote:

This is a problem with classification groups. Use merged groups. This has much better result and does exactly what you want.

I would set your "global" user to TEFT mode while learning or building up the corpus data, but after that you need to set it to TOE. If you don't set it to TOE, then the purge script will delete tokens. And you don't want that on the "global" user to happen.

If you want to keep the data small, then switch everyone to TOE. This is sligthly less accurate then TEFT, but this helps to keep the data in your storage to a minimum level.

If you want a faster and more flexible way of purging your old tokens from DSPAM, then use this dspam.cron script (you should have one already in /etc/cron.daily/dspam.cron):

Code:

#!/bin/bash
# Copyright 1999-2005 Gentoo Foundation
# Distributed under the terms of the GNU General Public License v2
#
# Remove old signatures and unimportant tokens from the DSPAM database
#

#
# Function to run dspam_clean
#
run_dspam_clean() {
if [[ ! -f "/usr/bin/dspam_clean" ]]
then
echo "/usr/bin/dspam_clean not found!"
return 1
else
/usr/bin/dspam_clean -s -p -u >/dev/null 2>&1
return 0
fi
}

#
# Function to check if we have all needed tools
#
check_for_tools() {
local myrc=0
for foo in awk head tail cut sed
do
DSPAM_Check_App="$(${foo} --version 2>&1)"
if [[ "${DSPAM_Check_App/ *}" == "bash:" ]]
then
echo "Command ${foo} not found!"
myrc=1
fi
done
return ${myrc}
}

#
# Check for needed tools
#
check_for_tools
if [[ "$?" -ne "0" ]]
then
# We have not all needed tools installed. Run just the dspam_clean part.
run_dspam_clean
exit $?
fi

#
# Try to get DSPAM home directory
#
DSPAM_HOMEDIR="$(grep ^dspam /etc/passwd|awk -F : '{print $6}')"
if [ ! -f ${DSPAM_HOMEDIR}/*.data ]
then
# Something is wrong in passwd! Check if /etc/mail/dspam exists instead.
if [ -f /etc/mail/dspam/*.data ]
then
DSPAM_HOMEDIR="/etc/mail/dspam"
fi
fi

if [[ -f "${DSPAM_HOMEDIR}/mysql.data" ]]
then
if [[ ! -f "/usr/bin/mysql_config" ]]
then
echo "Can not run MySQL purge script:"
echo " /usr/bin/mysql_config does not exist"
run_dspam_clean
exit 1
fi
DSPAM_MySQL_PURGE_SQL=""
DSPAM_MySQL_VER="$(/usr/bin/mysql_config --version | sed "s:$[^0-9\.]*$::g")"
DSPAM_MySQL_MAJOR="$(echo "${DSPAM_MySQL_VER}" | cut -d. -f1)"
DSPAM_MySQL_MINOR="$(echo "${DSPAM_MySQL_VER}" | cut -d. -f2)"
DSPAM_MySQL_MICRO="$(echo "${DSPAM_MySQL_VER}" | cut -d. -f3)"
DSPAM_MySQL_INT="$((DSPAM_MySQL_MAJOR * 65536 + DSPAM_MySQL_MINOR * 256 + DSPAM_MySQL_MICRO))"

# For MySQL >= 4.1 use the new purge script
if [[ "${DSPAM_MySQL_INT}" -ge "262400" ]]
then
if [[ -f "${DSPAM_HOMEDIR}/config/mysql_purge-4.1-optimized.sql" || -f "${DSPAM_HOMEDIR}/mysql_purge-4.1-optimized.sql" ]]
then
[[ -f "${DSPAM_HOMEDIR}/config/mysql_purge-4.1-optimized.sql" ]] && DSPAM_MySQL_PURGE_SQL="${DSPAM_HOMEDIR}/config/mysql_purge-4.1-optimized.sql"
[[ -f "${DSPAM_HOMEDIR}/mysql_purge-4.1-optimized.sql" ]] && DSPAM_MySQL_PURGE_SQL="${DSPAM_HOMEDIR}/mysql_purge-4.1-optimized.sql"
else
[[ -f "${DSPAM_HOMEDIR}/config/mysql_purge-4.1.sql" ]] && DSPAM_MySQL_PURGE_SQL="${DSPAM_HOMEDIR}/config/mysql_purge-4.1.sql"
[[ -f "${DSPAM_HOMEDIR}/mysql_purge-4.1.sql" ]] && DSPAM_MySQL_PURGE_SQL="${DSPAM_HOMEDIR}/mysql_purge-4.1.sql"
fi
else
[[ -f "${DSPAM_HOMEDIR}/config/mysql_purge.sql" ]] && DSPAM_MySQL_PURGE_SQL="${DSPAM_HOMEDIR}/config/mysql_purge.sql"
[[ -f "${DSPAM_HOMEDIR}/mysql_purge.sql" ]] && DSPAM_MySQL_PURGE_SQL="${DSPAM_HOMEDIR}/mysql_purge.sql"
fi

if [[ "${DSPAM_MySQL_PURGE_SQL}" == "" ]]
then
echo "Can not run MySQL purge script:"
echo " No mysql_purge SQL script found"
run_dspam_clean
exit 1
fi

if [[ ! -f "/usr/bin/mysql" ]]
then
echo "Can not run MySQL purge script:"
echo " /usr/bin/mysql does not exist"
run_dspam_clean
exit 1
fi

# Get DSPAM MySQL username and password
DSPAM_MySQL_HOST="$(cat ${DSPAM_HOMEDIR}/mysql.data|head -n 1|tail -n 1)"
DSPAM_MySQL_PORT="$(cat ${DSPAM_HOMEDIR}/mysql.data|head -n 2|tail -n 1)"
DSPAM_MySQL_USER="$(cat ${DSPAM_HOMEDIR}/mysql.data|head -n 3|tail -n 1)"
DSPAM_MySQL_PWD="$(cat ${DSPAM_HOMEDIR}/mysql.data|head -n 4|tail -n 1)"
DSPAM_MySQL_DB="$(cat ${DSPAM_HOMEDIR}/mysql.data|head -n 5|tail -n 1)"

# Run the MySQL purge script
(/usr/bin/mysql --user="${DSPAM_MySQL_USER}" --password="${DSPAM_MySQL_PWD}" ${DSPAM_MySQL_DB} < ${DSPAM_MySQL_PURGE_SQL}) 1>/dev/null 2>&1

# Run the dspam_clean command
run_dspam_clean

# Optimize the MySQL tables for DSPAM
for foo in $(/usr/bin/mysql --user="${DSPAM_MySQL_USER}" --password="${DSPAM_MySQL_PWD}" --silent --skip-column-names --batch ${DSPAM_MySQL_DB} -e 'SHOW TABLES;' 2>&1)
do
(/usr/bin/mysql --user="${DSPAM_MySQL_USER}" --password="${DSPAM_MySQL_PWD}" ${DSPAM_MySQL_DB} -e "OPTIMIZE TABLE ${foo};") 1>/dev/null 2>&1
done
exit 0
elif [[ -f "${DSPAM_HOMEDIR}/pgsql.data" ]]
then
DSPAM_PgSQL_PURGE_SQL=""
[[ -f "${DSPAM_HOMEDIR}/config/pgsql_purge.sql" ]] && DSPAM_PgSQL_PURGE_SQL="${DSPAM_HOMEDIR}/config/pgsql_purge.sql"
[[ -f "${DSPAM_HOMEDIR}/pgsql_purge.sql" ]] && DSPAM_PgSQL_PURGE_SQL="${DSPAM_HOMEDIR}/pgsql_purge.sql"

if [[ "${DSPAM_PgSQL_PURGE_SQL}" == "" ]]
then
echo "Can not run PostgreSQL purge script:"
echo " No pgsql_purge SQL script found"
run_dspam_clean
exit 1
fi

if [[ ! -f "/usr/bin/psql" ]]
then
echo "Can not run PostgreSQL purge script:"
echo " /usr/bin/psql does not exist"
run_dspam_clean
exit 1
fi

# Get DSPAM PostgreSQL username and password
DSPAM_PgSQL_HOST="$(cat ${DSPAM_HOMEDIR}/pgsql.data|head -n 1|tail -n 1)"
DSPAM_PgSQL_PORT="$(cat ${DSPAM_HOMEDIR}/pgsql.data|head -n 2|tail -n 1)"
DSPAM_PgSQL_USER="$(cat ${DSPAM_HOMEDIR}/pgsql.data|head -n 3|tail -n 1)"
DSPAM_PgSQL_PWD="$(cat ${DSPAM_HOMEDIR}/pgsql.data|head -n 4|tail -n 1)"
DSPAM_PgSQL_DB="$(cat ${DSPAM_HOMEDIR}/pgsql.data|head -n 5|tail -n 1)"

# Run the PostgreSQL purge script
(PGUSER=${DSPAM_PgSQL_USER} PGPASSWORD=${DSPAM_PgSQL_PWD} /usr/bin/psql -U ${DSPAM_PgSQL_USER} -d ${DSPAM_PgSQL_DB} -p ${DSPAM_PgSQL_PORT} -h ${DSPAM_PgSQL_HOST} -f ${DSPAM_PgSQL_PURGE_SQL}) 1>/dev/null 2>&1

# Run the dspam_clean command
run_dspam_clean

exit 0
elif [[ -f "${DSPAM_HOMEDIR}/oracle.data" ]]
then
DSPAM_Oracle_PURGE_SQL=""
[[ -f "${DSPAM_HOMEDIR}/config/ora_purge.sql" ]] && DSPAM_Oracle_PURGE_SQL="${DSPAM_HOMEDIR}/config/ora_purge.sql"
[[ -f "${DSPAM_HOMEDIR}/ora_purge.sql" ]] && DSPAM_Oracle_PURGE_SQL="${DSPAM_HOMEDIR}/ora_purge.sql"

if [[ "${DSPAM_Oracle_PURGE_SQL}" == "" ]]
then
echo "Can not run Oracle purge script:"
echo " No ora_purge SQL script found"
run_dspam_clean
exit 1
fi

if [[ ! -f "/usr/bin/sqlplus" ]]
then
echo "Can not run PostgreSQL purge script:"
echo " /usr/bin/sqlplus does not exist"
run_dspam_clean
exit 1
fi

# Get DSPAM PostgreSQL username and password
DSPAM_Oracle_DBLINK="$(cat ${DSPAM_HOMEDIR}/oracle.data|head -n 1|tail -n 1)"
DSPAM_Oracle_USER="$(cat ${DSPAM_HOMEDIR}/oracle.data|head -n 2|tail -n 1)"
DSPAM_Oracle_PWD="$(cat ${DSPAM_HOMEDIR}/oracle.data|head -n 3|tail -n 1)"
DSPAM_Oracle_SCHEMA="$(cat ${DSPAM_HOMEDIR}/oracle.data|head -n 4|tail -n 1)"

# Run the Oracle purge script
(/usr/bin/sqlplus -s ${DSPAM_Oracle_USER}/${DSPAM_Oracle_PWD} @${DSPAM_Oracle_PURGE_SQL}) 1>/dev/null 2>&1

# Run the dspam_clean command
run_dspam_clean

exit 0
else
run_dspam_clean
exit $?
fi

As you see, I use a optimized purge script (mysql_purge-4.1-optimized.sql). This is the content of my optimized script:

Code:

# $Id: purge-4.1.sql,v 1.5 2005/07/14 13:50:10 jonz Exp $

# => http://www.solidcore.dk/blog/2006/02/optimizing_dspam.html

set @a=to_days(current_date());

START TRANSACTION;
delete from dspam_token_data
where (innocent_hits*2) + spam_hits < 5
and from_days(@a-60) > last_hit;
COMMIT;

START TRANSACTION;
delete from dspam_token_data
where innocent_hits = 1 and spam_hits = 0
and @a-from_days(last_hit) > 15;
COMMIT;

START TRANSACTION;
delete from dspam_token_data
where innocent_hits = 0 and spam_hits = 1
and @a-to_days(last_hit) > 15;
COMMIT;

START TRANSACTION;
delete from dspam_token_data
USING
dspam_token_data LEFT JOIN dspam_preferences
ON dspam_token_data.uid = dspam_preferences.uid
AND dspam_preferences.preference = 'trainingMode'
AND dspam_preferences.value in('TOE','TUM','NOTRAIN')
WHERE from_days(@a-90) > dspam_token_data.last_hit
AND dspam_preferences.uid IS NULL;
COMMIT;

START TRANSACTION;
delete from dspam_token_data
USING
dspam_token_data LEFT JOIN dspam_preferences
ON dspam_token_data.uid = dspam_preferences.uid
AND dspam_preferences.preference = 'trainingMode'
AND dspam_preferences.value = 'TUM'
WHERE from_days(@a-90) > dspam_token_data.last_hit
AND innocent_hits + spam_hits < 50
AND dspam_preferences.uid IS NOT NULL;
COMMIT;

START TRANSACTION;
delete from dspam_signature_data
where from_days(@a-14) > created_on;
COMMIT

You can read more about it here.

If you need a script to delete in MySQL a user with all his data, then you could use this script:

Code:

#!/bin/bash

[[ "$1" == "" ]] && echo "Missing username" && exit 1

_dspam_sysconfdir=$(dspam --version|sed -n "s:^.*\-\-sysconfdir\=$[^ ]*$.*:\1:gIp");

if [[ "${_dspam_sysconfdir}" == "" ]];
then
echo "Error: Could not get DSPAM system config directory";
exit 1
elif [[ ! -d "${_dspam_sysconfdir}" ]];
then
echo "Error: DSPAM system config directory does not exist";
exit 1
elif [[ ! -f "${_dspam_sysconfdir}/mysql.data" ]];
then
echo "Error: DSPAM mysql.data file does not exist";
exit 1
fi

_dspam_mysql_host="$(cat ${_dspam_sysconfdir}/mysql.data|head -n 1|tail -n 1)"
_dspam_mysql_port="$(cat ${_dspam_sysconfdir}/mysql.data|head -n 2|tail -n 1)"
_dspam_mysql_user="$(cat ${_dspam_sysconfdir}/mysql.data|head -n 3|tail -n 1)"
_dspam_mysql_password="$(cat ${_dspam_sysconfdir}/mysql.data|head -n 4|tail -n 1)"
_dspam_mysql_db="$(cat ${_dspam_sysconfdir}/mysql.data|head -n 5|tail -n 1)"
_dspam_user_uid=$(mysql --user="${_dspam_mysql_user}" --password="${_dspam_mysql_password}" --host="localhost" --batch --skip-column-names -e "USE ${_dspam_mysql_db};SELECT uid FROM dspam_virtual_uids WHERE 1 AND username='${1}';")

if [[ "${_dspam_user_uid}" == "" ]];
then
echo "Error: Can not get UID for user ${1}";
exit 1
fi

_dspam_delete_uid="USE ${_dspam_mysql_db};$(mysql --user="${_dspam_mysql_user}" --password="${_dspam_mysql_password}" --host="localhost" --batch --skip-column-names -e "USE ${_dspam_mysql_db};SHOW TABLES;" | sed "s:^$.*$$:DELETE FROM \1 WHERE uid='${_dspam_user_uid}';:g")";
echo "Executing:"
echo ${_dspam_delete_uid}

mysql --user="${_dspam_mysql_user}" --password="${_dspam_mysql_password}" --host="localhost" --batch --skip-column-names -e "${_dspam_delete_uid};"

I have written other scripts (for example for setting trainingMode for all users and other stuff like that). If you need more scripts, then let me know.

If you want, I could upload my current data from my training system (this is the system where I do my current training and where all the config files I posted there are comming from). The dspam_token_data table is not that big (around 1'200'000 tokens) and if I dump the data with mysqldump and compress it with bzip2, then the data is only 13MB.
But I would not suggest you to use that data. It is way better to use your own data, since my data has alot of stuff wich is probably not needed in your environment (I have alot of german mails in this data). And one last advice: DON'T TRUST ANYONE, when it comes to Anti-Spam. Use your own data and own training. That's the best thing to do! DSPAM uses statistical data and if you use data from some one else (for example from me), then you are spoiling your accurancy.

If you need spam corpus data, then look here:

If you need spam and ham corpus data, then look here:

Ling-spam
SpamAssassin public mail corpus
Synthetic (Annexia/Xpert) Corpus
TREC 2005 Public Spam Corpus (this is mostly the same as the Enron Email Dataset, but machine sorted (therefore not that accurate) into spam/ham)
Anti-Spam-SMTP-Proxy (ASSP) corpus data

If you need unsorted mails, then look here (this data is unsorted! You need to classify it your self into spam/ham):

BIG warning about the public corpi: They contain alot of malware (trojans, virus, etc). Clean them with a Anti-Virus software!

petrjanda wrote:

Thanks!

No problem

cheers

SteveB

Last edited by steveb on Sat May 13, 2006 10:55 am; edited 1 time in total

petrjanda
Veteran
Veteran

Joined: 05 Sep 2003
Posts: 1557
Location: Brno, Czech Republic

Posted: Sat May 13, 2006 10:20 am Post subject:

Quote:

And how do i go about pointing it to the global user? Where is this done?

Thanks so much for all this!!!
_________________
There is, a not-born, a not-become, a not-made, a not-compounded. If that unborn, not-become, not-made, not-compounded were not, there would be no escape from this here that is born, become, made and compounded. - Gautama Siddharta

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Sat May 13, 2006 10:32 am Post subject:

petrjanda wrote:

And how do i go about pointing it to the global user? Where is this done?

What MTA do you use?

petrjanda wrote:

Thanks so much for all this!!!

No problem

cheers

SteveB

magic919
Advocate

Joined: 17 Jun 2005
Posts: 2182
Location: Berkshire, UK

Posted: Sat May 13, 2006 1:53 pm Post subject:

Great post Stevee. I'm sure some of this will help me with DSPAM too.

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Sat May 13, 2006 2:22 pm Post subject:

magic919 wrote:

Great post Stevee. I'm sure some of this will help me with DSPAM too.

Thanks! I am so happy with DSPAM, that I can not understand why so much people still use SpamAssassin? A heuristic approach (as the one in SA) is in no way so good as the statistical approach.

I have read so much about classifying mails into spam/ham, that I know that such a coctail of heristic and handwritten rules as the one in SA will never ever be flexible enought and accurate enught to beat something like DSPAM or CRM114 or another statistical filters.

In my experiance, the Markovian algorithm (used in CRM114 or DSPAM and some other filters) is the absolute fastest in learning and produces very accurate results. The Markovian algorithm in DSPAM (you need to use the hash driver) is good, but does not offer enought flexibility as the other drivers in DSPAM. But I am very sure, that this will change in future releases of DSPAM. When this happens, then I will probably switch everything in DSPAM to use the hash driver. But for now, I use the other algorithms.

cheers

SteveB

petrjanda
Veteran
Veteran

Joined: 05 Sep 2003
Posts: 1557
Location: Brno, Czech Republic

Posted: Sat May 13, 2006 6:00 pm Post subject:

steveb wrote:

petrjanda wrote:

And how do i go about pointing it to the global user? Where is this done?

What MTA do you use?

petrjanda wrote:

Thanks so much for all this!!!

No problem

cheers

SteveB

Postfix.
_________________
There is, a not-born, a not-become, a not-made, a not-compounded. If that unborn, not-become, not-made, not-compounded were not, there would be no escape from this here that is born, become, made and compounded. - Gautama Siddharta

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Sat May 13, 2006 6:21 pm Post subject:

petrjanda wrote:

Postfix.

Do you have Procmail on that server? If so, then you could just pass the message to procmail. Something like this:

Code:

# relearn missclassified spam messages
:0
* ^X-DSPAM-Result: Innocent
{
:0
| /usr/bin/dspam --user $USER --class=spam --source=error

:0
/dev/null
}

# delete correctly classified spam
:0
* ^X-DSPAM-Result: Spam
{
:0
/dev/null
}

# delete everything else
:0
/dev/null

Or you could pipe every mail getting to your spam trap, directly to DSPAM:

In /etc/mail/aliases add something linke this:

Code:

spam_user@domain.tld "| /usr/bin/dspam --user global --class=spam --source=corpus"

cheers

SteveB

petrjanda
Veteran
Veteran

Joined: 05 Sep 2003
Posts: 1557
Location: Brno, Czech Republic

Posted: Sun May 14, 2006 2:44 am Post subject:

steveb wrote:

petrjanda wrote:

Postfix.

Do you have Procmail on that server? If so, then you could just pass the message to procmail. Something like this:

Code:

Or you could pipe every mail getting to your spam trap, directly to DSPAM:

In /etc/mail/aliases add something linke this:

Code:

spam_user@domain.tld "| /usr/bin/dspam --user global --class=spam --source=corpus"

cheers

SteveB

Once again, Thanks!
_________________
There is, a not-born, a not-become, a not-made, a not-compounded. If that unborn, not-become, not-made, not-compounded were not, there would be no escape from this here that is born, become, made and compounded. - Gautama Siddharta

petrjanda
Veteran
Veteran

Joined: 05 Sep 2003
Posts: 1557
Location: Brno, Czech Republic

Posted: Sun May 14, 2006 6:38 am Post subject:

Steve, Ive actually got another question. What if I wanted some users not to be merged? For example: global:merged:* would merge global's tokens into all 2000 users, however what if i dont want user1 and user2 to be merged, let them have an empty set of tokens? _________________ There is, a not-born, a not-become, a not-made, a not-compounded. If that unborn, not-become, not-made, not-compounded were not, there would be no escape from this here that is born, become, made and compounded. - Gautama Siddharta

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Sun May 14, 2006 9:14 am Post subject:

petrjanda wrote:

Steve,
Ive actually got another question. What if I wanted some users not to be merged?

For example:

global:merged:* would merge global's tokens into all 2000 users, however what if i dont want user1 and user2 to be merged, let them have an empty set of tokens?

Then you have two possibilities:

Add all the users (1998 of them) without user1 and user2 (you can use wildcards to make the list smaller):
Code:

global:merged:user3,user4,user5,user6,user7,.....,a*,b*,c*,d*,e*,userx
Use
Code:

global:merged:*
but for user1 and user2 you add the preference to ignore any group membership:
Code:

dspam_admin add preference user1 ignoreGroups on
and
Code:

dspam_admin add preference user2 ignoreGroups on

BTW: Using them in a merged group does not mean, that their tokens will be merged with your "global" user. I mean: it will be merged, but not in the database. It will only be merged at run time. In the database they still have no tokens from your "global" user. Do you understand that?

Best would be to use the merged group and use TOE as training mode. That way only errors will be trained and this keeps the database very small, while still providing good accurancy.

cheers

SteveB

petrjanda
Veteran
Veteran

Joined: 05 Sep 2003
Posts: 1557
Location: Brno, Czech Republic

Posted: Sun May 14, 2006 9:35 am Post subject:

steveb wrote:

petrjanda wrote:

Then you have two possibilities:

Add all the users (1998 of them) without user1 and user2 (you can use wildcards to make the list smaller):
Code:

global:merged:user3,user4,user5,user6,user7,.....,a*,b*,c*,d*,e*,userx
Use
Code:

global:merged:*
but for user1 and user2 you add the preference to ignore any group membership:
Code:

dspam_admin add preference user1 ignoreGroups on
and
Code:

dspam_admin add preference user2 ignoreGroups on

Yep, I understand that. Thanks!
_________________
There is, a not-born, a not-become, a not-made, a not-compounded. If that unborn, not-become, not-made, not-compounded were not, there would be no escape from this here that is born, become, made and compounded. - Gautama Siddharta

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Sun May 14, 2006 9:55 am Post subject:

petrjanda wrote:

Yep, I understand that. Thanks!

Any other question about DSPAM? I feel in the mood to answer more stuff

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Sun May 14, 2006 6:20 pm Post subject:

FYI: DSPAM 3.6.6 is out

Code:

3.6.6 is a maintenance release

MAINT: Phased out deprecated Berkeley DB drivers
MAINT: Phased out legacy tools (dspam_corpus, dspam_genaliases)
BUGFIX: When using logfile, write errors result in segfault
BUGFIX: Compiler warnings with sqlite_drv and sqlite3_drv
BUGFIX: MySQLUIDInSignature causes segfault on retrain
BUGFIX: trainPristine preference "off" does not override default

Works so far without any problem over here (even with multiple storage drivers (currently I have mysql_drv and hash_drv)):

Code:

nautilus / # dspam --version

DSPAM Anti-Spam Suite 3.6.6 (agent/library)

Copyright (c) 2002-2006 Jonathan A. Zdziarski
http://dspam.nuclearelephant.com

DSPAM may be copied only under the terms of the GNU General Public License,
a copy of which can be found with the DSPAM distribution kit.

Configuration parameters: --prefix=/usr --host=i686-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --enable-long-username --with-delivery-agent=/usr/bin/procmail --enable-large-scale --with-dspam-home=/var/spool/dspam --sysconfdir=/etc/mail/dspam --with-mysql-includes=/usr/include/mysql --with-mysql-libraries=/usr/lib/mysql --enable-preferences-extension --enable-daemon --enable-virtual-users --with-storage-driver=mysql_drv,hash_drv --build=i686-pc-linux-gnu

nautilus / #

cheers

SteveB

petrjanda
Veteran
Veteran

Joined: 05 Sep 2003
Posts: 1557
Location: Brno, Czech Republic

Posted: Mon May 15, 2006 9:25 am Post subject:

steveb wrote:

petrjanda wrote:

Yep, I understand that. Thanks!

Any other question about DSPAM? I feel in the mood to answer more stuff

Ok, then, Why do some people find that DSPAM 3.6 is less accurate than 3.4 with the same amount of corpused data?
_________________
There is, a not-born, a not-become, a not-made, a not-compounded. If that unborn, not-become, not-made, not-compounded were not, there would be no escape from this here that is born, become, made and compounded. - Gautama Siddharta

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Mon May 15, 2006 9:30 am Post subject:

petrjanda wrote:

Ok, then, Why do some people find that DSPAM 3.6 is less accurate than 3.4 with the same amount of corpused data?

Probably because they don't read the README and leave the default they had with 3.4.x. Some algorithm are depreciated in 3.6.x.

Gereraly I would say, that 3.6.x with the hash driver (Markovian algorithm) is the most accurate (but as well the most unflexible) algorithm in DSPAM. 3.4.x does not have such an accurate algorithm.

Beside that, 3.6.x has some nice features to higher the accurancy (like naive, noise, etc).

And another thing: The 3.6 series discourages you from using corpus feeding. It is better to use dspam_train then feeding spam/ham with dspam_corpus.

cheers

SteveB

petrjanda
Veteran
Veteran

Joined: 05 Sep 2003
Posts: 1557
Location: Brno, Czech Republic

Posted: Mon May 15, 2006 10:03 am Post subject:

When I first installed DSPAM 3.6, i used a DSPAM 3.6 conf file and it was still very innacurate. I then installed DSPAM 3.4 and used its default conffile and it was clearly much more accurate. Thats what i cant explain. _________________ There is, a not-born, a not-become, a not-made, a not-compounded. If that unborn, not-become, not-made, not-compounded were not, there would be no escape from this here that is born, become, made and compounded. - Gautama Siddharta

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Mon May 15, 2006 10:22 am Post subject:

petrjanda wrote:

Ok, then, Why do some people find that DSPAM 3.6 is less accurate than 3.4 with the same amount of corpused data?

You see in my above posts, that I have a very high accurancy for my user "globaluser". And for this user I extra turned off whitelisting, processor bias, training buffer and all that stuff. And I still manage to get 99.997% (3 errors in 127'507) accurancy on the spamarchive.org submit Spam and 100% (0 errors in 11'002) accurancy on the submitautomated spamarchive.org Spam.

I would not call those numbers a bad accurancy. What do you think?

cheers

SteveB

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Mon May 15, 2006 10:24 am Post subject:

petrjanda wrote:

When I first installed DSPAM 3.6, i used a DSPAM 3.6 conf file and it was still very innacurate. I then installed DSPAM 3.4 and used its default conffile and it was clearly much more accurate. Thats what i cant explain.

Can you post your dspam.conf file and can you say how you trained? Did you use dspam_train (wich was not available in 3.4.x but it was available as a Perl script from the DSPAM mailing list or you could check it out from CVS and still use it for 3.6.x)

cheers

SteveB

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Mon May 15, 2006 10:35 am Post subject:

BTW: Can you run my above mentioned command for the spamarchive.org files? What accurancy do you get with your 3.4.x installation?

steveb
Advocate

Joined: 18 Sep 2002
Posts: 4564

Posted: Mon May 15, 2006 10:53 am Post subject:

petrjanda wrote:

Ok, then, Why do some people find that DSPAM 3.6 is less accurate than 3.4 with the same amount of corpused data?

When you write about "some people", do you mean Gentoo users or other users? I ask, because I don't use the Gentoo ebuild for DSPAM. I have my own ebuild (don't ask! It's a long storry... I am just not that happy with the ebuild and after complaining on bugzilla for a while, I started to maintain my own ebuild) for DSPAM. Maybe this has a influence as well on the accurancy?

cheers

SteveB

Display posts from previous:

	Gentoo Forums Forum Index Networking & Security	All times are GMT Goto page 1, 2, 3, 4 Next
Page 1 of 4

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Copyright 2001-2025 Gentoo Foundation, Inc. Designed by Kyle Manna © 2003; Style derived from original subSilver theme. | Hosting by Gossamer Threads Inc. © | Powered by phpBB 2.0.23-gentoo-p11 © 2001, 2002 phpBB Group
Privacy Policy