[olug] Bogofilter
William E. Kempf
wekempf at cox.net
Thu Mar 27 17:15:08 UTC 2003
Eric Penne said:
> I wanted to test out bayesian filtering on my email. I downloaded,
> compiled and installed bogofilter from Eric Raymond at
> http://bogofilter.sf.net on the olug server.
I use Ifile for the same thing.
> The program is standalone the way I'm using it but I bet there would be
> economies of scale if it was installed system wide. I don't really have
> any information on cpu/disk usage and such.
>
> No matter what though,it works wonderfully.
Bayes filtering is nice... though spammers are starting to figure out ways
around it.
> It bases it's filtering on keywords and phrases that it stores in files
> in my .bogofilter/ directory. Words it deems as good go into a
> goodlist.db files and spam words go into a spamlist.db file. Bogofilter
> gives the mail a spamicity. The spamicity is between 0 and 1. In the
> bogofilter file I have set the threshold value of the spamicity at 0.95.
> >= 0.95 is spam and < 0.95 is not spam. When bogofilter considers
> something to be spam it adds those words and phrases to the spamlist.db
> file. The same thing happens for the goodlist.db file. Occasionally I
> get False
> Positives and False Negatives which I then file accordingly into
> directories name FalsePositive and FalseNegative. Everyday I log in to
> olug.org and run a script that reassociates the mail from bad to good or
> good to bad.
Ifile allows for unlimited categorization. This means that not only do I
catch spam, but ifile also sorts my mail into various other folders. For
instance, mail from this group automatically goes to an "olug" folder.
> False Positives and False Negatives happen because the email has a lot
> of info that looks like spam. In the beginning before my spamlist.db
> was built up it put all spam in my goodlist.db because it didn't have
> anything in my spamlist.db. I pull a bunch of email down from Yahoo! so
> after reassociating 30 or 40 spam it quickly caught on to what I
> considered spam. Now my spamlist.db file is approx 20MB and my
> goodlist.db file is approx 1MB and it has a about a 0.5% error rate.
> The error is usually when I get Yahoo! reminders that get put in spam
> but I want to get them in my Inbox.
I've got spam corpuses you can use to "seed" the engine, if you're
interested.
For refiling with ifile, I use a different approach. My procmail scripts
inject a new header, Ifile-hint, into the mail. If it's misfiled, I just
move it to the correct folder. A cron script then finds mail with
Ifile-hint headers that don't match the folder and "relearn" the mail.
This means I don't need to have extra folders cluttering my system that
aren't really used.
--
William E. Kempf
More information about the OLUG
mailing list