Print Page | Close Window

How Does Bayesian Learn ?

Printed From: LogSat Software
Category: Spam Filter ISP
Forum Name: Spam Filter ISP Support
Forum Description: General support for Spam Filter ISP
URL: https://www.logsat.com/spamfilter/forums/forum_posts.asp?TID=4749
Printed Date: 30 December 2025 at 8:45am


Topic: How Does Bayesian Learn ?
Posted By: Lee
Subject: How Does Bayesian Learn ?
Date Posted: 13 December 2004 at 4:03pm

I am trying to better understand how Spam Filters Bayesian works and if it is actually improving over time.

My question is where does the Corpus db actually learn new words and how does it weight them ?

Does it do this from only Keywords that are entered ?
Can it also scan and learn from all quarantined mail ?

It would seem to me that the Bayesian filter should scan every rejected email and consider that to be spam. For example if spam is sent to an unknown address shouldn't the filter add words from those emails to it's db ?

Lee




Replies:
Posted By: LogSat
Date Posted: 13 December 2004 at 10:11pm
Lee,

SpamFilter examines every single email that it receives (good and spam), and breaks it apart into tokens. Tokens from spam email are weighed and assigned a certain score, tokens from clean emails are weighed differently and assigned a different score. All tokens are inserted into the corpus database, and as new emails arrive, the corpus is updated in realtime and reloaded by SpamFilter every 10 minutes or so.

The tokens are retrieved from the email's source, not from any of the keywords or filters you have configured. The SpamFilter keywords and filters simply help the statistical filter determine what is spam or not, they are not added to the corpus.

Furthermore, when a user force-delivers a valid email that was mistakenly added to the quarantine, that is also further processed by SpamFilter, and its tokens are "tagged" to reflect a false positive, and the corpus database is updated to account for that fact to reduce the chance of a similar mistake happening in the future.

Roberto F. LogSat Software


Posted By: mikek
Date Posted: 16 December 2004 at 3:55am
I am still convinced that for the bayesian filter to be effective, there should be some way to report false-negatives. Because right now, if you don't keep a lengthy list of keywords, too much spam will still get through, being tagged as "good" e-mail in the bayesian filter and therefore rendering the bayesian filter useless.


Posted By: keizersozay
Date Posted: 16 December 2004 at 2:37pm

I've been thinking about what you said and I am going to try something. Hopefully Roberto can tell me if this will work or not too.

What about adding an email adddress to spamfilter in the 'blacklist to' file for say mailto:'reportspam@spam.mycompanyname'" CLASS="ASPForums" TITLE="WARNING: URL created by poster. - 'reportspam@spam.mycompanyname' or something like that and add the ':nondr' option. notice I am not using a real domain. Now setup a distrobution list in your email server called 'reportspam' and include the above email address. (you may have to first set the above email address as a contact, then add the contact to the disto list) now setup your email server to forward email for that email address to SpamFilter using a smarthost...I think that can be done...

This would automatically add the contents of the email to the baysfilter and help it be more effective.



Posted By: LogSat
Date Posted: 17 December 2004 at 12:08am
If I understood your idea correctly, end-users would be forwarding the spam to the reporting email. If so, unfortunately I do not believe that will yield accurate results. The statistical filter works on the email's source, and if a user has Microsoft Outlook for example, that client *completely* changes the email's source. WHen the user forward a message, it's completely different than the original, so new similar original messages won't even see the tokens the end-user has forwarded.

Unless the user is able to forward the *full*, unmodified, original email's source this will not work as expected.

Roberto F. LogSat Software


Posted By: LogSat
Date Posted: 17 December 2004 at 12:13am
Mike,

While your statement is correct, SpamFilter's statistical filter could be trained better if we could feed it false-negatives, please note that a properly configured SpamFilter will block huge amounts of spam. During the past months for example, our own server has the following email counts:

[Messages] Spam=9105146 Good=559984

We receive 20x more spam than good emails. Still, under these conditions, during the past 3 days we blocked 100,000 emails using MAPS filter, 90,000 using blacklisted countries, and only 1,000 using Bayesian filtering.

The statistical filter is one of the last ones to be used as email arrives, other filters are applied first. And even though 1,000 may seem small compared to 100,000, it still means that we received one thousand less spam emails in three days....

Roberto F. LogSat Software



Print Page | Close Window