Spam Filter ISP Support Forum

  New Posts New Posts RSS Feed - How Does Bayesian Learn ?
  FAQ FAQ  Forum Search   Register Register  Login Login

How Does Bayesian Learn ?

 Post Reply Post Reply
Author
Lee View Drop Down
Groupie
Groupie


Joined: 04 February 2005
Location: United States
Status: Offline
Points: 50
Post Options Post Options   Thanks (0) Thanks(0)   Quote Lee Quote  Post ReplyReply Direct Link To This Post Topic: How Does Bayesian Learn ?
    Posted: 13 December 2004 at 4:03pm

I am trying to better understand how Spam Filters Bayesian works and if it is actually improving over time.

My question is where does the Corpus db actually learn new words and how does it weight them ?

Does it do this from only Keywords that are entered ?
Can it also scan and learn from all quarantined mail ?

It would seem to me that the Bayesian filter should scan every rejected email and consider that to be spam. For example if spam is sent to an unknown address shouldn't the filter add words from those emails to it's db ?

Lee

Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4106
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 13 December 2004 at 10:11pm
Lee,

SpamFilter examines every single email that it receives (good and spam), and breaks it apart into tokens. Tokens from spam email are weighed and assigned a certain score, tokens from clean emails are weighed differently and assigned a different score. All tokens are inserted into the corpus database, and as new emails arrive, the corpus is updated in realtime and reloaded by SpamFilter every 10 minutes or so.

The tokens are retrieved from the email's source, not from any of the keywords or filters you have configured. The SpamFilter keywords and filters simply help the statistical filter determine what is spam or not, they are not added to the corpus.

Furthermore, when a user force-delivers a valid email that was mistakenly added to the quarantine, that is also further processed by SpamFilter, and its tokens are "tagged" to reflect a false positive, and the corpus database is updated to account for that fact to reduce the chance of a similar mistake happening in the future.

Roberto F. LogSat Software
Back to Top
mikek View Drop Down
Senior Member
Senior Member
Avatar

Joined: 22 February 2005
Location: Switzerland
Status: Offline
Points: 133
Post Options Post Options   Thanks (0) Thanks(0)   Quote mikek Quote  Post ReplyReply Direct Link To This Post Posted: 16 December 2004 at 3:55am
I am still convinced that for the bayesian filter to be effective, there should be some way to report false-negatives. Because right now, if you don't keep a lengthy list of keywords, too much spam will still get through, being tagged as "good" e-mail in the bayesian filter and therefore rendering the bayesian filter useless.
Back to Top
keizersozay View Drop Down
Groupie
Groupie
Avatar

Joined: 26 January 2005
Location: United States
Status: Offline
Points: 77
Post Options Post Options   Thanks (0) Thanks(0)   Quote keizersozay Quote  Post ReplyReply Direct Link To This Post Posted: 16 December 2004 at 2:37pm

I've been thinking about what you said and I am going to try something. Hopefully Roberto can tell me if this will work or not too.

What about adding an email adddress to spamfilter in the 'blacklist to' file for say 'reportspam@spam.mycompanyname' or something like that and add the ':nondr' option. notice I am not using a real domain. Now setup a distrobution list in your email server called 'reportspam' and include the above email address. (you may have to first set the above email address as a contact, then add the contact to the disto list) now setup your email server to forward email for that email address to SpamFilter using a smarthost...I think that can be done...

This would automatically add the contents of the email to the baysfilter and help it be more effective.

Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4106
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 17 December 2004 at 12:08am
If I understood your idea correctly, end-users would be forwarding the spam to the reporting email. If so, unfortunately I do not believe that will yield accurate results. The statistical filter works on the email's source, and if a user has Microsoft Outlook for example, that client *completely* changes the email's source. WHen the user forward a message, it's completely different than the original, so new similar original messages won't even see the tokens the end-user has forwarded.

Unless the user is able to forward the *full*, unmodified, original email's source this will not work as expected.

Roberto F. LogSat Software
Back to Top
LogSat View Drop Down
Admin Group
Admin Group
Avatar

Joined: 25 January 2005
Location: United States
Status: Offline
Points: 4106
Post Options Post Options   Thanks (0) Thanks(0)   Quote LogSat Quote  Post ReplyReply Direct Link To This Post Posted: 17 December 2004 at 12:13am
Mike,

While your statement is correct, SpamFilter's statistical filter could be trained better if we could feed it false-negatives, please note that a properly configured SpamFilter will block huge amounts of spam. During the past months for example, our own server has the following email counts:

[Messages] Spam=9105146 Good=559984

We receive 20x more spam than good emails. Still, under these conditions, during the past 3 days we blocked 100,000 emails using MAPS filter, 90,000 using blacklisted countries, and only 1,000 using Bayesian filtering.

The statistical filter is one of the last ones to be used as email arrives, other filters are applied first. And even though 1,000 may seem small compared to 100,000, it still means that we received one thousand less spam emails in three days....

Roberto F. LogSat Software
Back to Top
 Post Reply Post Reply
  Share Topic   

Forum Jump Forum Permissions View Drop Down



This page was generated in 0.250 seconds.